Merging and Splicing Time Series¶
This tutorial demonstrates the usage and difference between ts_merge
and ts_splice
, two methods for folding together time series into a combined data structure.
``ts_merge`` blends multiple time series together based on priority, filling missing values. It potentiallyu uses all the input series at all timestamps.
``ts_splice`` stitches together time series in sequential time blocks without mixing values.
We will describe the effect on regularly sampled series (which have the freq
attribute) and on irregular. We will also explore the ``names`` argument, which controls how columns are selected or renamed in the merging/splicing process. There is a file-level command line tools for this as well in the dms_datastore
package.
Prioritized filling on regular series¶
Let’s begin by showing how ts_merge
and ts_splice
fold together two regular series but gappy series on a prioritized basis.
Here are the sample series:
[1]:
# ========================================
# 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)
# ========================================
idx1 = pd.date_range("2023-01-01", periods=10, freq="1D")
idx2 = pd.date_range("2023-01-01", periods=12, freq="1D")
idx3 = pd.date_range("2022-12-31", periods=14, freq="1D")
series1 = pd.Series([1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10], index=idx1, name="A")
series2 = pd.Series([np.nan, 2, np.nan, 4, np.nan, np.nan, 7, np.nan, np.nan, np.nan,3.,4.], index=idx2, name="A")
series3 = pd.Series([1000.,1001., 1002., np.nan, 1004., np.nan, np.nan, 1007., np.nan, np.nan, np.nan,1005.,1006.,1007.], index=idx3, name="A")
print("Series 1 (Primary):")
display(series1)
print("\nSeries 2 (Secondary - Fills Gaps):")
display(series2)
print("\nSeries 3 (Tertiary - Fills Gaps):")
display(series3)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 4
1 # ========================================
2 # 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)
3 # ========================================
----> 4 idx1 = pd.date_range("2023-01-01", periods=10, freq="1D")
5 idx2 = pd.date_range("2023-01-01", periods=12, freq="1D")
6 idx3 = pd.date_range("2022-12-31", periods=14, freq="1D")
NameError: name 'pd' is not defined
And here is what it looks like spliced instead of merged.
[ ]:
# ========================================
# 2️⃣ Using `ts_merge()` with Prioritization
# ========================================
merged_series = ts_merge((series1, series2, series3))
print("\nMerged Series with Prioritization:")
display(merged_series)
Merged Series with Prioritization:
2022-12-31 1000.0
2023-01-01 1.0
2023-01-02 2.0
2023-01-03 3.0
2023-01-04 4.0
2023-01-05 5.0
2023-01-06 6.0
2023-01-07 7.0
2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 3.0
2023-01-12 4.0
2023-01-13 1007.0
Freq: D, Name: A, dtype: float64
Splicing¶
Splicing marches through the prioritized list of input time series and exclusively uses values for the higher priority series one during the entire span of that series.
[ ]:
spliced_series = ts_splice((series1, series2, series3))
print("\nSpliced Series with Prioritization and default `prefer last`:")
display(spliced_series)
spliced_first = ts_splice((series1, series2, series3),transition="prefer_first")
print("\nSpliced Series with Prioritization, Prefer first:")
display(spliced_first)
Spliced Series with Prioritization:
2022-12-31 1000.0
2023-01-01 1001.0
2023-01-02 1002.0
2023-01-03 NaN
2023-01-04 1004.0
2023-01-05 NaN
2023-01-06 NaN
2023-01-07 1007.0
2023-01-08 NaN
2023-01-09 NaN
2023-01-10 NaN
2023-01-11 1005.0
2023-01-12 1006.0
2023-01-13 1007.0
Freq: D, Name: A, dtype: float64
Spliced Series with Prioritization, Prefer first:
2023-01-01 1.0
2023-01-02 NaN
2023-01-03 3.0
2023-01-04 NaN
2023-01-05 5.0
2023-01-06 6.0
2023-01-07 NaN
2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 3.0
2023-01-12 4.0
2023-01-13 1007.0
Freq: D, Name: A, dtype: float64
Irregular series¶
Now we will look at some irregular series and see the difference in output from ts_merge (which shuffles) and ts_splice (which exclusively uses values from one series at a time based on the span of the series and its priority)
[ ]:
# ========================================
# 3️⃣ Creating Irregular Time Series (No Freq Attribute)
# ========================================
idx_irreg1 = pd.to_datetime(["2023-01-01", "2023-01-03", "2023-01-07", "2023-01-10"])
idx_irreg2 = pd.to_datetime(["2023-01-02", "2023-01-04", "2023-01-08", "2023-01-11"])
series_irreg1 = pd.Series([1, np.nan, 3, 4], index=idx_irreg1, name="A")
series_irreg2 = pd.Series([10, 20, np.nan, 40], index=idx_irreg2, name="A")
print("\nIrregular Series 1:")
display(series_irreg1)
print("\nIrregular Series 2:")
display(series_irreg2)
# ========================================
# 4️⃣ Using `ts_merge()` with Irregular Time Series
# ========================================
merged_irregular = ts_merge((series_irreg1, series_irreg2))
print("\nMerged Irregular Series (May Shuffle Timestamps):")
display(merged_irregular)
# ========================================
# 5️⃣ Using `ts_splice()` with Irregular Time Series
# ========================================
spliced_irregular = ts_splice((series_irreg1, series_irreg2), transition="prefer_last")
print("\nSpliced Irregular Series (prefer_last):")
display(spliced_irregular)
Irregular Series 1:
2023-01-01 1.0
2023-01-03 NaN
2023-01-07 3.0
2023-01-10 4.0
Name: A, dtype: float64
Irregular Series 2:
2023-01-02 10.0
2023-01-04 20.0
2023-01-08 NaN
2023-01-11 40.0
Name: A, dtype: float64
Merged Irregular Series (May Shuffle Timestamps):
2023-01-01 1.0
2023-01-02 10.0
2023-01-03 NaN
2023-01-04 20.0
2023-01-07 3.0
2023-01-08 NaN
2023-01-10 4.0
2023-01-11 40.0
Name: A, dtype: float64
Spliced Irregular Series (prefer_last):
2023-01-01 1.0
2023-01-02 10.0
2023-01-04 20.0
2023-01-08 NaN
2023-01-11 40.0
Name: A, dtype: float64
Names
argument¶
Finally let’s look at some more intricate examples with mixed series and dataframes with differing numbers of columns and see how names
can be used to make selections or unify poorly coordinated labels. Here are the series:
[ ]:
import pandas as pd
import numpy as np
from vtools import ts_merge, ts_splice # Assuming these functions are in merge.py
# Create irregular time series
idx1 = pd.date_range("2023-01-01", periods=5, freq="2D")
idx2 = pd.date_range("2023-01-02", periods=5, freq="2D")
series1 = pd.Series([1, np.nan, 3, 4, 5], index=idx1, name="A")
series2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name="B")
df1 = pd.DataFrame({"A": [1, np.nan, 3, 4, 5], "B": [10, 20, 30, 40, 50]}, index=idx1)
df2 = pd.DataFrame({"A": [10, 20, np.nan, 40, 50], "B": [100, 200, np.nan, 400, 500]}, index=idx2)
df3 = pd.DataFrame({"A": [310, 320, np.nan, 340, np.nan],
"B": [100, 200, np.nan, 400, 500],
"C": [3100, 3200, np.nan, 3400, 3500]
}, index=idx2)
# Display Data
print("Series 1:")
display(series1)
print("Series 2:")
display(series2)
print("DataFrame 1:")
display(df1)
print("DataFrame 2:")
display(df2)
print("DataFrame 3:")
display(df3)
Series 1:
2023-01-01 1.0
2023-01-03 NaN
2023-01-05 3.0
2023-01-07 4.0
2023-01-09 5.0
Freq: 2D, Name: A, dtype: float64
Series 2:
2023-01-02 10.0
2023-01-04 20.0
2023-01-06 30.0
2023-01-08 NaN
2023-01-10 50.0
Freq: 2D, Name: B, dtype: float64
DataFrame 1:
A | B | |
---|---|---|
2023-01-01 | 1.0 | 10 |
2023-01-03 | NaN | 20 |
2023-01-05 | 3.0 | 30 |
2023-01-07 | 4.0 | 40 |
2023-01-09 | 5.0 | 50 |
DataFrame 2:
A | B | |
---|---|---|
2023-01-02 | 10.0 | 100.0 |
2023-01-04 | 20.0 | 200.0 |
2023-01-06 | NaN | NaN |
2023-01-08 | 40.0 | 400.0 |
2023-01-10 | 50.0 | 500.0 |
DataFrame 3:
A | B | C | |
---|---|---|---|
2023-01-02 | 310.0 | 100.0 | 3100.0 |
2023-01-04 | 320.0 | 200.0 | 3200.0 |
2023-01-06 | NaN | NaN | NaN |
2023-01-08 | 340.0 | 400.0 | 3400.0 |
2023-01-10 | NaN | 500.0 | 3500.0 |
Here are some example usage:
[ ]:
# Example: Using `names` to rename output columns
# Merging without a rename
merged_series_named = ts_merge((series1, series2))
print("Merged Series not renamed:")
display(merged_series_named)
# Rename a single column
merged_series_named = ts_merge((series1, series2), names="Renamed_A")
print("Merged Series renamed:")
display(merged_series_named)
# Select specific columns in DataFrame
try:
merged_df_named = ts_merge((df1, df2, df3), names=None)
except:
print("Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match")
#display(merged_df_named)
# Select specific columns in DataFrame
merged_df_named = ts_merge((df1, df2), names=None)
print("Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:")
display(merged_df_named)
# Select specific columns in DataFrame
merged_df_named = ts_merge((df1, df2, df3), names=["A"])
print("Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)")
display(merged_df_named)
# Rename column in splicing
spliced_series_named = ts_splice((series1, series2), names="Renamed_A", transition="prefer_last")
print("Spliced Series with Renamed Column:")
display(spliced_series_named)
Merged Series not renamed:
2023-01-01 1.0
2023-01-02 2.0
2023-01-03 3.0
2023-01-04 4.0
2023-01-05 5.0
2023-01-06 6.0
2023-01-07 7.0
2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 3.0
2023-01-12 4.0
Freq: D, Name: A, dtype: float64
Merged Series renamed:
2023-01-01 1.0
2023-01-02 2.0
2023-01-03 3.0
2023-01-04 4.0
2023-01-05 5.0
2023-01-06 6.0
2023-01-07 7.0
2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 3.0
2023-01-12 4.0
Freq: D, Name: Renamed_A, dtype: float64
Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match
Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:
A | B | |
---|---|---|
2023-01-01 | 1.0 | 10.0 |
2023-01-02 | 10.0 | 100.0 |
2023-01-03 | NaN | 20.0 |
2023-01-04 | 20.0 | 200.0 |
2023-01-05 | 3.0 | 30.0 |
2023-01-06 | NaN | NaN |
2023-01-07 | 4.0 | 40.0 |
2023-01-08 | 40.0 | 400.0 |
2023-01-09 | 5.0 | 50.0 |
2023-01-10 | 50.0 | 500.0 |
Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)
A | |
---|---|
2023-01-01 | 1.0 |
2023-01-02 | 10.0 |
2023-01-03 | NaN |
2023-01-04 | 20.0 |
2023-01-05 | 3.0 |
2023-01-06 | NaN |
2023-01-07 | 4.0 |
2023-01-08 | 40.0 |
2023-01-09 | 5.0 |
2023-01-10 | 50.0 |
Spliced Series with Renamed Column:
2023-01-01 1.0
2023-01-02 2.0
2023-01-03 NaN
2023-01-04 4.0
2023-01-05 NaN
2023-01-06 NaN
2023-01-07 7.0
2023-01-08 NaN
2023-01-09 NaN
2023-01-10 NaN
2023-01-11 3.0
2023-01-12 4.0
Freq: D, Name: Renamed_A, dtype: float64
Summary¶
Use ``ts_merge`` when you want to blend time series together, filling missing values in order of priority.
Use ``ts_splice`` when you want to keep each time series separate and transition from one to another based on time.
The ``names`` argument allows you to rename output columns or select specific columns when merging/splicing DataFrames.
This notebook provides a clear comparison to help you decide which method best suits your use case.