Merging and Splicing Time Series

This tutorial demonstrates the usage and difference between ts_merge and ts_splice, two methods for folding together time series into a combined data structure.

  • ``ts_merge`` blends multiple time series together based on priority, filling missing values. It potentiallyu uses all the input series at all timestamps.

  • ``ts_splice`` stitches together time series in sequential time blocks without mixing values.

We will describe the effect on regularly sampled series (which have the freq attribute) and on irregular. We will also explore the ``names`` argument, which controls how columns are selected or renamed in the merging/splicing process. There is a file-level command line tools for this as well in the dms_datastore package.

Prioritized filling on regular series

Let’s begin by showing how ts_merge and ts_splice fold together two regular series but gappy series on a prioritized basis.

Here are the sample series:

[1]:
# ========================================
# 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)
# ========================================
idx1 = pd.date_range("2023-01-01", periods=10, freq="1D")
idx2 = pd.date_range("2023-01-01", periods=12, freq="1D")
idx3 = pd.date_range("2022-12-31", periods=14, freq="1D")

series1 = pd.Series([1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10], index=idx1, name="A")
series2 = pd.Series([np.nan, 2, np.nan, 4, np.nan, np.nan, 7, np.nan, np.nan, np.nan,3.,4.], index=idx2, name="A")
series3 = pd.Series([1000.,1001., 1002., np.nan, 1004., np.nan, np.nan, 1007., np.nan, np.nan, np.nan,1005.,1006.,1007.], index=idx3, name="A")

print("Series 1 (Primary):")
display(series1)

print("\nSeries 2 (Secondary - Fills Gaps):")
display(series2)

print("\nSeries 3 (Tertiary - Fills Gaps):")
display(series3)




---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 4
      1 # ========================================
      2 # 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)
      3 # ========================================
----> 4 idx1 = pd.date_range("2023-01-01", periods=10, freq="1D")
      5 idx2 = pd.date_range("2023-01-01", periods=12, freq="1D")
      6 idx3 = pd.date_range("2022-12-31", periods=14, freq="1D")

NameError: name 'pd' is not defined

And here is what it looks like spliced instead of merged.

[ ]:
# ========================================
# 2️⃣ Using `ts_merge()` with Prioritization
# ========================================
merged_series = ts_merge((series1, series2, series3))
print("\nMerged Series with Prioritization:")
display(merged_series)

Merged Series with Prioritization:
2022-12-31    1000.0
2023-01-01       1.0
2023-01-02       2.0
2023-01-03       3.0
2023-01-04       4.0
2023-01-05       5.0
2023-01-06       6.0
2023-01-07       7.0
2023-01-08       8.0
2023-01-09       9.0
2023-01-10      10.0
2023-01-11       3.0
2023-01-12       4.0
2023-01-13    1007.0
Freq: D, Name: A, dtype: float64

Splicing

Splicing marches through the prioritized list of input time series and exclusively uses values for the higher priority series one during the entire span of that series.

[ ]:
spliced_series = ts_splice((series1, series2, series3))
print("\nSpliced Series with Prioritization and default `prefer last`:")
display(spliced_series)
spliced_first = ts_splice((series1, series2, series3),transition="prefer_first")
print("\nSpliced Series with Prioritization, Prefer first:")
display(spliced_first)

Spliced Series with Prioritization:
2022-12-31    1000.0
2023-01-01    1001.0
2023-01-02    1002.0
2023-01-03       NaN
2023-01-04    1004.0
2023-01-05       NaN
2023-01-06       NaN
2023-01-07    1007.0
2023-01-08       NaN
2023-01-09       NaN
2023-01-10       NaN
2023-01-11    1005.0
2023-01-12    1006.0
2023-01-13    1007.0
Freq: D, Name: A, dtype: float64

Spliced Series with Prioritization, Prefer first:
2023-01-01       1.0
2023-01-02       NaN
2023-01-03       3.0
2023-01-04       NaN
2023-01-05       5.0
2023-01-06       6.0
2023-01-07       NaN
2023-01-08       8.0
2023-01-09       9.0
2023-01-10      10.0
2023-01-11       3.0
2023-01-12       4.0
2023-01-13    1007.0
Freq: D, Name: A, dtype: float64

Irregular series

Now we will look at some irregular series and see the difference in output from ts_merge (which shuffles) and ts_splice (which exclusively uses values from one series at a time based on the span of the series and its priority)

[ ]:

# ======================================== # 3️⃣ Creating Irregular Time Series (No Freq Attribute) # ======================================== idx_irreg1 = pd.to_datetime(["2023-01-01", "2023-01-03", "2023-01-07", "2023-01-10"]) idx_irreg2 = pd.to_datetime(["2023-01-02", "2023-01-04", "2023-01-08", "2023-01-11"]) series_irreg1 = pd.Series([1, np.nan, 3, 4], index=idx_irreg1, name="A") series_irreg2 = pd.Series([10, 20, np.nan, 40], index=idx_irreg2, name="A") print("\nIrregular Series 1:") display(series_irreg1) print("\nIrregular Series 2:") display(series_irreg2) # ======================================== # 4️⃣ Using `ts_merge()` with Irregular Time Series # ======================================== merged_irregular = ts_merge((series_irreg1, series_irreg2)) print("\nMerged Irregular Series (May Shuffle Timestamps):") display(merged_irregular) # ======================================== # 5️⃣ Using `ts_splice()` with Irregular Time Series # ======================================== spliced_irregular = ts_splice((series_irreg1, series_irreg2), transition="prefer_last") print("\nSpliced Irregular Series (prefer_last):") display(spliced_irregular)

Irregular Series 1:
2023-01-01    1.0
2023-01-03    NaN
2023-01-07    3.0
2023-01-10    4.0
Name: A, dtype: float64

Irregular Series 2:
2023-01-02    10.0
2023-01-04    20.0
2023-01-08     NaN
2023-01-11    40.0
Name: A, dtype: float64

Merged Irregular Series (May Shuffle Timestamps):
2023-01-01     1.0
2023-01-02    10.0
2023-01-03     NaN
2023-01-04    20.0
2023-01-07     3.0
2023-01-08     NaN
2023-01-10     4.0
2023-01-11    40.0
Name: A, dtype: float64

Spliced Irregular Series (prefer_last):
2023-01-01     1.0
2023-01-02    10.0
2023-01-04    20.0
2023-01-08     NaN
2023-01-11    40.0
Name: A, dtype: float64

Names argument

Finally let’s look at some more intricate examples with mixed series and dataframes with differing numbers of columns and see how names can be used to make selections or unify poorly coordinated labels. Here are the series:

[ ]:
import pandas as pd
import numpy as np
from vtools import ts_merge, ts_splice  # Assuming these functions are in merge.py

# Create irregular time series
idx1 = pd.date_range("2023-01-01", periods=5, freq="2D")
idx2 = pd.date_range("2023-01-02", periods=5, freq="2D")

series1 = pd.Series([1, np.nan, 3, 4, 5], index=idx1, name="A")
series2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name="B")

df1 = pd.DataFrame({"A": [1, np.nan, 3, 4, 5], "B": [10, 20, 30, 40, 50]}, index=idx1)
df2 = pd.DataFrame({"A": [10, 20, np.nan, 40, 50], "B": [100, 200, np.nan, 400, 500]}, index=idx2)
df3 = pd.DataFrame({"A": [310, 320, np.nan, 340, np.nan],
                    "B": [100, 200, np.nan, 400, 500],
                    "C": [3100, 3200, np.nan, 3400, 3500]
                    }, index=idx2)

# Display Data
print("Series 1:")
display(series1)

print("Series 2:")
display(series2)

print("DataFrame 1:")
display(df1)

print("DataFrame 2:")
display(df2)

print("DataFrame 3:")
display(df3)

Series 1:
2023-01-01    1.0
2023-01-03    NaN
2023-01-05    3.0
2023-01-07    4.0
2023-01-09    5.0
Freq: 2D, Name: A, dtype: float64
Series 2:
2023-01-02    10.0
2023-01-04    20.0
2023-01-06    30.0
2023-01-08     NaN
2023-01-10    50.0
Freq: 2D, Name: B, dtype: float64
DataFrame 1:
A B
2023-01-01 1.0 10
2023-01-03 NaN 20
2023-01-05 3.0 30
2023-01-07 4.0 40
2023-01-09 5.0 50
DataFrame 2:
A B
2023-01-02 10.0 100.0
2023-01-04 20.0 200.0
2023-01-06 NaN NaN
2023-01-08 40.0 400.0
2023-01-10 50.0 500.0
DataFrame 3:
A B C
2023-01-02 310.0 100.0 3100.0
2023-01-04 320.0 200.0 3200.0
2023-01-06 NaN NaN NaN
2023-01-08 340.0 400.0 3400.0
2023-01-10 NaN 500.0 3500.0

Here are some example usage:

[ ]:
# Example: Using `names` to rename output columns

# Merging without a rename
merged_series_named = ts_merge((series1, series2))
print("Merged Series not renamed:")
display(merged_series_named)

# Rename a single column
merged_series_named = ts_merge((series1, series2), names="Renamed_A")
print("Merged Series renamed:")
display(merged_series_named)

# Select specific columns in DataFrame
try:
    merged_df_named = ts_merge((df1, df2, df3), names=None)
except:
    print("Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match")
#display(merged_df_named)

# Select specific columns in DataFrame
merged_df_named = ts_merge((df1, df2), names=None)
print("Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:")
display(merged_df_named)


# Select specific columns in DataFrame
merged_df_named = ts_merge((df1, df2, df3), names=["A"])
print("Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)")
display(merged_df_named)


# Rename column in splicing
spliced_series_named = ts_splice((series1, series2), names="Renamed_A", transition="prefer_last")
print("Spliced Series with Renamed Column:")
display(spliced_series_named)

Merged Series not renamed:
2023-01-01     1.0
2023-01-02     2.0
2023-01-03     3.0
2023-01-04     4.0
2023-01-05     5.0
2023-01-06     6.0
2023-01-07     7.0
2023-01-08     8.0
2023-01-09     9.0
2023-01-10    10.0
2023-01-11     3.0
2023-01-12     4.0
Freq: D, Name: A, dtype: float64
Merged Series renamed:
2023-01-01     1.0
2023-01-02     2.0
2023-01-03     3.0
2023-01-04     4.0
2023-01-05     5.0
2023-01-06     6.0
2023-01-07     7.0
2023-01-08     8.0
2023-01-09     9.0
2023-01-10    10.0
2023-01-11     3.0
2023-01-12     4.0
Freq: D, Name: Renamed_A, dtype: float64
Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match
Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:
A B
2023-01-01 1.0 10.0
2023-01-02 10.0 100.0
2023-01-03 NaN 20.0
2023-01-04 20.0 200.0
2023-01-05 3.0 30.0
2023-01-06 NaN NaN
2023-01-07 4.0 40.0
2023-01-08 40.0 400.0
2023-01-09 5.0 50.0
2023-01-10 50.0 500.0
Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)
A
2023-01-01 1.0
2023-01-02 10.0
2023-01-03 NaN
2023-01-04 20.0
2023-01-05 3.0
2023-01-06 NaN
2023-01-07 4.0
2023-01-08 40.0
2023-01-09 5.0
2023-01-10 50.0
Spliced Series with Renamed Column:
2023-01-01    1.0
2023-01-02    2.0
2023-01-03    NaN
2023-01-04    4.0
2023-01-05    NaN
2023-01-06    NaN
2023-01-07    7.0
2023-01-08    NaN
2023-01-09    NaN
2023-01-10    NaN
2023-01-11    3.0
2023-01-12    4.0
Freq: D, Name: Renamed_A, dtype: float64

Summary

  • Use ``ts_merge`` when you want to blend time series together, filling missing values in order of priority.

  • Use ``ts_splice`` when you want to keep each time series separate and transition from one to another based on time.

  • The ``names`` argument allows you to rename output columns or select specific columns when merging/splicing DataFrames.

This notebook provides a clear comparison to help you decide which method best suits your use case.