{ "cells": [ { "cell_type": "markdown", "id": "436d5ccf", "metadata": {}, "source": [ "# Merging, Splicing and Blending Time Series\n", "This tutorial demonstrates the usage and difference between `ts_merge` and `ts_splice`, two methods for folding together time series into a combined data structure.\n", "\n", "- **`ts_merge`** blends multiple time series together based on priority, optionally filling missing values in higher priority series with entries from lower priority. It potentially uses all the input series at all timestamps. See the [`strict_priority`](#ts_merge-strict-priority-option) option below for advanced control over nan-filling between priorities.\n", "- **`ts_splice`** stitches together time series in sequential time **blocks** without mixing values.\n", "\n", "We will describe the effect on regularly sampled series (which have the `freq` attribute) and on irregular. We will also explore the **`names`** argument, which controls how columns are selected or renamed in the merging/splicing process. There is a file-level command line tools for this as well in the `dms_datastore` package.\n", "\n", "## Prioritized filling on regular series\n", "Let's begin by showing how `ts_merge` and `ts_splice` fold together two regular series but gappy \n", "series on a prioritized basis.\n", "\n", "Here are the sample series:" ] }, { "cell_type": "code", "execution_count": 23, "id": "e52fb077", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Series 1 (Primary):\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 NaN\n", "2023-01-03 3.0\n", "2023-01-04 NaN\n", "2023-01-05 5.0\n", "2023-01-06 6.0\n", "2023-01-07 NaN\n", "2023-01-08 8.0\n", "2023-01-09 9.0\n", "2023-01-10 10.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Series 2 (Secondary - Fills Gaps):\n" ] }, { "data": { "text/plain": [ "2023-01-01 NaN\n", "2023-01-02 2.0\n", "2023-01-03 NaN\n", "2023-01-04 4.0\n", "2023-01-05 NaN\n", "2023-01-06 NaN\n", "2023-01-07 7.0\n", "2023-01-08 NaN\n", "2023-01-09 NaN\n", "2023-01-10 NaN\n", "2023-01-11 3.0\n", "2023-01-12 4.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Series 3 (Tertiary - Fills Gaps):\n" ] }, { "data": { "text/plain": [ "2022-12-31 1000.0\n", "2023-01-01 1001.0\n", "2023-01-02 1002.0\n", "2023-01-03 NaN\n", "2023-01-04 1004.0\n", "2023-01-05 NaN\n", "2023-01-06 NaN\n", "2023-01-07 1007.0\n", "2023-01-08 NaN\n", "2023-01-09 NaN\n", "2023-01-10 NaN\n", "2023-01-11 1005.0\n", "2023-01-12 1006.0\n", "2023-01-13 1007.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from vtools import ts_merge, ts_splice\n", "# ========================================\n", "# Creating Regular Time Series (1D Frequency with Missing Data)\n", "# ========================================\n", "idx1 = pd.date_range(\"2023-01-01\", periods=10, freq=\"1D\")\n", "idx2 = pd.date_range(\"2023-01-01\", periods=12, freq=\"1D\")\n", "idx3 = pd.date_range(\"2022-12-31\", periods=14, freq=\"1D\")\n", "\n", "series1 = pd.Series([1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10], index=idx1, name=\"A\")\n", "series2 = pd.Series([np.nan, 2, np.nan, 4, np.nan, np.nan, 7, np.nan, np.nan, np.nan,3.,4.], index=idx2, name=\"A\")\n", "series3 = pd.Series([1000.,1001., 1002., np.nan, 1004., np.nan, np.nan, 1007., np.nan, np.nan, np.nan,1005.,1006.,1007.], index=idx3, name=\"A\")\n", "\n", "print(\"Series 1 (Primary):\")\n", "display(series1)\n", "\n", "print(\"\\nSeries 2 (Secondary - Fills Gaps):\")\n", "display(series2)\n", "\n", "print(\"\\nSeries 3 (Tertiary - Fills Gaps):\")\n", "display(series3)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "b2175099", "metadata": {}, "source": [ "And here is what it looks like spliced instead of merged." ] }, { "cell_type": "code", "execution_count": 24, "id": "5dd08914", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Merged Series with Prioritization:\n" ] }, { "data": { "text/plain": [ "2022-12-31 1000.0\n", "2023-01-01 1.0\n", "2023-01-02 2.0\n", "2023-01-03 3.0\n", "2023-01-04 4.0\n", "2023-01-05 5.0\n", "2023-01-06 6.0\n", "2023-01-07 7.0\n", "2023-01-08 8.0\n", "2023-01-09 9.0\n", "2023-01-10 10.0\n", "2023-01-11 3.0\n", "2023-01-12 4.0\n", "2023-01-13 1007.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# ========================================\n", "# 2️⃣ Using `ts_merge()` with Prioritization\n", "# ========================================\n", "merged_series = ts_merge((series1, series2, series3))\n", "print(\"\\nMerged Series with Prioritization:\")\n", "display(merged_series)" ] }, { "cell_type": "markdown", "id": "fab7780c", "metadata": {}, "source": [ "## Splicing\n", "Splicing marches through the prioritized list of input time series and exclusively uses values for the higher priority series one during the entire span of that series. " ] }, { "cell_type": "code", "execution_count": 25, "id": "ae88f210", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Spliced Series with Prioritization and default `prefer last`:\n" ] }, { "data": { "text/plain": [ "2022-12-31 1000.0\n", "2023-01-01 1001.0\n", "2023-01-02 1002.0\n", "2023-01-03 NaN\n", "2023-01-04 1004.0\n", "2023-01-05 NaN\n", "2023-01-06 NaN\n", "2023-01-07 1007.0\n", "2023-01-08 NaN\n", "2023-01-09 NaN\n", "2023-01-10 NaN\n", "2023-01-11 1005.0\n", "2023-01-12 1006.0\n", "2023-01-13 1007.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Spliced Series with Prioritization, Prefer first:\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 NaN\n", "2023-01-03 3.0\n", "2023-01-04 NaN\n", "2023-01-05 5.0\n", "2023-01-06 6.0\n", "2023-01-07 NaN\n", "2023-01-08 8.0\n", "2023-01-09 9.0\n", "2023-01-10 10.0\n", "2023-01-11 3.0\n", "2023-01-12 4.0\n", "2023-01-13 1007.0\n", "Freq: D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "spliced_series = ts_splice((series1, series2, series3))\n", "print(\"\\nSpliced Series with Prioritization and default `prefer last`:\")\n", "display(spliced_series)\n", "spliced_first = ts_splice((series1, series2, series3),transition=\"prefer_first\")\n", "print(\"\\nSpliced Series with Prioritization, Prefer first:\")\n", "display(spliced_first)" ] }, { "cell_type": "markdown", "id": "4a2478d5", "metadata": {}, "source": [ "## Irregular series\n", "\n", "Now we will look at some irregular series and see the difference in output from ts_merge (which shuffles) and ts_splice (which exclusively uses values from one series at a time based on the span of the series and its priority)" ] }, { "cell_type": "code", "execution_count": 26, "id": "9a1d0dae", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Irregular Series 1:\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-03 NaN\n", "2023-01-07 3.0\n", "2023-01-10 4.0\n", "Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Irregular Series 2:\n" ] }, { "data": { "text/plain": [ "2023-01-02 10.0\n", "2023-01-04 20.0\n", "2023-01-08 NaN\n", "2023-01-11 40.0\n", "Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Merged Irregular Series (May Shuffle Timestamps):\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 10.0\n", "2023-01-03 NaN\n", "2023-01-04 20.0\n", "2023-01-07 3.0\n", "2023-01-08 NaN\n", "2023-01-10 4.0\n", "2023-01-11 40.0\n", "Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Spliced Irregular Series (prefer_last):\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 10.0\n", "2023-01-04 20.0\n", "2023-01-08 NaN\n", "2023-01-11 40.0\n", "Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "# ========================================\n", "# 3️⃣ Creating Irregular Time Series (No Freq Attribute)\n", "# ========================================\n", "idx_irreg1 = pd.to_datetime([\"2023-01-01\", \"2023-01-03\", \"2023-01-07\", \"2023-01-10\"])\n", "idx_irreg2 = pd.to_datetime([\"2023-01-02\", \"2023-01-04\", \"2023-01-08\", \"2023-01-11\"])\n", "\n", "series_irreg1 = pd.Series([1, np.nan, 3, 4], index=idx_irreg1, name=\"A\")\n", "series_irreg2 = pd.Series([10, 20, np.nan, 40], index=idx_irreg2, name=\"A\")\n", "\n", "print(\"\\nIrregular Series 1:\")\n", "display(series_irreg1)\n", "\n", "print(\"\\nIrregular Series 2:\")\n", "display(series_irreg2)\n", "\n", "# ========================================\n", "# 4️⃣ Using `ts_merge()` with Irregular Time Series\n", "# ========================================\n", "merged_irregular = ts_merge((series_irreg1, series_irreg2))\n", "print(\"\\nMerged Irregular Series (May Shuffle Timestamps):\")\n", "display(merged_irregular)\n", "\n", "# ========================================\n", "# 5️⃣ Using `ts_splice()` with Irregular Time Series\n", "# ========================================\n", "spliced_irregular = ts_splice((series_irreg1, series_irreg2), transition=\"prefer_last\")\n", "print(\"\\nSpliced Irregular Series (prefer_last):\")\n", "display(spliced_irregular)" ] }, { "cell_type": "markdown", "id": "de365104", "metadata": {}, "source": [ "## `Names` argument\n", "\n", "Finally let's look at some more intricate examples with mixed series and dataframes with differing numbers of columns and see how `names` can be used to make selections or unify poorly coordinated labels. Here are the series:" ] }, { "cell_type": "code", "execution_count": 27, "id": "35cfc422", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Series 1:\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-03 NaN\n", "2023-01-05 3.0\n", "2023-01-07 4.0\n", "2023-01-09 5.0\n", "Freq: 2D, Name: A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Series 2:\n" ] }, { "data": { "text/plain": [ "2023-01-02 10.0\n", "2023-01-04 20.0\n", "2023-01-06 30.0\n", "2023-01-08 NaN\n", "2023-01-10 50.0\n", "Freq: 2D, Name: B, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DataFrame 1:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2023-01-011.010
2023-01-03NaN20
2023-01-053.030
2023-01-074.040
2023-01-095.050
\n", "
" ], "text/plain": [ " A B\n", "2023-01-01 1.0 10\n", "2023-01-03 NaN 20\n", "2023-01-05 3.0 30\n", "2023-01-07 4.0 40\n", "2023-01-09 5.0 50" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DataFrame 2:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2023-01-0210.0100.0
2023-01-0420.0200.0
2023-01-06NaNNaN
2023-01-0840.0400.0
2023-01-1050.0500.0
\n", "
" ], "text/plain": [ " A B\n", "2023-01-02 10.0 100.0\n", "2023-01-04 20.0 200.0\n", "2023-01-06 NaN NaN\n", "2023-01-08 40.0 400.0\n", "2023-01-10 50.0 500.0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DataFrame 3:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
2023-01-02310.0100.03100.0
2023-01-04320.0200.03200.0
2023-01-06NaNNaNNaN
2023-01-08340.0400.03400.0
2023-01-10NaN500.03500.0
\n", "
" ], "text/plain": [ " A B C\n", "2023-01-02 310.0 100.0 3100.0\n", "2023-01-04 320.0 200.0 3200.0\n", "2023-01-06 NaN NaN NaN\n", "2023-01-08 340.0 400.0 3400.0\n", "2023-01-10 NaN 500.0 3500.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from vtools import ts_merge, ts_splice # Assuming these functions are in merge.py\n", "\n", "# Create irregular time series\n", "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"2D\")\n", "idx2 = pd.date_range(\"2023-01-02\", periods=5, freq=\"2D\")\n", "\n", "series1 = pd.Series([1, np.nan, 3, 4, 5], index=idx1, name=\"A\")\n", "series2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"B\")\n", "\n", "df1 = pd.DataFrame({\"A\": [1, np.nan, 3, 4, 5], \"B\": [10, 20, 30, 40, 50]}, index=idx1)\n", "df2 = pd.DataFrame({\"A\": [10, 20, np.nan, 40, 50], \"B\": [100, 200, np.nan, 400, 500]}, index=idx2)\n", "df3 = pd.DataFrame({\"A\": [310, 320, np.nan, 340, np.nan], \n", " \"B\": [100, 200, np.nan, 400, 500],\n", " \"C\": [3100, 3200, np.nan, 3400, 3500]\n", " }, index=idx2)\n", "\n", "# Display Data\n", "print(\"Series 1:\")\n", "display(series1)\n", "\n", "print(\"Series 2:\")\n", "display(series2)\n", "\n", "print(\"DataFrame 1:\")\n", "display(df1)\n", "\n", "print(\"DataFrame 2:\")\n", "display(df2)\n", "\n", "print(\"DataFrame 3:\")\n", "display(df3)\n" ] }, { "cell_type": "markdown", "id": "219a7aa6", "metadata": {}, "source": [ "Here are some example usage:" ] }, { "cell_type": "code", "execution_count": 28, "id": "c6949e66", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original univariate series\n", "2023-01-01 1.0\n", "2023-01-03 NaN\n", "2023-01-05 3.0\n", "2023-01-07 4.0\n", "2023-01-09 5.0\n", "Freq: 2D, Name: A, dtype: float64\n", "2023-01-02 10.0\n", "2023-01-04 20.0\n", "2023-01-06 30.0\n", "2023-01-08 NaN\n", "2023-01-10 50.0\n", "Freq: 2D, Name: B, dtype: float64\n", "Merged univariate series renamed:\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 10.0\n", "2023-01-03 NaN\n", "2023-01-04 20.0\n", "2023-01-05 3.0\n", "2023-01-06 30.0\n", "2023-01-07 4.0\n", "2023-01-08 NaN\n", "2023-01-09 5.0\n", "2023-01-10 50.0\n", "Name: C, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match\n", "Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2023-01-011.010.0
2023-01-0210.0100.0
2023-01-03NaN20.0
2023-01-0420.0200.0
2023-01-053.030.0
2023-01-06NaNNaN
2023-01-074.040.0
2023-01-0840.0400.0
2023-01-095.050.0
2023-01-1050.0500.0
\n", "
" ], "text/plain": [ " A B\n", "2023-01-01 1.0 10.0\n", "2023-01-02 10.0 100.0\n", "2023-01-03 NaN 20.0\n", "2023-01-04 20.0 200.0\n", "2023-01-05 3.0 30.0\n", "2023-01-06 NaN NaN\n", "2023-01-07 4.0 40.0\n", "2023-01-08 40.0 400.0\n", "2023-01-09 5.0 50.0\n", "2023-01-10 50.0 500.0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2023-01-011.0
2023-01-0210.0
2023-01-03NaN
2023-01-0420.0
2023-01-053.0
2023-01-06NaN
2023-01-074.0
2023-01-0840.0
2023-01-095.0
2023-01-1050.0
\n", "
" ], "text/plain": [ " A\n", "2023-01-01 1.0\n", "2023-01-02 10.0\n", "2023-01-03 NaN\n", "2023-01-04 20.0\n", "2023-01-05 3.0\n", "2023-01-06 NaN\n", "2023-01-07 4.0\n", "2023-01-08 40.0\n", "2023-01-09 5.0\n", "2023-01-10 50.0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Spliced Series with Renamed Column:\n" ] }, { "data": { "text/plain": [ "2023-01-01 1.0\n", "2023-01-02 10.0\n", "2023-01-04 20.0\n", "2023-01-06 30.0\n", "2023-01-08 NaN\n", "2023-01-10 50.0\n", "Name: Renamed_A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Example: Using `names` to rename output columns\n", "print(\"Original univariate series\")\n", "print(series1)\n", "print(series2)\n", "\n", "# Merging univariate with different names and using names to rename\n", "merged_series_named = ts_merge((series1, series2), names=[\"C\"])\n", "print(\"Merged univariate series renamed:\")\n", "display(merged_series_named)\n", "\n", "\n", "# Select specific columns in DataFrame\n", "try:\n", " merged_df_named = ts_merge((df1, df2, df3), names=None)\n", "except:\n", " print(\"Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match\")\n", "#display(merged_df_named)\n", "\n", "# Select specific columns in DataFrame\n", "merged_df_named = ts_merge((df1, df2), names=None)\n", "print(\"Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:\")\n", "display(merged_df_named)\n", "\n", "\n", "# Select specific columns in DataFrame\n", "merged_df_named = ts_merge((df1, df2, df3), names=[\"A\"])\n", "print(\"Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)\")\n", "display(merged_df_named)\n", "\n", "\n", "# Rename column in splicing\n", "spliced_series_named = ts_splice((series1, series2), names=\"Renamed_A\", transition=\"prefer_last\")\n", "print(\"Spliced Series with Renamed Column:\")\n", "display(spliced_series_named)\n" ] }, { "cell_type": "markdown", "id": "6baebda5", "metadata": {}, "source": [ "## Summary\n", "- **Use `ts_merge`** when you want to blend time series together, filling missing values in order of priority.\n", "- **Use `ts_splice`** when you want to keep each time series separate and transition from one to another based on time.\n", "- **The `names` argument** allows you to rename output columns or select specific columns when merging/splicing DataFrames.\n", "\n", "This notebook provides a clear comparison to help you decide which method best suits your use case.\n" ] }, { "cell_type": "markdown", "id": "d615df22", "metadata": {}, "source": [ "# `ts_merge`: strict priority option\n", "**New option**: `strict_priority` (default `False`) enforces that a higher‑priority series dominates between its `first_valid_index` and `last_valid_index`.\n", "\n", "**Semantics**\n", "- Per **column**, define the dominance window as `[first_valid_index, last_valid_index]`.\n", "- Within that window, lower‑priority series are **masked**, even if the higher‑priority value is `NaN`.\n", "- Outside those windows, merging is unchanged and lower priority may contribute.\n", "- With irregular inputs, timestamps that exist **only** in lower‑priority series **and** are fully masked inside a dominance window are dropped; timestamps from the top series' index are preserved even if all‑`NaN`.\n", "\n", "**`names` behavior** is unchanged.\n", "### Example 1 — Series with interior `NaN`\n", "\n", "```python\n", "import numpy as np, pandas as pd\n", "from vtools.functions.merge import ts_merge\n", "\n", "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n", "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n", "s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n", "s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n", "\n", "ts_merge((s1, s2)) # default\n", "ts_merge((s1, s2), strict_priority=True)\n", "```\n", "### Example 2 — Two columns, per‑column dominance\n", "\n", "```python\n", "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n", "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n", "df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n", "df1[\"B\"] = df1[\"A\"]\n", "df1.loc[idx1[2], \"B\"] = np.nan # interior NaN in high‑priority B\n", "df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n", "df2[\"B\"] = df2[\"A\"]\n", "\n", "ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]]\n", "```\n", "### Example 3 — Irregular inputs\n", "\n", "```python\n", "idx1 = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n", "idx2 = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n", "s1 = pd.Series([1.,2.,3.,4.], index=idx1, name=\"A\")\n", "s2 = pd.Series([10.,20.,30.,40.], index=idx2, name=\"A\")\n", "\n", "ts_merge((s1, s2), strict_priority=True)\n", "```\n" ] }, { "cell_type": "code", "execution_count": 29, "id": "d31654ba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Example 1 strict=False:\n", "2023-01-01 1.0\n", "2023-01-02 2.0\n", "2023-01-03 10.0\n", "2023-01-04 4.0\n", "2023-01-05 5.0\n", "2023-01-06 NaN\n", "2023-01-07 50.0\n", "Freq: D, Name: A, dtype: float64\n", "Example 1 strict=True:\n", "2023-01-01 1.0\n", "2023-01-02 2.0\n", "2023-01-03 NaN\n", "2023-01-04 4.0\n", "2023-01-05 5.0\n", "2023-01-06 NaN\n", "2023-01-07 50.0\n", "Freq: D, Name: A, dtype: float64\n", "\n", "Example 2 strict=True:\n", " A B\n", "2023-01-01 1.0 1.0\n", "2023-01-02 NaN NaN\n", "2023-01-03 3.0 NaN\n", "2023-01-04 4.0 4.0\n", "2023-01-05 5.0 5.0\n", "2023-01-06 40.0 40.0\n", "2023-01-07 50.0 50.0\n", "\n", "Example 3 strict=True:\n", "2023-01-01 1.0\n", "2023-01-03 2.0\n", "2023-01-07 3.0\n", "2023-01-10 4.0\n", "2023-01-11 40.0\n", "Name: A, dtype: float64\n" ] } ], "source": [ "import numpy as np, pandas as pd\n", "from vtools.functions.merge import ts_merge\n", "\n", "# Example 1\n", "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n", "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n", "s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n", "s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n", "print(\"Example 1 strict=False:\")\n", "print(ts_merge((s1, s2)))\n", "print(\"Example 1 strict=True:\")\n", "print(ts_merge((s1, s2), strict_priority=True))\n", "\n", "# Example 2\n", "df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n", "df1[\"B\"] = df1[\"A\"]; df1.loc[idx1[2], \"B\"] = np.nan\n", "df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n", "df2[\"B\"] = df2[\"A\"]\n", "print(\"\\nExample 2 strict=True:\")\n", "print(ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]])\n", "\n", "# Example 3\n", "idx1i = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n", "idx2i = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n", "s1i = pd.Series([1.,2.,3.,4.], index=idx1i, name=\"A\")\n", "s2i = pd.Series([10.,20.,30.,40.], index=idx2i, name=\"A\")\n", "print(\"\\nExample 3 strict=True:\")\n", "print(ts_merge((s1i, s2i), strict_priority=True))\n" ] }, { "cell_type": "markdown", "id": "77eb1ac4", "metadata": {}, "source": [ "## Blending near gaps: `ts_blend`\n", "\n", "The functions shown above (`ts_merge` and `ts_splice`) perform *hard* selections:\n", "\n", "- **`ts_merge`** picks the first non-NaN value in priority order at each timestamp.\n", "- **`ts_splice`** constructs a piecewise record by switching sources at explicit transition times.\n", "\n", "In some workflows, however, abrupt switches in the merged product create undesirable jumps.\n", "Often the *higher-priority* series is preferred, but it may contain gaps. In those regions it is\n", "useful to **fade in** the lower-priority series near the edges of gaps rather than switching\n", "immediately.\n", "\n", "`ts_blend` implements exactly that:\n", "\n", "- Takes a list of Series/DataFrames (higher priority first).\n", "- Aligns them onto a common union index.\n", "- Inside gaps of the high-priority series: **falls back** to lower-priority data (just like `ts_merge`).\n", "- On the *shoulders* of gaps: computes the **distance to the nearest gap** in the high-priority\n", " series and applies a smooth kernel.\n", "\n", "For a gap-edge point with distance $d$ from the nearest NaN and a user-specified blending\n", "radius $L$:\n", "\n", "$$\n", "\\tilde t = \\frac{L - d}{L}, \\qquad\n", "w_{\\mathrm{lo}} = 0.5 \\tilde t, \\qquad\n", "w_{\\mathrm{hi}} = 1 - w_{\\mathrm{lo}}.\n", "$$\n", "\n", "Thus:\n", "\n", "- Points *at* the gap edge blend in up to **50%** of the lower-priority value.\n", "- Points farther than `blend_length` away use **100%** of the high-priority value.\n", "- Inside gaps, the lower-priority series is used exactly.\n", "- If the lower-priority series is also missing at some point, the output remains NaN.\n", "\n", "`blend_length` can be:\n", "\n", "- an **integer** → interpreted as a *number of samples*, or\n", "- a **timedelta-like string** (e.g. `\"2h\"`, `\"1d\"`) → interpreted as a time window\n", " (requires a regular `DatetimeIndex` with `.freq` set).\n", "\n", "Setting `blend_length=None` makes `ts_blend` behave like a standard priority merge.\n" ] }, { "cell_type": "code", "execution_count": 30, "id": "d4eca9fc", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hiloblended
2023-01-011.00.50.916667
2023-01-022.01.51.833333
2023-01-03NaN2.52.500000
2023-01-044.0NaN4.000000
2023-01-055.04.54.916667
2023-01-066.05.55.833333
2023-01-07NaN6.56.500000
2023-01-088.0NaN8.000000
2023-01-099.0NaN9.000000
2023-01-1010.09.510.000000
2023-01-1111.010.511.000000
2023-01-1212.011.512.000000
\n", "
" ], "text/plain": [ " hi lo blended\n", "2023-01-01 1.0 0.5 0.916667\n", "2023-01-02 2.0 1.5 1.833333\n", "2023-01-03 NaN 2.5 2.500000\n", "2023-01-04 4.0 NaN 4.000000\n", "2023-01-05 5.0 4.5 4.916667\n", "2023-01-06 6.0 5.5 5.833333\n", "2023-01-07 NaN 6.5 6.500000\n", "2023-01-08 8.0 NaN 8.000000\n", "2023-01-09 9.0 NaN 9.000000\n", "2023-01-10 10.0 9.5 10.000000\n", "2023-01-11 11.0 10.5 11.000000\n", "2023-01-12 12.0 11.5 12.000000" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "from vtools.functions.blend import ts_blend\n", "\n", "# Example 12-point daily series\n", "idx = pd.date_range(\"2023-01-01\", periods=12, freq=\"D\")\n", "\n", "# High-priority with two gaps\n", "hi = pd.Series(\n", " [1, 2, np.nan, 4, 5, 6, np.nan, 8, 9, 10, 11, 12],\n", " index=idx,\n", " name=\"hi\",\n", ")\n", "\n", "# Low-priority with different gaps\n", "lo = pd.Series(\n", " [0.5, 1.5, 2.5, np.nan, 4.5, 5.5, 6.5, np.nan, np.nan, 9.5, 10.5, 11.5],\n", " index=idx,\n", " name=\"lo\",\n", ")\n", "\n", "blend_length = 3 # 3-sample blending shoulder\n", "\n", "out = ts_blend((hi, lo), blend_length=blend_length,names=\"blended\")\n", "\n", "# Identify shoulder points where blending had an effect\n", "shoulder_mask = (~hi.isna()) & (~lo.isna()) & (out != hi)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 6))\n", "\n", "ax.plot(hi.index, hi.values, \"o-\", label=\"High priority (hi)\", linewidth=2)\n", "ax.plot(lo.index, lo.values, \"o--\", label=\"Low priority (lo)\", linewidth=2)\n", "ax.plot(out.index, out.values, \"o-\", label=\"Blended output\", linewidth=3)\n", "\n", "# Highlight blend-affected points\n", "ax.scatter(\n", " out.index[shoulder_mask],\n", " out.values[shoulder_mask],\n", " s=140,\n", " color=\"tab:red\",\n", " label=\"Blended shoulder region\",\n", " zorder=5,\n", ")\n", "\n", "ax.set_title(f\"Blending near gaps with ts_blend (blend_length={blend_length})\")\n", "ax.grid(alpha=0.3)\n", "ax.legend()\n", "fig.tight_layout()\n", "plt.show()\n", "\n", "out_df = pd.DataFrame({\"hi\": hi, \"lo\": lo, \"blended\": out})\n", "out_df\n" ] }, { "cell_type": "markdown", "id": "dafe9d6d", "metadata": {}, "source": [ "or ... using time blend: " ] }, { "cell_type": "code", "execution_count": 31, "id": "cbf75106", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hi_hourlylo_hourlyblended
2023-02-01 00:00:001.00.01.00
2023-02-01 01:00:002.01.01.75
2023-02-01 02:00:00NaN2.02.00
2023-02-01 03:00:004.03.03.75
2023-02-01 04:00:005.0NaN5.00
2023-02-01 05:00:006.05.05.75
2023-02-01 06:00:00NaN6.06.00
2023-02-01 07:00:008.07.07.75
2023-02-01 08:00:009.08.09.00
2023-02-01 09:00:0010.0NaN10.00
2023-02-01 10:00:0011.010.011.00
2023-02-01 11:00:0012.011.012.00
\n", "
" ], "text/plain": [ " hi_hourly lo_hourly blended\n", "2023-02-01 00:00:00 1.0 0.0 1.00\n", "2023-02-01 01:00:00 2.0 1.0 1.75\n", "2023-02-01 02:00:00 NaN 2.0 2.00\n", "2023-02-01 03:00:00 4.0 3.0 3.75\n", "2023-02-01 04:00:00 5.0 NaN 5.00\n", "2023-02-01 05:00:00 6.0 5.0 5.75\n", "2023-02-01 06:00:00 NaN 6.0 6.00\n", "2023-02-01 07:00:00 8.0 7.0 7.75\n", "2023-02-01 08:00:00 9.0 8.0 9.00\n", "2023-02-01 09:00:00 10.0 NaN 10.00\n", "2023-02-01 10:00:00 11.0 10.0 11.00\n", "2023-02-01 11:00:00 12.0 11.0 12.00" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Time-based example with hourly data\n", "\n", "idx_h = pd.date_range(\"2023-02-01\", periods=12, freq=\"h\")\n", "\n", "hi_h = pd.Series(\n", " [1, 2, np.nan, 4, 5, 6, np.nan, 8, 9, 10, 11, 12],\n", " index=idx_h,\n", " name=\"hi_hourly\",\n", ")\n", "lo_h = pd.Series(\n", " [0, 1, 2, 3, np.nan, 5, 6, 7, 8, np.nan, 10, 11],\n", " index=idx_h,\n", " name=\"lo_hourly\",\n", ")\n", "\n", "out_h = ts_blend((hi_h, lo_h), blend_length=\"2h\",names=\"blended\")\n", "\n", "shoulder_mask_h = (~hi_h.isna()) & (~lo_h.isna()) & (out_h != hi_h)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 5))\n", "ax.plot(hi_h.index, hi_h.values, \"o-\", label=\"High priority (hi_hourly)\", linewidth=2)\n", "ax.plot(lo_h.index, lo_h.values, \"o--\", label=\"Low priority (lo_hourly)\", linewidth=2)\n", "ax.plot(out_h.index, out_h.values, \"o-\", label=\"Blended (2H window)\", linewidth=3)\n", "\n", "ax.scatter(\n", " out_h.index[shoulder_mask_h],\n", " out_h.values[shoulder_mask_h],\n", " s=140,\n", " color=\"tab:red\",\n", " zorder=5,\n", " label=\"Blended shoulder region\",\n", ")\n", "\n", "ax.set_title(\"Hourly ts_blend with time-based blend_length='2H'\")\n", "ax.grid(alpha=0.3)\n", "ax.legend()\n", "fig.tight_layout()\n", "plt.show()\n", "\n", "pd.DataFrame({\"hi_hourly\": hi_h, \"lo_hourly\": lo_h, \"blended\": out_h})\n" ] } ], "metadata": { "kernelspec": { "display_name": "schism", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 5 }