{
"cells": [
{
"cell_type": "markdown",
"id": "436d5ccf",
"metadata": {},
"source": [
"# Merging and Splicing Time Series\n",
"This tutorial demonstrates the usage and difference between `ts_merge` and `ts_splice`, two methods for folding together time series into a combined data structure.\n",
"\n",
"- **`ts_merge`** blends multiple time series together based on priority, optionally filling missing values in higher priority series with entries from lower priority. It potentially uses all the input series at all timestamps. See the [`strict_priority`](#ts_merge-strict-priority-option) option below for advanced control over nan-filling between priorities.\n",
"- **`ts_splice`** stitches together time series in sequential time **blocks** without mixing values.\n",
"\n",
"We will describe the effect on regularly sampled series (which have the `freq` attribute) and on irregular. We will also explore the **`names`** argument, which controls how columns are selected or renamed in the merging/splicing process. There is a file-level command line tools for this as well in the `dms_datastore` package.\n",
"\n",
"## Prioritized filling on regular series\n",
"Let's begin by showing how `ts_merge` and `ts_splice` fold together two regular series but gappy \n",
"series on a prioritized basis.\n",
"\n",
"Here are the sample series:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e52fb077",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'pd' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[1], line 4\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# ========================================\u001b[39;00m\n\u001b[0;32m 2\u001b[0m \u001b[38;5;66;03m# 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)\u001b[39;00m\n\u001b[0;32m 3\u001b[0m \u001b[38;5;66;03m# ========================================\u001b[39;00m\n\u001b[1;32m----> 4\u001b[0m idx1 \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241m.\u001b[39mdate_range(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m2023-01-01\u001b[39m\u001b[38;5;124m\"\u001b[39m, periods\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m, freq\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m1D\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 5\u001b[0m idx2 \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mdate_range(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m2023-01-01\u001b[39m\u001b[38;5;124m\"\u001b[39m, periods\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m12\u001b[39m, freq\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m1D\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 6\u001b[0m idx3 \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mdate_range(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m2022-12-31\u001b[39m\u001b[38;5;124m\"\u001b[39m, periods\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m14\u001b[39m, freq\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m1D\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"\u001b[1;31mNameError\u001b[0m: name 'pd' is not defined"
]
}
],
"source": [
"# ========================================\n",
"# 1️⃣ Creating Regular Time Series (1D Frequency with Missing Data)\n",
"# ========================================\n",
"idx1 = pd.date_range(\"2023-01-01\", periods=10, freq=\"1D\")\n",
"idx2 = pd.date_range(\"2023-01-01\", periods=12, freq=\"1D\")\n",
"idx3 = pd.date_range(\"2022-12-31\", periods=14, freq=\"1D\")\n",
"\n",
"series1 = pd.Series([1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10], index=idx1, name=\"A\")\n",
"series2 = pd.Series([np.nan, 2, np.nan, 4, np.nan, np.nan, 7, np.nan, np.nan, np.nan,3.,4.], index=idx2, name=\"A\")\n",
"series3 = pd.Series([1000.,1001., 1002., np.nan, 1004., np.nan, np.nan, 1007., np.nan, np.nan, np.nan,1005.,1006.,1007.], index=idx3, name=\"A\")\n",
"\n",
"print(\"Series 1 (Primary):\")\n",
"display(series1)\n",
"\n",
"print(\"\\nSeries 2 (Secondary - Fills Gaps):\")\n",
"display(series2)\n",
"\n",
"print(\"\\nSeries 3 (Tertiary - Fills Gaps):\")\n",
"display(series3)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "b2175099",
"metadata": {},
"source": [
"And here is what it looks like spliced instead of merged."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dd08914",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Merged Series with Prioritization:\n"
]
},
{
"data": {
"text/plain": [
"2022-12-31 1000.0\n",
"2023-01-01 1.0\n",
"2023-01-02 2.0\n",
"2023-01-03 3.0\n",
"2023-01-04 4.0\n",
"2023-01-05 5.0\n",
"2023-01-06 6.0\n",
"2023-01-07 7.0\n",
"2023-01-08 8.0\n",
"2023-01-09 9.0\n",
"2023-01-10 10.0\n",
"2023-01-11 3.0\n",
"2023-01-12 4.0\n",
"2023-01-13 1007.0\n",
"Freq: D, Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ========================================\n",
"# 2️⃣ Using `ts_merge()` with Prioritization\n",
"# ========================================\n",
"merged_series = ts_merge((series1, series2, series3))\n",
"print(\"\\nMerged Series with Prioritization:\")\n",
"display(merged_series)"
]
},
{
"cell_type": "markdown",
"id": "fab7780c",
"metadata": {},
"source": [
"## Splicing\n",
"Splicing marches through the prioritized list of input time series and exclusively uses values for the higher priority series one during the entire span of that series. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae88f210",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Spliced Series with Prioritization:\n"
]
},
{
"data": {
"text/plain": [
"2022-12-31 1000.0\n",
"2023-01-01 1001.0\n",
"2023-01-02 1002.0\n",
"2023-01-03 NaN\n",
"2023-01-04 1004.0\n",
"2023-01-05 NaN\n",
"2023-01-06 NaN\n",
"2023-01-07 1007.0\n",
"2023-01-08 NaN\n",
"2023-01-09 NaN\n",
"2023-01-10 NaN\n",
"2023-01-11 1005.0\n",
"2023-01-12 1006.0\n",
"2023-01-13 1007.0\n",
"Freq: D, Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Spliced Series with Prioritization, Prefer first:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 NaN\n",
"2023-01-03 3.0\n",
"2023-01-04 NaN\n",
"2023-01-05 5.0\n",
"2023-01-06 6.0\n",
"2023-01-07 NaN\n",
"2023-01-08 8.0\n",
"2023-01-09 9.0\n",
"2023-01-10 10.0\n",
"2023-01-11 3.0\n",
"2023-01-12 4.0\n",
"2023-01-13 1007.0\n",
"Freq: D, Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"spliced_series = ts_splice((series1, series2, series3))\n",
"print(\"\\nSpliced Series with Prioritization and default `prefer last`:\")\n",
"display(spliced_series)\n",
"spliced_first = ts_splice((series1, series2, series3),transition=\"prefer_first\")\n",
"print(\"\\nSpliced Series with Prioritization, Prefer first:\")\n",
"display(spliced_first)"
]
},
{
"cell_type": "markdown",
"id": "4a2478d5",
"metadata": {},
"source": [
"## Irregular series\n",
"\n",
"Now we will look at some irregular series and see the difference in output from ts_merge (which shuffles) and ts_splice (which exclusively uses values from one series at a time based on the span of the series and its priority)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a1d0dae",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Irregular Series 1:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-03 NaN\n",
"2023-01-07 3.0\n",
"2023-01-10 4.0\n",
"Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Irregular Series 2:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-02 10.0\n",
"2023-01-04 20.0\n",
"2023-01-08 NaN\n",
"2023-01-11 40.0\n",
"Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Merged Irregular Series (May Shuffle Timestamps):\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 10.0\n",
"2023-01-03 NaN\n",
"2023-01-04 20.0\n",
"2023-01-07 3.0\n",
"2023-01-08 NaN\n",
"2023-01-10 4.0\n",
"2023-01-11 40.0\n",
"Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Spliced Irregular Series (prefer_last):\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 10.0\n",
"2023-01-04 20.0\n",
"2023-01-08 NaN\n",
"2023-01-11 40.0\n",
"Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"# ========================================\n",
"# 3️⃣ Creating Irregular Time Series (No Freq Attribute)\n",
"# ========================================\n",
"idx_irreg1 = pd.to_datetime([\"2023-01-01\", \"2023-01-03\", \"2023-01-07\", \"2023-01-10\"])\n",
"idx_irreg2 = pd.to_datetime([\"2023-01-02\", \"2023-01-04\", \"2023-01-08\", \"2023-01-11\"])\n",
"\n",
"series_irreg1 = pd.Series([1, np.nan, 3, 4], index=idx_irreg1, name=\"A\")\n",
"series_irreg2 = pd.Series([10, 20, np.nan, 40], index=idx_irreg2, name=\"A\")\n",
"\n",
"print(\"\\nIrregular Series 1:\")\n",
"display(series_irreg1)\n",
"\n",
"print(\"\\nIrregular Series 2:\")\n",
"display(series_irreg2)\n",
"\n",
"# ========================================\n",
"# 4️⃣ Using `ts_merge()` with Irregular Time Series\n",
"# ========================================\n",
"merged_irregular = ts_merge((series_irreg1, series_irreg2))\n",
"print(\"\\nMerged Irregular Series (May Shuffle Timestamps):\")\n",
"display(merged_irregular)\n",
"\n",
"# ========================================\n",
"# 5️⃣ Using `ts_splice()` with Irregular Time Series\n",
"# ========================================\n",
"spliced_irregular = ts_splice((series_irreg1, series_irreg2), transition=\"prefer_last\")\n",
"print(\"\\nSpliced Irregular Series (prefer_last):\")\n",
"display(spliced_irregular)"
]
},
{
"cell_type": "markdown",
"id": "de365104",
"metadata": {},
"source": [
"## `Names` argument\n",
"\n",
"Finally let's look at some more intricate examples with mixed series and dataframes with differing numbers of columns and see how `names` can be used to make selections or unify poorly coordinated labels. Here are the series:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35cfc422",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Series 1:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-03 NaN\n",
"2023-01-05 3.0\n",
"2023-01-07 4.0\n",
"2023-01-09 5.0\n",
"Freq: 2D, Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Series 2:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-02 10.0\n",
"2023-01-04 20.0\n",
"2023-01-06 30.0\n",
"2023-01-08 NaN\n",
"2023-01-10 50.0\n",
"Freq: 2D, Name: B, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataFrame 1:\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 2023-01-01 | \n",
" 1.0 | \n",
" 10 | \n",
"
\n",
" \n",
" 2023-01-03 | \n",
" NaN | \n",
" 20 | \n",
"
\n",
" \n",
" 2023-01-05 | \n",
" 3.0 | \n",
" 30 | \n",
"
\n",
" \n",
" 2023-01-07 | \n",
" 4.0 | \n",
" 40 | \n",
"
\n",
" \n",
" 2023-01-09 | \n",
" 5.0 | \n",
" 50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2023-01-01 1.0 10\n",
"2023-01-03 NaN 20\n",
"2023-01-05 3.0 30\n",
"2023-01-07 4.0 40\n",
"2023-01-09 5.0 50"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataFrame 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 2023-01-02 | \n",
" 10.0 | \n",
" 100.0 | \n",
"
\n",
" \n",
" 2023-01-04 | \n",
" 20.0 | \n",
" 200.0 | \n",
"
\n",
" \n",
" 2023-01-06 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2023-01-08 | \n",
" 40.0 | \n",
" 400.0 | \n",
"
\n",
" \n",
" 2023-01-10 | \n",
" 50.0 | \n",
" 500.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2023-01-02 10.0 100.0\n",
"2023-01-04 20.0 200.0\n",
"2023-01-06 NaN NaN\n",
"2023-01-08 40.0 400.0\n",
"2023-01-10 50.0 500.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataFrame 3:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 2023-01-02 | \n",
" 310.0 | \n",
" 100.0 | \n",
" 3100.0 | \n",
"
\n",
" \n",
" 2023-01-04 | \n",
" 320.0 | \n",
" 200.0 | \n",
" 3200.0 | \n",
"
\n",
" \n",
" 2023-01-06 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2023-01-08 | \n",
" 340.0 | \n",
" 400.0 | \n",
" 3400.0 | \n",
"
\n",
" \n",
" 2023-01-10 | \n",
" NaN | \n",
" 500.0 | \n",
" 3500.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"2023-01-02 310.0 100.0 3100.0\n",
"2023-01-04 320.0 200.0 3200.0\n",
"2023-01-06 NaN NaN NaN\n",
"2023-01-08 340.0 400.0 3400.0\n",
"2023-01-10 NaN 500.0 3500.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from vtools import ts_merge, ts_splice # Assuming these functions are in merge.py\n",
"\n",
"# Create irregular time series\n",
"idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"2D\")\n",
"idx2 = pd.date_range(\"2023-01-02\", periods=5, freq=\"2D\")\n",
"\n",
"series1 = pd.Series([1, np.nan, 3, 4, 5], index=idx1, name=\"A\")\n",
"series2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"B\")\n",
"\n",
"df1 = pd.DataFrame({\"A\": [1, np.nan, 3, 4, 5], \"B\": [10, 20, 30, 40, 50]}, index=idx1)\n",
"df2 = pd.DataFrame({\"A\": [10, 20, np.nan, 40, 50], \"B\": [100, 200, np.nan, 400, 500]}, index=idx2)\n",
"df3 = pd.DataFrame({\"A\": [310, 320, np.nan, 340, np.nan], \n",
" \"B\": [100, 200, np.nan, 400, 500],\n",
" \"C\": [3100, 3200, np.nan, 3400, 3500]\n",
" }, index=idx2)\n",
"\n",
"# Display Data\n",
"print(\"Series 1:\")\n",
"display(series1)\n",
"\n",
"print(\"Series 2:\")\n",
"display(series2)\n",
"\n",
"print(\"DataFrame 1:\")\n",
"display(df1)\n",
"\n",
"print(\"DataFrame 2:\")\n",
"display(df2)\n",
"\n",
"print(\"DataFrame 3:\")\n",
"display(df3)\n"
]
},
{
"cell_type": "markdown",
"id": "219a7aa6",
"metadata": {},
"source": [
"Here are some example usage:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6949e66",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Merged Series not renamed:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 2.0\n",
"2023-01-03 3.0\n",
"2023-01-04 4.0\n",
"2023-01-05 5.0\n",
"2023-01-06 6.0\n",
"2023-01-07 7.0\n",
"2023-01-08 8.0\n",
"2023-01-09 9.0\n",
"2023-01-10 10.0\n",
"2023-01-11 3.0\n",
"2023-01-12 4.0\n",
"Freq: D, Name: A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Merged Series renamed:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 2.0\n",
"2023-01-03 3.0\n",
"2023-01-04 4.0\n",
"2023-01-05 5.0\n",
"2023-01-06 6.0\n",
"2023-01-07 7.0\n",
"2023-01-08 8.0\n",
"2023-01-09 9.0\n",
"2023-01-10 10.0\n",
"2023-01-11 3.0\n",
"2023-01-12 4.0\n",
"Freq: D, Name: Renamed_A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match\n",
"Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 2023-01-01 | \n",
" 1.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2023-01-02 | \n",
" 10.0 | \n",
" 100.0 | \n",
"
\n",
" \n",
" 2023-01-03 | \n",
" NaN | \n",
" 20.0 | \n",
"
\n",
" \n",
" 2023-01-04 | \n",
" 20.0 | \n",
" 200.0 | \n",
"
\n",
" \n",
" 2023-01-05 | \n",
" 3.0 | \n",
" 30.0 | \n",
"
\n",
" \n",
" 2023-01-06 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2023-01-07 | \n",
" 4.0 | \n",
" 40.0 | \n",
"
\n",
" \n",
" 2023-01-08 | \n",
" 40.0 | \n",
" 400.0 | \n",
"
\n",
" \n",
" 2023-01-09 | \n",
" 5.0 | \n",
" 50.0 | \n",
"
\n",
" \n",
" 2023-01-10 | \n",
" 50.0 | \n",
" 500.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2023-01-01 1.0 10.0\n",
"2023-01-02 10.0 100.0\n",
"2023-01-03 NaN 20.0\n",
"2023-01-04 20.0 200.0\n",
"2023-01-05 3.0 30.0\n",
"2023-01-06 NaN NaN\n",
"2023-01-07 4.0 40.0\n",
"2023-01-08 40.0 400.0\n",
"2023-01-09 5.0 50.0\n",
"2023-01-10 50.0 500.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" 2023-01-01 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2023-01-02 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2023-01-03 | \n",
" NaN | \n",
"
\n",
" \n",
" 2023-01-04 | \n",
" 20.0 | \n",
"
\n",
" \n",
" 2023-01-05 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 2023-01-06 | \n",
" NaN | \n",
"
\n",
" \n",
" 2023-01-07 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 2023-01-08 | \n",
" 40.0 | \n",
"
\n",
" \n",
" 2023-01-09 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 2023-01-10 | \n",
" 50.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A\n",
"2023-01-01 1.0\n",
"2023-01-02 10.0\n",
"2023-01-03 NaN\n",
"2023-01-04 20.0\n",
"2023-01-05 3.0\n",
"2023-01-06 NaN\n",
"2023-01-07 4.0\n",
"2023-01-08 40.0\n",
"2023-01-09 5.0\n",
"2023-01-10 50.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Spliced Series with Renamed Column:\n"
]
},
{
"data": {
"text/plain": [
"2023-01-01 1.0\n",
"2023-01-02 2.0\n",
"2023-01-03 NaN\n",
"2023-01-04 4.0\n",
"2023-01-05 NaN\n",
"2023-01-06 NaN\n",
"2023-01-07 7.0\n",
"2023-01-08 NaN\n",
"2023-01-09 NaN\n",
"2023-01-10 NaN\n",
"2023-01-11 3.0\n",
"2023-01-12 4.0\n",
"Freq: D, Name: Renamed_A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Example: Using `names` to rename output columns\n",
"\n",
"# Merging without a rename\n",
"merged_series_named = ts_merge((series1, series2))\n",
"print(\"Merged Series not renamed:\")\n",
"display(merged_series_named)\n",
"\n",
"# Rename a single column\n",
"merged_series_named = ts_merge((series1, series2), names=\"Renamed_A\")\n",
"print(\"Merged Series renamed:\")\n",
"display(merged_series_named)\n",
"\n",
"# Select specific columns in DataFrame\n",
"try:\n",
" merged_df_named = ts_merge((df1, df2, df3), names=None)\n",
"except:\n",
" print(\"Merged DataFrame without Selected Columns (names=None) results in an error if the columns don't match\")\n",
"#display(merged_df_named)\n",
"\n",
"# Select specific columns in DataFrame\n",
"merged_df_named = ts_merge((df1, df2), names=None)\n",
"print(\"Merged DataFrame without selected columns (names=None) for input DataFrames with matched columns:\")\n",
"display(merged_df_named)\n",
"\n",
"\n",
"# Select specific columns in DataFrame\n",
"merged_df_named = ts_merge((df1, df2, df3), names=[\"A\"])\n",
"print(\"Merged DataFrame with Selected Columns A merges that column ([A,B] would have been OK too)\")\n",
"display(merged_df_named)\n",
"\n",
"\n",
"# Rename column in splicing\n",
"spliced_series_named = ts_splice((series1, series2), names=\"Renamed_A\", transition=\"prefer_last\")\n",
"print(\"Spliced Series with Renamed Column:\")\n",
"display(spliced_series_named)\n"
]
},
{
"cell_type": "markdown",
"id": "6baebda5",
"metadata": {},
"source": [
"## Summary\n",
"- **Use `ts_merge`** when you want to blend time series together, filling missing values in order of priority.\n",
"- **Use `ts_splice`** when you want to keep each time series separate and transition from one to another based on time.\n",
"- **The `names` argument** allows you to rename output columns or select specific columns when merging/splicing DataFrames.\n",
"\n",
"This notebook provides a clear comparison to help you decide which method best suits your use case.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `ts_merge`: strict priority option\n",
"**New option**: `strict_priority` (default `False`) enforces that a higher‑priority series dominates between its `first_valid_index` and `last_valid_index`.\n",
"\n",
"**Semantics**\n",
"- Per **column**, define the dominance window as `[first_valid_index, last_valid_index]`.\n",
"- Within that window, lower‑priority series are **masked**, even if the higher‑priority value is `NaN`.\n",
"- Outside those windows, merging is unchanged and lower priority may contribute.\n",
"- With irregular inputs, timestamps that exist **only** in lower‑priority series **and** are fully masked inside a dominance window are dropped; timestamps from the top series' index are preserved even if all‑`NaN`.\n",
"\n",
"**`names` behavior** is unchanged.\n",
"### Example 1 — Series with interior `NaN`\n",
"\n",
"```python\n",
"import numpy as np, pandas as pd\n",
"from vtools.functions.merge import ts_merge\n",
"\n",
"idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
"idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
"s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n",
"s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n",
"\n",
"ts_merge((s1, s2)) # default\n",
"ts_merge((s1, s2), strict_priority=True)\n",
"```\n",
"### Example 2 — Two columns, per‑column dominance\n",
"\n",
"```python\n",
"idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
"idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
"df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n",
"df1[\"B\"] = df1[\"A\"]\n",
"df1.loc[idx1[2], \"B\"] = np.nan # interior NaN in high‑priority B\n",
"df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n",
"df2[\"B\"] = df2[\"A\"]\n",
"\n",
"ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]]\n",
"```\n",
"### Example 3 — Irregular inputs\n",
"\n",
"```python\n",
"idx1 = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n",
"idx2 = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n",
"s1 = pd.Series([1.,2.,3.,4.], index=idx1, name=\"A\")\n",
"s2 = pd.Series([10.,20.,30.,40.], index=idx2, name=\"A\")\n",
"\n",
"ts_merge((s1, s2), strict_priority=True)\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np, pandas as pd\n",
"from vtools.functions.merge import ts_merge\n",
"\n",
"# Example 1\n",
"idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
"idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
"s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n",
"s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n",
"print(\"Example 1 strict=False:\")\n",
"print(ts_merge((s1, s2)))\n",
"print(\"Example 1 strict=True:\")\n",
"print(ts_merge((s1, s2), strict_priority=True))\n",
"\n",
"# Example 2\n",
"df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n",
"df1[\"B\"] = df1[\"A\"]; df1.loc[idx1[2], \"B\"] = np.nan\n",
"df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n",
"df2[\"B\"] = df2[\"A\"]\n",
"print(\"\\nExample 2 strict=True:\")\n",
"print(ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]])\n",
"\n",
"# Example 3\n",
"idx1i = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n",
"idx2i = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n",
"s1i = pd.Series([1.,2.,3.,4.], index=idx1i, name=\"A\")\n",
"s2i = pd.Series([10.,20.,30.,40.], index=idx2i, name=\"A\")\n",
"print(\"\\nExample 3 strict=True:\")\n",
"print(ts_merge((s1i, s2i), strict_priority=True))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "schism",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}