vtools.data package

Submodules

vtools.data.dst module

Daylight Savings Time Conversion

This module provides the function dst_st() for converting a pandas Series/Dataframe with a naive DatetimeIndex that observes daylight savings time (DST) to a fixed standard time zone (e.g., PST) using POSIX conventions.

See the automatic API documentation for details: vtools.data.dst.dst_st()

dst_st(ts, src_tz: str = 'US/Pacific', target_tz: str = 'Etc/GMT+8')[source]

Convert a pandas Series with a datetime index from a timezone-unaware index that observes DST (e.g., US/Pacific) to a fixed standard time zone (e.g., Etc/GMT+8) using POSIX conventions.

Parameters:
tspandas.Series

Time series with a naive (timezone-unaware) DatetimeIndex.

src_tzstr, optional

Source timezone name (default is ‘US/Pacific’).

target_tzstr, optional

Target standard timezone name (default is ‘Etc/GMT+8’).

Returns:
pandas.Series

Time series with index converted to the target standard timezone and made naive.

Notes

  • The function assumes the index is not already timezone-aware.

  • ‘Etc/GMT+8’ is the correct tz name for UTC-8 (PST) in pytz; note the sign is reversed from what might be expected.

  • Handles ambiguous/nonexistent times due to DST transitions.

  • The returned index is naive (timezone-unaware) but represents the correct standard time.

  • If the input index is already timezone-aware, this function will raise an error.

Examples

>>> import pandas as pd
>>> from vtools import dst_st
>>> rng = pd.date_range("2023-11-05 00:00", "2023-11-05 04:00", freq="30min")
>>> ts = pd.Series(range(len(rng)), index=rng)
>>> converted = dst_st(ts)
>>> print(converted)
2023-11-05 00:00:00    0
2023-11-05 00:30:00    1
2023-11-05 01:00:00    2
2023-11-05 01:30:00    3
2023-11-05 02:30:00    5
2023-11-05 03:00:00    6
2023-11-05 03:30:00    7
2023-11-05 04:00:00    8
dtype: int64

vtools.data.gap module

describe_null(dset, name, context=2)[source]

If dset is a DataFrame, run describe_series_gaps on each column. If it’s a Series, just run it once.

describe_series_gaps(s: Series, name: str, context: int = 2)[source]

Print gaps in a single Series s, showing context non-null points before and after each gap, with an ellipsis marker in between.

example_gap()[source]
gap_count(ts, state='gap', dtype=<class 'int'>)[source]

Count missing data Identifies gaps (runs of missing or non-missing data) and quantifies the length of the gap in terms of number of samples, which works better for regular series. Each time point receives the length of the run.

Parameters:
tsDataFrame

Time series to analyze

statestr one of ‘gap’|’good’|’both’

State to count. If state is gap, block size of missing data are counted and reported for time points in the gap (every point in a given gap will receive the same value). Non missing data will have a size of zero. Setting state to ‘good’ inverts this – missing blocks are reported as zero and good data are counted.

dtypestr or type

Data type of output, should be acceptable to pandas astype

gap_distance(ts, disttype='count', to='good')[source]

For each element of ts, count the distance to the nearest good data/or bad data.

Parameters:
tsDataFrame
Time series to analyze
disttypestr one of ‘bad’|’good’
If disttype = “count” this is the number of values. If dist_type=”freq” it is in the units of ts.freq
(so if freq == “15min” it is in minutes”)
tostr one of ‘bad’|’good’
If to = “good” this is the distance to the nearest good data (which is 0 for good data).
If to = “bad”, this is the distance to the nearest nan (which is 0 for nan).
Returns:
resultDataFrame

A new regular time series with the same freq as the argument holding the distance of good/bad data.

gap_size(ts)[source]

Identifies gaps (runs of missing data) and quantifies the length of the gap. Each time point receives the length of the run in terms of seconds or number of values in the time dimension, with non-missing data returning zero. Time is measured from the time the data first started being missing to when the data first starts being not missing .

Parameters:
tsDataFrame
Returns:
resultDataFrame

A new regular time series with the same freq as the argument holding the size of the gap.

Examples

>>> ndx = pd.date_range(pd.Timestamp(2017,1,1,12),freq='15min',periods=10)
>>> vals0 = np.arange(0.,10.,dtype='d')
>>> vals1 = np.arange(0.,10.,dtype='d')
>>> vals2 =  np.arange(0.,10.,dtype='d')
>>> vals0[0:3] = np.nan
>>> vals0[7:-1] = np.nan
>>> vals1[2:4] = np.nan>>>
>>> vals1[6] = np.nan
>>> vals1[9] = np.nan
>>> df = pd.DataFrame({'vals0':vals0,'vals1':vals1,'vals2':vals2},index = ndx)
>>> out = gap_size(df)
>>> print(df)
                         vals0  vals1  vals2
2017-01-01 12:00:00    NaN    0.0    0.0
2017-01-01 12:15:00    NaN    1.0    1.0
2017-01-01 12:30:00    NaN    NaN    2.0
2017-01-01 12:45:00    3.0    NaN    3.0
2017-01-01 13:00:00    4.0    4.0    4.0
2017-01-01 13:15:00    5.0    5.0    5.0
2017-01-01 13:30:00    6.0    NaN    6.0
2017-01-01 13:45:00    NaN    7.0    7.0
2017-01-01 14:00:00    NaN    8.0    8.0
2017-01-01 14:15:00    9.0    NaN    9.0
>>> print(out)
                         vals0  vals1  vals2
2017-01-01 12:00:00   45.0    0.0    0.0
2017-01-01 12:15:00   45.0    0.0    0.0
2017-01-01 12:30:00   45.0   30.0    0.0
2017-01-01 12:45:00    0.0   30.0    0.0
2017-01-01 13:00:00    0.0    0.0    0.0
2017-01-01 13:15:00    0.0    0.0    0.0
2017-01-01 13:30:00    0.0   15.0    0.0
2017-01-01 13:45:00   30.0    0.0    0.0
2017-01-01 14:00:00   30.0    0.0    0.0
2017-01-01 14:15:00    0.0    0.0    0.0

vtools.data.sample_series module

bessel_df()[source]

Sample series with bessel function signals

extra()[source]
interval(ts)[source]

Sampling interval of series

jay_flinchem_chirptest(c1=3.5, c2=5.5, c3=0.0002, c4=6.75)[source]

Approximation of the signal from Jay and Flinchem 1999 A comparison of methods for analysis of tidal records containing multi-scale non-tidal background energy that has a small tide with noisy, river-influenced amplitude and subtide

small_subtide(subtide_scale=0.0, add_nan=False)[source]

Inspired by large tidal flow with small Qr undercurrent with 72hr period This is a tough lowpass filtering job because the diurnal band is large and must be supressed in order to see the more subtle subtidal amplitude

vtools.data.timeseries module

Time series module Helpers for creating regular and irregular time series, transforming irregular to regular and analyzing gaps.

class PchipInterpolator(x, y, axis=0, extrapolate=None)[source]

Bases: CubicHermiteSpline

PCHIP 1-D monotonic cubic interpolation.

x and y are arrays of values used to approximate some function f, with y = f(x). The interpolant uses monotonic cubic splines to find the value of new points. (PCHIP stands for Piecewise Cubic Hermite Interpolating Polynomial).

Parameters:
xndarray, shape (npoints, )

A 1-D array of monotonically increasing real values. x cannot include duplicate values (otherwise f is overspecified)

yndarray, shape (…, npoints, …)

A N-D array of real values. y’s length along the interpolation axis must be equal to the length of x. Use the axis parameter to select the interpolation axis.

axisint, optional

Axis in the y array corresponding to the x-coordinate values. Defaults to axis=0.

extrapolatebool, optional

Whether to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs.

See also

CubicHermiteSpline

Piecewise-cubic interpolator.

Akima1DInterpolator

Akima 1D interpolator.

CubicSpline

Cubic spline data interpolator.

PPoly

Piecewise polynomial in terms of coefficients and breakpoints.

Notes

The interpolator preserves monotonicity in the interpolation data and does not overshoot if the data is not smooth.

The first derivatives are guaranteed to be continuous, but the second derivatives may jump at \(x_k\).

Determines the derivatives at the points \(x_k\), \(f'_k\), by using PCHIP algorithm [1].

Let \(h_k = x_{k+1} - x_k\), and \(d_k = (y_{k+1} - y_k) / h_k\) are the slopes at internal points \(x_k\). If the signs of \(d_k\) and \(d_{k-1}\) are different or either of them equals zero, then \(f'_k = 0\). Otherwise, it is given by the weighted harmonic mean

\[\frac{w_1 + w_2}{f'_k} = \frac{w_1}{d_{k-1}} + \frac{w_2}{d_k}\]

where \(w_1 = 2 h_k + h_{k-1}\) and \(w_2 = h_k + 2 h_{k-1}\).

The end slopes are set using a one-sided scheme [2].

References

[1]

F. N. Fritsch and J. Butland, A method for constructing local monotone piecewise cubic interpolants, SIAM J. Sci. Comput., 5(2), 300-304 (1984). :doi:`10.1137/0905021`.

[2]

see, e.g., C. Moler, Numerical Computing with Matlab, 2004. :doi:`10.1137/1.9780898717952`

Attributes:
axis
c
extrapolate
x

Methods

__call__(x[, nu, extrapolate])

Evaluate the piecewise polynomial or its derivative.

derivative([nu])

Construct a new piecewise polynomial representing the derivative.

antiderivative([nu])

Construct a new piecewise polynomial representing the antiderivative.

roots([discontinuity, extrapolate])

Find real roots of the piecewise polynomial.

__doc__ = "PCHIP 1-D monotonic cubic interpolation.\n\n    ``x`` and ``y`` are arrays of values used to approximate some function f,\n    with ``y = f(x)``. The interpolant uses monotonic cubic splines\n    to find the value of new points. (PCHIP stands for Piecewise Cubic\n    Hermite Interpolating Polynomial).\n\n    Parameters\n    ----------\n    x : ndarray, shape (npoints, )\n        A 1-D array of monotonically increasing real values. ``x`` cannot\n        include duplicate values (otherwise f is overspecified)\n    y : ndarray, shape (..., npoints, ...)\n        A N-D array of real values. ``y``'s length along the interpolation\n        axis must be equal to the length of ``x``. Use the ``axis``\n        parameter to select the interpolation axis.\n    axis : int, optional\n        Axis in the ``y`` array corresponding to the x-coordinate values. Defaults\n        to ``axis=0``.\n    extrapolate : bool, optional\n        Whether to extrapolate to out-of-bounds points based on first\n        and last intervals, or to return NaNs.\n\n    Methods\n    -------\n    __call__\n    derivative\n    antiderivative\n    roots\n\n    See Also\n    --------\n    CubicHermiteSpline : Piecewise-cubic interpolator.\n    Akima1DInterpolator : Akima 1D interpolator.\n    CubicSpline : Cubic spline data interpolator.\n    PPoly : Piecewise polynomial in terms of coefficients and breakpoints.\n\n    Notes\n    -----\n    The interpolator preserves monotonicity in the interpolation data and does\n    not overshoot if the data is not smooth.\n\n    The first derivatives are guaranteed to be continuous, but the second\n    derivatives may jump at :math:`x_k`.\n\n    Determines the derivatives at the points :math:`x_k`, :math:`f'_k`,\n    by using PCHIP algorithm [1]_.\n\n    Let :math:`h_k = x_{k+1} - x_k`, and  :math:`d_k = (y_{k+1} - y_k) / h_k`\n    are the slopes at internal points :math:`x_k`.\n    If the signs of :math:`d_k` and :math:`d_{k-1}` are different or either of\n    them equals zero, then :math:`f'_k = 0`. Otherwise, it is given by the\n    weighted harmonic mean\n\n    .. math::\n\n        \\frac{w_1 + w_2}{f'_k} = \\frac{w_1}{d_{k-1}} + \\frac{w_2}{d_k}\n\n    where :math:`w_1 = 2 h_k + h_{k-1}` and :math:`w_2 = h_k + 2 h_{k-1}`.\n\n    The end slopes are set using a one-sided scheme [2]_.\n\n\n    References\n    ----------\n    .. [1] F. N. Fritsch and J. Butland,\n           A method for constructing local\n           monotone piecewise cubic interpolants,\n           SIAM J. Sci. Comput., 5(2), 300-304 (1984).\n           :doi:`10.1137/0905021`.\n    .. [2] see, e.g., C. Moler, Numerical Computing with Matlab, 2004.\n           :doi:`10.1137/1.9780898717952`\n\n    "
__init__(x, y, axis=0, extrapolate=None)[source]
__module__ = 'scipy.interpolate._cubic'
static _edge_case(h0, h1, m0, m1)[source]
static _find_derivatives(x, y)[source]
axis
c
extrapolate
x
datetime_elapsed(index_or_ts, reftime=None, dtype='d', inplace=False)[source]

Convert a time series or DatetimeIndex to an integer/double series of elapsed time

Parameters:
index_or_tsDatatimeIndex

Time series or index to be transformed

reftimeDatatimeIndex or something convertible

The reference time upon which elapsed time is measured. Default of None means start of series

dtypestr like ‘i’ or ‘d’ or type like int (Int64) or float (Float64)

Data type for output, which starts out as a Float64 (‘d’) and gets converted, typically to Int64 (‘i’)

inplacebool

If input is a data frame, replaces the index in-place with no copy

Returns:
result

A new index using elapsed time from reftime as its value and of type dtype

days(d)[source]

Create a time interval representing d days

elapsed_datetime(index_or_ts, reftime=None, time_unit='s', inplace=False)[source]

Convert a time series or numerical Index to a Datetime index or series

Parameters:
index_or_tsDatatimeIndex

Time series or index to be transformed with index in elapsed seconds from reftime

reftimeDatatimeIndex or something convertible

The reference time upon which datetimes are to be evaluated.

inplacebool

If input is a data frame, replaces the index in-place with no copy

Returns:
result

A new index using DatetimeIndex inferred from elapsed time from reftime as its value and of type dtype

example()[source]
extrapolate_ts(ts, start=None, end=None, method='ffill', val=None)[source]

Extend a regular time series to a new start and/or end using a specified extrapolation method.

Parameters:
tspandas.Series or pandas.DataFrame

The input time series with a DateTimeIndex and a regular frequency.

startdatetime-like, optional

The new starting time. If None, no extension is done before the existing data.

enddatetime-like, optional

The new ending time. If None, no extension is done after the existing data.

method{‘ffill’, ‘bfill’, ‘linear_slope’, ‘taper’, ‘constant’}, default ‘ffill’

The method used to fill new values outside the original time range:

  • ‘ffill’ : Forward-fill after the original data using its last value.

  • ‘bfill’ : Backward-fill before the original data using its first value.

  • ‘linear_slope’ : Bidirectional linear extrapolation using the first/last two points.

  • ‘taper’ : One-sided linear interpolation to/from a specified value (val).

  • ‘constant’ : One-sided constant value fill with val.

valfloat, optional

Required for ‘taper’ and ‘constant’. Specifies the value to use.

Returns:
extendedpandas.Series or pandas.DataFrame

The time series extended and filled using the selected method.

Raises:
ValueError
  • If extrapolation rules are violated based on the method.

  • If method requires or forbids val and it’s misused.

  • If frequency cannot be inferred.

hours(h)[source]

Create a time interval representing h hours

is_regular(ts, raise_exception=False)[source]

Check if a pandas DataFrame, Series, or xarray object with a time axis (axis 0) has a regular time index.

Regular means:
  • The index is unique.

  • The index equals a date_range spanning from the first to the last value with the inferred frequency.

Parameters:

ts : DataFrame, Series, or xarray object. raise_exception (bool): If True, raises a ValueError when the index is not regular.

Otherwise, returns False.

Returns:

bool : True if the time index is regular; False otherwise.

minutes(m)[source]

Create a time interval representing m minutes

months(m)[source]

Create a time interval representing m months

rts(data, start, freq, columns=None, props=None)[source]

Create a regular or calendar time series from data and time parameters

Parameters:
dataarray_like
Should be a array/list of values. There is no restriction on data

type, but not all functionality like addition or interpolation will work on all data.

startPandas.Timestamp

Timestamp or a string or type that can be coerced to one.

interval_time_interval

Can also be a string representing a pandas freq.

Returns:
resultPandas.DataFrame

A regular time series with the freq attribute set

rts_formula(start, end, freq, valfunc=nan)[source]

Create a regular time series filled with constant value or formula based on elapsed seconds

Parameters:
startPandas.Timestamp

Starting Timestamp or a string or type that can be coerced to one.

endPandas.Timestamp

Ending Timestamp or a string or type that can be coerced to one.

freq_time_interval

Can also be a string representing an interval.

valfuncdict

Constant or dictionary that maps column names to lambdas based on elapsed time from the starts of the series. An example would be {“value”: lambda x: np.nan}

Returns:
resultPandas.DataFrame

A regular time series with the freq attribute set

seconds(s)[source]

Create a time interval representing s seconds

time_overlap(ts0, ts1, valid=True)[source]

Check for overlapping time coverage between series Returns a tuple of start and end of overlapping periods. Only considers the time stamps of the start/end, possibly ignoring NaNs at the beginning if valid=True, does not check for actual time stamp alignment

to_dataframe(ts)[source]
transition_ts(ts0, ts1, method='linear', create_gap=None, overlap=(0, 0), return_type='series')[source]
years(y)[source]

Create a time interval representing y years

vtools.data.vis_gap module

generate_sample_data()[source]
interactive_gap_plot(df)[source]
plot_missing_data(df, ax, min_gap_duration, overall_start, overall_end)[source]

vtools.data.vtime module

Basic ops for creating, testing and manipulating times and time intervals. This module contains factory and helper functions for working with times and time intervals.

For time intervals (or deltas), VTools uses classes that are compatible with the “freq” argument of

requires a time and time interval system that is consistent (e.g. time+n*interval makes sense) and that can be applied to both calendar dependent and calendar-independent intervals. Because this requirement is not met by any one implementation it is recommended that you always use the factory functions in this module for creating intervals or testing whether an interval is valid.

days(d)[source]

Create a time interval representing d days

dst_to_standard_naive(ts, dst_zone='US/Pacific', standard_zone='Etc/GMT+8')[source]

Convert timezone-unaware series from a local (with daylight) time to standard time This would be useful, say, for converting a series that is PDT during summer to one that is not. The routine is mainly to treat cases where the time stamps at DST interfaces are not redundant – if they are you can probably use tz_convert and tz_localize with the ambiguous = ‘infer’ option and do the job more efficiently, but lots of databases don’t store data this way.

The choice of the standard_zone is, it seems, buggy. The defaults are supposed to convert from PST/PDT to pure PST, and the latter should be GMT-8. In a sense, this function is included before the behavior is really understood.

Only regular series are accepted … this is a quirk of the implementation

hours(h)[source]

Create a time interval representing h hours

minutes(m)[source]

Create a time interval representing m minutes

months(m)[source]

Create a time interval representing m months

seconds(s)[source]

Create a time interval representing s seconds

years(y)[source]

Create a time interval representing y years

Module contents