Averging, Filtering and Low Passing Data

VTools provides simple methods for period averaging, filtering and low passing data, particularly tidal data. Basic period averaging and in pareticular daily averaging is easily covered by the regular Pandas API, but some of the most common tools can distort tidal data or yield unepected results with missing data.

As an introduction to the topic let’s start by making a single time series with three of the most common tidal constituents. This produces a series that is complex enough to be interesting and realistic, but it doesn’t have any low frequencies in it, so a reasonable tidal filtration should look like a flat line at zero.

[37]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import scipy.signal
from vtools.data.vtime import minutes,days,months
from vtools.functions.period_op import period_op
plt.style.use(['seaborn-colorblind','seaborn-talk'])
mpl.rcParams['figure.figsize'] = [12,4.]

delta_t = minutes(15)  # creating a real offset instead of a string because it will be used in algebra shortly
ndx = pd.date_range("2000-01-01","2000-02-15",freq=delta_t)
thours = 0.25*np.arange(len(ndx))
twopi = 2.*np.pi
M2 = np.cos(twopi*thours/12.4)
K1 = 0.5*np.cos(twopi*thours/23.9+1.)
O1 = 0.5*np.cos(twopi*thours/25.8+1.)
tide = M2 + O1 + K1
df = pd.DataFrame(index=ndx,data={"tide": tide})
ax=df.plot(color="black")
ax.set_ylabel("m")





[37]:
Text(0, 0.5, 'm')
../_images/notebooks_filters_1_1.png

Now consider the ways you might averaging or filtering this data. Again note that a reasonable/desired outcome is just a line at zero.

One approach is a 24-hour boxcar average. A boxcar average is moving window average, that puts equal weight in all the points in the window. This is not the same as a daily average because the output is defined/recalculated at each the original 15min points. There is no way to center a 24 hour average so that it is centered perfectly, because there are an even number of samples in 24 hours. The fact that odd numbers are needed for a centered filter may be unintuitive, but just think of your hand – there is only a middle finger because you have an odd number of fingers.

A true daily average results in one value per day. Pandas can do this with df.resample('D').mean(). The time centering of the result won’t have much meaning in this case. You also have to watch out for missing data and boundary effects. Pandas may attempt to complete the average even if given just a few values even though this can cause absurd values near data boundaries or missing data. An example is showin in the plot below, and the value is so bad it had to be truncated. The vtools period_op allows you to specify the fraction of allowed missing data which should be set fairly low, say between 0.0 and 0.1. The fact that Pandas doesn’t have a switch for this is uncharacteristic of their programming interface – they often have arguments in a lot of their functions (like min_periods for rolling() operations) that will help you specify a minimum number of data.

Finally, note that the boxcar and daily average are related. A daily average can be thought of as a boxcar that is subsequently resampled once per day and then had its timestamp shifted from the middle of the period to the beginning (or sometimes the end) of the period. We’ve added markers to the VTools daily average in the plot below to show the idea of this resampling. Each green point is a snapshot of the blue boxcar shifted ~12 hours to the left. Sampling the wiggles once per 24 hours causes something called “aliasing” where a spurious frequency (14-day signal) is caused by visiting the tidal wiggles once per day in different places in the tide cycle … and yes, we’ve seen these spurious signals analyzed as if they were real.

[38]:
samples_day = 96
box = df.rolling(samples_day,center=True,min_periods=85).mean()
daily = df.resample('D').mean()
daily_vtools = period_op(df,period = "D",agg="mean",max_absent_frac=0.)

ax = box.plot()
daily.plot(ax=ax)
daily_vtools.plot(ax=ax,style=['-o'])
ax.grid()
ax.set_ylim(-0.3,0.3)
ax.axhline(y=-0.07,color="black",linestyle=":")
ax.axhline(y=0.07,color="black",linestyle=":")
ax.set_ylabel('m')

plt.legend(["boxcar","daily","daily vtools"])
plt.show()
../_images/notebooks_filters_3_0.png

Next comes the all-important question: how much do the wiggles matter? Importance depends on the signal and the context of the analysis, but in a predictable way. The strength of the wiggles will be about 5% of the tidal range of the original signal. In the present case that was a bit over 1.5m. The size of the wiggles is thus about 0.07m. To show that, a dotted line is drawn at +/- 0.07m on the plot.

If you are trying to filter discharge at Rio Vista on the Sacramento River, the tides have an amplitude of 150,000cfs and the mean signal you are trying to extract could be as amall as 3,000cfs during a dry summer. Using the 5% rule, the spurious wiggles will be 7,500cfs, and will dominate your analysis. Even worse – it will look real and it will seem like a spring-neap effect. For water levels, the situation would be less bad. At Rio Vista, the tidal amplitude is perhaps 1m, so the filter wiggles will be about 5cm – detectable but not dominant compared to the 1.5m worth of variation in water levels at that station seasonally or in flood events. Similarly, signal processing errors are often not the biggest weakness in a salinity analysis. If you are analyzing fluxes, however, it is recommended that you use a better filter.

Filters

Both daily averaging and 24-hour moving averages are problematic for tidal quantities because they are calendar/solar day methods being applied to a signal that is largely lunar in nature. simplest filter you could use in the above case is a boxcar that is more tuned to the 24.8 lunar period. This might be implemented using an averaging period of 25 values for hourly data or 99 15 minute samples. In the code below I’ve been a bit fussier about the required data, allowing no misisng data and avoiding the “crazy” values at the end. As you can see from the blue and orange lines,the issues are reduced in the fillowing plot by half.

We can go a bit further and use a real tidal filter. These filters are like moving averages, but their weights are not equal like a boxcar filter. Below are a couple examples. One is the so-called “Godin” filter which gained popularity because it could be constructed from successive boxcar ingredients – one 25 hour boxcar that is centered on the time point under analysis and two 24 hour boxcars that are staggered to the left and right by one sample so that overall they produce a centered effect.

The last filter is a cosine_lanczos. More specifically it is a filter with a Lanczos window built around a 40 hour cutoff frequency that is “squared” in the sense that it is applied forward and backward which produces a phase-neutral response and increases the effective order of the filtration.

Note that both the Godin and cosine-Lanczos filters produce an almost entirely flat response. Both have completely supressed the tide. We’ll worry in a few moments about what good things they may have suppressed as well. As a practical matter, note that the Godin line has fewer missing values at the beginning and end of the series than the cosine-Lanczos. This is because it is a shorter filter that has non-zero weights over a briefer period. Tidal filters often produce nans at the beginning and end of the data set

[39]:
from vtools.functions.filter import godin,cosine_lanczos
samples_tidal_day = 99
box24 = df.rolling(samples_day,center=True,min_periods=96).mean()
box25 = df.rolling(samples_tidal_day,center=True,min_periods=99).mean()
gdn = godin(df)
coslan = cosine_lanczos(df,'40H')
ax1 = box24.plot()
box25.plot(ax=ax1)
gdn.plot(ax=ax1)
coslan.plot(ax=ax1,color="0.35")
ax1.legend(['24 hour boxcar','25 hour boxcar','godin','cosine-lanczos'])
[39]:
<matplotlib.legend.Legend at 0x1fb4dc7e448>
../_images/notebooks_filters_6_1.png

So at this point we’ve tested the ability of the filters to eliminate the tides, their so-called stopband characteristics. Now what about their ability to capture or “pass” the lower frequencies that are of interest. For this we need to add a subtidal (slow varying) component to the tide. To do this we will use a function called a “chirp” which transitions from a period of 12 days to a period of 4 days – in both cases much longer than a tidal day.

[40]:

# Create the chirp and show it. f0=1./(24.*12.) f1=1./(24.*4) subtide = 0.5*scipy.signal.chirp(thours, f0, thours[-1], f1, method='linear', phi=0, vertex_zero=True) dfsub = pd.DataFrame(index=df.index,data=subtide) ax=dfsub.plot() # Add it to the original tide df_with_sub = df.squeeze() + dfsub.squeeze() df_with_sub = df_with_sub.to_frame() df_with_sub.plot(ax=ax) ax.get_legend().remove() plt.show()
../_images/notebooks_filters_8_0.png

Now we attempt to extract the known subtidal signals using some of the tidal averages and filters. Doing so reveals differences that may be significant in some applications. The Godin filter attenuates signals with periods of 3-7 days by as much as 30%. The 25-hour boxcar (which is one of the “ingredients” of the Godin filter) attenuates less, but adds wiggles. The cosine-Lanczos filter is better in both regards, but as we have already mentioned it amplifies gaps and boundary effects. One useful step to help minimize this is to interpolate small gaps before applying the filter. The way to do this is shown in the first step in the followoing snippet, but of course it wouldn’t do anything here because there is no missing data.

[41]:
# Linear by default, this will leave any gaps longer than 4 samples. In this case there is none so this does nothing.
df_with_sub = df_with_sub.interpolate(limit=4)

box25 = df_with_sub.rolling(99,center=True).mean()   # Boxcar spanning 99 15min values
gdnsub = godin(df_with_sub)                          # Godin
cl_sub = cosine_lanczos(df_with_sub,'40H')           # cosine-Lanzos squared

ax1=dfsub.iloc[2500:3800].plot()
box25.iloc[2500:3800].plot(ax=ax1)
gdnsub.iloc[2500:3800].plot(ax=ax1)
cl_sub.iloc[2500:3800].plot(ax=ax1,color="0.35")


ax=dfsub.plot()
box25.plot(ax=ax)
gdnsub.plot(ax=ax)
cl_sub.plot(ax=ax,color="0.35")
ax.legend(["actual","boxcar","godin","cosine_lanczos"])
ax1.set_title("Close Up")
ax1.get_legend().remove()


plt.show()
../_images/notebooks_filters_10_0.png
../_images/notebooks_filters_10_1.png

The passband, stopband and fidelity of the filters can be compared by looking at their frequency response:

[42]:
from vtools.functions.frequency_response import *
compare_response(40)


../_images/notebooks_filters_12_0.png

VTools also provides a simplified interface to the Butterworth filter. However, Butterworth filters are members of the class of “infinite impulse response” filters. These are kind of hard to use for imperfect, gappy data.