dms_datastore!

dms_datastore is a package for downloading and managing a repository of csv files of continuous time series data, mostly focused on environmental data for the Bay-Delta.

NOTE: THE DMS_DATASTORE IS STILL UNDER CONSTRUCTION. NO RELEASE HAS BEEN MADE

Introduction

The main functionality includes:

Automatic downloading scripts for major data providers.

Station lists and a utility to lookup infor from the station lists.

Populating routine that orchestrates downloads into a repository.

Readers for downloaded formats.

Reformatting and time alignment to repackage time series in a common csv format with metadata headers.

Screening routines.

Along the way, the package provides definitions of the concept of stations, methods of access, units and names in a way that can encapsulate many of the quirks and ambiguities of individual providers.

Installation

Prerequisites

The repository depends on vtools3, matlplotlib and several downloading, processing and visualization libraries.

Conda

The easiest way to install the library is using conda (typically minicoda). Some hints for using conda tools and channels for our libraries or for doing a developmer install are given here [ref]. Assuming you follow those instructions for setting up channels, the minimal installation is:

$ conda install dms_datastore

We recommend that you do a conda installation even if you follow it up with a developer install using pip.

Quickstart: Things You Can Do Quickly with dms_datastore

Lookup information on a station using a fragment of its name, standard id.:

$ station_info francisco
Matches:
      station_id agency        agency_id                                                  name         x          y        lat         lon
id
alk          alk   usgs  374938122251801  San Francisco Bay at Northeast Shore Alcatraz Island  550895.2  4186802.8  37.827222 -122.421667
dum          dum   usgs  373015122071000           South San Francisco Bay at Dumbarton Bridge  577828.8  4151167.7  37.504000 -122.119000
dumbr      dumbr   usgs  373025122065901             San Francisco Bay at Old Dumbarton Bridge  578096.0  4151478.4  37.506944 -122.116389
richb      richb   usgs  375607122264701       San Francisco Bay at Richmond-San Rafael Bridge  548648.5  4198778.5  37.935278 -122.446389
sffpx      sffpx   noaa          9414290                                         San Francisco  547094.8  4184503.1  37.806700 -122.465000
sfp17      sfp17   usgs  374811122235001                          San Francisco Bay at Pier 17  553143.4  4184169.8  37.803000 -122.397000

If you know the station id and agency (see above), you can get data for individual or groups of stations or even from a list:

$ download_nwis --start 2022-01-01 --end 2022-06-01 --stations osj hol --param flow --dest .

Of course this can be done programatically as well. A discussion can be found here: [REF]

There are three main routines [actually 2 one in dev], read_ts, read_multi.ts_multifile_read and ts_**. All are based on how they deal with filenames that are wildcards to represent time sharding by years (file name ends in _2020.csv) or blocks of years (_2015_2019.csv):

read_ts is mostly designed around heterogeneous legacy formats. It can read wildcard if the files are otherwise the same basic format.
ts_multifile_read is a wrapper around read_ts that allows mixes of different formats. This would be useful if, for instance, historical data comes from one source/format and real-time data from another.
read_ts_multi assumes data is all in dms_datastore standard format and constructs multi-column views quickly.