dms_datastore!

dms_datastore is a package for downloading and managing a repository of csv files of continuous time series data, mostly focused on environmental data for the Bay-Delta.

NOTE: THE DMS_DATASTORE IS STILL UNDER CONSTRUCTION. NO RELEASE HAS BEEN MADE

Introduction

The main functionality includes:

  • Automatic downloading scripts for major data providers.

  • Station lists and a utility to lookup infor from the station lists.

  • Populating routine that orchestrates downloads into a repository.

  • Readers for downloaded formats.

  • Reformatting and time alignment to repackage time series in a common csv format with metadata headers.

  • Screening routines.

Along the way, the package provides definitions of the concept of stations, methods of access, units and names in a way that can encapsulate many of the quirks and ambiguities of individual providers.

Installation

Prerequisites

The repository depends on vtools3, matlplotlib and several downloading, processing and visualization libraries.

Conda

The easiest way to install the library is using conda (typically minicoda). Some hints for using conda tools and channels for our libraries or for doing a developmer install are given here [ref]. Assuming you follow those instructions for setting up channels, the minimal installation is:

$ conda install dms_datastore

We recommend that you do a conda installation even if you follow it up with a developer install using pip.

Quickstart: Things You Can Do Quickly with dms_datastore

Lookup information on a station using a fragment of its name, standard id.:

$ station_info francisco
Matches:
      station_id agency        agency_id                                                  name         x          y        lat         lon
id
alk          alk   usgs  374938122251801  San Francisco Bay at Northeast Shore Alcatraz Island  550895.2  4186802.8  37.827222 -122.421667
dum          dum   usgs  373015122071000           South San Francisco Bay at Dumbarton Bridge  577828.8  4151167.7  37.504000 -122.119000
dumbr      dumbr   usgs  373025122065901             San Francisco Bay at Old Dumbarton Bridge  578096.0  4151478.4  37.506944 -122.116389
richb      richb   usgs  375607122264701       San Francisco Bay at Richmond-San Rafael Bridge  548648.5  4198778.5  37.935278 -122.446389
sffpx      sffpx   noaa          9414290                                         San Francisco  547094.8  4184503.1  37.806700 -122.465000
sfp17      sfp17   usgs  374811122235001                          San Francisco Bay at Pier 17  553143.4  4184169.8  37.803000 -122.397000

If you know the station id and agency (see above), you can get data for individual or groups of stations or even from a list:

$ download_nwis --start 2022-01-01 --end 2022-06-01 --stations osj hol --param flow --dest .

Of course this can be done programatically as well. A discussion can be found here: [REF]

There are three main routines [actually 2 one in dev], read_ts, read_multi.ts_multifile_read and ts_**. All are based on how they deal with filenames that are wildcards to represent time sharding by years (file name ends in _2020.csv) or blocks of years (_2015_2019.csv):

  • read_ts is mostly designed around heterogeneous legacy formats. It can read wildcard if the files are otherwise the same basic format.

  • ts_multifile_read is a wrapper around read_ts that allows mixes of different formats. This would be useful if, for instance, historical data comes from one source/format and real-time data from another.

  • read_ts_multi assumes data is all in dms_datastore standard format and constructs multi-column views quickly.

Indices and tables