Dropbox Data Processing System

Overview

The Dropbox Data Processing System is a component of the DMS Datastore package that facilitates collection, transformation, and storage of time-series data using a configuration-driven workflow.

Key Components

dropbox_data.py

Main processing script that reads a YAML specification file and processes data according to the defined rules.
dropbox_spec.yaml

YAML configuration that defines data sources, collection parameters, and metadata inference rules.

How it works

The processing follows these steps:

Read a YAML specification.
For each entry: locate files, read time-series, augment with metadata, and write standardized output files.

Usage

Basic usage from Python:

from dms_datastore.dropbox_data import dropbox_data
dropbox_data("path/to/dropbox_spec.yaml")

Or run as a module:

python -m dms_datastore.dropbox_data

Configuration Specification

Typical keys in the spec include dropbox_home, dest, and a data list with entries containing collect and metadata. The spec also supports metadata_infer rules using regular expressions.

Example configuration snippet:

- name: USGS Aquarius flows
  skip: False
  collect:
    name: file_search
    recursive_search: True
    file_pattern: "Discharge.ft^3_s.velq@*.EntireRecord.csv"
    location: "//cnrastore-bdo/.../dropbox/usgs_aquarius_request_2020/**"
    reader: read_ts
  metadata:
    station_id: infer_from_agency_id
    source: aquarius
    agency: usgs
    param: flow
    sublocation: default
    unit: ft^3/s
  metadata_infer:
    regex: .*@(.*)\.EntireRecord.csv
    groups:
      1: agency_id

Key Classes and Functions

DataCollector — handles file discovery based on patterns.
get_spec — loads and caches the YAML spec.
populate_meta — enriches metadata using the station database.
infer_meta — extracts metadata from filenames via regex.

Output

Processed files are saved in the destination directory. Filenames follow the pattern {source}_{station_id}_{agency_id}_{param}.csv and may be chunked by year.

Additional Notes

Relies on a station database for lookups.
Standardizes time-series to include a value column and metadata headers.
Files may be year-sharded for easier management.