Time Series Repositories
This page describes the design of the repository system for developers and advanced users who need to build inventory tools, implement downloaders, or integrate with registry metadata.
Core Concepts
The repository system is built around four elements:
Repository Configuration — describes how a directory of files becomes structured data
Filename Interpretation — maps filenames ↔ structured metadata
Registry (
station_dbase.csv) — authoritative station metadataInventory & I/O — reading, writing, and summarising datasets
These form a strict, deterministic pipeline:
files → interpret_fname → metadata → registry join → inventory / read_ts_repo
Repository Configuration
A repo configuration defines how a directory of files is interpreted as structured data.
Example dstore_config.yaml repo block:
root: //.../repo/continuous/formatted
registry: continuous
provider_key: source
provider_resolution_mode: by_registry_column
filename_templates:
- "{source}_{station_id@subloc}_{agency_id}_{param}_{year}.csv"
search:
use_source_slot: true
shard_style: auto
parse:
style: legacy
Required fields: root, registry, provider_key,
provider_resolution_mode, filename_templates. Missing any of these is an error.
The site identity column is always station_id and is not configurable.
Key terminology
station_idUniversal identity column for all sites (stations, structures, synthetic locations). Used for registry joins, inventory grouping, and read lookups.
provider_keyData provenance (e.g.,
source, formerlyagency). Used for distinguishing file families and applying priority ordering.provider_resolution_modeDefines how conflicts are resolved when multiple providers supply data for the same station. Typical mode:
by_registry_column.
Source Priority Configuration
The source_priority block in dstore_config.yaml specifies preferred data
sources per agency-managed station group:
source_priority:
ncro: ['ncro','cdec']
dwr_ncro: ['ncro']
des: ['des']
dwr_des: ['des']
usgs: ['usgs']
noaa: ['noaa']
usbr: ['cdec']
dwr_om: ['cdec']
dwr: ['cdec']
ebmud: ['usgs','ebmud','cdec']
For example, EBMUD station data is resolved by preferring USGS, then EBMUD, then CDEC.
Filename Templates and Interpretation
Filename templates define the bidirectional mapping between metadata and filenames.
Example template:
{source}_{station_id@subloc}_{agency_id}_{param}_{year}.csv
Given usgs_dsj_11313433_ec_2020.csv, interpretation recovers:
source: usgs
station_id: dsj
subloc: default
agency_id: 11313433
param: ec
year: 2020
Design rules:
Parsing is structural, not heuristic.
Templates must support both rendering and interpretation.
A filename that matches no template is an error.
The
@instation_id@sublocis structural —sublocdefaults to"default"when absent.
Registry (station_dbase.csv)
The registry provides authoritative metadata that enriches filename-derived fields.
Flow:
filename → parsed metadata → registry join → enriched metadata
The registry is authoritative; filenames are operational identifiers only.
Registry data provides spatial, descriptive, and relationship metadata.
Inventory System
Inventory converts repository files into structured summaries.
File inventory (repo_file_inventory)
Groups by file_pattern. Represents:
Physical file families
Shard coverage (years)
Provider-specific datasets
Data inventory (repo_data_inventory)
Groups by series_id. Represents:
Unique logical time series, independent of provider.
A series_id is constructed from metadata:
[provider?] | site | subloc | param | modifier?
Inventory type |
Groups by |
Purpose |
|---|---|---|
File inventory |
file_pattern |
Filesystem view |
Data inventory |
series_id |
Logical dataset view |
Populating the Repository
The dms_datastore Command Reference page documents the full CLI workflow. A summary:
populate_repo --dest <raw_dir>
reformat --inpath <raw_dir> --outpath <formatted_dir>
usgs_multi --fpath <formatted_dir>
auto_screen --fpath <formatted_dir> --dest <screened_dir>
read_ts_repo
The canonical function for reading a dataset from the repository.
from dms_datastore.read_multi import read_ts_repo
data = read_ts_repo("dsj", "ec", repo="formatted")
Responsibilities:
Resolve request → metadata search
Match filename patterns
Apply provider priority
Read files and merge time shards
Return time series
Design principles:
Deterministic — same inputs always produce same outputs
No guessing — no implicit provider defaults
No silent fallback — fail on ambiguity
See Reading Time Series and Metadata for full parameter reference and usage examples.
write_ts_csv
Canonical writer for repository CSV files.
Metadata modes
NoneCreates a minimal header (format + timestamp only).
dict(preferred)Structured metadata dict. Must include or receive
format. The dict is not mutated.stringLegacy/manual header, for migration purposes.
Guarantees: stable formatting, idempotent round-trip, canonical YAML header.
End-to-End Flow
Read path:
request → read_ts_repo → pattern search → interpret_fname → read_ts → merged time series
Inventory path:
files → interpret_fname → metadata dataframe → groupby → registry join → inventory output
Write path:
time series + metadata → write_ts_csv → prep_header → CSV file
Design Principles
- Fail fast
Bad filenames → error. Bad config → error. No implicit recovery.
- No hidden behavior
No implicit defaults, no guessing providers.
Separation of concerns
Concern |
Component |
|---|---|
Filename parsing |
interpret_fname |
Metadata enrichment |
registry |
Dataset lookup |
read_ts_repo |
File writing |
write_ts_csv |
Architectural Evolution Notes
The repository system was refactored from an implicit agency-based model to a fully config-driven provider model. Key terminology changes:
Old term |
New term |
Notes |
|---|---|---|
agency |
provider |
Generalizes provenance |
key_column |
station_id (hardcoded) |
Universal identity column |
source_priority |
provider_resolution_mode | Config-driven |
|
Configs must define provider_key and
provider_resolution_mode. The site identity column (station_id) is
hardcoded and no longer configurable. Misconfigured repos fail immediately —
there is no fallback behavior.
The legacy parse.style = legacy option remains for backward compatibility.