Concepts and Conventions
Overview
The overarching goal of this data organization effort is to retrieve data from data providers and store it in a common format, validate data (screened), and produce data suitable for applications such as boundary conditions (filled/aggregated or derived data), referred to as “processed” data. The system moves away from manually manipulated data in favor of standardized formats for programmatic access.
Data Repository Structure
The centralized data repository is housed in a file system-based share at
<internal shared directory server>\\Modeling_Data\\continuous_station_repo.
A mirrored copy is available at http://tinyurl.com/dmsdatastore.
Data flows through four distinct stages:
- Raw
Data is stored exactly “as downloaded” without transformation or unit changes. Raw files are unique per datastream per time block.
- Formatted
Data adheres to file naming conventions and includes prescribed metadata. Units are generally not changed at this stage.
- Screened
Data has undergone QA/QC processes including flagging data rejected by providers or users. Units are standardized and consistent here.
- Processed
Final stage data may have been filled by algorithms and is ready for specific applications like boundary conditions. These files are not necessarily unique per datastream.
Data Repository Workflow
flowchart LR
subgraph datasources ["Data Sources"]
direction LR
d1["USGS"]
d2["DWR"]
d3["NOAA"]
d4["CDEC"]
d5["USBr"]
end
subgraph dropbox ["Drop Box"]
end
subgraph repograph ["Repository"]
C2["raw/"]
D1["formatted/"]
E1["screened/"]
F1["processed/"]
end
subgraph userqaqc ["User QA/QC"]
H["Flag Editing"]
H-- updates user flag -->E1
end
datasources --> B("Download")
B --> C2
C2 --> D("Format")
D --> D1
D1 --> E("Automated Screening")
E --> E1
E1 --> F("Process")
F --> F1
F1 --> G["Modeling Applications & Boundary Conditions"]
dropbox --> repograph
User Access
Users typically have write access only to “incoming” subdirectories within raw,
screened, and processed directories. Correctly formatted submissions are then
ingested into the main, read-only directories. Users are generally not expected to access
raw data directly; formatted data allows review of original downloaded values, and
screened data includes user flags and consistent units.
Station, Sublocation, and Datastream Concepts
- Station
A well-defined concept tied to a (location, institution) pair. Physical locations may vary slightly and different agencies at the same approximate location may have subtly different platforms. The
station_dbase.csvcontains station information including ID, agency ID, name, latitude, and longitude. These locations are corrected to fit the SCHISM mesh.- Sublocation
Used when a station ID alone does not uniquely identify a datastream — for example, top/bottom sensors or different programs within the same agency measuring the same variable. The
station_subloc.csvtable defines sublocations. Thesublocconcept generalizes depths and other ambiguities.- Datastream
Describes a single sensor and is uniquely identified by the combination of (station, sublocation, variable).
File Naming Conventions
Files follow the pattern:
agency_dwrID[@subloc]_agencyID_variable[_YYYY[_9999]].csv
Example: usgs_sjj@bgc_11337190_turbidity_2016_2020.csv
Components:
- agency
The agency that collects the data, potentially including a high-level program name (e.g.,
dwr_des).- dwr ID and sublocation
The DWR station ID, followed by
@sublocif a sublocation exists (e.g.,anh@north). The@symbol is structural.- agency_id
The identifier used by the collecting agency (e.g.,
11337190for USGS).- variable
Standardized variable name (e.g.,
turbidity,temp,ec@daily).- _YYYY_9999
Time shard.
9999means the file is open-ended (actively updated).
Files use # for comments, , as separator, and ISO/CF compliant timestamps
(e.g., 2009-02-10T00:00). Metadata is included as key: value pairs in the
#-commented header.
Units and Standardization
CF Compliance: Variable names and units are intended to be CF (Climate and Forecast) compliant wherever possible.
Stage and flow: feet (ft) and cubic feet per second (cfs) respectively.
All other variables: SI units (e.g., °C for temperature).
PSU exception: Practical Salinity Unit is technically a ratio and not a true unit.
Specific Conductivity (EC): Always normalized to 25°C (µS/cm at 25°C).
Known Challenges and Exceptions
- WDL Station IDs
WDL station IDs may not match the canonical
station_iddue to appended"00"or"Q"suffixes. An internal alias is used as thestation_idto ensure uniqueness.- SWP/CVP Exports
Exports are calculated differently for hourly vs. instantaneous values, producing distinct datasets. These different calculations are treated as “sublocations”.
- USGS Multiple Instruments
USGS may have multiple instruments measuring the same variable for one station ID due to different programs or sublocations. The
raw/directory can store dual versions for QA/QC, though the processed set should ideally be unified.