Concepts and Conventions

Overview

The overarching goal of this data organization effort is to retrieve data from data providers and store it in a common format, validate data (screened), and produce data suitable for applications such as boundary conditions (filled/aggregated or derived data), referred to as “processed” data. The system moves away from manually manipulated data in favor of standardized formats for programmatic access.

Data Repository Structure

The centralized data repository is housed in a file system-based share at <internal shared directory server>\\Modeling_Data\\continuous_station_repo. A mirrored copy is available at http://tinyurl.com/dmsdatastore.

Data flows through four distinct stages:

Raw: Data is stored exactly “as downloaded” without transformation or unit changes. Raw files are unique per datastream per time block.
Formatted: Data adheres to file naming conventions and includes prescribed metadata. Units are generally not changed at this stage.
Screened: Data has undergone QA/QC processes including flagging data rejected by providers or users. Units are standardized and consistent here.
Processed: Final stage data may have been filled by algorithms and is ready for specific applications like boundary conditions. These files are not necessarily unique per datastream.

Data Repository Workflow

        flowchart LR
 subgraph datasources ["Data Sources"]
   direction LR
    d1["USGS"]
    d2["DWR"]
    d3["NOAA"]
    d4["CDEC"]
    d5["USBr"]
 end
 subgraph dropbox ["Drop Box"]
 end
 subgraph repograph ["Repository"]
   C2["raw/"]
   D1["formatted/"]
   E1["screened/"]
   F1["processed/"]
 end
 subgraph userqaqc ["User QA/QC"]
        H["Flag Editing"]
        H-- updates user flag -->E1
 end
    datasources --> B("Download")
    B --> C2
    C2 --> D("Format")
    D --> D1
    D1 --> E("Automated Screening")
    E --> E1
    E1 --> F("Process")
    F --> F1
    F1 --> G["Modeling Applications & Boundary Conditions"]

    dropbox --> repograph

User Access

Users typically have write access only to “incoming” subdirectories within raw, screened, and processed directories. Correctly formatted submissions are then ingested into the main, read-only directories. Users are generally not expected to access raw data directly; formatted data allows review of original downloaded values, and screened data includes user flags and consistent units.

Station, Sublocation, and Datastream Concepts

Station: A well-defined concept tied to a (location, institution) pair. Physical locations may vary slightly and different agencies at the same approximate location may have subtly different platforms. The station_dbase.csv contains station information including ID, agency ID, name, latitude, and longitude. These locations are corrected to fit the SCHISM mesh.
Sublocation: Used when a station ID alone does not uniquely identify a datastream — for example, top/bottom sensors or different programs within the same agency measuring the same variable. The station_subloc.csv table defines sublocations. The subloc concept generalizes depths and other ambiguities.
Datastream: Describes a single sensor and is uniquely identified by the combination of (station, sublocation, variable).

File Naming Conventions

Files follow the pattern:

agency_dwrID[@subloc]_agencyID_variable[_YYYY[_9999]].csv

Example: usgs_sjj@bgc_11337190_turbidity_2016_2020.csv

Components:

agency: The agency that collects the data, potentially including a high-level program name (e.g., dwr_des).
dwr ID and sublocation: The DWR station ID, followed by @subloc if a sublocation exists (e.g., anh@north). The @ symbol is structural.
agency_id: The identifier used by the collecting agency (e.g., 11337190 for USGS).
variable: Standardized variable name (e.g., turbidity, temp, ec@daily).
_YYYY_9999: Time shard. 9999 means the file is open-ended (actively updated).

Files use # for comments, , as separator, and ISO/CF compliant timestamps (e.g., 2009-02-10T00:00). Metadata is included as key: value pairs in the #-commented header.

Units and Standardization

CF Compliance: Variable names and units are intended to be CF (Climate and Forecast) compliant wherever possible.
Stage and flow: feet (ft) and cubic feet per second (cfs) respectively.
All other variables: SI units (e.g., °C for temperature).
PSU exception: Practical Salinity Unit is technically a ratio and not a true unit.
Specific Conductivity (EC): Always normalized to 25°C (µS/cm at 25°C).

Known Challenges and Exceptions

WDL Station IDs: WDL station IDs may not match the canonical station_id due to appended "00" or "Q" suffixes. An internal alias is used as the station_id to ensure uniqueness.
SWP/CVP Exports: Exports are calculated differently for hourly vs. instantaneous values, producing distinct datasets. These different calculations are treated as “sublocations”.
USGS Multiple Instruments: USGS may have multiple instruments measuring the same variable for one station ID due to different programs or sublocations. The raw/ directory can store dual versions for QA/QC, though the processed set should ideally be unified.