API
  • API Index
  • Batch API
  • Content Negotiation
  • Dynamic Paths
  • Hierarchy API
  • List API
  • Quick Reference
  • Time Series API
  • Views
    • Configuration
  • Overview
  • Application Configuration
  • Documentation Configuration
  • Spec Configuration
    • Features
  • API Testing
  • Asynchronous Processing
  • Cassandra Time-Series Engine
  • Cassandra Integration
  • Default Controller Customisation
  • Documentation
  • Geospatial Filtering
  • Model Documentation
  • PostgreSQL Integration
  • Excel Spreadsheet Format
    • Tutorials
  • Step-by-Step Guide
  • Archetype Setup
  • Logging
  • Prometheus
  • Sentry Logging
    • UI Customisation
  • UI Customisation
  • SASS Build Chain
  • Shared Layout
  • API Documentation
  • Model Documentation
  • Results
  • sapi-nt v current

    Cassandra time-series engine

    Cassandra stores need a table structure strongly tied to the expected query patterns. This makes a completely generic Cassandra sapi-nt engine tricky, and probably not that useful.

    The time-series Cassandra engine is designed for use just with time-series data where we impose a standardized expectation of queries and table structure.

    Data

    The store is assumed to hold a set of time-series, each labelled by some identifier. Each time-series comprises a series of simple valued observations (doubles) with an extensible set of associated attributes. In principle, if the data is not numeric, there’s no reason you can’t put values into the attribute slots and ignore the value slot.

    Each observation in the series has:

    Field Type Meaning
    measurereference text Identifier for the time series
    timestamp timestamp The result time for the observation
    date text The date of the observation (logically redundant with timestamp but allows grouping)
    value double The observed value
    attribute text Each observation can be annotated with an arbitrary set of attribute/value pairs. The values can be serialized RDFNode values, thereby supporting types. Note that in the database these additional attributes are contained by the meta column.

    The identifier for the time series is arbitrary. It typically encodes the measurement station, observed property, qualifiers for the observation and qualifiers for the series (quality controlled, interval, mean/instantaneous etc). Internally we store a simple identifier string, this can be converted to/from an external URI in the API driver.

    All metadata describing the time-series and the associated measurement points, observed properties etc are assumed to be held in a separate RDF store. It is up to the driving application to provide an integrated API to locate suitable time series and then retrieve them. The sapi-nt Cassandra time-series engine just provides facilities to query the time-series themselves.

    Schema

    Assumes two tables for indexing by date and by measures.

    CREATE TABLE dev.measures_by_date (
        date text,
        measurereference text,
        timestamp timestamp,
        meta set<text>,
        value double,
        PRIMARY KEY (date, measurereference, timestamp)
    ) WITH CLUSTERING ORDER BY (measurereference ASC, timestamp ASC) ;
    
    CREATE TABLE dev.measures_by_measurereference (
        measurereference text,
        date text,
        timestamp timestamp,
        meta set<text>,
        value double,
        PRIMARY KEY (measurereference, date, timestamp)
    ) WITH CLUSTERING ORDER BY (date ASC, timestamp ASC);
    
    

    The meta column is stores all of the attribute value pairs as a set of strings which have the format: attribute=value-encoding.

    Query capabilities

    General list queries over a set of measurements.

    Can filter by:

    Field Filter options
    measure Value, set of values
    date Value, set of values, relational operation (eg. <)
    timestamp Value, set of values, relational operation (eg. <)
    attribute Value
    latest (only for a defined set of stations)
    earliest (only for a defined set of stations)

    View defines the mapping to RDF and the subset of attributes to show.

    No ordering specification, order is fixed in the tables.

    Configuration requirements

    The data store configuration should specify:

    keyspace
    by_date table
    by_measure table
    name of measure column     (default, measurereference)
    username/password          (indirectly via credentials file?)
    

    Example Queries

    Observations for set of measures for a date range:

    select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
    and date >= '2017-10-01' and date < '2017-10-03';
    

    Observations for a set of measures since a timestamp. Have to give explicit date. Doesn’t work for set of dates (IN) or date range (at least if returning collections). Not much use either skip timestamp ranges or do in client?

    select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
    and date = '2018-01-17' and timestamp > '2018-01-17T12:00:00Z';
    

    Latest observation for a set of measures:

    select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
        GROUP BY stationreference ORDER BY date ASC, timestamp ASC;
    

    Earliest observation for a set of measures:

    select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
        GROUP BY stationreference ORDER BY date DESC, timestamp DESC;
    

    Latest observation for all measures. Can’t see a way to do this. For example, obvious ways is not allowed:

    select * from measures_by_station GROUP BY stationreference ORDER BY date ASC, timestamp ASC;
    select * from measures GROUP BY date, stationreference ORDER BY date ASC, timestamp ASC;
    

    Without the ORDER BY is allowed but times out.

    Latest observation today (or at least a specific day) is OK:

    select * from measures WHERE date = '2018-01-17' group by stationreference ORDER BY  stationreference, timestamp;
    

    Observations since timestamp for all stations:

    select * from measures  WHERE date > '2018-01-17' and timestamp > '2018-01-17T12:00:00Z' ALLOW FILTERING;