Cassandra time-series engine

Cassandra stores need a table structure strongly tied to the expected query patterns. This makes a completely generic Cassandra sapi-nt engine tricky, and probably not that useful.

The time-series Cassandra engine is designed for use just with time-series data where we impose a standardized expectation of queries and table structure.

Data

The store is assumed to hold a set of time-series, each labelled by some identifier. Each time-series comprises a series of simple valued observations (doubles) with an extensible set of associated attributes. In principle, if the data is not numeric, there’s no reason you can’t put values into the attribute slots and ignore the value slot.

Each observation in the series has:

Field	Type	Meaning
`measurereference`	text	Identifier for the time series
`timestamp`	timestamp	The result time for the observation
`date`	text	The date of the observation (logically redundant with timestamp but allows grouping)
`value`	double	The observed value
attribute	text	Each observation can be annotated with an arbitrary set of attribute/value pairs. The values can be serialized RDFNode values, thereby supporting types. Note that in the database these additional attributes are contained by the `meta` column.

The identifier for the time series is arbitrary. It typically encodes the measurement station, observed property, qualifiers for the observation and qualifiers for the series (quality controlled, interval, mean/instantaneous etc). Internally we store a simple identifier string, this can be converted to/from an external URI in the API driver.

All metadata describing the time-series and the associated measurement points, observed properties etc are assumed to be held in a separate RDF store. It is up to the driving application to provide an integrated API to locate suitable time series and then retrieve them. The sapi-nt Cassandra time-series engine just provides facilities to query the time-series themselves.

Schema

Assumes two tables for indexing by date and by measures.

CREATE TABLE dev.measures_by_date (
    date text,
    measurereference text,
    timestamp timestamp,
    meta set<text>,
    value double,
    PRIMARY KEY (date, measurereference, timestamp)
) WITH CLUSTERING ORDER BY (measurereference ASC, timestamp ASC) ;

CREATE TABLE dev.measures_by_measurereference (
    measurereference text,
    date text,
    timestamp timestamp,
    meta set<text>,
    value double,
    PRIMARY KEY (measurereference, date, timestamp)
) WITH CLUSTERING ORDER BY (date ASC, timestamp ASC);

The meta column is stores all of the attribute value pairs as a set of strings which have the format: attribute=value-encoding.

Query capabilities

General list queries over a set of measurements.

Can filter by:

Field	Filter options
measure	Value, set of values
date	Value, set of values, relational operation (eg. `<`)
timestamp	Value, set of values, relational operation (eg. `<`)
attribute	Value
latest	(only for a defined set of stations)
earliest	(only for a defined set of stations)

View defines the mapping to RDF and the subset of attributes to show.

No ordering specification, order is fixed in the tables.

Configuration requirements

The data store configuration should specify:

keyspace
by_date table
by_measure table
name of measure column     (default, measurereference)
username/password          (indirectly via credentials file?)

Example Queries

Observations for set of measures for a date range:

select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
and date >= '2017-10-01' and date < '2017-10-03';

Observations for a set of measures since a timestamp. Have to give explicit date. Doesn’t work for set of dates (IN) or date range (at least if returning collections). Not much use either skip timestamp ranges or do in client?

select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
and date = '2018-01-17' and timestamp > '2018-01-17T12:00:00Z';

Latest observation for a set of measures:

select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
    GROUP BY stationreference ORDER BY date ASC, timestamp ASC;

Earliest observation for a set of measures:

select * from measures_by_station  WHERE stationreference IN ('E72539', 'F1508') 
    GROUP BY stationreference ORDER BY date DESC, timestamp DESC;

Latest observation for all measures. Can’t see a way to do this. For example, obvious ways is not allowed:

select * from measures_by_station GROUP BY stationreference ORDER BY date ASC, timestamp ASC;
select * from measures GROUP BY date, stationreference ORDER BY date ASC, timestamp ASC;

Without the ORDER BY is allowed but times out.

Latest observation today (or at least a specific day) is OK:

select * from measures WHERE date = '2018-01-17' group by stationreference ORDER BY  stationreference, timestamp;

Observations since timestamp for all stations:

select * from measures  WHERE date > '2018-01-17' and timestamp > '2018-01-17T12:00:00Z' ALLOW FILTERING;

if timestamp but no date inject data limit/filter
if date range and not large then use IN?
when to use group by ?