Cassandra stores need a table structure strongly tied to the expected query patterns. This makes a completely generic Cassandra sapi-nt engine tricky, and probably not that useful.
The time-series Cassandra engine is designed for use just with time-series data where we impose a standardized expectation of queries and table structure.
The store is assumed to hold a set of time-series, each labelled by some identifier. Each time-series comprises a series of simple valued observations (doubles) with an extensible set of associated attributes. In principle, if the data is not numeric, there’s no reason you can’t put values into the attribute slots and ignore the value slot.
Each observation in the series has:
| Field | Type | Meaning |
|---|---|---|
measurereference |
text | Identifier for the time series |
timestamp |
timestamp | The result time for the observation |
date |
text | The date of the observation (logically redundant with timestamp but allows grouping) |
value |
double | The observed value |
| attribute | text | Each observation can be annotated with an arbitrary set of attribute/value pairs. The values can be serialized RDFNode values, thereby supporting types. Note that in the database these additional attributes are contained by the meta column. |
The identifier for the time series is arbitrary. It typically encodes the measurement station, observed property, qualifiers for the observation and qualifiers for the series (quality controlled, interval, mean/instantaneous etc). Internally we store a simple identifier string, this can be converted to/from an external URI in the API driver.
All metadata describing the time-series and the associated measurement points, observed properties etc are assumed to be held in a separate RDF store. It is up to the driving application to provide an integrated API to locate suitable time series and then retrieve them. The sapi-nt Cassandra time-series engine just provides facilities to query the time-series themselves.
Assumes two tables for indexing by date and by measures.
CREATE TABLE dev.measures_by_date (
date text,
measurereference text,
timestamp timestamp,
meta set<text>,
value double,
PRIMARY KEY (date, measurereference, timestamp)
) WITH CLUSTERING ORDER BY (measurereference ASC, timestamp ASC) ;
CREATE TABLE dev.measures_by_measurereference (
measurereference text,
date text,
timestamp timestamp,
meta set<text>,
value double,
PRIMARY KEY (measurereference, date, timestamp)
) WITH CLUSTERING ORDER BY (date ASC, timestamp ASC);
The meta column is stores all of the attribute value pairs as a set of strings which have the format: attribute=value-encoding.
General list queries over a set of measurements.
Can filter by:
| Field | Filter options |
|---|---|
| measure | Value, set of values |
| date | Value, set of values, relational operation (eg. <) |
| timestamp | Value, set of values, relational operation (eg. <) |
| attribute | Value |
| latest | (only for a defined set of stations) |
| earliest | (only for a defined set of stations) |
View defines the mapping to RDF and the subset of attributes to show.
No ordering specification, order is fixed in the tables.
The data store configuration should specify:
keyspace
by_date table
by_measure table
name of measure column (default, measurereference)
username/password (indirectly via credentials file?)
Observations for set of measures for a date range:
select * from measures_by_station WHERE stationreference IN ('E72539', 'F1508')
and date >= '2017-10-01' and date < '2017-10-03';
Observations for a set of measures since a timestamp. Have to give explicit date.
Doesn’t work for set of dates (IN) or date range (at least if returning collections).
Not much use either skip timestamp ranges or do in client?
select * from measures_by_station WHERE stationreference IN ('E72539', 'F1508')
and date = '2018-01-17' and timestamp > '2018-01-17T12:00:00Z';
Latest observation for a set of measures:
select * from measures_by_station WHERE stationreference IN ('E72539', 'F1508')
GROUP BY stationreference ORDER BY date ASC, timestamp ASC;
Earliest observation for a set of measures:
select * from measures_by_station WHERE stationreference IN ('E72539', 'F1508')
GROUP BY stationreference ORDER BY date DESC, timestamp DESC;
Latest observation for all measures. Can’t see a way to do this. For example, obvious ways is not allowed:
select * from measures_by_station GROUP BY stationreference ORDER BY date ASC, timestamp ASC;
select * from measures GROUP BY date, stationreference ORDER BY date ASC, timestamp ASC;
Without the ORDER BY is allowed but times out.
Latest observation today (or at least a specific day) is OK:
select * from measures WHERE date = '2018-01-17' group by stationreference ORDER BY stationreference, timestamp;
Observations since timestamp for all stations:
select * from measures WHERE date > '2018-01-17' and timestamp > '2018-01-17T12:00:00Z' ALLOW FILTERING;