Python Soft Skills Tutorials

Aggregate Time Series Data with TimescaleDB4 min read

Jun 28, 2022 3 min

Aggregate Time Series Data with TimescaleDB4 min read

Reading Time: 3 minutes

With millions of API connections every month, we cannot analyze trends and product performance using direct, slow queries on our huge analytics database. Instead, we need a simple means of:

  • aggregate huge amounts of data quickly.
  • produce periodical summaries.

Here comes TimescaleDB’s continuous aggregates. At Mindee, we track and store every API call with Timescaledb, a time-series SQL database, so we can understand how our APIs are performing and find ways to improve them.

TimescaleDB’s continuous aggregates collect data in PostgreSQL’s materialized views, which allows you to run faster queries. They can be automatically refreshed, either continuously or by following a policy, either on the whole database or over a specific range. The advantage is that they are very fast since they only take into account the data that has changed since the last refresh.

Prerequisites

  • TimescaleDB 2+
  • Postgres12+

Create a continuous aggregate

Let’s create a trivial continuous aggregate, which is based on an api_responses table. 

We will aggregate per hour the sum of API success responses, the sum of errors, the average file size, and the maximum computation time for each customer.

CREATE MATERIALIZED VIEW api_responses_hourly_summary
WITH (timescaledb.continuous, timescaledb.materialized_only=true) AS
SELECT time_bucket(INTERVAL '1 hour', created_at) AS time_bucket,
    customer_name AS customer_name,
    COUNT(CASE WHEN response_status = 'success' THEN 1 END) AS successes,
    COUNT(CASE WHEN response_status = 'error' THEN 1 END) AS errors,
    AVG(file_size) AS avg_file_size,
    MAX(processing_time) AS max_processing_time
FROM api_responses
GROUP BY customer_name, time_bucket
WITH NO DATA;

From the above example, the option timescaledb.materialized_only=true disables the real-time aggregation, and the option WITH NO DATA creates the continuous aggregate instantly with no data refresh.

Note: join, filter, order by, and distinct cannot be used in the creation of the aggregate.

In case you want to delete what we just did, the DROP MATERIALIZED VIEW command will delete the aggregate of your choice.

DROP MATERIALIZED VIEW api_responses_hourly_summary;

Add a retention policy

You can add a policy to automatically drop the data of your aggregate after a certain amount of time. For instance, the following command will create a policy that will drop the data of the aggregate after 6 months.

SELECT add_retention_policy('api_responses_hourly_summary', INTERVAL '6 months');

And the following will remove the retention policy.

SELECT remove_retention_policy('api_responses_hourly_summary');

Refresh the continuous aggregate

There are three ways to refresh the data for a continuous aggregate:

  • manually over a selected time window.
  • periodically following an automatic refresh policy.
  • continuously with real-time aggregation.

Manually refresh a continuous aggregate

Right now, our continuous aggregate does not have any data. Let’s fill it with a manual refresh over the last 6 months.

CALL refresh_continuous_aggregate('api_responses_hourly_summary', '2022-01-01', '2022-04-30');

The two dates in the parameters are the start and end of the time window over which the aggregate will be computed. Also, keep in mind that even for this time window, the aggregate will be calculated only for the data that really requires computation.

Since you probably have a lot of data, I suggest starting your first refresh only over a small window of time instead of over everything. In testing, doing multiple little refreshes over little windows was way faster than doing a single one over a large window.

Create a periodic refresh policy

If we want the aggregate to update its data periodically, we can add a refresh policy. The policy has three arguments:

  • start_offset: the beginning of the refresh window, relative to when the refresh runs.
  • end_offset: the end of the refresh window, relative to when the refresh runs.
  • schedule_interval: the interval between the run of two refreshes

The following policy will allow the data to be refreshed every 10 minutes over the last two hours.

SELECT add_continuous_aggregate_policy('api_responses_hourly_summary',
    start_offset => INTERVAL '2 hour',
    end_offset => INTERVAL '0 hour',
    schedule_interval => INTERVAL '10 minutes');

It can then be deleted with the following :

Opt for real-time continuous aggregation

TimescaleDB offers a real-time aggregation option, which, unlike the refresh policy, will let your aggregate always be up-to-date. The following command will enable the real-time aggregation:

ALTER MATERIALIZED VIEW api_responses_hourly_summary set (timescaledb.materialized_only = false);

Query the aggregate

Since TimescaleDB is based on PostgreSQL, continuous aggregates can be queried with any standard SQL commands.

SELECT customer_name, successes, error
FROM api_responses_hourly_summary
WHERE customer_name = 'my_number_one_customer'
GROUP BY time_bucket

Conclusion

When working with data already aggregated over buckets, some metrics like mean, median, or percentiles can be hard to compute. Fortunately, the official TimescaleDB toolkit provides many analytics tools to compute approximates of multiple metrics.

Your email address will not be published.