Skip to Content
DocumentationCachingGetting started with pre-aggregations

Getting started with pre-aggregations

Often at the beginning of an analytical application’s lifecycle - when there is a smaller dataset that queries execute over - the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. This is true of both application and purpose-built data warehousing solutions.

This leaves us with a chicken-and-egg problem; application databases can deliver low-latency responses with small-to-large datasets, but struggle with massive analytical datasets; data warehousing solutions usually make no guarantees except to deliver a response, which means latency can vary wildly on a query-to-query basis.

Database TypeLow Latency?Massive Datasets?
Application (Postgres/MySQL)
Analytical (BigQuery/Redshift)

Cube provides a solution to this problem: pre-aggregations. In layman’s terms, a pre-aggregation is a condensed version of the source data. It specifies attributes from the source, which Cube uses to condense (or crunch) the data. This simple yet powerful optimization can reduce the size of the dataset by several orders of magnitude, and ensures subsequent queries can be served by the same condensed dataset if any matching attributes are found.

Pre-aggregations are defined within each cube’s data model, and cubes can have as many pre-aggregations as they require. The pre-aggregated data is stored in Cube Store, a dedicated pre-aggregation storage layer.

Pre-Aggregations without Time Dimension

To illustrate pre-aggregations with an example, let’s use a sample e-commerce database. We have a data model representing all our orders:

cubes: - name: orders sql_table: orders measures: - name: count type: count dimensions: - name: id sql: id type: number primary_key: true - name: status sql: status type: string - name: completed_at sql: completed_at type: time
cube(`orders`, { sql_table: `orders`, measures: { count: { type: `count` } }, dimensions: { id: { sql: `id`, type: `number`, primary_key: true }, status: { sql: `status`, type: `string` }, completed_at: { sql: `completed_at`, type: `time` } } })

Some sample data from this table might look like:

idstatuscompleted_at
1completed2021-02-15T12:21:11.290
2completed2021-02-25T18:15:12.369
3shipped2021-03-15T20:40:57.404
4processing2021-03-13T10:30:21.360
5completed2021-03-10T18:25:32.109

Our first requirement is to populate a dropdown in our front-end application which shows all possible statuses. The Cube query to retrieve this information might look something like:

{ "dimensions": ["orders.status"] }

In that case, we can add the following pre-aggregation to the orders cube:

cubes: - name: orders # ... pre_aggregations: - name: order_statuses dimensions: - status
cube(`orders`, { // ... pre_aggregations: { order_statuses: { dimensions: [status] } } })

Pre-Aggregations with Time Dimension

Using the same data model as before, we are now finding that users frequently query for the number of orders completed per day, and that this query is performing poorly. This query might look something like:

{ "measures": ["orders.count"], "timeDimensions": ["orders.completed_at"] }

In order to improve the performance of this query, we can add another pre-aggregation definition to the orders cube:

cubes: - name: orders # ... pre_aggregations: - name: orders_by_completed_at measures: - count time_dimension: completed_at granularity: month
cube(`orders`, { // ... pre_aggregations: { orders_by_completed_at: { measures: [count], time_dimension: completed_at, granularity: `month` } } })

Note that we have added a granularity property with a value of month to this definition. This allows Cube to aggregate the dataset to a single entry for each month.

The next time the API receives the same JSON query, Cube will build (if it doesn’t already exist) the pre-aggregated dataset, store it in the source database server and use that dataset for any subsequent queries. A sample of the data in this pre-aggregated dataset might look like:

completed_atcount
2021-02-01T00:00:00.0002
2021-03-01T00:00:00.0003

Keeping pre-aggregations up-to-date

Pre-aggregations can become out-of-date or out-of-sync if the original dataset changes. Cube uses a refresh key to check the freshness of the data; if a change in the refresh key is detected, the pre-aggregations are rebuilt. These refreshes are performed in the background as a scheduled process, unless configured otherwise.

Was this page useful?