description |
---|
This page describes the indexing techniques available in Apache Pinot. |
In this page you will learn what a star-tree index is and gain a conceptual understanding of how one works.
Unlike other index techniques which work on a single column, the star-tree index is built on multiple columns and utilizes pre-aggregated results to significantly reduce the number of values to be processed, resulting in improved query performance.
One of the biggest challenges in real-time OLAP systems is achieving and maintaining tight SLAs on latency and throughput on large data sets. Existing techniques such as sorted index or inverted index help improve query latencies, but speed-ups are still limited by the number of documents that need to be processed to compute results. On the other hand, pre-aggregating the results ensures a constant upper bound on query latencies, but can lead to storage space explosion.
Use the star-tree index to utilize pre-aggregated documents to achieve both low query latencies and efficient use of storage space for aggregation and group-by queries.
{% embed url="https://www.youtube.com/watch?v=bwO0HSXguFA" %}
Consider the following data set, which is used here as an example to discuss these indexes:
Country | Browser | Locale | Impressions |
---|---|---|---|
CA | Chrome | en | 400 |
CA | Firefox | fr | 200 |
MX | Safari | es | 300 |
MX | Safari | en | 100 |
USA | Chrome | en | 600 |
USA | Firefox | es | 200 |
USA | Firefox | en | 400 |
In this approach, data is sorted on a primary key, which is likely to appear as filter in most queries in the query set.
This reduces the time to search the documents for a given primary key value from linear scan O(n) to binary search O(logn), and also keeps good locality for the documents selected.
While this is a significant improvement over linear scan, there are still a few issues with this approach:
- While sorting on one column does not require additional space, sorting on additional columns requires additional storage space to re-index the records for the various sort orders.
- While search time is reduced from O(n) to O(logn), overall latency is still a function of the total number of documents that need to be processed to answer a query.
In this approach, for each value of a given column, we maintain a list of document id’s where this value appears.
Below are the inverted indexes for columns ‘Browser’ and ‘Locale’ for our example data set:
Browser | Doc Id |
---|---|
Firefox | 1,5,6 |
Chrome | 0,4 |
Safari | 2,3 |
Locale | Doc Id |
---|---|
en | 0,3,4,6 |
es | 2,5 |
fr | 1 |
For example, if we want to get all the documents where ‘Browser’ is ‘Firefox’, we can look up the inverted index for ‘Browser’ and identify that it appears in documents [1, 5, 6].
Using an inverted index, we can reduce the search time to constant time O(1). The query latency, however, is still a function of the selectivity of the query: it increases with the number of documents that need to be processed to answer the query.
In this technique, we pre-compute the answer for a given query set upfront.
In the example below, we have pre-aggregated the total impressions for each country:
Country | Impressions |
---|---|
CA | 600 |
MX | 400 |
USA | 1200 |
With this approach, answering queries about total impressions for a country is a value lookup, because we have eliminated the need to process a large number of documents. However, to be able to answer queries that have multiple predicates means we would need to pre-aggregate for various combinations of different dimensions, which leads to an exponential increase in storage space.
On one end of the spectrum we have indexing techniques that improve search times with a limited increase in space, but don't guarantee a hard upper bound on query latencies. On the other end of the spectrum, we have pre-aggregation techniques that offer a hard upper bound on query latencies, but suffer from exponential explosion of storage space
The star-tree data structure offers a configurable trade-off between space and time and lets us achieve a hard upper bound for query latencies for a given use case. The following sections cover the star-tree data structure, and explain how Pinot uses this structure to achieve low latencies with high throughput.
Tree structure
The star-tree index stores data in a structure that consists of the following properties:
- Root node (Orange): Single root node, from which the rest of the tree can be traversed.
- Leaf node (Blue): A leaf node can containing at most T records, where T is configurable.
- Non-leaf node (Green): Nodes with more than T records are further split into children nodes.
- Star node (Yellow): Non-leaf nodes can also have a special child node called the star node. This node contains the pre-aggregated records after removing the dimension on which the data was split for this level.
- Dimensions split order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split on for a given level in the tree.
Node properties
The properties stored in each node are as follows:
- Dimension: The dimension that the node is split on
- Start/End Document Id: The range of documents this node points to
- Aggregated Document Id: One single document that is the aggregation result of all documents pointed by this node
The star-tree index is generated in the following steps:
- The data is first projected as per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial star-tree documents (separate from the original documents).
- Sort the star-tree documents based on the dimensionsSplitOrder. It is primary-sorted on the first dimension in this list, and then secondary sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.
- The tree structure can be created recursively (starting at root node) as follows:
-
If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.
-
A star node can be created (per configuration) for the current node, by dropping the dimension being split on, and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the star-tree documents.
If there is only one value for the current dimension, a star node won’t be created because the documents under the star node are identical to the single node.
-
- The above step is repeated recursively until there are no more nodes to split.
- Multiple star-trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T)
Aggregation is configured as a pair of aggregation functions and the column to apply the aggregation.
All types of aggregation function that have a bounded-sized intermediate result are supported.
Supported functions
- COUNT
- MIN
- MAX
- SUM
- SUM_PRECISION
- The maximum precision can be optionally configured in
functionParameters
using the keyprecision
. For example:{"precision": 20}
.
- The maximum precision can be optionally configured in
- AVG
- MIN_MAX_RANGE
- PERCENTILE_EST
- PERCENTILE_RAW_EST
- PERCENTILE_TDIGEST
- The compression factor for the
TDigest
histogram can be optionally configured infunctionParameters
using the keycompressionFactor
. For example:{"compressionFactor": 200}
. If not configured, the default value of100
will be used.
- The compression factor for the
- PERCENTILE_RAW_TDIGEST
- The compression factor for the
TDigest
histogram can be optionally configured infunctionParameters
using the keycompressionFactor
. For example:{"compressionFactor": 200}
. If not configured, the default value of100
will be used.
- The compression factor for the
- DISTINCT_COUNT_BITMAP
- NOTE: The intermediate result RoaringBitmap is not bounded-sized, use carefully on high cardinality columns.
- DISTINCT_COUNT_HLL
- The
log2m
value for theHyperLogLog
structure can be optionally configured infunctionParameters
, for example:{"log2m": 16}
. If not configured, the default value of8
will be used. Remember that a largerlog2m
value leads to better accuracy but also a larger memory footprint.
- The
- DISTINCT_COUNT_RAW_HLL
- The
log2m
value for theHyperLogLog
structure can be optionally configured infunctionParameters
, for example:{"log2m": 16}
. If not configured, the default value of8
will be used. Remember that a largerlog2m
value leads to better accuracy but also a larger memory footprint.
- The
- DISTINCT_COUNT_HLL_PLUS
- The
p
(precision value of normal set) andsp
(precision value of sparse set) values for theHyperLogLogPlus
structure can be optionally configured infunctionParameters
, for example:{"p": 16, "sp": 32}
. If not configured,p
will have the default value of14
andsp
will have the default value of0
.
- The
- DISTINCT_COUNT_RAW_HLL_PLUS
- The
p
(precision value of normal set) andsp
(precision value of sparse set) values for theHyperLogLogPlus
structure can be optionally configured infunctionParameters
, for example:{"p": 16, "sp": 32}
. If not configured,p
will have the default value of14
andsp
will have the default value of0
.
- The
- DISTINCT_COUNT_THETA_SKETCH
- The
nominalEntries
value for the Theta Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withDISTINCT_COUNT_THETA_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- DISTINCT_COUNT_RAW_THETA_SKETCH
- The
nominalEntries
value for the Theta Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withDISTINCT_COUNT_RAW_THETA_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- DISTINCT_COUNT_TUPLE_SKETCH
- The
nominalEntries
value for the Tuple Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withDISTINCT_COUNT_TUPLE_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH
- The
nominalEntries
value for the Tuple Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withDISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH
- The
nominalEntries
value for the Tuple Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withSUM_VALUES_INTEGER_SUM_TUPLE_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH
- The
nominalEntries
value for the Tuple Sketch can be optionally configured infunctionParameters
, for example:{"nominalEntries": 4096}
. If not configured, the default value of16384
will be used. Note that thenominalEntries
provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with{"nominalEntries": 8192}
can be used withAVG_VALUE_INTEGER_SUM_TUPLE_SKETCH
havingnominalEntries=8192
or less for any power of 2.
- The
- DISTINCT_COUNT_CPC_SKETCH
- The
lgK
value for the CPC Sketch can be optionally configured infunctionParameters
, for example:{"lgK": 13}
. If not configured, the default value of12
will be used. Note that thenominalEntries
provided at query time should be2 ^ lgK
in order for a star-tree index to be used. For instance, a star-tree index with{"lgK": 13}
can be used withDISTINCTCOUNTCPCSKETCH
havingnominalEntries=8192
.
- The
- DISTINCT_COUNT_RAW_CPC_SKETCH
- DISTINCT_COUNT_ULL
- The
p
value (precision parameter) for theUltraLogLog
structure can be optionally configured infunctionParameters
, for example:{"p": 20}
. If not configured, the default value of12
will be used.
- The
- DISTINCT_COUNT_RAW_ULL
- The
p
value (precision parameter) for theUltraLogLog
structure can be optionally configured infunctionParameters
, for example:{"p": 20}
. If not configured, the default value of12
will be used.
- The
Unsupported functions
- DISTINCT_COUNT
- Intermediate result Set is unbounded.
- SEGMENT_PARTITIONED_DISTINCT_COUNT:
- Intermediate result Set is unbounded.
- PERCENTILE
- Intermediate result List is unbounded.
Functions to be supported
- ST_UNION
Multiple index generation configurations can be provided to generate multiple star-trees. Each configuration should contain the following properties:
Property | Description |
---|---|
dimensionsSplitOrder | An ordered list of dimension names can be specified to configure the split order. Only the dimensions in this list are reserved in the aggregated documents. The nodes will be split based on the order of this list. For example, split at level i is performed on the values of dimension at index i in the list. |
skipStarNodeCreationForDimensions | (Optional, default empty): A list of dimension names for which to not create the Star-Node. |
functionColumnPairs | A list of aggregation function and column pairs (split by double underscore “__”). E.g. SUM__Impressions (SUM of column Impressions) or COUNT__*. |
aggregationConfigs | Check AggregationConfigs |
maxLeafRecords | (Optional, default 10000): The threshold T to determine whether to further split each node. |
{% hint style="info" %} `functionColumnPairs` and `aggregationConfigs` are interchangeable. Consider using `aggregationConfigs` since it supports additional parameters like compression. {% endhint %}
{% hint style="info" %} All aggregations of a query should be included in `aggregationConfigs` or in `functionColumnPairs` in order to use the star-tree index. {% endhint %}
Property | Description |
---|---|
columnName | (Required) Name of the column to aggregate. The column can be either dictionary encoded or raw. |
aggregationFunction | (Required) Name of the aggregation function to use. |
compressionCodec | (Optional, default PASS_THROUGH , introduced in release 1.1.0 ) Used to configure the compression enabled on the star-tree-index. Useful when aggregating on columns that contain big values. For example, a BYTES column containing HLL counters serialisations used to calculate DISTINCTCOUNTHLL . In this case setting "compressionCodec": "LZ4" can significantly reduce the space used by the index. Equivalent to compressionCodec in #raw-value-forward-index |
deriveNumDocsPerChunk | (Optional, introduced in release 1.2.0 ) Equivalent to deriveNumDocsPerChunk in #raw-value-forward-index |
indexVersion | (Optional, introduced in release 1.2.0 ) Equivalent to rawIndexWriterVersion in #raw-value-forward-index |
targetMaxChunkSize | (Optional, introduced in release 1.2.0 ) Equivalent to targetMaxChunkSize in #raw-value-forward-index |
targetDocsPerChunk | (Optional, introduced in release 1.2.0 ) Equivalent to targetDocsPerChunk in #raw-value-forward-index |
functionParameters | (Optional) A configuration map used to pass in additional configurations to the aggregation function. For example, on DISTINCTCOUNTHLL , this could look like {"log2m": 16} in order to build the star-tree index using DISTINCTCOUNTHLL with a non-default value for log2m . Note that the index will only be used for queries using the same value for log2m with DISTINCTCOUNTHLL . |
A default star-tree index can be added to a segment by using the boolean config enableDefaultStarTree under the tableIndexConfig.
A default star-tree will have the following configuration:
- All dictionary-encoded single-value dimensions with cardinality smaller or equal to a threshold (10000) will be included in the dimensionsSplitOrder, sorted by their cardinality in descending order.
- All dictionary-encoded Time/DateTime columns will be appended to the _dimensionsSplitOrder _following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the dimensionsSplitOrder.
- Include COUNT(*) and SUM for all numeric metrics in the functionColumnPairs.
- Use default maxLeafRecords (10000).
For our example data set, in order to solve the following query efficiently:
SELECT SUM(Impressions)
FROM myTable
WHERE Country = 'USA'
AND Browser = 'Chrome'
GROUP BY Locale
We may configure the star-tree index as follows:
"tableIndexConfig": {
"starTreeIndexConfigs": [{
"dimensionsSplitOrder": [
"Country",
"Browser",
"Locale"
],
"skipStarNodeCreationForDimensions": [
],
"functionColumnPairs": [
"SUM__Impressions"
],
"maxLeafRecords": 1
}],
...
}
Alternatively using aggregationConfigs
instead of functionColumnPairs
and enabling compression on the aggregation:
"tableIndexConfig": {
"starTreeIndexConfigs": [{
"dimensionsSplitOrder": [
"Country",
"Browser",
"Locale"
],
"skipStarNodeCreationForDimensions": [
],
"aggregationConfigs": [
{
"columnName": "Impressions",
"aggregationFunction": "SUM",
"compressionCodec": "LZ4"
}
],
"maxLeafRecords": 1
}],
...
}
{% hint style="info" %} Note: In above example configs maxLeafRecords is set to 1 so that all of the dimension combinations are pre-aggregated for clarity in visual below. {% endhint %}
The star-tree and documents should be something like below:
The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.
Star-tree documents
Country | Browser | Locale | SUM__Impressions |
---|---|---|---|
CA | Chrome | en | 400 |
CA | Firefox | fr | 200 |
MX | Safari | en | 100 |
MX | Safari | es | 300 |
USA | Chrome | en | 600 |
USA | Firefox | en | 400 |
USA | Firefox | es | 200 |
CA | * | en | 400 |
CA | * | fr | 200 |
CA | * | * | 600 |
MX | Safari | * | 400 |
USA | Firefox | * | 600 |
USA | * | en | 1000 |
USA | * | es | 200 |
USA | * | * | 1200 |
* | Chrome | en | 1000 |
* | Firefox | en | 400 |
* | Firefox | es | 200 |
* | Firefox | fr | 200 |
* | Firefox | * | 800 |
* | Safari | en | 100 |
* | Safari | es | 300 |
* | Safari | * | 400 |
* | * | en | 1500 |
* | * | es | 500 |
* | * | fr | 200 |
* | * | * | 2200 |
For query execution, the idea is to first check metadata to determine whether the query can be solved with the star-tree documents, then traverse the Star-Tree to identify documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the star-tree to the identified documents, apply aggregation/group-by on the qualified documents.
The algorithm to traverse the tree can be described as follows:
- Start from root node.
- For each level, what child node(s) to select depends on whether there are any predicates/group-by on the split dimension for the level in the query.
- If there is no predicate or group-by on the split dimension, select the Star-Node if exists, or all child nodes to traverse further.
- If there are predicate(s) on the split dimension, select the child node(s) that satisfy the predicate(s).
- If there is no predicate, but there is a group-by on the split dimension, select all child nodes except Star-Node.
- Recursively repeat the previous step until all leaf nodes are reached, or all predicates are satisfied.
- Collect all the documents pointed by the selected nodes.
- If all predicates and group-by's are satisfied, pick the single aggregated document from each selected node.
- Otherwise, collect all the documents in the document range from each selected node.note
- EQ (
=
) - NOT EQ (
!=
) - IN
- NOT IN
- RANGE (
>
,>=
,<
,<=
,BETWEEN
) - AND
- REGEXP_LIKE: It is intentionally left unsupported because it requires scanning the entire dictionary.
- IS NULL: Currently
NULL
value info is not stored in star-tree index, and the dimension will be indexed as default value. A workaround is to docol = <default>
instead. - IS NOT NULL: Same as
IS NULL
. A workaround is to docol != <default>
.
- OR
- It can be applied to predicates on the same dimension, e.g.
WHERE d1 < 10 OR d1 > 50)
- It CANNOT be applied to predicates on multiple dimensions because star-tree index will double counting with pre-aggregated results.
- It can be applied to predicates on the same dimension, e.g.
- NOT (Added since
1.2.0
)- It can be applied to simple predicate and
NOT
- It CANNOT be applied on top of
AND
/OR
because star-tree index will double counting with pre-aggregated results.
- It can be applied to simple predicate and
{% hint style="info" %} In scenarios where you have a transform on a column(s) which is in the dimension split order (should include all columns that are either a predicate or a group by column in target query(ies)) AND used in a group-by, then Star-tree index will get applied automatically. If a transform is applied to a column(s) which is used in predicate (WHERE clause) then Star-tree index won't apply.
For e.g if query contains round(colA,600) as roundedValue from tableA group by roundedValue
and colA is included in dimensionSplitOrder then Pinot will use the pre-aggregated records to first scan matching records and then apply transform round()
to derive roundedValue
.
{% endhint %}