visit
They flow in like crazy. Every system event or click from user generates a log. A company often produces tens of billions of new logs per day.
They are bulky. Logs are supposed to stay. They might not be useful until they are. So a company can accumulate up to PBs of log data, many of which are seldom visited but take up huge storage space.
They must be quick to load and find. Locating the target log for troubleshooting is literally like looking for a needle in a haystack. People long for real-time log writing and real-time responses to log queries.
High-throughput real-time data ingestion: It should be able to write blogs in bulk, and make them visible immediately.
Low-cost storage: It should be able to store substantial amounts of logs without costing too many resources.
Real-time text search: It should be capable of quick text search.
Inverted index (Elasticsearch): It is well-embraced due to its support for full-text search and high performance. The downside is the low throughput in real-time writing and the huge resource consumption in index creation.
Lightweight index / no index (Grafana Loki): It is the opposite of inverted index because it boasts high real-time write throughput and low storage cost but delivers slow queries.
Upon data writing, the system tokenizes texts into terms, and stores these terms in a posting list which maps terms to the ID of the row where they exist. In text queries, the database finds the corresponding row ID of the keyword (term) in the posting list, and fetches the target row based on the row ID. By doing so, the system won't have to traverse the whole dataset and thus improves query speeds by orders of magnitudes.
Simply put, Elasticsearch and Grafana Loki represent different tradeoffs between high writing throughput, low storage cost, and fast query performance. What if I tell you there is a way to have them all? We have introduced inverted indexes in and further optimized it to realize two times faster log query performance than Elasticsearch with 1/5 of the storage space it uses. Both factors combined, it is a 10 times better solution.
Generally, there are two ways to implement indexes: external indexing system or built-in indexes.
External indexing system: You connect an external indexing system to your database. In data ingestion, data is imported to both systems. After the indexing system creates indexes, it deletes the original data within itself. When data users input a query, the indexing system provides the IDs of the relevant data, and then the database looks up the target data based on the IDs.
Data ingestion & compaction: As a segment file is written into Doris, an inverted index file will be written, too. The index file path is determined by the segment ID and the index ID. Rows in segments correspond to the docs in indexes, so are the RowID and the DocID.
Query: If the where
clause includes a column with inverted index, the system will look up in the index file, return a DocID list, and convert the DocID list into a RowID Bitmap. Under the RowID filtering mechanism of Apache Doris, only the target rows will be read. This is how queries are accelerated.
C++ Implementation and Vectorization
Different from Elasticsearch, which uses Java, Apache Doris implements C++ in its storage modules, query execution engine, and inverted indexes. Compared to Java, C++ provides better performance, allows easier vectorization, and produces no JVM GC overheads. We have vectorized every step of inverted indexing in Apache Doris, such as tokenization, index creation, and queries. To provide you with a perspective, in inverted indexing, Apache Doris writes data at a speed of 20MB/s per core, which is four times that of Elasticsearch (5MB/s).
Columnar Storage & Compression
Apache Lucene lays the foundation for inverted indexes in Elasticsearch. As Lucene itself is built to support file storage, it stores data in a row-oriented format.
By utilizing Zstandard compression, Apache Doris realizes a compression ratio ranging from 5:1 to 10:1, faster compression speeds, and 50% less space usage than GZIP compression.
BKD Trees for Numeric / Datetime Columns
Apache Doris implements BKD trees for numeric and datetime columns. This not only increases performance of range queries, but is a more space-saving method than converting those columns to fixed-length strings. Other benefits of it include:
Efficient range queries: It is able to quickly locate the target data range in numeric and datetime columns.
Less storage space: It aggregates and compresses adjacent data blocks to reduce storage costs.
Support for multi-dimensional data: BKD trees are scalable and adaptive to multi-dimensional data types, such as GEO points and ranges.
Optimization for low-cardinality scenarios: We have fine-tuned the compression algorithm for low-cardinality scenarios, so decompressing and de-serializing large amounts of inverted lists will consume less CPU resources.
Pre-fetching: For high-hit-rate scenarios, we adopt pre-fetching. If the hit rate exceeds a certain threshold, Doris will skip the indexing process and start data filtering.
Create inverted index for a specified data partition: create index for logs of the past seven days, etc.
Delete inverted index for a specified data partition: delete index for logs from over one month ago, etc. (so as to clear out index space).
Benchmarking tool: ES Rally, the official testing tool for Elasticsearch
Dataset: 1998 World Cup HTTP Server Logs (self-contained dataset in ES Rally)
Data Size (Before Compression): 32G, 247 million rows, 134 bytes per row (on average)
Query: 11 queries including keyword search, range query, aggregation, and ranking; Each query is serially executed 100 times.
Environment: 3 × 16C 64G cloud virtual machines
Data: 6.7G, 28.73 million rows, the Hacker News dataset, Parquet format
Query: 3 keyword searches, counting number of occurrence of the keywords "ClickHouse", "OLAP" OR "OLTP", and "avx" AND "sve".
Environment: 1 × 16C 64G cloud virtual machine
Result: Apache Doris was 4.7 times, 12 times, 18.5 times faster than ClickHouse in the three queries, respectively.
Dataset: one million comment records from Hacker News
Step 1: Specify inverted index to the data table upon table creation.
Parameters:
INDEX idx_comment (comment
): create an index named "idx_comment" comment for the "comment" column
CREATE TABLE hackernews_1m
(
`id` BIGINT,
`deleted` TINYINT,
`type` String,
`author` String,
`timestamp` DateTimeV2,
`comment` String,
`dead` TINYINT,
`parent` BIGINT,
`poll` BIGINT,
`children` Array<BIGINT>,
`url` String,
`score` INT,
`title` String,
`parts` Array<INT>,
`descendants` INT,
INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES ("replication_num" = "1");
Note: You can add index to an existing table via ADD INDEX idx_comment ON hackernews_1m(
comment) USING INVERTED PROPERTIES("parser" = "english")
. Different from that of smart index and secondary index, the creation of inverted index only involves the reading of the comment column, so it can be much faster.
Step 2: Retrieve the words"OLAP" and "OLTP" in the comment column with MATCH_ALL
. The response time here was 1/10 of that in hard matching with like
. (The performance gap widens as data volume increases.)
mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';
+---------+
| count() |
+---------+
| 15 |
+---------+
1 row in set (0.13 sec)
mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';
+---------+
| count() |
+---------+
| 15 |
+---------+
1 row in set (0.01 sec)