Synerise Terrarium - a massive scale in-memory & disk storage built from scratch

Terrarium is a column and row store engine designed specifically for behavioral intelligence, real-time data processing, and is the core of the Synerise platform. It simultaneously processes data heavy analytics while executing various business scenarios in real-time.

Synerise is a Polish company conducting commercial activities and R&D works in the field of Big Data and AI (Artificial Intelligence). We created the Synerise platform to help organizations grow rapidly.

‍

It’s no secret that we’re an engineering driven company that constantly pushes boundaries with fresh and innovative products. At Synerise, we believe that only organizations that innovate with passion, practice consistency, and perform intellectual experiments will stay relevant and become the leaders in their space. That's how we stay competitive against other companies with strong intellectual property – by producing the best products, based on science.

Today, I want to share a couple words about our own data storage - engine which is core to all Synerise components.

Terrarium is an "all in one" DB project that resolves the challenges no other available tool can solve. With Terrarium we didn’t need to string together several independent third party technologies, which would result in both higher system complexity and costs. Terrarium enables effective and simultaneous executions of analytical / column and row queries.

Table of content:

How did it start?
Is Terrarium replaceable with other solutions today?
Why do we need our own storage?
Advantages of an in-house database engine
What kind of DBMS features does Terrarium have?
What kind of limitations does Terrarium have?
Which use cases does Terrarium fit best?
How is this engine better than others?
What does Terrarium look like compared to other solutions?
Future work

How did it start?

In early 2017 at Synerise, after looking for the perfect solution to analyze and execute business scenarios on large-scale data, with the IT Core Team led by Milosz Balus and Marcin Pasternak, we decided to produce our own storage engine.

At that time, the landscape of #bigdata technologies was vast and we found many different engines with column stores for analyses and row stores for documents on the market.

We defined the fundamental features that a perfect solution should have:

Schema-free organization of data
Scalable and accessible architecture
Large scale data warehouse service to use with business intelligence tools and to work on real-time data
Extremely fast services for analytics queries made on billions of records where the time, context, and order of events are essential for business cases, e.g. funnels, segmentations, flow diagrams like Sankey (column store approach).
Optimized framework to handle time series data: each entry should be associated with a timestamp
Cloud-agnostic mechanisms available for bare-metal systems (on-premise Synerise distribution)
Data locality for best performance in case of a direct query characteristic for a single object analysis with time context (row store approach).
One source of truth: one storage that can do OLAP (analytics) queries and direct queries for single object data analysis (in our case, it was customers and products) to make decisions based on data in real-time (<500ms).
Cost-effectiveness
Stored procedures and functions for fast computing

The above requirements are specific to various types of storage: documents, wide column store, RDBMS, time series DBMS, and OLAP storages.

Before we started developing our own solutions, we set a benchmark and during tests we tried to use and adapt different kinds of technologies. We tested real use cases with MemSQL, ClickHouse, Tarantool (worker implementation), CitusData, Druid, HBase, Spark SQL, ElasticSearch, Apache Ignite, and Hazelcast. We also tried making a hybrid solution, like Druid or ClickHouse for analytics or HBase and Spark for a direct query for single object analyses. After a few months of tests, we shortlist a few promising technologies and a custom implementation of Tarantool for a single worker used in a cluster mode, developed in-house. We learned that for the required performance, data should be local and stored in memory & disk. All technologies are different for various purposes, but they all have substantial advantages that we were looking for in our final solutions: data locality and in-memory & disk storage for fast computing.

None of the technologies mentioned above met all our expectations, business needs, requirements because of problematic edge cases such as data consistency, schema requirements, writing different queries for analysis and direct queries (row store characteristic), performance, redundancy, complexity, and more.

**Is Terrarium replaceable with other solutions today?**

No, there are no solutions "more reliable or better" that would meet the characteristics of traffic handled by Terrarium while providing the required scalability. Though, it is likely that such technologies will begin to emerge in the next few years.

There is currently no single solution available to replace Terrarium. None of the TOP30 (https://db-engines.com/en/ranking) commonly used tools meet the requirements. Solutions are needed to efficiently perform analytical queries (e.g. ClickHouse), while offering easy access to individual lines (e.g. Scylla). Terrarium should not be classified as a database in the Row Stores (MySQL, PostgreSQL) / Column Stores (Vertica, Greenplum, ClickHouse) / Wide Column Stores (Cassandra, Scylla, Hbase) categories, but as a Cloud-Native Database (https://dzone.com/articles/cloud-native-databases-and-why-should-you-use-them) and as a solution providing a tool to store data in all of the above categories simultaneously. Just as a few years ago, there was a demand for solutions in the container orchestration field (Amazon ECS, Kubernetes), which would allow for better use of resources provided by the cloud (VMs). Along with the increase in the amount of generated and stored data, there is a need to create a new category of databases with a view to operating in the cloud. This allow us to use the full potential of the cloud (Blob Storage, Spot Instances), which will translate into a reduction in data storage and computing costs.

Why do we need our own storage?

Firstly, to reduce the maintenance costs of the system and to ensure the scalability of the provided solutions.

Secondly, to deliver a unified experience for our clients. The core requirement in Synerise is real-time processing defined as access to unstructured heterogeneous data and object metrics in milliseconds. We analyze and deliver information at the same time.

The column store approach is perfect for the data analysis purposes that our clients need to make informed decisions. At the same time, after gaining insights from data, our clients need to act on them. For example, by building segments of objects (customers, products, locations, etc.) based on a stream of events generated from that object and delivering output data to 3rd party solutions (POS, campaign engines, call center, etc.) in real time. They require real-time and ad-hoc analyses and data execution in the same moment.

Figure 1. Traditional approach for data analytics and business scenario execution

In a traditional legacy solution, this approach is delivered by separate technologies. First, you look for some insights into one technology, e.g., some metrics. After that, you take a data snapshot and deliver it to a separate database engine where you can enrich and execute data. This approach is not effective in terms of time, effort or cost. Using multiple technologies generates problems with the maintenance and scalability of the provided solutions. It is also problematic for real-time requirements, e.g., personalized content on the website, counting loyalty points/metrics based on behavioral data, or delivering some insights to POS or serving AI recommendations, churn models, scoring etc.

With our clients, we dive into the tech world and see the demand for technology that combines Column Stores (Vertica, Greenplum, ClickHouse) with Row Stores/Wide Column Stores (Cassandra, Scylla, Hbase). The currently available solutions do not meet the requirements (schema-less), and solutions such as Vitess are only a partial mitigation of current restrictions; as time goes by, this will become even more apparent. Investing in technologies that will be outdated in a few years is pointless. Terrarium was built from scratch to avoid such problems.

Figure 2. High level architecture definition for Terrarium

Advantages of an in-house database engine

One may need very sound reasons to create a completely new database engine, especially considering how much has been done with regards to column and row stores in general.. That’s why we devoted ourselves to studying and grasping the complexity of our use case and consequences on our technology. Our work resulted in many benefits and led us to develop our own database.

Query flexibility, which mainly inspired our work on the database, is a key factor. We made sure our database is able to process a wide range of queries, starting from simple aggregations to machine learning algorithms.
Ultimately, we reduced costs due to controlling the place and method of data storage. Cost efficiency with high data volumes is our priority, although column-oriented databases usually ensure horizontal scalability. While exploring and testing other databases, unique behavioral analyses, which is our specialty and main focus, wasn’t supported enough. That is why having control over the underlying system may be seen as an advantage.

As there are two sides of every story, there were also significant disadvantages to this approach, such as the required engineering effort or the chance that we would be forced to recreate mechanisms responsible for reliability and scalability.

What kind of DBMS features does Terrarium have?

The main features of the Terrarium DBMS are:

True column-oriented & row-oriented DBMS in one engine. The same stored data set can be processed at the same time in two separate ways, characteristic to row or column store DBMS
Linear scalability. It's possible to extend a cluster by adding servers
Fault tolerance. The system is available in cluster mode with active workers, which increases performance and ability to handle more load from public traffic
Capability to store and process petabytes of data
High performance
Support for vector calculations. This approach allows us to achieve high CPU performance
Distributed query processing is available for full unstructured, heterogeneous data (JSON format)
Data compression
Hard disk drive (HDD) optimization. The system can process data that doesn’t fit in (RAM).
HTTP API for database connectivity
Optimized for INSERT/UPDATE queries

What kind of limitations does Terrarium have?

Terrarium has some features that can be considered disadvantages:

Currrently, there is no support for DB transactions
We are working on SQL syntax

**Which use cases does Terrarium fit best?**

Terrarium was designed to analyze behavioral data, where data order and time are important to make business decisions. More cases

Query large number of rows extracted from the DB, but only a small subset of columns
Get an object with a single query for complex business logic with latencies of about 5 ms are allowed
Store data from different systems (website, POS, ERP, etc.)
Analyze and build internal dashboards with the data or perform real-time analyses for business purposes

How is this engine better than others?

Efficiency, scalability, combining Column Stores and Row Stores characteristics, cost optimization (separation of the calculation layer and the data storage layer, allowing for dynamic adjustment of the amount of resource usage, depending on the load characteristics), and analytical calculation features.

To achieve a high compression ratio of stored data, it was necessary to use a columnar file format. Unfortunately, the use of existing formats such as Parquet or ORC were not possible due to the need to store schema-less data. To meet these requirements, we decided to develop a proprietary data storage format.

The mentioned file consists of column blocks containing up to 1M values. A dictionary containing unique values is generated for each column. Then, dictionary encoded values are split into chunks of data containing up to 65k values, and each chunk is dictionary-encoded once again. This solution allows us to achieve a very high compression ratio and selective loading of data. In order to filter rows containing only the specified value, we first search a column dictionary. If we do not find the value we are looking for, there is no need to load the remaining data (chunk dictionaries and chunk values), and we can immediately go to filtering the columns from the next block. Dictionary encoding with a fixed length of single encoded values also allows for easy use of vector extensions to decompress data. Compression of column dictionary, chunk dictionaries, and encoded values could be improved by enabling generic compression algorithms: ZSTD, LZ4 or ZLIB.

To support writing to HDFS we decided to include file metadata as a footer. Our file format also allows us to encrypt stored data with the adoption of generic encryption algorithms.

What does Terrarium look like compared to other solutions?

For our use cases, we prepare a benchmark which compared Terrarium and ClickHouse engines - we have great respect to that technology.

ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).

ClickHouse was developed by the Russian IT company Yandex for the Yandex.Metrica web analytics service. ClickHouse allows analysis of data that is updated in real time. The system is marketed for high performance. ClickHouse is used by the Yandex.Tank load testing tool. ClickHouse was also implemented at CERN’s LHCb experiment to store and process metadata on 10 billion events with over 1000 attributes per event, and Tinkoff Bank uses ClickHouse as a data store for a project.

Hardware

We use Azure virtual machines series L32s_v2 with:

32 x CPU
265 GB RAM
Temporary disk 7,2 TB NVMe (150k IOPS)

For both technologies, we prepared clusters with 10 nodes. The whole capacity of a cluster was:

320 x CPU
2 560 GB RAM
Temporary disk 72 TB NVMe (150k IOPS)

The ClickHouse configuration was boosted and set up for gaining the highest performance:

set max_block_size=262144;set distributed_group_by_no_merge=0;set optimize_distributed_group_by_sharding_key=1;set distributed_aggregation_memory_efficient=1;set optimize_skip_unused_shards=1;

Engine definition for ClickHouse schema:

dist.col_all_events_124: ENGINE = Distributed('analytics', 'buffers', 'col_all_events_124', client_id)buffers.col_all_events_124: ENGINE = Buffer('default', 'col_all_events_124', 16, 120, 300, 10000, 50000, 10240000, 10240000)default.col_all_events_124: ENGINE = CollapsingMergeTree(sign)PARTITION BY (xxHash32(action) % 29, toYYYYMM(event_date))ORDER BY (client_id, event_timestamp, client_uuid_hash, event_uuid_hash)SETTINGS index_granularity = 8192, storage_policy = 'nvme_jbod'

Data Sizes

For both solutions, we inserted 1.3 billion real customer events data like page visits, transactions and other client activities.

Query Performance

For test purposes, we defined three types of query which are unique for real business cases:

Q1 (analytic query) shows a list of clients that fit the segment definition:

In the last six months, they spent more than $1k
They were on the page where the URI contains "TV" sets. The client made at least three such visits one after the other.
They made a purchase, but it was not a product in the category of "TVs"
No more than 30 minutes elapsed between the first click on the "TV" category and the purchase.

Q2 (direct query) the same as Q1 but with a check for a particular client_id is in the segment

Q3 (direct query) count points in the loyalty program:

Each dollar spent is 1 point
If the customer purchased the "Carcass" category in the period from 4/04/2020 to 30/04/2020, we counted the points 10 times
If the customer has made five purchases today or for an amount higher than 10,000, we do not charge his points
If the customer purchased a product in the "Computers and tablets" category, we do not charge points for this category
We counted the full value of products rounding down, i.e. 1.99 to 1 point; 3.20 to 2 points
If in the last two weeks, he has bought five products for an amount greater than 2k, he gets not 1 but 100 points for every dollar
Points are not added if:

They were on the page where the URI contains "TV" - there were at least three such visits one after another (completed within 30 seconds last 30 days), or they bought the product in the category "TVs" in the last 30 days.

From the first entry into the TV category to the purchase of a product in this category, no more than 30 minutes have passed

Query performance summary

Query performance with data locality for single unit in Terrarium (worker) and ClickHouse (node)

Table 1. Query performance with data locality for a single unit in Terrarium (worker) and ClickHouse (node)

Query performance for cluster mode with 10 nodes and random client_id – production load

Table 2. Query performance for a cluster mode with 10 nodes and random client_id – production load

Figure 4. Production workload (6,000 requests per second) with execution time for single direct query in Terrarium.

Terrarium is exposed to a heavy public load. In one of our production clusters, it handles 6k rps. For a direct query like those for scenario Q2 and Q3, response time is less than 5 ms for 95 percentile.

We see that ClickHouse has 2x better performance for data locality. It does not support native query routing for data locality. If we want to adopt ClickHouse for our business use cases and public load (6k rps) we need:

● A write service that will route queries for a particular node to ensure data locality computation

● To increase cluster size from 10 to 34 notes if we want to handle 6k rps

ClickHouse and Terrarium summary competition

The compression ratio for ClickHouse 5,69 in Terrarium equals 22,2 in practice, meaning that at the same disk space Terrarium can store 3,9x more data than ClickHouse
For public load, 95% of queries for Terrarium end in less than 12 ms, the same query in ClickHouse take 459 ms it means that Terrarium is significantly faster – 38x times faster than ClickHouse
If we want to handle public traffic for one cluster with 6k rps we need to buy 3,4x more VM. A single VM cost is 1,832 EUR (June 2020). A single node for ClickHouse can handle 173 rps, meaning that we should buy 34 VM for 62,288 EUR in a pay-as-you-go model. The cost for Terrarium will be 18,320 EUR. In both clusters, we can store the same quantity of events because the Terrarium compression ratio is 3,9x better than in ClickHouse

Future work

Our Analytics Engine is one of the key differentiators that has helped us scale to meet the needs of our customers and help them build better products. We continue to invest resources to support large volume customers, improve query latency, add new features, and reduce costs.

Auto-scaling & spot instances

Stateless architecture and spot instances decrease cost infrastructure in cloud deployments by about 60% to 80% in comparison to the current deployment. Also, re-architecture caused a significant reduction of the bootstrap time for a new node.

Building support for third data layer – object oriented unstructured storage

Today Terrarium natively supports data processing stored in-memory and fast hard drive NVMe in order to meet performance requirements. In the next few months, we will implement native support for third data source layer - object-oriented unstructured data available in all cloud providers. This solution decreases the cost of storing huge data sets without losing performance. Our customers will be able store petabytes of data and pay only for what they are using - storage and computation will be separated. Data will always be available for fast computation and execution.

Implementing SQL for querying data

For the time being, Terrarium lacks the ability to process SQL queries. However, in the future, it is our main focus to implement SQL to increase the work quality and make our tool accessible with user-friendly interfaces for people working with data like data scientists, analysts, data engineers.

Soon Terrarium will debut as a standalone product.

‍