Enhanced node synchronization with Binlogs for reliable data handling

In the TerrariumDB, we have three types of nodes: Agent, Gateway (we named it ScorpIO), and Controller:

A Controller is responsible for the configuration of Agent nodes and Gateway and handles transactions,
An Agent is responsible for managing data storage and executing crucial compute operations for optimal performance
ScorpIO is responsible for handling all requests and fetching data from the Agent.

To ensure flawless synchronization among these nodes, a configuration file has been used, containing vital information like database tables, shards, nodes, and more. However, as TerrariumDB continues to expand, this configuration file started growing significantly, sometimes reaching the size of megabytes. To tackle this challenge, the TerrariumDB team has been hard at work on groundbreaking enhancements.

Figure 1 Communication between various types of nodes.

Old solution

In the initial approach, data updates across nodes demanded transmitting the entire configuration, resulting in a substantial volume of data transfer (for instance, over 200 Mb per transaction when dealing with 21 nodes). While this approach proved efficient for smaller data sets, it proved to be time-consuming and inefficient for larger configurations. Rather than repeatedly sending the entire configuration file, which involved redundant data transfers, we came up with a straightforward solution which is sending this log which stores all information about modifications.

Apply log

Each log captures a specific type - such as ADD_TABLE, ADD_DATABASE, ADD_NODE or REMOVE_SHARD - and the body of this command, like: name, primary_keys, shard_keys, among others. When a transaction is updated, the Controller creates a new log, eliminating the necessity to transmit the entire configuration file. As a result, only a small amount of data, a few kilobytes, is sent to update each node according to its type.. Thanks to log's unique LSN (Log Sequence Number), it's possible to establish the precise sequence of logs, allowing Controllers to determine the required updates for each node accurately. TerrariumDB's nodes store the logs in JSON format, allowing quick rebooting by locally reading the logs and receiving new logs from the Controller to incorporate any missed changes during downtime. Below you can find an example log:

{
"lsn":"272",
"operation":{
"table":{
"compression":"ZSTD",
"databaseId":79,
"id":22,
"name":"events_automation.clientstartpath",
"primaryKey":[
[
"objectId"
],
[
"timestamp"
],
[
"uuid"
]
],
"shardKey":[
[
"objectId"
]
]
},
"type":"ADD_TABLE"
}
}

This log contains information about a table with corresponding id: 22, and the name: events_automation.clientstartpath. We need to specify the database id for such a table (79). The table contains fields that correspond to its parameters, including compression, primaryKeys, and shardKeys. The last field contains information about the type of log, in our case this is ADD_TABLE.

Synchronization of logs

To ensure seamless coordination among multiple Controllers, it's crucial to synchronize logs and prevent the loss of vital information. In a scenario where a leader drops, there is a risk of corrupted data and missing details about the latest tables and databases. To address this, TerrariumDB uses Zookeeper for log storage and also plays a role in leader election.

When a new transaction occurs, a log is created and stored in Zookeeper before being distributed to other nodes. In the event of a leader dropping, the newly elected Controller reads the most up-to-date logs and proceeds to reconfigure the other nodes accordingly. For example, let's consider a log with an LSN of 120. Now, if new transactions introduce 5 tables, 5 additional logs are generated and stored in Zookeeper. In the event of a Controller drop - perhaps due to a shutdown of the virtual machine - a new Controller takes charge. It reads all logs, ensuring its own LSN becomes 125. Recognizing that other nodes still have an LSN of 120, the new Controller sends the missing 5 logs to all nodes, effectively safeguarding against log loss. This mechanism ensures the system remains resilient even in the face of leader shutdowns at any given moment.

Snapshot

To enhance efficiency and maintain data integrity, TerrariumDB rather than sending individual logs, which can be time-consuming, transmits a snapshot of the entire configuration when dealing with a large volume of logs. This is particularly beneficial in scenarios where the number of logs is substantial. To achieve this, we divide the logs into manageable segments, creating snapshots for every n logs and storing a set number of m snapshots in Zookeeper. For example, storing 5 snapshots, each encompassing changes for 100 logs, results in 500 logs stored in Zookeeper collectively.

When a new leader assumes control, it retrieves the latest snapshot, applies the corresponding logs, and distributes any missing logs to the other nodes. If a snapshot gets corrupted, we can use an older snapshot from the backup for restoration. If a node remains offline for an extended period and there is a lack of older logs in Zookeeper, we send the latest snapshot and apply all the missing logs associated with that particular snapshot.

This approach strikes a balance between transmitting the entire configuration and applying logs, preventing potential corruption of the configuration. By leveraging multiple snapshots, we can restore and replace any corrupted snapshots using older ones. This dual benefit not only safeguards data integrity but also significantly speeds up the configuration process, resulting in a win-win situation.

To enhance system reliability, we also store this snapshot locally on nodes. In the event of an issue with the local snapshot preventing node startup, the node can rely on an older snapshot and apply logs instead of requesting a new snapshot from the controller.

Conclusion

Logs and snapshots offer an efficient and reliable method for updating large configurations and synchronizing nodes without constantly sending the entire set of data. By storing a few snapshots, we can easily restore previous configurations if needed. To further enhance node startup speed, we can locally store binlogs for each node and only request logs from the leader for transactions that occurred while the node was offline, such as during maintenance processes. This not only streamlines the process but also minimizes reliance on the leader for entire snapshots.

‍