Building an Open Source Data Stack for Blockchain

TL;DR

I am building an open source data stack for blockchain. This stack will specialize for blockchain data to improve efficiency and developer experience. It will include libraries for memory handling, probabilistic filtering, data formats and IO. These tools will build an efficient foundation for systems that handle blockchain data, like blockchain nodes, databases and data pipelines.

1) Why is this needed?

Blockchain data is growing at a faster rate as there are more users and more transactions. There are also more applications, developers and use cases coming up all the time. Efficiency is more and more important as we have more data and we require more ways to process this data.

Efficiency greatly improves user experience and enables new use cases. Some benefits of efficiency include:

Lower latency and more detailed information in a block explorer. This means clicks are more responsive, users have access to more data and analysis, and the latest data appears in the block explorer more quickly.
Lower latency, more detailed data, and faster sync times in a wallet application.
Syncing historical blockchain nodes in hours instead of days or weeks.
Getting data into a developer environment faster and at a lower cost. This means developers don't have to wait and disrupt their workflow during experimentation. They also have access to more experimentation and data due to the reduced cost.
Less servers to host the system, so lower infrastructure cost.

Having a standard base that other infrastructure tools build on top will enable continuous efficiency improvements on the entire ecosystem as this base continues to develop. Having a standard base also enables faster development of new tools and makes it easier to get them to production readiness without compromising on efficiency.

2) Why do this?

Niche optimizations are possible when handling blockchain data because it is ordered and immutable. Specialized tools that exploit these properties can achieve better efficiency than general purpose tools.

Blockchain infrastructure projects still mainly use general purpose tools that come with (mostly) unneeded features like random inserts, random updates, deletes, compaction. These features are nice, but they come with an efficiency price.

Specialized solutions improve efficiency by exploiting the properties of blockchain data. For example, many data providers like Tenderly are building high-performance versions of standard APIs like eth_getLogs based on proprietary and ad-hoc solutions.

Specialized solutions achieve better efficiency but this approach stifles development of the whole ecosystem because different teams are not aware of each others’ work.

A better approach is to have a set of generic open-source tools that are used to build specific products. This approach is also getting more popular in general database ecosystem with tools like DataFusion, Polars, Velox, DuckDB.

I believe this is the way to go so I’ll implement open source libraries that show that these improvements are possible. I’ll be writing about each piece as I implement them and I will try to integrate them into existing projects to prove this idea. Development will be in the open so anyone can contribute or use the libraries.

3) Ideas

This section goes over some optimization ideas. Most of the optimizations are possible because blockchain data is ordered and immutable. Some optimizations are more general like using better IO APIs and managing memory carefully.

I plan to go into greater detail on each optimization in next blog posts as I implement them as libraries.

a) Data Layout

Most storage engines use one of LSM tree or B+ tree approaches. LSM tree has background compactions which can be a problem. B+ tree has bad random write performance because each random write re-creates a page and this causes small writes to translate into much bigger disk IO.

The reality is more nuanced than this in practice, but in general LSM tree has better write performance while B+ tree has better read performance. Both create write amplification.

We can get away with a simpler layout for blockchain data because it is immutable, ordered by block number and available from external sources in case of a restart.

This layout has an in memory section that can be atomically mutated to handle rollbacks or to append new data. We don’t have to write new changes to disk as soon as they happen because we can recover the data from external sources in case of a crash.

After in memory data size reaches some threshold, we build the indices, compress the data and write everything to disk. We can atomically remove the written data from memory after the disk write is persisted.

Readers and writer are never blocked because writes are atomic without any locking and readers can naturally get a consistent view of data. In-memory section can be implemented with copy-on-write semantics and the time frame where a reader holds an in memory view is short because it is just a small amount of in memory data. This ensures that we don’t have too much memory usage because of readers that are holding onto old versions of the tip section.

We don’t need compactions for read performance because we already have big-enough data chunks. Because the indices are built once, we can use expensive to build indices to improve the read performance.

Tables are bound together by block number so we can partition them together. This coupled with strict ordering by block number and similar fields makes joins fast. Because we partition the tables together, we can have indices that combine columns from multiple tables. This means we have smaller indices that can be used to skip entire sections.

Another benefit of this layout is that recent data is always in memory. This fits blockchain use case perfectly since most queries will be for recent data.

Note: It is easy enough to implement full or partial consistency so writes are not lost in case of a restart. But in-memory tip model is better for systems that just take raw blockchain data and store it. Bacause it can have low latency on writes, less IO, less disk tear.

State data is more challenging because it has to be in a specific layout to make proof generation and point reads fast. Firewood is an example at specializing for blockchain use case to improve efficiency for state data storage.

b) Indexing and Filtering

Columnar data formats generally support several types of indices; min-max, bloom filter, set and btree.

BTree index means we build a global btree of values that exist in a specific column and then we can query this tree to get which sections contain the data we are looking for. Reads work well when we want to query by a high cardinality column like user address so this seems fitting to our use case.

But we don’t want to have a btree index as it introduces write amplification and this write amplification is especially bad when we have a field like user address that is essentially a random number. Write amplification will cause slow writes.

An LSM tree based btree index might make sense eventually but it doesn’t seem like the best option at the start as we can utilize probabilistic filters which are simpler to implement, and we are aiming for simplicity.

Set indices are just a set of available values in a section of data, like a hash map. We can skip over an entire section if our query value isn’t in the set. They can be useful if we have a low-medium cardinality column. We can build a perfect hash set of the values and use it for a set index. For example we can create a set index for transaction type column and utilize this index when querying for a transaction type that is rare.

Min-max indices are nice for ordered columns like block number or timestamp. We store the minimum and maximum value for a column in a section. We can skip this section when querying if the value we are looking for isn’t in the range. This type of index isn’t useful in queries by user addresses because it requires a mostly-ordered column to work well.

Probabilistic filters are the secret sauce that we are looking for in this case because they are simple and efficient for filtering by user address, block hash and similar columns. I’ll just refer to probabilistic filters as filters here.

Most commonly used filters are bloom filters. They are used in blockchains, databases, data formats and so on. A bloom filter is like a lossy hash set that doesn’t even store the keys. Filter is just an array of bits. We hash a key we want to insert into the filter, we find the bit index that corresponds to the hash and we set it to 1. When we want to query our key, we find the corrensponding index and check if the bit is 1. This means we get a 1 for sure if our key is in the set. But we might still get a 1 if our key isn’t in the set, this is because the corresponding index of our key and some other key that was inserted might collide. It is a false positive if we get a 1 when our key isn’t in the set.

So we can build a bloom filter based on values of a column in a section of data and use this filter to skip the entire section when querying based on this column. This works especially well with high cardinality columns like user address.

The false positive rate coupled with query time determines how useful a filter is in our use case. We don’t care about construction time so much because we only construct filters once and use them later.

We want to optimize for space efficiency without losing too much cpu time, so we can use more space to lower the false positive rate. Lowering the false positive rate means lowering the IO so we can maintain low read latencies and high throughput even if we have a large data set.

We also want to keep all of our filters in memory if possible. So we can keep things simple and avoid implementing caching. This also means performance is more predictable and easier to measure or calculate.

A filter with false positive rate x has to have at least log2 (1/x) bits per key. For example vanilla partitioned bloom filters like the one used in solana have about 40% space overhead from the theoretical limit and blocked bloom filters like the one used in parquet standard have over 50% space overhead.

Blocked bloom filters are extremely fast (<10ns query time per key on a regular modern CPU). But they sacrifice space efficiency. They also suffer from false positive issues if some sections of a filter gets overcrowded.

There are filters that have lower space overhead like Binary Fuse Filters. Binary fuse filters have 20% overhead if we have <100k keys in a single filter and it goes down to 7% as we reach 1 million keys in a single filter.

There are also ribbon filters which are built on the same principle as binary fuse filters but give a little bit more configurability and they have variants that can achieve as low as <1% overhead over the theoretical limit. And they are able to hit these low overhead numbers even when we have 10k-100k keys in a single filter.

Ribbon concept can also be used to construct a structure that is similar to a hash-map but returns garbage values if our key-value pair isn’t in the map. This can be used to achieve even better results than just using a filter that gives true/false response.

c) Disk IO

Modern NVMe storage is fast. The main bottleneck on storage is the APIs we use and the architecture of our systems.

Note: I’m writing this section in the context of local attached NVMe drives on dedicated physical machines. The type of system described here then can be used as a cache or a frontend to the bigger pool of data that primarily resides in s3 or networked storage if that is needed.

Having locally attached NVMe drives in a dedicated machine means we can use direct_io and io_uring which will enable us to have low latency, high throughput, more controlled memory usage, lower amount of system calls and lower amount of running OS threads. All of this adds up to a an efficient IO system.

direct_io enables us to read from and write to our disks without going through the kernel page cache which means the OS won’t be wasting memory trying to cache the files we write/read based on our pattern of access and we don’t suffer the memory copy overhead of the page cache.

Consider a database that uses parquet files as a backing storage. Parquet files are fairly slow to decode so our read times might be dominated by parquet decoding. If we use OS cache, it will cache parquet data in memory for us automatically, but we still have to decode it every time we access it. If we use direct_io, we can manually cache decoded parquet data and avoid the decoding cost for repeated reads.

We can also cache specific data in memory, like current state data for most used slots e.g. the USDT state for most used addresses.

As another example, we can load storage slots that we know will be accessed by a contract x blocks later when doing historical sync. This kind of optimization might make it possible to remove IO bottleneck that ethereum nodes have when doing historical sync.

Main downside of this approach is that it is harder to implement compared to memory mapping files from disk or using standard read/write calls and letting the OS handle IO and caching. It is worth it to spend most of our complexity budget here because control over memory and IO enables us to do the other optimizations.

d) Memory Management

Memory in a program is split into pages that have a 4KB size on a common linux system. When we allocate memory, the OS gives us a bunch of (fake) pages. When we access a page for the first time, our program stops and that page’s address is mapped to an actual location in the hardware. This mechanism is called a page fault.

Let’s say we are copying a 4MB buffer. This is a common case when filtering large columnar data in memory. We first allocate a new buffer with 4MB size, this buffer is comprised of 1024 individual pages. As we copy our data, we will trigger 1024 page faults. This means our program will stop 1024 times and control flow will go somewhere else and come back to our program 1024 times while copying a single buffer.

But we already knew what we will do, so why not allocate 4MB of “real“ memory and just have a single page fault? OS doesn’t do this by default because it doesn’t know our intention and it wants to prevent excessive memory usage unless it is really needed, so it lazily maps the pages.

When we access a page of memory in our code, the cpu has to somehow know which place this page corresponds to in the hardware. It maps our page address to hardware via a mapping that is cached in a structre called TLB. But TLB has a finite capacity so performance suffers if we have a large amount of pages and we access different pages frequently.

Page faults and the TLB cache are important for performance in a system that handles significant amounts of data. A no-brainer way to reduce the number of page faults and TLB cache misses is to make our pages bigger. Which is possible in linux in the form of transparent huge pages, and explicit huge pages. We can bring down the amount of page faults and cache misses by using these bigger pages and trying to pre-fault our pages on allocation time via MAP_POPULATE and MAP_LOCKED options. This can give a significant performance boost.

We can also use specific sub-allocators where it makes sense. For example we can use a bump allocator when serving a query and free all of that memory at once when we finish handling the query. This means we can have a concrete limit on how much memory a single request can use. We also lower the overhead of small allocations.

e) Thread-Per-Core Architecture

Thread-per-core architecture means sharding the data into pieces and having different parts of the system handle different parts of the data. This enhances data locality, makes scaling natural and lowers synchronization overhead. Thread-per-core is used in high-performance computing systems, high-frequency trading and high-performance databases like ScyllaDB.

Another upside of thread-per-core is that it fits well with new linux IO interface io_uring. io_uring makes disk io more efficient which is important for systems that handle data.

Thread-per-core isn’t applied widely because there are downsides. Biggest downsides are difficulty of implementation, difficulty utilizing the whole physical system and high tail latency in case of an imbalanced load.

Blockchain data is naturally sharded by block number. So we can shard our queries to subqueries so each subquery scans a set range of blocks, this will mean an individual subquery has a limit on the amount resources it will use. Given this, we can spread our subqueries evently to our CPU cores so we balance the load while minimizing synchronization.

This is still more difficult to implement compared to just applying multi-threaded work-stealing paradigm to everything but we can handle this difficulty since we are specializing for blockchain data and our entire system is simpler.

There are also several options for a cut-off point between thread-per-core and the work-stealing parts of the system. For example, we can implement the storage engine as thread-per-core and use a work-strealing multi-threaded architecture for the application logic.