Indexing Ethereum

This post describes an indexer we built at Subsquid that indexes Ethereum blockchain data and implements an efficient query interface to query logs and transactions. Read part 1 of this story first to get a broader picture if you didn't read it already.

1) Why do this?

Blockchain data is needed to implement many applications (e.g, NFT marketplaces and Wallet applications). Ethereum is the most popular blockchain for smart contracts and decentralized applications. The default source for Ethereum blockchain data is Ethereum JSON RPC which is implemented by all popular Ethereum client implementations (Geth, Erigon, Nethermind, etc.). One of the problems with using this API is the fact that it has a huge cost of running an entire blockchain node or paying providers for endpoints that implement this API. Even if the developers are using a provider, the provider is still maintaining blockchain nodes to be able to serve these requests. This means developers need to pay relatively high costs to providers for rate-limited endpoints or host their blockchain node which is equally if not more costly.

Having an indexer index the data beforehand into a format that is optimized for querying and having an optimized query engine on top of that data can lead to a huge performance increase. Other benefits include much faster sync times, by storing already indexed data in cold storage and downloading it to servers that implement the query interface. Also the possibility of sharding the data among query servers for extra performance, lowering system requirements etc.

2) The easy way of doing it

Most applications that are built for indexing blockchain data are using some RDMS like Postgres for data storage, in great part because it provides a solid base to develop on without doing much work. But this kind of implementation typically uses too much storage space, CPU and memory for this task.

Having a custom implementation allows for a much more efficient and stable indexer.

Ethereum blockchain data doesn't contain many relations. And the ones it has are very simple like block numbers and transaction/log indexes inside a block. So a relational database with complex indexes isn't needed to store and query the data. Instead, we can store the data in a format like Parquet which is designed for big data and gives us a very solid base for the indexer.

3) What is Parquet?

Parquet is an open-source columnar data format. The benefits of using parquet are:

High compression rate because it is columnar. (Similar values are next to each other in storage)
Built-in indexes that allow us to only load parts of a file. It divides the file into row groups and has statistics(min-max, bloom filters) for each row group so row groups can be skipped when querying.
There are existing tools for querying it. (DataFusion, Polars, DuckDB)
It's just a file, so it's easy to distribute and can be stored on IPFS or a similar system, allowing for built-in decentralization.
Can work on a subset of columns which is usually the case here because not every client is interested in all columns.

4) Architecture

The indexer consists of two parts:

Ingester
Worker

The ingester is responsible for downloading data from the Ethereum RPC endpoint and storing it in parquet files. We designed it to optionally sync these files to s3-compatible storage (e.g., Filebase) for better distribution to workers.

The worker is responsible for building auxiliary indexes on data (e.g., bloom filters on log files using the log address field). And serving the data over a REST API. It also fetches the data it doesn't have in parquet files from the Ethereum RPC API and stores it locally. It can optionally sync parquet files from s3-compatible storage.

5) Ingester

Ingester downloads data from Ethereum RPC into memory and writes it into parquet files when it reaches the target number of logs, transactions or blocks. It groups the parquet files into folders which are named after the block range they contain. Each folder contains log.parquet, tx.parquet and block.parquet. This structure allows for very easy processing of the data. For example, if we need to process block data that corresponds to a log file we just need the block file that is located in the same folder. We can easily construct indexes per folder and also treat a folder like an atomic unit of data which helps when writing or transferring data.

6) Worker

The worker uses RocksDB which is an embedded key-value store implemented by Facebook. It stores bloom filter indexes for each parquet folder in RocksDB. So when executing a query it can iterate over the indexes and prune the query using the bloom filter and only open the actual parquet files if it needs to. This provides a huge performance boost if the block range per parquet folder is sufficiently small and the contract/address you are querying for is not super busy.

The worker also stores the hot(new) block data which didn't make it to parquet files yet. This allows it to serve queries for the tip of the blockchain.

It implements a REST API which allows for querying logs and transactions. These queries on parquet files are performed using Polars, which is a data frame library implemented in Rust and it can query parquet files. Queries on hot data are performed via custom code that uses RocksDB API directly.

7) How fast is it?

Full Ethereum data sync takes about 16 hours on our current setup which uses resource-limited Kubernetes pods. On the other hand, syncing a worker If we already have the parquet files in s3 takes about 5 hours.

As for the query performance, the only comparative result we have is the gravatar squid. It takes under twenty seconds for full sync. In comparison, an equivalent subgraph takes 20 minutes. We didn't test syncing this data directly from Ethereum RPC but I would guess it would be even slower. This result is only anecdotal so I don't think any conclusions should be taken at this point.

Appendix A: Bloom Filters

A bloom filter is a data structure that consists of a series of hash functions and a bit vector. It hashes the input a bunch of times using the hash functions to get an index. For insertion, it just sets the value corresponding to this index in the bit vector to one. To check if the set contains this element, it checks if the value corresponding to this index in the bit vector is zero or one. A zero means we don't have this element in the set that this bloom filter represents. A one means the element might be in the set. So bloom filters provide an efficient way to check if an element belongs to a set, but they have a chance of yielding a false positive result. They provide an excellent way to index our parquet files. There are a lot of alternatives to bloom filters like ribbon filters but they either don't have a usable implementation in Rust or don't provide a boost for our use case.

Appendix B: I/O Bottleneck on the Cloud

We experienced a bottleneck when we deployed to the cloud because of disk speed in both the ingestion process and worker queries. We were downloading data faster than we could write it into parquet files in the ingester. And the query was grossly bottlenecked by disk I/O on the worker. Following this, we fixed this problem by just running a lot more writes/reads in parallel. But this change, in turn, caused memory consumption to go up. So in the end we needed to balance memory consumption with I/O performance. This provided a much-appreciated speed boost to ingester so we didn't have to wait days and days to sync Polygon data. Also provided a 1x-2x performance boost to worker queries. Also as a side note, we enabled io_uring support on RocksDB which is supposed to do some nice prefetching and improve performance in this kind of environment. But this change didn't provide any performance difference as far as we could tell.

Appendix C: Parquet Libraries

Although parquet format provides a lot of compression formats, encodings and data types, not all are implemented by libraries. The most feature-complete and performant stack we could find was rust-arrow2 and polars. The only issue we had with these was polars didn't have support for arrow binary type and we didn't want to write data to parquets as hex strings because this makes the data two times the size. Also, polars had a bug in predicate pushdown optimization implementation. We implemented binary data type support for polars ourselves and helped the maintainers identify the optimization correctness issue.

Also, we plan to implement support for bloom filters in parquet files in polars which might improve the performance of filtering logs by topics.

In my opinion, this all goes to show these libraries aren't used enough and aren't fully mature yet so it seems like the ecosystem can get a lot better as tooling matures.