Building an Ethereum(EVM) API accelerator

This post is about building an accelerator for APIs that serve EVM based blockchain data.

ChatGPT Summary

The post talks about the problems with the current Ethereum RPC servers and PostgreSQL indexing solutions. It suggests building an accelerator that indexes the archival data and exposes an efficient and feature-full API to accelerate all of these APIs. The author also shares their experience with building eth-archive and the improvements they made to it. The post concludes with the status of the project and an invitation to contribute to its development.

Overall, the post provides a detailed analysis of the challenges with Ethereum RPC servers and PostgreSQL indexing solutions, and proposes a potential solution for improving the performance of API serving methods.

Why?

I wrote about how I built an archive for Ethereum data at my old job here.

After using a lot of RPC providers while building this archive, It seemed like the current Ethereum RPC API serving methods could be drastically improved by putting the historical data into an accelerator and just offloading archival requests to that accelerator.

Also after checking the architecture of TheGraph indexer, I realized it might be possible to improve it by replacing the PostgreSQL based indexing with a similar approach I used when building eth-archive.

It seems like it should be possible to build a single accelerator which indexes the archival data and exposes an efficient and feature-full API to accelerate all of these APIs.

Problems with current Ethereum RPC servers

As far as I can understand, node providers host the Ethereum RPC API directly from blockchain clients. This means every request sent by the users go to these blockchain nodes (except the ones which can be served by some kind of cache, which should be mostly false for achival requests). Each blockchain node does a huge amount of work other than serving this API, e.g. consensus. They use much higher space than required to just serve the api, especially for archival requests. An archival ethereum mainnet node requires more than 3TB of disk space, which needs to be an SSD to have sensible sync speeds.

Overall these node implementations aren't only designed to serve this API, but to actually run EVM based chains.

Also the node implementations are missing some obvious improvements like sensing if user is trying to get data for a bunch of consecutive blocks or transactions, and just reading this data from an iterator over the database instead of reading all of the entries in parallel using normal get operations. I checked Erigon code to handle RPC requests and there seems to be zero optimizations for these types of queries.

Problems with PostgreSQL indexing solutions

Indexing solutions using traditional RDMS also tend to have huge system requirements compared to a more handcrafted indexing solution. For example the graph system requirements are huge compared to running an instance of eth-archive* per each chain that TheGraph indexes. Also the contract, transaction data query performance is much lower than eth-archive because it doesn't shard data and use bloom filters to skip huge amounts of data.

* Requirements to run eth-archive for ethereum mainnet is: 600 GB disk space (can be slow, network mounted ssd or sata ssd, should even be ok with HDD but wasn't tested), 2 CPU Cores, 8 GB RAM (can work with 4 GB as well but can't serve many requests at once).

Problems with eth-archive

After investigating memory usage and scalability issues with eth-archive, I found that the issues mostly came from polars which was the library I used to query parquet files.

After looking into how it leverages built-in parquet indexes, I found that it doesn't do much and it makes sense because from what I understand it is mostly designed to run analytical jobs where you load the parquet into memory and run some cpu intensive jobs on it. Unfortunately my use case is much different, I want to go over the files once and I want to minimize IO as much as possible. So I want to skip sections of the files as much as possible, no matter how much CPU it costs because CPU usage is already super low for this kind of query.

After realizing I could do much better external indexing to skip sections of the files, I decided to just drop polars and directly use arrow2 to query parquet files. I added more in depth indexes for files into the embedded database I was already using to index the files.

After making these changes IO usage dropped drastically and many queries were at least 2x the performance of the old version. Unfortunately I made a mistake and wrote the filtering code in regular rust, so the program was decoding arrow data into rust arrays and filtering after that, which gave abysmal CPU performance. I never got to improve this and release this version since I parted ways with subsquid.

eth-archive also worked per chain so if you wanted to index many small chains, you would have to run an archive instance per chain. This is not very nice since there are many EVM chains and if you spin an archive for each of them, chances are, most of them will stay idle while some of them will have high load. The way to solve this would be making a single archive instance able to handle many chains at once. This turns out to be much easier than it sounds but I wasn't able to implement this for eth-archive either.

Also there were architecture problems with eth-archive like requiring separate processes for ingester and worker. The ingester wrote parquet data locally and then backed it up to s3 and worker downloaded from s3 into it's own storage. This means the parquet data is unnecessarily duplicated. A much simpler design would be just the worker to download and write parquet files itself and back them up to s3 and other workers being able to start from s3 data.

Poor testing coverage and release process. These problems were mostly because of a rush to deliver features and the team only consisting of me.

The last problem was the fact that eth-archive exposed a very inefficient API using JSON over HTTP. This was mainly due to Subsquid SDK limitations and prioritization of tasks on eth-archive. There are many solutions that would work better than JSON over HTTP but the most obvious solution seems to be to just use Arrow Flight instead.

Status of the project

I started developing a solution to these problems recently. The code is here and it is in very early stages of development. Please contact me if you want to contribute to the development or if you are interested in running it or if you want to give feedback about the rationale behind it or the project itself.