Quant Infrastructure #3 - Performant Time-Series Storage

A guide to an extremally fast and lightweight time-series data store.

May 18, 2023

∙ Paid

Today we look at storing and processing historical data.

We show a data store implementation suitable for loading arbitrary amounts of time-series data instantly (at hardware performance limit) to Python and Rust. The technique can be adapted to other languages. The basic implementation is very short at around 150 lines of code and is suitable for independent practitioners and Quant teams alike.

Broadly, the data store has three roles:

Destination/warehouse for historical data.
Data source for research/exploration.
Data source for simulations/backtests.

A good implementation is valuable because performance and capacity gains here will tend to translate into (usually non-linear) gains everywhere else — most importantly in research and backtesting.

For example, ideally, we would like data access to have as little friction as possible — ideally, we would like it to be instant and not have a size constraint of how much data we can handle at once. We would also like our backtests to not be bottlenecked by the data feed or for the data feed to take computing resources from the backtest.

This can be hard because computing resources are finite and easily exhausted by market data — to the point where some ideas are impractical or impossible to research even on the most expensive AWS machines. To make them possible we have to be smart about how we manage data.

Most Quants will start by doing this wrong — typically by reaching for a third-party database or a columnar data file format. In practice, a much better solution requires barely 150 lines of code and no third-party system.

The databases are inadequate for the data sizes we need to work with. For example, a naive select query on a SQL database with 1e6 records (small by Quant standards — e.g. 1-minute data for a single perp contract) can easily take minutes to complete and parse rows into usable data types. We expect to commonly deal with 1e9 and higher at any one time.
Columnar data storage file formats such as Apache Parquet are much better here though still complicated and very slow vis-a-vis theoretical limits and the solution we describe here.

Instead, we show a technique that addresses all of the earlier points. Loading of most data is instant and hardware resource usage minimal. Our system is able to run within a few percent of the maximum disk throughput. It is also much simpler and orders of magnitude more performant than most (all?) third-party alternatives.

The method is portable across programming languages, here Rust and Python.

To give a demonstration of what is possible, the following Python code:

%%time
klines = read_klines('btc/usdt', 'futures', 'binance')
klines.shape

takes this long to complete with the following result:

CPU times: total: 0 ns
Wall time: 0 ns
(1926509,)

Yes, that is 0 nanoseconds. The above code loads 1,926,509 1-minute candlesticks for Binance USD-stablecoin margined futures into a Numpy array in Python.

To show that our technique is fully usable, we compute the mean price and see again how long this takes:

%%time
np.mean(klines['close'])

results in:

CPU times: total: 0 ns
Wall time: 10.3 ms
26704.60773270721

This test was run on a mid-range computer from 3 years ago with an M.2 SSD.

We provide one more benchmark from Rust’s side, which is more relevant for backtesting. This time we load all BTC-quoted pairs from Binance Spot and compute an EWMA for each of them:

EWMA all BTC pairs
---------------------------
Time       : 18.42 seconds
Klines     : 614098628
Throughput : 2798.56 MB/s
Klines/s   : 33346791.62 klines/s

I had redone the benchmark since the last article announcement after reminding myself that I had an Ethereum node syncing in the background and eating my I/O. The result is 2x the speed now compared to the last article’s announcement.

As before, the complete source code is attached to the article for the reader’s benefit. The implementation is simple and takes around 150 lines of code.

The article itself is a commentary on architecture and a how-to guide. The reader should be able to easily extend the data store to support more data types based on his need, such as aggregated trades and order book updates. He should also be able to adapt the source code for his language of choice.

Let’s dive in.1 2

Keep reading with a 7-day free trial

Subscribe to TaiwanQuant's Newsletter to keep reading this post and get 7 days of free access to the full post archives.