

Discover more from TaiwanQuant's Newsletter
Post-Mortem: Losing Money At 36k-Feet Above Sea Level & How Not To
An all-rounded guide to independent Quant's loss-prevention, error management, logistics, DevOps, and related.
Picture this: you're about to board a 10-hour flight. As you board the plane (or maybe some time waiting at the gate), a notification pops up in your pocket. You're busy with other things, so you ignore it and forget about it (in fact, you're used to ignoring notifications because you thought at one point that's the only healthy way to approach them). You walk into the airplane, sit down, take out your phone and turn the airplane mode on while still not paying attention to notifications.
Soon, you've made yourself comfy in your seat. V1, rotate, takeoff, all is good. As you look outside the window, you decide to quickly scan notifications and delete them all.
Then you see it:
Whoops…
So, this happened to me last Wednesday and I figured it makes for a good opportunity to write about robustness and what one can do to avoid losing money or account on account of things like this and errors in general.
This will not be a complete, fool-proof way that covers every possible failure case as that is probably impossible — instead, it will be a collection of mechanisms — both manual and automated — that together will give us pretty good coverage (and better sleep).
Everything I write about here is based on my last three years of trading, observations, and what I saw others recommend. I run or have run all of these in my own systems.
Let’s begin.
So, I’m toast?
Maybe, let’s see what happened next and what lessons we can extract from it.
Back to the airplane.
Okay, I see the notification and realize what happened: Binance retired an endpoint that my old connector code used:
/fapi/v1/account
and replaced it with v2. I read the changelog earlier but didn’t associate that I was using the old endpoint (there is a lesson here perhaps but this is the first failure of this type in three years so maybe it’s acceptable).
The second thing I realize is that I have open positions in this account. The strategy there trades shitcoins with a relatively large size on a short time horizon. Leaving this unmanaged for the next 10 hours (or any hours really) may even liquidate the account if I’m unlucky.
Needless to say, this would suck.
Now, this is an old (and as we will see pretty janky) infrastructure. I haven't updated it (blame my laziness/priorities) and wasn’t sure how it would behave in this particular failure scenario.
The warning tells me that the error was classed as transient (retriable) due to the unknown error code, and therefore became a soft warning. In other words, rather than closing all positions and killing itself, the program will retry the request every few minutes.
The old infrastructure also has a quirk to it: it relies on the (now failing) request to correct the state of its inventory over time (as opposed to the event feed, which is the proper way). As this no longer works, what it thinks its inventory is and what it actually is will likely gradually diverge over time.
Or so I think at the time (another lesson here: should make sure we can survive if we fail to sync inventory for X minutes).
After thinking for a bit, I decide to kill the program manually. To do so I need an internet connection. My phone signal wouldn’t pick up by this point. Fortunately, the airline offers an in-flight wifi service so I buy it and my next thought is to dig a bit into the source code to figure out what the infrastructure will actually do.
I do a few things that for OpSec reasons will remain unwritten and end up getting SSH access to the machine running the bot from my phone (this is the second time a mobile SSH client is a lifesaver — another lesson).
I proceed to $> cat
the relevant source files and… nope, better turn it off.
So I open Discord and type:
/kill
Alas, the bot is now dead and I can continue the flight in peace. I will fix it once I land and get some sleep. Except:
The positions weren’t closed because I messed up the code.
“Great!”— I think enthusiastically (not!) and close the positions manually via the phone app (more on this particular error in a moment since it’s important).
Okay, I’m safe.
What can we learn from this?
In terms of raw errors, there were two:
Inventory sync failed as a result of an endpoint retired by the exchange. More precisely, the infrastructure incorrectly understood the error as transient because of the unknown error code and failed to behave correctly i.e. stop itself.
Failed orders to close positions due to bad request timestamps. This is in fact much worse compared to (1), because if the bot did stop itself without me realizing it, and the close orders failed, then it would mean that the positions would be left completely unmanaged.
These can be dealt with automatically (for the most part) so I will leave them for the next section and discuss the manual measures here.
There were a couple of things that were immediately useful on the manual side:
Notifications, in this case via Discord.
Commands, such as /kill.
Mobile SSH access.
Notifications
This is simply a way to conveniently monitor our system live.
I found it convenient to do this by hooking it up to an IM (such as Telegram, Discord, LINE, etc). This also plays nicely with remote commands since most IMs will let us do both notifications and commands (more on the latter in a bit).
The choice of IM depends in part on our preferences but there are a few things to think about. I’ll try to list them here:
Expected downtimes.
Rate limits.
End-to-end encryption.
Can we set a different ringtone to bypass silent mode for important alerts?
Is the API good?
I found Telegram to be good on all of these except 4 (it can’t bypass silent mode on iOS). Discord is terrible on all of these points (yes, the old infrastructure uses it).
The other problem is what to notify about.
The first time I implemented notifications I sent them for every trade, which ended up really distracting (after all why wouldn’t we want to have our eyes glued to the candles for every trade?).
It is probably better to only notify about significant changes to the total equity, exposure, leverage, and/or once a day. I write this as a bit of an afterthought since I haven’t made this change yet myself.
Implementation-wise, we should handle notifications on a distinct thread to avoid blocking the main one. The design I use looks something like this:
# Python-esque pseudocode
def notify_thread(send_queue):
for msg in send_queue.recv():
telegram.send(msg)
def trading_loop():
send_queue = channel(max_size=10)
spawn(lambda: notify_thread(chan))
loop:
# do the trading...
send_queue.send("Buy 10 DOGE @ $1") # return immediately if full
Notably the send queue has a finite size to avoid blowing up the rate limit. When the queue is full, chan.send() will simply return immediately without sending.
Yes, skipping messages is undesirable but it is also easy to spot because we will be spammed with messages first. Not doing so also risks accumulating a large number of unsent notifications in the queue and running out of RAM as a result so it is pretty good by comparison.
Lastly, if we want to, we can also play with priority, multiple channels, etc. The above is just a basic design.
Emergency alerts
Emergency alerts are a special case of notifications where we want to be notified regardless of circumstances — in the middle of the night, in a movie theatre, etc. (yes, I’ll be a dick if it’s about my trading accounts).
One way to do this is to pay for a service that lets us make phone calls via an API and set the contact to bypass silent mode.
Commands
Commands are a handy way to control the bot remotely. Most IMs will have them built in, for example, here’s a screenshot of what I currently run from Telegram (I cut off the top so people can’t find the bot and mess with it):
This may seem very useful but in practice most of the time I ended up using /kill and /info and not much more so if we want the bare minimum, these two will suffice.
A case may also be made for /ban and /pause which is good when something unexpected happens (as it does in Crypto) or when we need to move assets around manually without the bot interfering. In practice, most of the time I did a /ban, it lost me and I almost never used /pause.
Here is a list of the commands I have for the reader’s benefit:
Commands:
/info - Display stats and positions.
/roll - Roll a dice.
/ping - Check if bot is running.
/kill - Stop the bot and market exit all positions.
/stop - Stop the bot and limit exit all positions.
/resume - Resume trading after /stop or /hardstop.
/abort - Abort the bot process.
/pause - Cancel all outstanding orders and pause trading.
/unpause - Unpause trading.
/record - Toggle recording.
/ban - Ban an instrument from trading.
/unban - Unban an instrument from trading.
/debug - Display debug information.
/debug_instrument - Display instrument debug information.
/cancel - Cancel current command.
/sync - Reload bot state from the exchange.
Mobile SSH access
Mobile SSH access is very useful when we need to do something we haven’t implemented a command for and we only have a mobile phone at hand. This is pretty obvious so I won’t say much here.
The only thing I will say is to treat OpSec seriously and at minimum limit SSH access to a private network.
Let’s look at techniques for handling failure automatically next.
My code can’t fail, literally, almost
By this I don’t just mean handling some errors individually — our approach should ensure reasonable behaviour regardless of the particular error, ideally.
After all, we don’t want that alarm in the middle of the night or in the movie theatre to actually happen, do we?
The technique described below takes us 90% there.
Isolating errors
The fundamental problem we will address is that errors tend to pop up in seemingly random places and scatter across a codebase. This makes them/failure in general hard to reason about (and makes our sleep worse).
Some time ago, I was doing some refactoring and noticed an interesting property that solves this problem.
I’ll try to show where the observation comes from first. I’m not sure how clear it will be to readers who don’t know Rust but I’ll try to do this as simply as possible.
Suppose we have a call chain of several functions that looks like this:
main()
calls driver.run()
calls strategy.on_event()
calls order_exec.set_inventory()
calls order_mgr.send_order()
Yes, we shouldn’t be sending orders on the main thread but let’s say we are for now.
Now, in Rust, errors are explicit, typically they are passed around by returning a Result<T, E> from a fallible function — a value which is either the successful result (T) or an error (E).
The observation is that:
If we don’t want to handle an error at a particular level in this call chain, we must turn the function at that level fallible (return a Result<T, E> from it).
Conversely and more importantly, if we do handle it, then we can keep it infallible (return only T instead of Result<T, E>).
For example, suppose that send_order() returns a Result<T, E>. If set_inventory() handles the error from send_order() internally (along with any other errors of its own), then it itself does not need to return a Result. Therefore, on_event(), run() and main(), also don’t need to do so.
It turns out that this approach allows us to isolate errors basically only to calls which deal directly with I/O. Taking it a step further, of these calls, only the exchange Connector (as well as any other feeds if we have those) are critical to us trading safely.
The exchange Connector consists of only a few functions/points of failure and each of them is (or should be) used in one place only:
Get exchange and account state.
Place, modify, cancel order.
Set initial leverage.
Event feed.
In effect, this means there are only these few calls in the entire infrastructure which need to deal with trading-related failure, so failure is isolated and significantly easier to think about than if it was scattered all over the codebase.
Quite beautiful, isn’t it?
This is also the approach we took when implementing the OrderExecutor and other components in prior articles.
Of course, we still need sanity assertions but these deal with logic errors and bugs, not with unpredictable I/O or exchange failure of one kind or another.
And because they are assertions, they can be dealt with in bulk by a dead switch.
Dead switches
Our infrastructure will die at various points and for various reasons, these include errors such as sanity assertion, integer overflows, segmentation faults, out-of-memory, out-of-disk-space, intentional panics, etc.
When this happens we want to (1) handle the failure in some reasonable way; (2) be notified.
As far as detecting and reacting to failure, it can be done in one of two ways:
The first way is code within the infrastructure that detects when the trading thread has died and performs some action as a result, such as cancelling all orders and exiting all positions. We saw this one fail in the story.
For example:
thread = spawn(trade)
result = thread.join()
if result.is_err() {
loop {
cancel_all_orders()
close_all_positions()
sleep(5) // Do not spam the exchange to avoid getting banned.
}
}
This has the downside of running with the infrastructure so there is a chance that it will never actually run, for example, if the entire machine dies.
The second way is to have a separate bit of code running elsewhere that pings the bot and notifies us if it fails to hear back for some short period of time.
It doesn’t have the problem of potentially dying with the bot but we should be careful about automatically cancelling orders and closing positions with it since if the bot has not actually died, it may end up re-opening them, causing the dead switch to re-close, causing the bot to re-open, causing the dead switch to re-close, and so on draining the account.
As an honorary mention, some exchanges implement dead switches built into their APIs to automatically close open orders if we lose connection to the event feed. This is not sufficient alone and doesn’t help with exposure, however.
When are we toast then?
There are certain classes of logic errors to be aware of when writing that are particularly dangerous to an account. The first is unwanted exposure and the second is buy-sell loops.
Unwanted exposure
The former occurs when our bot takes on more exposure than we want, for example, because its inventory or order tracking has a bug in it.
This is not helped by the fact that the default leverage limit on most perp exchanges is 20x, which can drain our account very quickly because the exchange will allow us to lever up to that multiple. It is best to lower it if we don’t intend to use it.
Buy-sell loops
The latter (buy-sell loop) occurs when we buy and sell an asset repeatedly very quickly paying fees and spread (as well as possibly depth) in the process.
To my knowledge, there are no automated ways to completely protect against either unwanted exposure or buy-sell loops (yes we can try to detect the latter — I did it and it just felt janky and unnecessary so I leave it to the reader to decide).
Something that helps a lot however is to run the infrastructure in a simulator with the assumption that if the problem did not occur in the simulator then it also won’t occur in reality.
Of course, this does not cover all cases hence we should still be careful.
Bad request timestamps
At the beginning of the article, we saw the orders to close positions fail on account of invalid timestamps.
This occurred as a result of a significant clock difference (more than 10s in fact) between the exchange and the machine running the bot in concert with relying on the local clock for the timestamp (a newbie mistake).
The topic of time has more nuance than we would think so let’s look briefly at what matters to us.
The default system clock can be a tragically unreliable way to tell time. For one, it is not guaranteed to be monotonic. Two, most operating systems will synchronize their local time with an NTP server every now and then changing it. Three and most importantly, there’s nothing that prevents our local time and the exchange time from becoming substantially different and thus our requests from failing.
To do this properly we should:
Use the exchange’s time obtained, for example, from the latest event or an API request.
Use a large receive window for requests which are not latency sensitive.
Theoretically, it should also be possible to disable synchronization to an NTP server, then measure the clock difference between the local and exchange, and then adjust for said difference subsequently.
However, after I did this I was still getting errors so I’m giving it a 50-50 chance that either I wrote a bug or the exchange’s clock was also variable (syncing with an NTP?). In any case, event times will work well.
Bonus: How to run in production
I was asked over DMs about how I deploy a bot in production. This is quite simple so let’s talk about it.
You may consider this a bonus section.
Getting a Linux VM
To run infrastructure live we will need a computer to host it on with a static IP address which we can access via SSH. The easiest and frankly best way to do this is a VPS (for example AWS EC2).
The static IP address is required because exchanges typically force it for long-term API keys for security reasons.
There is also an open secret (or not even a secret) that we can get the best latency to Binance and many other exchanges by hosting our infrastructure on EC2 in the Tokyo region (this may change with time so do your own research).
Once we get it we can connect to it via SSH and start setting things up.
Running in the background
To keep our bot running when we are not connected to SSH, we can either run it as a system service, which is the “proper” sys-admin way — or we can run it in an “I-just-want-it-working-and-don’t-care-about-proper-way-way” with tmux.
The latter is much simpler and as far as I know, there are no real disadvantages to it compared to the former — so let’s look at it.
There are a lot of tmux tutorials out there so I'll only repeat the basics here.
We’ll make a new tmux session by typing:
tmux create-session
This will give us a persistent terminal that will stay alive after we leave SSH.
We can detach from it at any time by pressing ctrl+b followed by d.
To re-attach to it we will type:
tmux list-sessions
tmux attach-session -t <session number or name>
And that’s really it.
Conclusion
Whew, this makes for a pretty long post (and an eventful week). We have dissected a real-life algorithmic trading emergency and discussed mechanisms to monitor, prevent and handle such emergencies reasonably. We also looked at logic errors to be aware of and at deploying a trading infrastructure as a bonus.
As always, I hope this article was useful to you, thank you for reading and see you next time!
Also, let me know in X DMs (@TaiwanQuant) or chat here if you would like to see more materials like this in place of pure Quant Infrastructure tutorials.
If you are a free subscriber and would like to read the Quant Infrastructure tutorials or simply support the publication consider subscribing.
Cheers!
Disclaimer: Trading is a risky endeavour and I am not responsible for any losses you may incur implementing ideas learnt through these articles.
Disclaimer: This article is not financial advice.