visit
The evolution, failures and design decisions behind one of the world’s largest real-time, high-frequency and low-latency streaming systems.
We run one of the largest real-time, high-frequency, low-latency streaming systems in the world with over 3 million messages per second and over 1320 billion messages per month. We call these Quotes Streamers.
This article talks about the evolution of this mammoth system and the design considerations behind it.Fire and forget
In our case, everything is updating every few milliseconds. Anything that’s a few seconds old is literally worthless. That’s why we choose to design it without any kind of storage or caching. This important consideration tremendously improved the scalability.Since we had faced issues in the past with Redis, we decided to completely eliminate any use of memory, except process memory which is minimized as well.Maintaining order
Every message is actually a trade that gets executed, and the latest message is the last traded price. You can’t possibly mess up the order.Now, the initial thought was to maintain the ordering on the server. However, any mechanism to maintain order or state in a highly concurrent system, tends to pose bottlenecks at scale. Instead, we choose to do the ordering on the client side. Every message has a timestamp and the client simply discards any older message.In classic Market Pulse fashion, we figured out a very simple solution that works well for us, and is one of the key reasons why the Quotes Streamer is so freakishly fast.No DNS or Load Balancers
We decided to put the LB on the client. We don’t use DNS or LBs on server side at Market Pulse (another post on this soon). Our serves the list of available servers to the client, which load balances, using a simple round robin algorithm.Four reasons why we simply put LB on the client for QS:Shared-nothing architecture
We wanted to make sure we could scale to 100s of servers with no effort at all. For this, we ensured each instance was fully independent, unaware of other instances with nothing in common.Everything is in process memory
In our design, we ensured we keep nothing on storage or in any in-memory cache. Everything is stored in process memory. You might say that a process could always die. And you’d be right. But we simply designed our system to be fault-tolerant. If a process dies, it is recreated. Yes, the subscriber data, past history of messages is lost. But that’s okay. The clients reconnect to another healthy server in the meantime, within 3 seconds. And this new process is just treated as a new addition to the cluster. There is no past history of messages, no list of subscribers to be restored from the older one.Golang
We are very fond of Elixir. It could also work decently well for this purpose, but in our benchmarking, it really couldn’t beat the performance of Golang for this special use-case. Had we needed a distributed pub-sub or channels, Elixir would have been a much nicer fit. But, given that we used a shared-nothing architecture, Golang was ideal for the need at hand.Previously published at