visit
At PEAK6 Capital Management we operate a variety of different systems in
support of our trading teams. As we improve and evolve these systems, we
sometimes run into hurdles along the way that are not all that easy to
diagnose. This is the story of one of these hurdles our systems and core team ran into that came about while updating our pricing system.
As you might expect, market data and pricing systems are central to our
trading platform and they process a LOT of data (around 7MM
messages/second or about 4TB/day) just for pricing (in aggregate, our
systems can pump more than 30MM messages/second at 10Gbps).
All this data is needed so these systems can give us visibility into market
volatility, position risk, and liquidity. Additionally, even though we
are not a high-frequency trading shop, our ability to create, monitor,
and execute trading strategies relies on up to date market data and
derived calculations like NBBO, the greeks, and others. So, imagine our
surprise when a routine server hardware upgrade resulted in our
high-performance pricing system using new processors like this in our
test environments:
When HyperThreading is enabled, each physical CPU will act as if there are two logical CPUs available and the logical CPUs will appear in tools like htop or lscpu. In the image below, you can see how HyperThreading is advertised and shown to impact execution, essentially reducing CPU idle time by packing in the two threads worth of work.
At this point our engineers working on the pricing system started to
question their sanity and began trying to reproduce in our testing
environments. Easy enough, one variable down, let’s keep asking why and
see if we can get to root cause.
This shows the pricing system running across all 80 logical CPUs with fairly
even distribution and is exactly what we’d expect the behavior to look
like. There were a few sighs of relief seeing the expected behavior, but
more than a few furrowed brows as we now wanted to know what was
different between environments.
I should mention that most of Peak6’s applications rely on Kubernetes and
are containerized. Moving our standalone apps to containers and using
an orchestrator like Kubernetes means that we have a common interface
and application runtime across all of our systems. We have around 200
nodes across four geographically isolated data centers running around
1,500 applications at any given time. Additionally, as we have systems
like pricing and use some low-level hardware features for things like
shared memory and multicast, we do end up with some highly tailored
Kubernetes configurations.
Now, in our development environments, we typically launch applications
outside of docker as we want to do local testing, debugging, etc,
without building images. The down side is, we may run into differences
in application behavior depending on whether the application runs
locally or within docker. When we tested the pricing system in the
development environment as shown above, we tested both natively and
using a containerized version of the application. In both cases, the
pricing system had a fairly even load distribution across CPUs. So this
was another variable we could put to rest, … or so we thought.
At times like these, the best thing you can do is go full detective and
review your case notes. Let’s recap what we’ve seen so far:
It was in reviewing the above, that we realized that in each case we saw
every other CPU being utilized, it was always every odd or every even CPU. Why not all CPUs for specific processors? Why not just one processor? This led us to look more at the processor stats where we saw the following:
Diving deeper, we used the numastat tool to get more info into how NUMA was performing and the trail got hot as we saw the following:
Here the numa_miss and numa_foreign tell us that our pricing system is running almost exclusively on node0. It also explains why threads running non-locally are not executing as efficiently and the workload is unbalanced. If we look at our dev environment, we see a much better profile:
That’s no numa_miss or numa_foreign and almost 1:1 local_node memory use. This gets us to the root of the behavior being inefficient thread scheduling across NUMA-bound processors. However, it doesn’t explain why this was happening.
As a note, I’m simplifying here a bit as there’s more complexity with scheduling algorithms, balancing latency targets, time slicing, etc that can all be tuned and could all be their own posts.Me: Knock, Knock
You: Who's there?
Me: Kubelet
You: Kubelet who?
Me: Kubelet gonna schedule all your containers' threads randomlyLet's recap the facts again:
The last and final clue to our case came as we compared Kubernetes pod
configurations between environments looking for any additional hints for
why we’d see a pricing system in development work as expected while a
Kubernetes scheduled pricing system in UAT or Production continue to
show a CPU imbalance.
As it turns out, our pod configurations for the pricing system did not
include any CPU requests in UAT or Production. This seemed benign, but
it resulted in Kubelet (Kubernetes worker node orchestrator) assigning
only the minimum cpu shares to our pricing system containers. You can
see this in source in the conversion to milli cpu to shares returning MinShares.
Alright, you’re thinking. I get it, there were only two cpu shares defined for
these containers, but what does that have to do with NUMA and CPU
balancing? Here’s where we go all the way down the rabbit hole.
Further evidence of this behavior can be seen by checking out /proc/vmstat which shows us virtual memory statistics including numa_pte_updates and numa_huge_pte_updates. These show that we’re updating memory to be moved within the numa nodes around 19 Billion times and 3,495 for huge pages!
Simply adding a resources.requests.cpu: “70” to the pod configuration changes everything. It allows kubelet to properly hint that the pricing system’s threads will need compute, which results in more contiguous processor allocation and better memory locality in NUMA. This one-line change completely changes the pricing system behavior giving us better system saturation across all available resources.
First, we see CPU shares appropriately set at around 72k instead of 2.We’re rather lucky here as we see the Linux scheduler react nicely to the CPU request. Kubernetes versions continue to add better support for NUMA and CPU topology and assignment. If we were still running into issues after narrowing down CPU shares as the root cause, we could use the finer grained numactl controls or cpuset and memset configurations to
continue to tweak the allocations. Thankfully, we don’t have to yet.
After applying our one-line fix for a CPU request, we watched our pricing
system performance over the next few days in our UAT environment. The
change not only corrected the CPU performance imbalance, but also
resulted in about a 3x improvement in calculation speed along with a 75%
reduction in overall calculation queuing during market spikes. Heck
Yes!
Hopefully you’ve gotten a taste of some of the complexity that comes with
maintaining a high-performing pricing system. As we talked a little
about CPU features like HyperThreading and affinity, NUMA and memory
allocation, and Kubernetes, we also shared how our engineers worked
together and triaged this issue from reproducing it to evaluating a fix.
If you’re interested in diving into these kinds of problems or working
on systems that really scale, check out our . We’re always looking for passionate, curious engineers who like working on hard problems!