visit
is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot datasets close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.
Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.
To improve load balancing and handle flaky workers, two preferred nodes are chosen. If the first choice is busy or unresponsive, the second one is used. Data might be cached at 2 worker nodes physically.
For more details on soft affinity scheduling, please read .
Soft affinity scheduling relies on hashing algorithms to compute the mapping between split and worker nodes. Previously, modular function is used:
WorkerID1 =Hash(splitID) % workerCount
WorkerID2 =Hash(splitID) % workerCount + 1
This hashing strategy is simple and works well when the cluster is stable and there’s no change on worker nodes. However, if a worker is temporarily unavailable or down, worker count could change, and the split to worker mapping would be completely reshuffled, causing the cache hit rate to drop significantly. If the problematic worker comes back online later, this reshuffle would happen again.
To mitigate this issue, Presto uses all worker count instead of active worker count when using modular to compute worker assignment. However, this can only mitigate rehashing caused by temporary worker nodes offline. There are situations where it makes sense to add/remove workers due to workload fluctuations. In these scenarios, is it possible to still keep a reasonable cache hit rate and not introduce massive rehashing? The solution is consistent hashing.
The concept of Consistent hashing was introduced in 1997 by David Karger as a way of distributing requests among a changing population of web servers. The technique is widely used in load balancing, distributed hash tables, etc.
Imagine that the hash output range [0, MAX_VALUE] is mapped onto a circle (connects the MAX_VALUE to 0). To demonstrate how consistent hashing works, assume a Presto cluster of 3 Presto worker nodes and there are 10 splits that are queried repeatedly.
First, the worker nodes are hashed onto the hashing ring. For each split, it will be assigned to the worker that’s next to its hash value on the hashing ring.
In the scenario above, the splits are assigned as follows:
Worker1 |
Split1, Split4, Split6, Split8 |
---|---|
Worker2 |
Split0, Split5, Split7 |
Worker3 |
Split2, Split3, Split9 |
Now if worker2 becomes offline for some reason, according to the algorithm, split 0, 5, and 7 will be scheduled to the worker with the next hash value, which is worker2:
Worker1 |
Split0, Split1, Split4, Split5, Split6, Split7, Split8 |
---|---|
Worker3 |
Split2, Split3, Split9 |
Only the splits that were hashed to the offline worker (worker3 in our example) need to be rehashed. Other data were not affected. If worker3 comes online later, Split 2, 3, and 9 will again be hashed to worker3, not affecting the hit rate on other workers.
Now if the workload increases and another worker node, worker4, needs to be added to the cluster. Worker4’s hash value is on the hashing ring as follows:
In this case, split8 will fall into the range of worker4, all other splits’ assignments are not affected, thus the cache hit rate on those splits will not be affected. The new assignment is:
Worker1 |
Split1, Split4, Split6 |
---|---|
Worker2 |
Split0, Split5, Split7 |
Worker3 |
Split2, Split3, Split9 |
Worker4 |
Split8 |
As you can see from above, consistent hashing can guarantee that in the situation of node changes, on average only Nsplits / Nnodes of splits need to be rehashed. However, due to a lack of randomness in worker distribution, the splits might not be uniformly distributed among all worker nodes. The concept of “virtual nodes” is introduced to mitigate this issue. Virtual nodes can also help redistribute a node’s load to multiple nodes when they are disconnected, which reduces load fluctuation due to cluster instability.
Each physical worker node has multiple virtual nodes mapped to it. Virtual nodes are put on the hashing ring. A split will be assigned to the next virtual node on the hashing ring, thus route to the physical node mapped to the virtual node. The following examples show a possible scenario where each physical worker node has 3 virtual nodes:
Worker1 |
Worker1_v1 |
Split6 |
---|---|---|
|
Worker1_v2 |
Split0 |
|
Worker1_v3 |
Split1 |
Worker2 |
Worker2_v1 |
Split5, Split7 |
|
Worker2_v2 |
Split4 |
|
Worker2_v3 |
Split9 |
Worker3 |
Worker3_v1 |
Split2, Split3 |
|
Worker3_v2 |
|
|
Woker3_v3 |
Split8 |
As the number of nodes on the hashing ring increases, the hash space is more likely to be evenly partitioned.
In the scenario of a physical node down, all virtual nodes corresponding to that physical node will be rehashed. But now instead of all splits being rehashed to the same node, they will be distributed across multiple virtual nodes, thus mapping to multiple physical nodes, providing better load balancing.
Following shows when worker3 is removed, Split2 and 3 are rehashed to worker2, while Split8 is rehashed to worker1.
Worker1 |
Worker1_v1 |
Split6 |
---|---|---|
|
Worker1_v2 |
Split4, Split8 |
|
Worker1_v3 |
Split1 |
Worker2 |
Worker2_v1 |
Split5, Split7 |
|
Worker2_v2 |
Split0, Split2, Split3 |
|
Worker2_v3 |
Split9 |
This is currently an experimental feature that we recently contributed to Presto. Feel free to contact us if you are interested in testing or collaboration.
To use this feature, first enable caching with Presto by following this or this .
Make sure you choose SOFT_AFFINITY as the scheduling policy. In /catalog/hive.properties, add hive.node-selection-strategy=SOFT_AFFINIT
Y.
Enable consistent hashing. In config.properties
, add node-scheduler.node-selection-hash-strategy=CONSISTENT_HASHING
.
As illustrated above, consistent hashing can minimize the impact of workload assignments when nodes are introduced or removed. Scheduling workload based on consistent hashing can minimize the impact on cache hit rate on existing nodes when a cluster’s worker nodes change. This makes consistent caching a better strategy to use in situations where Presto’s cluster size would be scaled up and down according to workload needs or in situations where the deployment does not have total control of the hardware, and workers can be potentially relocated every now and then.
At Alluxio community, we have constantly been improving the integration between Alluxio and data applications (e.g., Presto in this article) both in functionality and usability. With the introduction of consistent hashing in Presto scheduling, Alluxio can better leverage the potential of soft affinity in Presto with higher data locality and cache efficiency, which can be translated to better performance and cost efficiency. We will continue bringing further improvements and optimizations to the data ecosystem.
First Published