visit
Purpose and usage
The output of a diff algorithm is called patch or delta. The delta format might be human readable (text) or only machine readable (binary). Human readable format is usually employed for tracking and reconciling changes to human readable text like source code. Binary format is usually space optimized and used in order to save bandwidth. It transfers only the set of changes to an old version of the data already available to a recipient as opposed to transferring all the new data. The formal term for this is delta encoding.
Binary VS Text?
There seems to be a common misconception that diff algorithms are specialized based on the type of input. The truth is, diff algorithms are omnivorous and can handle any input, as long as the input can simply be treated as a . That string might consist of the English alphabet or opaque binary data. Any diff algorithm will generate a correct delta given two input strings in the same alphabet.
The misconception that a different algorithm is required to handle binary data arises from commonly used diff/merge tools treating text and binary as if they were actually different. These tools generally aim to provide a human-readable delta, and as such focus on human-readable input to the exclusion of binary data.
Equality is the only relevant output in the case of binary diffs, and as such, a simple bit-by-bit comparison is considered to be the fastest and most appropriate solution. This categorization of algorithms by the efficiency of solution causes a partitioning of inputs into different types.
Another aspect that adds to the confusion is the line-based, word-based, and character-based classification of textual diff outputs produced by diff/merge tools. A diff algorithm that is described as “line-based” gives the impression that it produces “text-only” output, and that this means that it accepts only text input and never binary data inputs.
However, line/word/character-based is not a characteristic of a diff algorithm itself; rather, it is an optimization applied to the input before feeding it to the actual diff algorithm.
Because new lines and spaces have meaning as separators in human-readable text, the diff tool can segment the string based on the hashes of the lines or words in the text.
This hash string is much shorter than the original text thus saving time at the cost of reduced granularity of the diff. Additionally, line-based granularity might actually even increase human readability of the diff in some cases.
However, if the input is known to be opaque binary data, there are no meaningful separators nor human-readable diff to display, so this optimization cannot be applied. Algorithms capable of optimizing human-readable data before it becomes an input are thus prone to getting miscast as wholly incapable of processing binary data.
Pure block move
The next generation of diff algorithms was based on seemingly small optimizations over the previous generation. The character edits were upgraded to block-of-characters edits. That is, instead of expressing the diff as operations on single characters, the diff would be expressed as operations on blocks of characters. The operations are usually copy and insert where blocks of data that appear in both inputs are recorded in the delta as copied from one input to the other. The blocks unique to one of the inputs are recorded as insertions. This approach was first proposed by .Compression based block move
How Ably generates deltas in its pub/sub messaging platform using the block move approach
At first glance, the block move approach seems like a minor optimization. However, it has significant impact once you take into account the possibility that some block(s) of characters can repeat in some or both of the inputs. Thinking about diff generation in terms of copying blocks of data and keeping an eye out for the same block repeating more than once opens the door to using compression algorithms to generate a diff and delta file.Compression algorithms do just that: find the biggest possible repeating blocks of data and replace each consecutive occurrence with a reference to the first occurrence. Blocks of data that never repeat are copied straight to the output. Compression algorithms are in essence block move algorithms.If the block move analysis done by a compression algorithm is performed on both inputs of a diff algorithm, it will easily identify the common parts of both inputs. It will also point out which blocks of data are unique, that is different in both of the inputs. Once you have this data, it is straightforward to come up with a sequence of block copy/delete operations that will convert one of the inputs to the other one.The major benefit of using compression algorithms is the greatly reduced size of the delta. A block of data will never appear more than once in the delta. It might be referred to multiple times but the actual data of the block will be contained in the delta only once. That is a major difference from the preceding approaches. It should also be mentioned that the delta size is reduced at the cost of reduced human readability.xDelta, zDelta, Bentley/McIlroy are widely used de-facto standard implementations of diff algorithms of this generation.
Latest Upgrades
This would be the latest generation of diff algorithms. Most of its members exist only in research papers and don’t have any commercial implementations yet. They are largely based on the block move approach but offer substantial implementation optimizations, which result in claims of double-digit factor improvements in speed over the previous generation.These optimizations are mostly focused on efficiently finding matching blocks of data in the two inputs. Various incremental hashing or compression-like techniques (such as suffix-trees) are used to achieve this purpose.edelta, ddelta, bsdiff could be assigned to this generation of diff algorithms.
Myers Algorithm - human readable diffs
The belongs to the string correction family and is widely used by tools fine tuned to generate human readable delta/patch files out of human readable inputs. This is used by tools such as Git Diff and GNU Diff.Original Myers time and space complexity is O(ND) where N is the sum of the lengths of both inputs and D is the size of the minimum edit script that converts one input to the other. When there number of differences is small, as is the case with edits of the same code/text file, the algorithm is fast. Various optimizations can and have been applied to the original Myers algorithm resulting in improvements of up to O(NlgN + D^2) time and O(N) space.Bentley-McIlroy
The algorithm belongs to the block move family and is focused on producing delta/patch files of optimal size. It has various implementations on different platforms and languages so it can be considered a somewhat de facto standard for scenarios where delta size matters. Google's is one of the most prominent usages of Bentley-McIlroy that is able to generate a VCDiff format delta/patch.The Bentley-McIlroy algorithm has time complexity of O(sqrt(N)*N) although the authors claim linear complexity in the average case. The memory complexity is linear.XDelta
belongs to the block move family and is focused on speed of delta generation. The algorithm sacrifices delta size for improved speed. The delta generation tool is the most prominent usage of XDelta and is also able to generate a VCDiff format delta/patch.The XDelta algorithm has linear time and space complexity.BSDiff
The algorithm belongs to the block move family and is focused on achieving minimal delta/patch size. It is also specifically optimized for executable files. The tool is the most prominent usage of the BSDiff algorithm. The bsdiff tool uses its own custom delta/patch file format.BSDiff time complexity is O((n+m)log(n)) where n and m are the sizes of both inputs. Its memory complexity is max (17n,9n+m)+O(1).Unix .patch
This is a family of delta/patch formats produced by the GNU diff tool that is aimed at human readability. The GNU diff tool has been around for a long time, and these patch formats are widely accepted/used with or without modifications by various text processing tools and source control systems.VCDiff
is the most prominent attempt at creating a data-agnostic and algorithm-agnostic delta/patch format aimed at compactness and speed of application. VCDiff gained quite the adoption related to Google’s effort. Nowadays several diff algorithm implementations are able to generate delta/patch files in VCDiff format. VCDiff delta application libraries in various states of maturity exist for most of the popular languages and platforms.Test setup
The tests were done using the latest official implementations of the algorithms found on GitHub at the time of writing this post (June 2019).Both algorithms expose a great number of tweaks and settings like memory window size that greatly affect their performance. A deliberate effort has been made to run both under the same settings but mistakes are possible.Tests used the .Test results: Average time over 3 minutes execution in a loop
This is an American transit source. If you happen to be reading this post at a time when there are no busses running - such as early in the morning in Europe - you won't see any data.
Ably is a realtime messaging platform. We deliver billions of realtime messages everyday to more than 50 million end-users across web, mobile, and IoT platforms.
Developers use Ably to build realtime capabilities in their apps with our multi-protocol (including ), , and , from across industries like transport and finance, and that extend Ably into third party clouds and systems like AWS Kinesis and RabbitMQ.
Both businesses and developers choose to build on Ably because we provide the only realtime platform architected around Four Pillars of Dependability: performance, high availability, reliability, and integrity of data. This allows our customers to focus on their code and data streams while we provide unrivaled quality of service, fault-tolerance, and scalability.
We’re for roles across our commercial and engineering teams :).
Previously published at