visit
All of the above examples represent a problem of finding a shortest path between two vertexes in a graph (a route or path between two places, a number of actions or complexity of getting a sheet of paper from one place or another). This class of shortest path problems is called SSSP (Single Source Shortest Path) and the base algorithm for solving these problems is , which has a O(n^2)
computational complexity.
But, sometimes we need to find all shortest paths between all vertexes. Consider the following example: you are creating a map for you regular movements between home, work and theatre. In this scenario you will end up with 6 routes: work ⭢ home
, home ⭢ work
, work ⭢ theatre
, theatre ⭢ work
, home ⭢ theatre
and theatre ⭢ home
(the reverse routes can be different because of one-way roads for example).
A(k, n) = n! / (n - m)!
// where n - is a number of elements,
// k - is a number of elements in permutation (in our case k = 2)
Which gives us 12 routes for 4 places and 90 routes for 10 places (which is impressive). Note… this is without considering intermediate points between places i.e., to get from home to work you have to cross 4 streets, walk along the river and cross the bridge. If we imagine, some routes can have common intermediate points… well … as the result we will have a very large graph, with a lot of vertexes, where each vertex will represent either a place or a significant intermediate point. The class of problems, where we need to find all shortest paths between all pairs of vertexes in the graph, is called APSP (All Pairs Shortest Paths) and the base algorithm for solving these problems is , which has O(n^3)
computational complexity.
Floyd-Warshall algorithm finds all shortest paths between every pair of vertexes in a graph. The algorithms was published by Robert Floyd in [1] (see “References” section for more details). In the same year, Peter Ingerman in [2] described a modern implementation of the algorithm in form of three nested for
loops:
algorithm FloydWarshall(W) do
for k = 0 to N - 1 do
for i = 0 to N - 1 do
for j = 0 to N - 1 do
W[i, j] = min(W[i, j], W[i, k] + W[k, j])
end for
end for
end for
end algorithm
// where W - is a weight matrix of N x N size,
// min() - is a function which returns lesser of it's arguments
If you never had a change to work with a graph represented in form of matrix then it could be difficult to understand what the above algorithm does. So, to ensure we are on the same page, let’s look into how graph can be represented in form of matrix and why such representation is beneficial to solve shortest path problem.
The picture bellow illustrates a graph of 5 vertexes. On the left, the graph is presented in a visual form, which is made of circles and arrows, where each circle represents a vertex and arrow represents an edge with direction. Number inside a circle corresponds to a vertex number and number above an edge corresponds to edge’s weight. On the right, the same graph is presented in form of weight matrix. Weight matrix is a form of where each matrix cell contains a “weight” – a distance between vertex i
(row) and vertex j
(column). Weight matrix doesn’t include information about a “path” between vertexes (a list of vertexes through which you get from i
to j
) – just a weight (distance) between these vertexes.
In a weight matrix we can see that cell values are equal to weights between vertexes. That is why, if we inspect paths from vertex 0
(row 0
), we will see that … there is only one path – from 0
to 1
. But, on a visual representation we can clearly see paths from vertex 0
to vertexes 2
and 3
(through vertex 1
). The reason for this is simple – in an initial state, weight matrix contains distance between adjacent vertexes only. However, this information alone is enough to find the rest.
Let’s see how it works. Pay attention to cell W[0, 1]
. It’s value indicates, there is a path from vertex 0
to vertex 1
with a weight equal to 1
(in short we can write in as: 0 ⭢ 1: 1
). With this knowledge, we now can scan all cells of row 1
(which contains all weights of all paths from vertex 1
) and back port this information to row 0
, increasing the weight by 1
(value of W[0, 1]
).
Using the same steps we can find paths from vertex 0
through other vertexes. During the search, it might happen that there are more than one path leading to the same vertex and what is more important the weights of these paths could be different. An example of such a situation is illustrated on the picture bellow, where search from vertex 0
through vertex 2
revealed one more path to vertex 3
of a smaller weight.
We have two paths: an original path – 0 ⭢ 3: 4
and a new path we have just discovered – 0 ⭢ 2 ⭢ 3: 3
(keep in mind, weight matrix doesn’t contain paths, so we don’t know which vertexes are included in original path). This is a moment when we make a decision to keep a shortest one and write 3
into cell W[0, 3]
.
algorithm FloydWarshall(W) do
for k = 0 to N - 1 do
for i = 0 to N - 1 do
for j = 0 to N - 1 do
W[i, j] = min(W[i, j], W[i, k] + W[k, j])
end for
end for
end for
end algorithm
The outermost cycle for
on k
iterates over all vertexes in a graph and on each iteration, variable k
represents a vertex we are searching paths through. Inner cycle for
on i
also iterates over all vertexes in the graph and on each iteration, i
represents a vertex we are searching paths from. And lastly an innermost cycle for
on j
iterates over all vertexes in the graph and on each iteration, j
represents a vertex we are searching path to. In combination it gives us the following: on each iteration k
we are searching paths from all vertexes i
into all vertexes j
through vertex k
. Inside a cycle we compare path i ⭢ j
(represented by W[i, j]
) with a path i ⭢ k ⭢ j
(represented by sum of W[I, k]
and W[k, j]
) and writing the shortest one back into W[i, j]
.
The source code and experimental graphs are available in on GitHub. The experimental graphs can be found in Data/Sparse-Graphs.zip
directory. All benchmarks in this post are implemented in file.
NO_EDGE = (int.MaxValue / 2) - 1
.
Regarding #1. When we speak about matrixes we define cells in terms of “rows” and “columns”. Because of this it seems natural to imagine a matrix in form of “square” or “rectangle” (Picture 4a).
i = row * row_size + col;
// where row - cell row index,
// col - cell column index,
// row_size - number of cells in a row.
Lineal array of weight matrix is a precondition for effective execution of Floyd-Warshall algorithm. The reason for that isn’t simple and detailed explanation deserves a separate post… or a few posts. However, currently, it is important to mention that such representation significantly improves , which in fact has a great impact on algorithm’s performance.
- Author’s note
Regarding #2. If you take a closer look at the algorithm pseudocode, you won’t find any checks related to existence of a path between two vertexes. Instead, pseudocode simply use min()
function. The reason is simple – originally, if there is no path between to vertexes, a cell value is set to infinity
and in all languages, except may be JavaScript, all values are less than infinity
. In case of integers it might be tempting to use int.MaxValue
as a “no path” value. However this will lead to integer overflow in cases when values of both i ⭢ k
and k ⭢ j
paths will be equal to int.MaxValue
. That is why we use a value which is one less than a half of the int.MaxValue
.
- Thoughtful reader
It is indeed possible but unfortunately it will lead to a significant performance penalty. In short, CPU keeps statistics of branch evaluation results ex. when some of if
statements evaluate to true
or false
. It uses this statistics to execute code of “statistically predicted branch” upfront before the actual if
statement is evaluated (this is called ) and therefore execute code more efficiently. However, when CPU prediction is inaccurate, it causes a significant performance loss compared to correct prediction and unconditional execution because CPU has to stop and calculate the correct branch.
Because on each iteration k
we update a significant portion of weight matrix, CPU branch statistics becomes useless because there is no code pattern, all branches are based exclusively on data. So such check will result in significant amount of .
Uhh, looks like we are done with the theoretical part – let’s implement the algorithm (we denote this implementation as Baseline
):
public void Baseline(int[] matrix, int sz)
{
for (var k = 0; k < sz; ++k)
{
for (var i = 0; i < sz; ++i)
{
for (var j = 0; j < sz; ++j)
{
var distance = matrix[i * sz + k] + matrix[k * sz + j];
if (matrix[i * sz + j] > distance)
{
matrix[i * sz + j] = distance;
}
}
}
}
}
The above code is almost an identical copy of the previously mentioned pseudocode with a single exception – instead of Math.Min()
we use if
to update a cell only when necessary.
- Thoughtful reader
The reason is simple. At the moment of writing JIT emits almost equivalent code for both if
and Math.Min
implementations. You can check it in details on but here are snippets of main loop bodies:
//
// == Assembly code of implementation of innermost loop for of j using if.
//
53: movsxd r14, r14d
//
// compare matrix[i * sz + j] and distance (Condition)
//
56: cmp [rdx+r14*4+0x10], ebx
5b: jle short 64
//
// if matrix[i * sz + j] greater than distance write distance to matrix
//
5d: movsxd rbp, ebp
60: mov [rdx+rbp*4+0x10], ebx
64: // end of loop
//
// == Assembly code of implementation of innermost loop for of j using Math.Min.
//
4f: movsxd rbp, ebp
52: mov r14d, [rdx+rbp*4+0x10]
//
// compare matrix[i * sz + j] and distance (Condition)
//
57: cmp r14d, ebx.
//
// if matrix[i * sz + j] less than distance write value to matrix
//
5a: jle short 5e
//
// otherwise write distance to matrix
//
5c: jmp short 61
5e: mov ebx, r14d
61: mov [rdx+rbp*4+0x10], ebx
65: // end of loop
You may see, regardless of whether we use if
or Math.Min
there is still a conditional check. However, in case of if
there is no unnecessary write.
var max = v * (v - 1)) / 2;
// where v - is a number of vertexes in a graph.
Let’s rock!
Method | Size | Mean | Error | StdDev |
---|---|---|---|---|
Baseline | 300 | 27.525 ms | 0.1937 ms | 0.1617 ms |
Baseline | 600 | 217.897 ms | 1.6415 ms | 1.5355 ms |
Baseline | 1200 | 1,763.335 ms | 7.4561 ms | 6.2262 ms |
Baseline | 2400 | 14,533.335 ms | 63.3518 ms | 52.9016 ms |
Baseline | 4800 | 119,768.219 ms | 181.5514 ms | 160.9406 ms |
From the results we can see, calculation time grows dramatically compared to a size of a graph – for a graph of 300 vertexes it took 27 milliseconds, for a graph of 2400 vertex – 14.5 seconds and for a graph of 4800 – 119 seconds which is almost 2 minutes!
However, in our experiments we use directed, acyclic graphs, which have a wonderful property – if there is a path from vertex 1
to vertex 2
, then there is no path from vertex 2
to vertex 1
. For us, it means, there are a lot of non-adjacent vertexes which we can skip if there is no path from i
to k
(we denote this implementation as SpartialOptimisation
).
public void SpartialOptimisation(int[] matrix, int sz)
{
for (var k = 0; k < sz; ++k)
{
for (var i = 0; i < sz; ++i)
{
if (matrix[i * sz + k] == NO_EDGE)
{
continue;
}
for (var j = 0; j < sz; ++j)
{
var distance = matrix[i * sz + k] + matrix[k * sz + j];
if (matrix[i * sz + j] > distance)
{
matrix[i * sz + j] = distance;
}
}
}
}
}
Here are the results of previous (Baseline
) and current (SpartialOptimisation
) implementations on the same set of graphs (fastest results are highlighted in bold):
Method | Size | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Baseline | 300 | 27.525 ms | 0.1937 ms | 0.1617 ms | 1.00 |
SpartialOptimisation | 300 |
12.399 ms | 0.0943 ms | 0.0882 ms | 0.45 |
Baseline | 600 | 217.897 ms | 1.6415 ms | 1.5355 ms | 1.00 |
SpartialOptimisation | 600 |
99.122 ms | 0.8230 ms | 0.7698 ms | 0.45 |
Baseline | 1200 | 1,763.335 ms | 7.4561 ms | 6.2262 ms | 1.00 |
SpartialOptimisation | 1200 |
766.675 ms | 6.1147 ms | 5.7197 ms | 0.43 |
Baseline | 2400 | 14,533.335 ms | 63.3518 ms | 52.9016 ms | 1.00 |
SpartialOptimisation | 2400 |
6,507.878 ms | 28.2317 ms | 26.4079 ms | 0.45 |
Baseline | 4800 | 119,768.219 ms | 181.5514 ms | 160.9406 ms | 1.00 |
SpartialOptimisation | 4800 |
55,590.374 ms | 414.6051 ms | 387.8218 ms | 0.46 |
My PC is equipped with Inter Core i7-7700 CPU 3.60GHz (Kaby Lake)
processor with 8 logical processors () or 4 cores with technology. Having more than one core is like having more “spare hands” which we can put to work. So, let’s see which part of work can be efficiently divided between multiple workers and then, parallelise it.
Let’s start from for of k
loop. On each iteration loop calculates paths from every vertex to every vertex through vertex k
. New and updated paths are stored in weight matrix. Looking at iterations you might notice – they can be executed in any order: 0, 1, 2, 3 or 3, 1, 2, 0 without affecting the result. However, they still have to be executed in sequence, otherwise some of the iterations won’t be able to use new or updated paths because they won’t be written into weight matrix, yet. Such inconsistency will definitely crush the results. So we need to keep looking.
The next candidate is for of i
loop. On each iteration loop reads a path from vertex i
to vertex k
(cell: W[i, k]
), a path from vertex k
to vertex j
(cell: W[k, j
]) and then checks whether known path from i
to j
(cell: W[i, j]
) is shorter than i ⭢ k ⭢ j
path (sum of: W[i, k]
+ W[k, j]
). If you look closer at the access pattern, you might notice – on each iteration i
loop reads data from k
row and updated i
row which basically means – iterations are independent and can be executed not only in any order but also in parallel!
This looks promising, so let’s implement it (we denote this implementation as SpartialParallelOptimisations
).
for of j
loop also can be paralleled. However, parallelisation of innermost cycle in this case is very inefficient. You can verify it yourself by making a few simple changes in the .
- Author’s note
public void SpartialParallelOptimisations(int[] matrix, int sz)
{
for (var k = 0; k < sz; ++k)
{
Parallel.For(0, sz, i =>
{
if (matrix[i * sz + k] == NO_EDGE)
{
return;
}
for (var j = 0; j < sz; ++j)
{
var distance = matrix[i * sz + k] + matrix[k * sz + j];
if (matrix[i * sz + j] > distance)
{
matrix[i * sz + j] = distance;
}
}
});
}
}
Here are the results of Baseline
, SpartialOptimisation
and SpartialParallelOptimisations
implementations on the same set of graphs (parallelisation is done using class):
Method | Size | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Baseline | 300 | 27.525 ms | 0.1937 ms | 0.1617 ms | 1.00 |
SpartialOptimisation | 300 | 12.399 ms | 0.0943 ms | 0.0882 ms | 0.45 |
SpartialParallelOptimisations | 300 |
6.252 ms | 0.0211 ms | 0.0187 ms | 0.23 |
Baseline | 600 | 217.897 ms | 1.6415 ms | 1.5355 ms | 1.00 |
SpartialOptimisation | 600 | 99.122 ms | 0.8230 ms | 0.7698 ms | 0.45 |
SpartialParallelOptimisations | 600 |
35.825 ms | 0.1003 ms | 0.0837 ms | 0.16 |
Baseline | 1200 | 1,763.335 ms | 7.4561 ms | 6.2262 ms | 1.00 |
SpartialOptimisation | 1200 | 766.675 ms | 6.1147 ms | 5.7197 ms | 0.43 |
SpartialParallelOptimisations | 1200 |
248.801 ms | 0.6040 ms | 0.5043 ms | 0.14 |
Baseline | 2400 | 14,533.335 ms | 63.3518 ms | 52.9016 ms | 1.00 |
SpartialOptimisation | 2400 | 6,507.878 ms | 28.2317 ms | 26.4079 ms | 0.45 |
SpartialParallelOptimisations | 2400 |
2,076.403 ms | 20.8320 ms | 19.4863 ms | 0.14 |
Baseline | 4800 | 119,768.219 ms | 181.5514 ms | 160.9406 ms | 1.00 |
SpartialOptimisation | 4800 | 55,590.374 ms | 414.6051 ms | 387.8218 ms | 0.46 |
SpartialParallelOptimisations | 4800 |
15,614.506 ms | 115.6996 ms | 102.5647 ms | 0.13 |
From the results we can see that parallelisation of for of i
loop have reduced the calculation time by 2-4 times comparing to previous implementation (SpartialOptimisation
)! This is very impressive, however, it is important to remember, parallelisation of pure computations consumes all available compute resources which can lead to resource starvation of other applications in the system.
var a = new [] { 5, 7, 8, 1 };
var b = new [] { 4, 2, 2, 6 };
var c = new [] { 0, 0, 0, 0 };
for (var i = 0; i < 4; ++i)
c[i] = a[i] + b[i];
In the oversimplified way, execution of iteration 0
of the above for
loop on a CPU level can be illustrated as following:
Here is what happening. CPU loads values of arrays a
and b
from memory into CPU registers (one register can hold exactly one value). Then CPU executes scalar addition operation on these registries and writes the result back into main memory – right into an array c
. This process is repeated four times before the loop ends.
Here is what happening. CPU loads values of arrays a
and b
from memory into CPU registers (however this time, one vector register can hold two array value). Then CPU executes vector addition operation on these registries and writes the result back into main memory – right into an array c
. Because we operate on two values at once, this process is repeated two times instead of four.
public void SpartialVectorOptimisations(int[] matrix, int sz)
{
for (var k = 0; k < sz; ++k)
{
for (var i = 0; i < sz; ++i)
{
if (matrix[i * sz + k] == NO_EDGE)
{
continue;
}
var ik_vec = new Vector<int>(matrix[i * sz + k]);
var j = 0;
for (; j < sz - Vector<int>.Count; j += Vector<int>.Count)
{
var ij_vec = new Vector<int>(matrix, i * sz + j);
var ikj_vec = new Vector<int>(matrix, k * sz + j) + ik_vec;
var lt_vec = Vector.LessThan(ij_vec, ikj_vec);
if (lt_vec == new Vector<int>(-1))
{
continue;
}
var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec);
r_vec.CopyTo(matrix, i * sz + j);
}
for (; j < sz; ++j)
{
var distance = matrix[i * sz + k] + matrix[k * sz + j];
if (matrix[i * sz + j] > distance)
{
matrix[i * sz + j] = distance;
}
}
}
}
}
Vectorised code can look a bit bizarre, so let’s go through it step by step.
We create a vector of i ⭢ k
path: var ik_vec = new Vector<int>(matrix[i * sz + k])
. As the result, if vector can hold four values of int
type and weight of i ⭢ k
path is equal to 5, we will create a vector of four 5’s – [5, 5, 5, 5]. Next, on each iteration, we simultaneously calculate paths from vertex i
to vertexes j, j + 1, ..., j + Vector<int>.Count
.
Property Vector<int>.Count
returns number of elements of int
type that fits into vector registers. The size of vector registers depends on CPU model, so this property can return different values on different CPU’s.
- Author’s note
ij_vec
and ikj_vec
.ij_vec
and ikj_vec
vectors and select smallest values into a new vector r_vec
.r_vec
back to weight matrix.
While #1 and #3 are pretty straightforward, #2 requires explanation. As it was mentioned previously, with vectors we manipulate multiple values at the same time. Therefore, result of some operations can’t be singular i.e., comparison operation Vector.LessThan(ij_vec, ikj_vec)
can’t return true
or false
because it compares multiple values. So instead it returns a new vector which contains comparison results between corresponding values from vectors ij_vec
and ikj_vec
(-1
, if value from ij_vec
is less than value from ikj_vec
and 0
if otherwise). Returned vector (by itself) isn’t really useful, however, we can use Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec)
vector operation to extract required values from ij_vec
and ikj_vec
vectors into a new vector – r_vec
. This operation returns a new vector where values are equal to smaller of two corresponding values of input vectors i.e., if value of vector lt_vec
is equal to -1
, then operation selects a value from ij_vec
, otherwise it selects a value from ikj_vec
.
Besides #2, there is one more part, requires explanation – the second, non-vectorised loop. It is required when size of the weight matrix isn’t a product of Vector<int>.Count
value. In that case the main loop can’t process all values (because you can’t load part of the vector) and second, non-vectored, loop will calculate the remaining iterations.
Method | sz | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Baseline | 300 | 27.525 ms | 0.1937 ms | 0.1617 ms | 1.00 |
SpartialOptimisation | 300 | 12.399 ms | 0.0943 ms | 0.0882 ms | 0.45 |
SpartialParallelOptimisations | 300 | 6.252 ms | 0.0211 ms | 0.0187 ms | 0.23 |
SpartialVectorOptimisations | 300 |
3.056 ms | 0.0301 ms | 0.0281 ms | 0.11 |
Baseline | 600 | 217.897 ms | 1.6415 ms | 1.5355 ms | 1.00 |
SpartialOptimisation | 600 | 99.122 ms | 0.8230 ms | 0.7698 ms | 0.45 |
SpartialParallelOptimisations | 600 | 35.825 ms | 0.1003 ms | 0.0837 ms | 0.16 |
SpartialVectorOptimisations | 600 |
24.378 ms | 0.4320 ms | 0.4041 ms | 0.11 |
Baseline | 1200 | 1,763.335 ms | 7.4561 ms | 6.2262 ms | 1.00 |
SpartialOptimisation | 1200 | 766.675 ms | 6.1147 ms | 5.7197 ms | 0.43 |
SpartialParallelOptimisations | 1200 | 248.801 ms | 0.6040 ms | 0.5043 ms | 0.14 |
SpartialVectorOptimisations | 1200 |
185.628 ms | 2.1240 ms | 1.9868 ms | 0.11 |
Baseline | 2400 | 14,533.335 ms | 63.3518 ms | 52.9016 ms | 1.00 |
SpartialOptimisation | 2400 | 6,507.878 ms | 28.2317 ms | 26.4079 ms | 0.45 |
SpartialParallelOptimisations | 2400 |
2,076.403 ms | 20.8320 ms | 19.4863 ms | 0.14 |
SpartialVectorOptimisations | 2400 | 2,568.676 ms | 31.7359 ms | 29.6858 ms | 0.18 |
Baseline | 4800 | 119,768.219 ms | 181.5514 ms | 160.9406 ms | 1.00 |
SpartialOptimisation | 4800 | 55,590.374 ms | 414.6051 ms | 387.8218 ms | 0.46 |
SpartialParallelOptimisations | 4800 |
15,614.506 ms | 115.6996 ms | 102.5647 ms | 0.13 |
SpartialVectorOptimisations | 4800 | 18,257.991 ms | 84.5978 ms | 79.1329 ms | 0.15 |
From the result it is obvious, vectorisation has significantly reduced calculation time – from 3 to 4 times compared to non-parallelised version (SpartialOptimisation
). The interesting moment here, is that vectorised version also outperforms parallel version (SpartialParallelOptimisations
) on smaller graphs (less than 2400 vertexes).
If you are interested in practical application of vectorisation, I recommend you to read series of posts by where he vectorised Array.Sort
(these results later found themselves in update for Garbage Collector in ).
The last implementation combines efforts of parallelisation and vectorisation and … to be honest it lacks of individuality 🙂 Because… well, we have just replaced a body of Parallel.For
from SpartialParallelOptimisations
with vectorised loop from SpartialVectorOptimisations
:
public void SpartialParallelVectorOptimisations(int[] matrix, int sz)
{
for (var k = 0; k < sz; ++k)
{
Parallel.For(0, sz, i =>
{
if (matrix[i * sz + k] == NO_EDGE)
{
return;
}
var ik_vec = new Vector<int>(matrix[i * sz + k]);
var j = 0;
for (; j < sz - Vector<int>.Count; j += Vector<int>.Count)
{
var ij_vec = new Vector<int>(matrix, i * sz + j);
var ikj_vec = new Vector<int>(matrix, k * sz + j) + ik_vec;
var lt_vec = Vector.LessThan(ij_vec, ikj_vec);
if (lt_vec == new Vector<int>(-1))
{
continue;
}
var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec);
r_vec.CopyTo(matrix, i * sz + j);
}
for (; j < sz; ++j)
{
var distance = matrix[i * sz + k] + matrix[k * sz + j];
if (matrix[i * sz + j] > distance)
{
matrix[i * sz + j] = distance;
}
}
});
}
}
All results of this post are presented bellow. As expected, simultaneous use of parallelism and vectorisation demonstrated the best results, reducing the calculation time up to 25 times (for graphs of 1200 vertexes) comparing to initial implementation.
Method | sz | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Baseline | 300 | 27.525 ms | 0.1937 ms | 0.1617 ms | 1.00 |
SpartialOptimisation | 300 | 12.399 ms | 0.0943 ms | 0.0882 ms | 0.45 |
SpartialParallelOptimisations | 300 | 6.252 ms | 0.0211 ms | 0.0187 ms | 0.23 |
SpartialVectorOptimisations | 300 | 3.056 ms | 0.0301 ms | 0.0281 ms | 0.11 |
SpartialParallelVectorOptimisations | 300 |
3.008 ms | 0.0075 ms | 0.0066 ms | 0.11 |
Baseline | 600 | 217.897 ms | 1.6415 ms | 1.5355 ms | 1.00 |
SpartialOptimisation | 600 | 99.122 ms | 0.8230 ms | 0.7698 ms | 0.45 |
SpartialParallelOptimisations | 600 | 35.825 ms | 0.1003 ms | 0.0837 ms | 0.16 |
SpartialVectorOptimisations | 600 | 24.378 ms | 0.4320 ms | 0.4041 ms | 0.11 |
SpartialParallelVectorOptimisations | 600 |
13.425 ms | 0.0319 ms | 0.0283 ms | 0.06 |
Baseline | 1200 | 1,763.335 ms | 7.4561 ms | 6.2262 ms | 1.00 |
SpartialOptimisation | 1200 | 766.675 ms | 6.1147 ms | 5.7197 ms | 0.43 |
SpartialParallelOptimisations | 1200 | 248.801 ms | 0.6040 ms | 0.5043 ms | 0.14 |
SpartialVectorOptimisations | 1200 | 185.628 ms | 2.1240 ms | 1.9868 ms | 0.11 |
SpartialParallelVectorOptimisations | 1200 |
76.770 ms | 0.3021 ms | 0.2522 ms | 0.04 |
Baseline | 2400 | 14,533.335 ms | 63.3518 ms | 52.9016 ms | 1.00 |
SpartialOptimisation | 2400 | 6,507.878 ms | 28.2317 ms | 26.4079 ms | 0.45 |
SpartialParallelOptimisations | 2400 | 2,076.403 ms | 20.8320 ms | 19.4863 ms | 0.14 |
SpartialVectorOptimisations | 2400 | 2,568.676 ms | 31.7359 ms | 29.6858 ms | 0.18 |
SpartialParallelVectorOptimisations | 2400 |
1,281.877 ms | 25.1127 ms | 64.8239 ms | 0.09 |
Baseline | 4800 | 119,768.219 ms | 181.5514 ms | 160.9406 ms | 1.00 |
SpartialOptimisation | 4800 | 55,590.374 ms | 414.6051 ms | 387.8218 ms | 0.46 |
SpartialParallelOptimisations | 4800 | 15,614.506 ms | 115.6996 ms | 102.5647 ms | 0.13 |
SpartialVectorOptimisations | 4800 | 18,257.991 ms | 84.5978 ms | 79.1329 ms | 0.15 |
SpartialParallelVectorOptimisations | 4800 |
12,785.824 ms | 98.6947 ms | 87.4903 ms | 0.11 |