visit
Reference counting (RC) has rather been a minor party to the other garbage collection (GC) algorithms in functional programming in the last decades as, for example, and use non-RC GC. However, several recent papers, such as and , showed the efficiency of highly optimized RC GC in functional languages with sacrifice or restriction of some language features like circular references. The latter paper introduced an efficient RC GC algorithm called Perceus which is basically all-in-one RC.
In this section, I use the term "synchronized" to mean "marked as shared by multiple threads." In Koka and Lean 4, they use the term "shared" to mean the same thing but I rephrased it to reduce confusion.
In the Perceus reference counting GC, memory blocks have mainly two un-synchronized and synchronized states represented by positive and negative counts respectively. Heap blocks are synchronized before they get shared with other threads and are never reverted back to un-synchronized states once they get synchronized. But you may wonder if this is necessary or not. If we have a memory block with a reference count of 1, that also means it's not shared with any other threads anymore. So isn't it possible to use a common count value of 0 to represent unique references and reduce the overhead of some atomic operations potentially?
The answer is no because in that case, we need to synchronize memory operations on those references un-synchronized back with drops of those references by the other threads with release memory ordering. For example, let's consider a situation where a thread shares a reference with the other thread:
So if references can be un-synchronized back, we always need to use atomic operations with acquire memory ordering at the point (4) above to make all side effects performed by thread B at the point (3) visible for thread A. Otherwise, thread A might free or rewrite memory locations thread B is trying to read! So as a result, we are rather increasing the overhead of atomic operations for references never synchronized before.
When your language has record types and syntax for record field access, things might be a little complex. Let's think about the following pseudo code where we want to update a recursive data structure of type A
in place (The code is written in but assume that we implemented it with Perceus.):
type alias A =
{ x : Maybe A
, y : Int
}
f : A -> A
-- From here, assume that we are in a function scope rather than a module scope.
foo : A
foo = { x = Nothing, y = 42 }
bar : A
bar =
-- Here, we want to reuse a memory block of `foo`...
{ foo |
x = case foo.x of
Nothing -> Nothing
-- There are two references to `x` on evaluation of `f x` here!
Just x -> f x
}
At the line of Just x -> f x
, the program applies a function f
to a field value x
which originates from foo
. However, at this point of the function application, we are still keeping the record value foo
itself and the value of x
has two references! Therefore, heap reuse specialization (i.e. in-place record update) cannot be applied there. In order to update the value of x
in place instead, we need to rather deconstruct foo
into its inner values first as follows.
bar =
let { x, y } = foo
in
{ x =
case x of
Nothing -> Nothing
Just x -> f x
, y = y
}
Note that, even if languages do not support record deconstruction, for self-recursive types, dropping fields containing the types themselves is possible in most cases in practice because otherwise such types' values cannot exist at runtime unless they are dynamically generated in functions or thunks in those fields.
| Atomic RC (seconds) | Perceus RC (seconds) | Improvement (times) |
---|---|---|---|
Conway's game of life | 2.136 | 1.142 | 1.87 |
Hash map insertion | 0.909 | 0.255 | 3.57 |
Hash map update | 1.935 | 0.449 | 4.31 |
> uname -a
Linux xenon 5.13.0-1033-gcp #40~20.04.1-Ubuntu SMP Tue Jun 14 00:44:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Virtualization features:
Hypervisor vendor: KVM
Caches (sum of all):
L1d: 64 KiB (2 instances)
L1i: 64 KiB (2 instances)
L2: 512 KiB (2 instances)
L3: 55 MiB (1 instance)
> lsmem
Memory block size: 128M
Total online memory: 4G
> hyperfine -w 3 ./life-old ./life-new
Benchmark 1: ./life-old
Time (mean ± σ): 2.136 s ± 0.033 s [User: 1.699 s, System: 0.392 s]
Range (min … max): 2.097 s … 2.206 s 10 runs
Benchmark 2: ./life-new
Time (mean ± σ): 1.142 s ± 0.035 s [User: 1.087 s, System: 0.009 s]
Range (min … max): 1.102 s … 1.203 s 10 runs
Summary
'./life-new' ran
1.87 ± 0.06 times faster than './life-old'
> hyperfine -w 3 ./insert-old ./insert-new
Benchmark 1: ./insert-old
Time (mean ± σ): 909.0 ms ± 14.2 ms [User: 605.7 ms, System: 250.6 ms]
Range (min … max): 890.7 ms … 932.8 ms 10 runs
Benchmark 2: ./insert-new
Time (mean ± σ): 254.8 ms ± 5.1 ms [User: 189.1 ms, System: 15.0 ms]
Range (min … max): 247.0 ms … 263.4 ms 12 runs
Summary
'./insert-new' ran
3.57 ± 0.09 times faster than './insert-old'
> hyperfine -w 3 ./update-old ./update-new
Benchmark 1: ./update-old
Time (mean ± σ): 1.935 s ± 0.032 s [User: 1.405 s, System: 0.476 s]
Range (min … max): 1.879 s … 1.976 s 10 runs
Benchmark 2: ./update-new
Time (mean ± σ): 448.6 ms ± 14.4 ms [User: 381.5 ms, System: 15.1 ms]
Range (min … max): 430.9 ms … 484.3 ms 10 runs
Summary
'./update-new' ran
4.31 ± 0.16 times faster than './update-old'