Build system

1. Build modes
2. Cross-compilation
- 2.1. Cross-compile with an embedded linux
3. Performance comparaison & SIMD
4. Programming language Benchmarks
- 4.1. Results
5. 1BRC

1. Build modes

Zig build modes are way easier for beginners to graps compared to C

There are only 4 different build modes:

Debug (default)
ReleaseFast
ReleaseSafe
ReleaseSmall

You can check them in more details here

There are also a few other options you can tweak like avx here

1.1. Generate automatically documentation

In Zig there is an experimental feature, it's automatically generated documentation. It will scan all the public structures and functions and will create documentation. Moreover, the comments like /// are used to give more information to types.

This example is created with the help of the build.zig of the Zig language GitHub repository. This is used the same way as the Zig teams auto-generate the standard library documentation. See: build.zig

pub fn build(b: *std.Build) void {
    const autodoc_test = b.addTest(.{
         .root_source_file = .{ .path = "src/main.zig" },
     });
     const install_std_docs = b.addInstallDirectory(.{
         .source_dir = autodoc_test.getEmittedDocs(),
         .install_dir = .prefix,
         .install_subdir = "doc",
     });

     const doc_step = b.step("docs", "Generate documentation");
     doc_step.dependOn(&install_std_docs.step);
     doc_step.makeFn = generateDocumentation;
 }

 fn generateDocumentation(self: *std.build.Step, progress: *std.Progress.Node) !void {
     _ = self;
     _ = progress;
     std.log.info("The documentation has been generated", .{});
 }

1.2. Strip output binary in Zig in Linux

ELF format In an ELF executable there are various sections that hold program and control information. For example:

.bss: it holds uninitialized variables and data.
.text: it holds the instruction of the program
.debug: it holds unspecified information for debugging

Removing the debug symbols will reduce his size, make it harder to reverse engineering, and improve speed performance.

By default, Zig will produce an executable that has all the debug symbols. However, Zig has reproduced a drop-in replacement for the program GNU objcopy.

To strip an output in the zig command line :

zig build-exe -fstrip src/main.zig

To be sure that the output executable is stripped, the command file can be useful :

file main

And the output :

main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

With the Zig build system (build.zig) there is an option to strip an executable :

const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});

var exe = b.addExecutable(.{
    .name = "linkedlist",
    .root_source_file = .{ .path = "src/main.zig" },
    .target = target,
    .optimize = optimize,
});

exe.strip = true;

If the option strip is set to false, we have :

$file linkedlist
linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

And with the option to true :

$file linkedlist
linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

1.3. Separate debug symbols from ELF executable

The chapter above shows how to remove all the symbols for the executable directly in the Zig build system. In this section, the interest is more about the Zig objcopy command. In fact, Zig has implemented his own objcopy utils to strip, and remove symbols like the objcopy from the bin utils (GNU utils). (Note: for now, the Zig objcopy has fewer features)

The interest in having objcopy directly in Zig is there is no need to have multiple objcopy executables. Indeed, in a cross-compilation world, each binutil needs to be compatible with the CPU architecture target. So, it avoids using different toolchains and scripts to build on various targets.

Here's how to invoke the help of the Zig build objcopy:

zig objcopy --help

The following example illustrates several applications of the objcopy command. First, remove the debug symbols of the executable. After that, keep only the debug symbols in a separate file. And finally, how to link the debug symbols file to an executable (that has no debug file).

For this example, the output executable of the Zig build toolchain is named main.

$file main
main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

The first step is to strip the debug file from the executable:

zig objcopy --strip-debug main mainStripped

The small flag alternative can also be used:

zig objcopy -g main mainNoDebugInfo

After this command, the new executable has no longer the debug info.

$file mainNoDebugInfo
mainNoDebugInfo: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped

The second step shows how to keep only the debug symbols in the main.dbg file:

zig objcopy --only-keep-debug main main.dbg

Finally, to recreate an executable with a link to a debug symbols file, the following command can be used:

zig objcopy --add-gnu-debuglink=main.dbg mainNoDebugInfo mainWithDbgSymbols

Now, the executable named mainWithDbgSymbols can be used, for example in GDB to debug it (GDB will know where its debug symbols file is located).

Removing the debug symbol will create a smaller executable. Bellow shows the process with a mini-example:

executable name	size
main	1.9M
main.dbg	1.2M
mainNoDebugInfo	690K
mainWithDbgSymbols	690K

The advantage of this method is that the executable can be sent to production without debug symbols (it's more difficult to reverse engineering the exe, and it has a smaller size). But when a bug is reported, it is always possible to debug it because all you have to do is link the debug symbol to the executable.

2. Cross-compilation

2.1. Cross-compile with an embedded linux

The -target flag of zig build-exe can select the target architecture for the cross-compilation. There are multiple target selections, see {{< cite "CrossCompilationZig2024" >}} to have the entire list of targets.

Here's an example of a test that was performed, the aim being to run an executable produced by Zig (from a x86_64) for a -nanopi neo plus 2- which has an arm64 architecture and runs a Linux kernel. The information of the embedded target :

uname -a
Linux csel 5.15.148 #2 SMP PREEMPT Mon Mar 4 21:21:00 UTC 2024 aarch64 GNU/Linux

Here's the command to cross-compile the executable (the executable is also stripped to have a smaller size):

zig build-exe src/main.zig -target aarch64-linux -fstrip

The executable on the embedded target runs flawlessly, below is shown the output of the target:

# ./main
All your codebase are belong to us.
Run `zig build test` to run the tests.

3. Performance comparaison & SIMD

The following chapters will explain different benchmark and code.

3.1. Vectors

A Zig vector is an array of child type: booleans, integers, floats or pointers. It provides SIMD instructions for parallelizing operations where possible.

As described in the standard documentation, the vectors smaller than the target machine's SIMD instruction will be compiled to a single SIMD instruction. In other hands, the vectors longer than the the target machine's SIMD will be compile to multiple SIMD instructions. {{< cite "DocumentationZigProgramming" >}}

The Zig vector does not work in the same way as the C++ vector. {{< cite "WelcomeZigGuide2024" >}}

We can use multiple opteration on Zig vectors:

Arithmetics
Bitwise operation
Comparaison operator

const stdout = std.io.getStdOut();
const size = 4;
fn print(comptime T: type, name: []const u8, vector: T) !void {
    try stdout.writer().print("{s}: ", .{name});
    for (0..size) |index| {
        try switch (T) {
            @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}),
            @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}),
            else => stdout.writer().print("{d}, ", .{vector[index]}),
        };
    }
    try stdout.writer().print("\n", .{});
}

pub fn main() !void {
    const a: @Vector(size, i32) = .{ 1, 2, 3, 4 };
    const b: @Vector(size, i32) = .{ 5, 6, 7, 8 };

    try stdout.writer().print("Access to an item: {d}\n", .{a[2]});

    const c = a + b;
    try print(@TypeOf(c), "c", c);

    const d = a * b;
    try print(@TypeOf(d), "d", d);

    const e = a << b;
    try print(@TypeOf(e), "e", e);

    const f = a == b;
    try print(@TypeOf(f), "f", f);
}

Zig support a vector size of:

\begin{equation} 2^{32} - 1 \end{equation}

But currently on the last Zig version, a long size (2²⁰) can crash at compile-time. see documentation {{< cite "DocumentationZigProgramminga" >}}

There are utility functions to simplify the use of vectors:

@splat, this function will create a vector of a given size with the same value. @splat
@Reduce, this function will reduce the vector into a scalar. We can specify the operation. For intergers, every operation is available, for booleans plus there's .And, .Or, .Xor. And finally, for floating points they have also .Min, .Max, .Add, .Mul. {{< cite "DocumentationZigProgramming" >}} @reduce
@shuffle, this function will construct a new vector by selecting the elements of two vectors (a and b) with a mask array. The function will iterate on each element, and select the value of the mask. With the mask value, it will select the element of the array a or b. If the number of the mask value is 0 or higher, it will select the index of the first vector (a). If the number of the mask value is -1 or less, it will select the index starting at -1 and decrementing. (read the example to be more clear) @shuffle
@select, this function will select from two vectors (a and b) based on a pred vector. If pred[i] is positif, the element will be a[i] otherwise b[i] @select

const stdout = std.io.getStdOut();
const size = 4;
fn print(comptime T: type, name: []const u8, vector: T) !void {
    try stdout.writer().print("{s}: ", .{name});
    for (0..size) |index| {
        try switch (T) {
            @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}),
            @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}),
            else => stdout.writer().print("{d}, ", .{vector[index]}),
        };
    }
    try stdout.writer().print("\n", .{});
}

pub fn main() !void {
    const a: @Vector(size, i32) = @splat(3);
    try print(@TypeOf(a), "a", a);

    const b = @reduce(.Add, a);
    try stdout.writer().print("b: {d}\n", .{b});

    const mask: @Vector(size, i32) = .{ 1, -2, -1, 0};
    const c: @Vector(size, u8) = .{ 't', 't', 'q', 'q'};
    const d: @Vector(size, u8) = .{ 's', 'e', 't', 'a'};
    const e = @shuffle(u8, c, d, mask);
    try print(@TypeOf(e), "e", e);

    const pred: @Vector(size, bool) = .{ true, false, false, true};
    const f: @Vector(size, u8) = .{ 'a', 'a', 'a', 'a'};
    const g: @Vector(size, u8) = .{ 'b', 'b', 'b', 'b'};
    const h = @select(u8, pred, f, g);
    try print(@TypeOf(h), "h", h);
}

3.2. Leibniz algorithm

This chapter shows the performance of Zig compared to other language, in particular with C.

The speed-comparaison github repository {{< cite "NiklasheerSpeedcomparisonRepo" >}} has differents examples code that shows the Leibniz algorithm {{< cite "LeibnizFormula2024" >}} in action. This algorithm has puspose to compute the PI number with the following formula:

\begin{equation} \frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \ldots = \sum_{k = 0}^{\infty} \frac{(-1)^k}{2k + 1} \end{equation}

Below, we can find example from the speed-comparaison github repository. They have been modified so that they can be run in org mode.

The C code shows the standard way to compute the Leibniz algorithm, at each iteration, PI will improve. The C compiler will optimize the C code to use SIMD instructions under the hood. The line: x = -1.0 + 2.0 * (i & 0x1); is written in this way to enable the compiler to be optimised. The compiler can then use the fmadd instruction, which performs a multiplication and an addition in a single operation. {{< cite "IntelIntrinsicsGuide" >}} {{< cite "ArmA64Instruction" >}}

unsigned rounds = 1000000000;
double pi = 1.0;

rounds += 2u; // do this outside the loop

for (unsigned i=2u; i < rounds; ++i)
{
    double x = -1.0 + 2.0 * (i & 0x1); // allows vectorization
    pi += (x / (2u * i - 1u)); // double / unsigned = double
}

pi *= 4;
printf("%.16f\n", pi); // print 16 decimal digits of pi

The code below shows the adaptation of the C code but for the Zig language.

var rounds: u64 = 1000000000;
rounds += 2;

var pi: f64 = 1.0;
var i: usize = 2;
while (i < rounds) : (i += 1) {
    const isOdd: f64 = @floatFromInt(i & 0x1);
    const x: f64 = -1.0 + 2.0 * isOdd;
    const den: f64 = @floatFromInt(2 * i - 1);
    pi += (x / den);
}
pi *= 4;
try std.io.getStdOut().writer().print("{}", .{pi});

The Zig optimize code below will produce a similar output to the two other codes but it uses SIMD instruction to try to manually optimize the code. Now the number of iterations is divided by 4 because the calculations are parallelized by 4. Note: you can see that the length of the vector is 4. At the end of the example, the code doesn't use vectors, it will only use for the last iterations. Indeed, the vectorization will only work for a number of rounds divisible by 4 (%4).

var rounds: u64 = 1000000000;
rounds += 2;
const unroll = 4;

const x = @Vector(4, f64){ -1.0, 1.0, -1.0, 1.0 };
var den: @Vector(4, f64) = @splat(0.0);
const inc: @Vector(4, f64) = @splat(4.0);
const two: @Vector(4, f64) = @splat(2.0);
const mone: @Vector(4, f64) = @splat(-1.0);
var ivec = @Vector(4, f64){ 2.0, 3.0, 4.0, 5.0 };
var pivec: @Vector(4, f64) = @splat(0.0);

const vec_end: u64 = rounds - rounds % unroll;

var i: u64 = 2;
while (i < vec_end) : (i += unroll) {
    den = mone + (two * ivec);
    ivec = ivec + inc;
    pivec = pivec + (x / den);
}

var _x: f64 = 1.0;
var pi: f64 = 1.0;
pi += @reduce(.Add, pivec);
while (i < vec_end) : (i += 1) {
    _x = -_x;
    const i_2: f64 = @floatFromInt(i);
    pi += (_x / 2 * i_2 - 1);
}
pi *= 4;

try std.io.getStdOut().writer().print("{d:.16}\n", .{pi});

3.3. Benchmark your x86\₆₄ CPU

The code below shows the comamnds to compile the three examples explains upper. Moreover before to execute the code block, you need to tangle the examples with the shortcut C-c C-v t.

The command to compile the C code (with gcc) is specific to an x86₆₄ architecture, if you want to follow the examples make sure to have a compatible CPU. Note: If emacs is unable to find the path to Zig during execution, the environment variable must be exported to the .zshenv file for a zsh shell.

gcc leibniz.c -o leibniz_c -O3 -s -flto -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math
zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig -fstrip
zig build-exe leibniz_simd.zig -OReleaseFast -femit-bin=leibniz_simd_zig -fstrip

You will find the three executables in your directory.

Now to benchmark your code you need a tool to run the analysis, it's called hyperfine {{< cite "peterHyperfine2023" >}}. Il faut installer hyperfine pour faire le benchmark : https://github.com/sharkdp/hyperfine

To code bellow will run a benchmark with the hyperfine tool. To use this tool, you need to install it and have in your path.

hyperfine --warmup=3 './leibniz_c' './leibniz_zig' './leibniz_simd_zig' -i -N

If you run the benchmark in your computer, you can see the result in the block upper.

3.4. ARM benchmark example

We made the same benchmark but for an arm CPU, here are the compilers's options used. The target is a nanopi neo plus 2.

aarch64-buildroot-linux-gnu-gcc leibniz.c -o leibniz_c -O3 -march=armv8-a -mtune=cortex-a53 -s -fno-signed-zeros -fno-trapping-math -fassociative-math
zig build-exe leibniz_smid.zig -OReleaseFast -femit-bin=leibniz_zig_simd_aarch -target aarch64-native -mcpu cortex_a53 -fstrip
zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig_aarch -target aarch64-native -mcpu cortex_a53 -fstrip

On our arm target we cannot execute hyperfine, we measure 8 times the each program and we made statistics.

Program	Mean [s]	Max [s]	Min [s]
`./leibniz_c`	20.57	20.58	20.57
`./leibniz_zig_aarch`	19.95	19.96	19.95
`./leibniz_zig_simd_aarch`	14.43	14.44	14.43

We can find a difference in performance between the ARM and the x86 benchmarks. We can note that the set of instructions used between the two benchmarks is different, the optimisations may then differ. Moreover, the version of the compilator used is not the same for the C example:

For x86: gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
For ARM: aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08.3-dirty) 11.3.0

3.5. Different code examples benchmark

Mes benchmarks sur mon ordi :

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./leibniz_zig`	85.9 ± 0.4	85.0	87.0	2.02 ± 0.03
`./leibniz_zig_simd`	42.5 ± 0.5	41.6	43.7	1.00
`./leibniz_c`	43.3 ± 0.8	42.2	45.4	1.02 ± 0.02
`./leibniz_rs`	86.4 ± 1.5	85.2	94.3	2.03 ± 0.04
`./leibniz_avx2`	43.3 ± 0.5	42.1	44.3	1.02 ± 0.02

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./leibniz_zig`	883.9 ± 9.5	864.1	896.0	1.99 ± 0.03
`./leibniz_zig_simd`	443.8 ± 5.2	437.2	452.0	1.00
`./leibniz_c`	481.5 ± 4.4	476.8	489.2	1.08 ± 0.02
`./leibniz_rs`	912.7 ± 9.7	897.1	925.1	2.06 ± 0.03
`./leibniz_avx2`	451.2 ± 6.3	443.8	465.7	1.02 ± 0.02

Command	Mean [s]	Min [s]	Max [s]	Relative
`./leibniz_zig`	8.889 ± 0.095	8.722	9.058	13.91 ± 0.33
`./leibniz_zig_simd`	4.386 ± 0.028	4.357	4.420	6.86 ± 0.15
`./leibniz_c`	0.665 ± 0.003	0.658	0.669	1.04 ± 0.02
`./leibniz_rs`	8.903 ± 0.053	8.855	8.973	13.93 ± 0.31
`./leibniz_avx2`	0.639 ± 0.013	0.618	0.657	1.00

4. Programming language Benchmarks

Another way of seeing the difference in performance between C and Zig is to use slightly more complex code. The Programming Language Benchmark {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}} git repository shows differents codes writing in different languages. Their website shows benchmarks made with their hardware zig-vs-c. {{< cite "ZigVSBenchmarks" >}} We then reproduced the benchmarks between the C and the Zig to see if there were any differences with our hardware.

Here's our results: CPU INFO:[x86\₆₄][8 cores] Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (Model 142)

Here's what you need to know about file naming {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}}:

*-m in a file name stands for multi-threading or multi-processing

*-i in a file name stands for direct intrinsics usage. (Usage of SIMD intrinsics via libraries is not counted). It means that the SIMD instruction is directly use in the code. The instructions can be found in the Intel Intrinsics Guide.

*(You may find time < time(user) + time(sys) for some non-parallelized programs, the overhead is from GC or JIT compiler, which are allowed to take advantage of multi-cores as that's more close to real-world scenarios.)

You can find in the repo how the examples are compiled here.

knucleotide

Input: 2500000:

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
c	1-m.c	185ms	8.3ms	23.6MB	623ms	13ms	gcc 13.2.0
c	1-m.c	308ms	19ms	27.5MB	2143ms	80ms	clang 17.0.6
zig	1.zig	856ms	6.0ms	19.9MB	907ms	10ms	zig 0.12.0

Input: 250000:

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
c	1-m.c	33ms	1.8ms	11.9MB	153ms	0ms	gcc 13.2.0
c	1-m.c	47ms	7.8ms	16.8MB	310ms	15ms	clang 17.0.6
zig	1.zig	90ms	2.1ms	7.6MB	90ms	0ms	zig 0.12.0

nbody Input: 5000000

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
zig	2.zig	179ms	2.1ms	1.3MB	187ms	0ms	zig 0.12.0
c	5.c	193ms	1.7ms	2.1MB	200ms	0ms	clang 17.0.6
c	5.c	211ms	7.9ms	1.4MB	220ms	0ms	zigcc 0.12.0
c	8-i.c	215ms	2.9ms	1.4MB	223ms	0ms	zigcc 0.12.0
c	8-i.c	222ms	2.9ms	2.0MB	233ms	0ms	gcc 13.2.0
c	8-i.c	225ms	3.5ms	2.1MB	233ms	0ms	clang 17.0.6
c	2.c	225ms	1.4ms	2.0MB	233ms	0ms	clang 17.0.6
c	2.c	240ms	5.2ms	2.0MB	253ms	0ms	gcc 13.2.0
c	5.c	242ms	11ms	2.0MB	250ms	0ms	gcc 13.2.0
c	2.c	254ms	2.5ms	1.4MB	267ms	0ms	zigcc 0.12.0
zig	1.zig	265ms	7.1ms	1.3MB	280ms	0ms	zig 0.12.0

Input: 500000

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
zig	2.zig	18ms	0.2ms	1.3MB	10ms	0ms	zig 0.12.0
c	5.c	21ms	0.3ms	2.0MB	13ms	0ms	clang 17.0.6
c	8-i.c	21ms	0.3ms	1.5MB	17ms	0ms	zigcc 0.12.0
c	5.c	21ms	0.9ms	1.4MB	13ms	0ms	zigcc 0.12.0
c	8-i.c	22ms	1.4ms	2.3MB	17ms	0ms	gcc 13.2.0
c	8-i.c	25ms	0.2ms	2.1MB	20ms	0ms	clang 17.0.6
c	5.c	25ms	1.2ms	2.3MB	20ms	0ms	gcc 13.2.0
c	2.c	25ms	0.4ms	2.1MB	20ms	0ms	clang 17.0.6
c	2.c	25ms	1.0ms	2.0MB	20ms	0ms	gcc 13.2.0
zig	1.zig	27ms	0.4ms	1.1MB	20ms	0ms	zig 0.12.0
c	2.c	28ms	1.7ms	1.4MB	20ms	0ms	zigcc 0.12.0

spectral-norm Input: 8000

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
zig	2-m.zig	340ms	1.5ms	1.6MB	2550ms	3ms	zig 0.12.0
c	6-im.c	547ms	11ms	4.6MB	4593ms	17ms	clang 17.0.6
c	4-m.c	566ms	7.7ms	4.6MB	4780ms	13ms	clang 17.0.6
c	6-im.c	568ms	14ms	2.1MB	4820ms	0ms	gcc 13.2.0
c	4-m.c	586ms	11ms	2.1MB	4980ms	0ms	gcc 13.2.0
c	5-im.c	727ms	4.5ms	2.1MB	6203ms	0ms	gcc 13.2.0
c	3-m.c	749ms	13ms	4.6MB	6327ms	7ms	clang 17.0.6
c	5-im.c	799ms	11ms	4.6MB	6770ms	23ms	clang 17.0.6
c	3-m.c	833ms	23ms	2.1MB	7070ms	0ms	gcc 13.2.0
zig	2.zig	1168ms	3.1ms	1.5MB	1257ms	0ms	zig 0.12.0
zig	1.zig	2774ms	59ms	1.5MB	3010ms	0ms	zig 0.12.0

Input: 4000

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
zig	2-m.zig	89ms	1.6ms	1.6MB	493ms	0ms	zig 0.12.0
c	6-im.c	157ms	11ms	4.9MB	1237ms	3ms	clang 17.0.6
c	6-im.c	163ms	8.7ms	2.1MB	1327ms	0ms	gcc 13.2.0
c	4-m.c	164ms	4.2ms	4.8MB	1300ms	3ms	clang 17.0.6
c	4-m.c	176ms	5.4ms	2.0MB	1420ms	0ms	gcc 13.2.0
c	5-im.c	205ms	17ms	2.0MB	1683ms	0ms	gcc 13.2.0
c	3-m.c	224ms	26ms	4.8MB	1797ms	10ms	clang 17.0.6
c	5-im.c	226ms	2.9ms	4.9MB	1820ms	10ms	clang 17.0.6
zig	2.zig	292ms	1.1ms	1.4MB	310ms	0ms	zig 0.12.0
c	3-m.c	488ms	33ms	2.1MB	3600ms	0ms	gcc 13.2.0
zig	1.zig	720ms	11ms	1.3MB	773ms	0ms	zig 0.12.0

Input: 2000

lang	code	time	stddev	peak-mem mem	time(user)	time(sys)	compiler compiler/runtime
zig	2-m.zig	28ms	0.6ms	1.4MB	47ms	0ms	zig 0.12.0
c	6-im.c	54ms	12ms	4.6MB	363ms	0ms	clang 17.0.6
c	3-m.c	61ms	4.1ms	4.6MB	427ms	0ms	clang 17.0.6
c	4-m.c	65ms	3.5ms	4.9MB	433ms	7ms	clang 17.0.6
zig	2.zig	72ms	1.0ms	1.4MB	70ms	0ms	zig 0.12.0
c	5-im.c	73ms	11ms	4.6MB	520ms	3ms	clang 17.0.6
c	6-im.c	78ms	9.4ms	2.0MB	577ms	0ms	gcc 13.2.0
c	3-m.c	78ms	10ms	2.1MB	600ms	0ms	gcc 13.2.0
c	4-m.c	80ms	12ms	2.1MB	590ms	0ms	gcc 13.2.0
c	5-im.c	81ms	13ms	2.3MB	620ms	0ms	gcc 13.2.0
zig	1.zig	192ms	3.6ms	1.3MB	203ms	0ms	zig 0.12.0

4.1. Results

We can see that execution times vary greatly depending on the programming language, the implementation, the target, and finally, the compiler used. In addition, we found that Zig codes that use vectors perform better than those that do not. Zig codes that are not optimised tend to perform less well than C codes.

On the other hand, we can see that Zig codes consume less memory than C codes. For the comparison, to be fair, the allocator used in Zig is the libC allocator. And finally, we can see that the performances depend also on the input.

We can't say that one language is faster than the other, as this will depend enormously on the context, the optimisations used, and the complexity of the code. We can make a hypothesis based on the measurements taken: if dynamic memory allocation is performed correctly, we can expect Zig code to generally consume less memory than C code.

As a reminder, the results obtained are specific to the machine and operating system used. In addition, we can compare the differences in performance between the benchmarks obtained on a local machine and the machine used by the Programming Language Benchmark site. For their part, the Programming Language Benchmark site uses a machine with the following information:

CPU INFO:[x86₆₄][4 cores] AMD EPYC 7763 64-Core Processor (Model 1)
We should also note that compiler versions have changed, which can also add variability to the results
1. gcc 13.2.0
2. clang 14.0.0-1ubuntu1.1
3. zig 0.12.0-dev.2341+92211135f
4. zigcc 0.12.0-dev.2341+92211135f

For the knucleotide example, we obtain a comparable ranking in the two results. However, we notice that my x86 target is 85 ms slower for the version of Zig with an input of 2500000, but it consumes less memory (around 500KB).

For the nbody example, we can note that on average, the performance obtained is better on my x86 target, but the memory used remains in the same order of magnitude. We also see that the 8-i.c implementation is faster under AMD Epics for the zigcc compiler and for gcc. And that the 5.c implementation with the Clang compiler is among the first in the x86 test, but is among the last for AMD. We note that the difference between version 14 and version 17 of clang can also affect the results. This is the only example where the zigcc compiler is used for benchmarking. We can see that the performance results obtained are similar to those we can get with gcc or clang. We can also observe that the same code compiled with zigcc uses less memory than the other compilers. Further tests would be needed to confirm this, as the nbody example does not allocate memory dynamically.

In the last example, we notice a big difference between the result of the 2.zig code for an input of 8000: it is second to last for performance on x86, while it is second best on AMD. Furthermore, we can see that the standard deviation may be higher for certain codes, but the results are different depending on the target. Finally, the 1.zig code is the worst-performing code for both benchmarks. It runs 2x slower than the second-worst code

5. 1BRC

The one billion row challenge is a challenge that involves calculating the min, max, and average of 1 billion rows of measurements (a 12 Gio file). {{< cite "BillionRowChallenge" >}} Everyone can submit a code to try to have a better implementation. Originally the challenge was for Java, but now we can find multiple solutions in many languages.

In this documentation, we tried to see the differences between three implementations: one in C and two in Zig:

1brc C {{< cite "kootenDannyvankooten1brc2024" >}} {{< cite "OneBillionRows2024" >}}
1brc-zig {{< cite "leeCandrewlee141brczig2024" >}}
1brc-zig - second implementation {{< cite "mykolaichukAlmmiko1brczig2024" >}}

In each repository, we can find how to compile their binary, we followed their instructions.

After each implementation build, we benchmark with hyperfine to have the results:

First, we used hyperfine to benchmark the three implementations one after the other:

hyperfine --warmup=3 './1brc/bin/analyze ~/Documents/1brc/measurements.txt' './1brc-zig/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' './1brc-zig2/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' --export-orgmode "benchmark-1brc.org" --output null -i

And we have the results:

Command	Mean [s]	Min [s]	Max [s]	Relative
`C implementation`	11.774 ± 2.051	7.582	13.034	1.00
`1th Zig implementation`	21.283 ± 0.051	21.228	21.390	1.81 ± 0.31
`2th Zig implementation`	24.788 ± 0.187	24.332	24.931	2.11 ± 0.37

We can see the differences in performances between C and Zig. We note that the differences between the best C result and the worst C result have a difference of 6 seconds.

To see if we get different performances, we benchmark each implementation on its own:

The first Zig implementation 1brc-zig

Command	Mean [s]	Min [s]	Max [s]	Relative
`1th Zig implementation`	15.647 ± 2.206	13.301	21.649	1.00

Sometimes when we redo the benchmark we don't get this huge difference. We use the Linux kernel to map the memory of the measurements file. Depending on the computer load, some differences can be found.

Command	Mean [s]	Min [s]	Max [s]	Relative
`1th Zig implementation`	12.749 ± 0.345	12.202	13.545	1.00

The second Zig implementation 1brc-zig - second implementation

Command	Mean [s]	Min [s]	Max [s]	Relative
`2th Zig implementation`	18.730 ± 2.837	15.192	24.906	1.00

And the C implementation

Command	Mean [s]	Min [s]	Max [s]	Relative
`C implementation`	8.898 ± 0.591	7.507	9.510	1.00

In the C implementation, we note that his writer uses a custom hashmap to hash the city names. He uses a fast multiplication hash combined with a linear search probing. The load factor of his hashmap is 0.5. {{< cite "BillionRowChallenge" >}}

The first Zig implementation use also a hashmap. But the hashmap comes from the Zig standard library. His purpose is generic and uses the Wyhash algorithm to compute the hashes.{{< cite "wangyiWangyifudanWyhash2024" >}}

The Hotspot program has been used to create flamegraph of the implementations:

If you look at Zig's flamegraph, you can see that 40% of the cycle time is used to insert the city into the hashmap.

Figure 1: Flamegrph of the Zig's 1brc implementation

In C, on the other hand, the insertion time in the hashmap is not measurable because it is included in the process_chunck function but does not require any complex hash calculation. Calculating the base index is mapped into an array, and then, if there are any collisions, the index is incremented.

Figure 2: Flamegrph of the C's 1brc implementation

In the example below, Danny van Kooten, the writer of the C One Billion Rows Challenge, explains how the hashes are calculated for each city. {{< cite "OneBillionRows2024a" >}} He uses a simple fast multiplication hash combined with linear probing.

In the first bloc, he multiples each letter of the city by 31. After that, he sums up all the results.

In the second bloc, he checks if the hash can be inserted into the hashmap. If there is a collision, he iterates over the hashmap until he finds a free spot or a matching key.

// hash everything up to ';'
// assumption: key (city name) is at least 1 char
unsigned int len = 1;
unsigned int hash = (unsigned char)buf[0];
while (buf[len] != ';') {
  hash = (hash * 31) + (unsigned char)buf[len++];
}

// probe map until free spot or matching key
unsigned int idx = hashmap[hash & cap-1];
while (idx != 0 && memcmp(results[idx], buf, len) != 0) {
    hash++;
    idx = hashmap[hash & cap-1];
}

// idx is now either 0 (new entry)
// or contains the index of our key in the results a

For the second implementation of the Zig code, we note that it has been implemented with a btree, used for data merge. This merge is performed because there are several threads processing the data. However, it also uses the same hashmap as the one used in the first version.