Build system

Table of Contents

1. Build modes

Zig build modes are way easier for beginners to graps compared to C

There are only 4 different build modes:

  • Debug (default)
  • ReleaseFast
  • ReleaseSafe
  • ReleaseSmall

You can check them in more details here

There are also a few other options you can tweak like avx here

1.1. Generate automatically documentation

In Zig there is an experimental feature, it's automatically generated documentation. It will scan all the public structures and functions and will create documentation. Moreover, the comments like /// are used to give more information to types.

This example is created with the help of the build.zig of the Zig language GitHub repository. This is used the same way as the Zig teams auto-generate the standard library documentation. See: build.zig

pub fn build(b: *std.Build) void {
    const autodoc_test = b.addTest(.{
         .root_source_file = .{ .path = "src/main.zig" },
     });
     const install_std_docs = b.addInstallDirectory(.{
         .source_dir = autodoc_test.getEmittedDocs(),
         .install_dir = .prefix,
         .install_subdir = "doc",
     });

     const doc_step = b.step("docs", "Generate documentation");
     doc_step.dependOn(&install_std_docs.step);
     doc_step.makeFn = generateDocumentation;
 }

 fn generateDocumentation(self: *std.build.Step, progress: *std.Progress.Node) !void {
     _ = self;
     _ = progress;
     std.log.info("The documentation has been generated", .{});
 }

1.2. Strip output binary in Zig in Linux

ELF format In an ELF executable there are various sections that hold program and control information. For example:

  • .bss: it holds uninitialized variables and data.
  • .text: it holds the instruction of the program
  • .debug: it holds unspecified information for debugging

Removing the debug symbols will reduce his size, make it harder to reverse engineering, and improve speed performance.

By default, Zig will produce an executable that has all the debug symbols. However, Zig has reproduced a drop-in replacement for the program GNU objcopy.

To strip an output in the zig command line :

zig build-exe -fstrip src/main.zig

To be sure that the output executable is stripped, the command file can be useful :

file main

And the output :

main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

With the Zig build system (build.zig) there is an option to strip an executable :

const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});

var exe = b.addExecutable(.{
    .name = "linkedlist",
    .root_source_file = .{ .path = "src/main.zig" },
    .target = target,
    .optimize = optimize,
});

exe.strip = true;

If the option strip is set to false, we have :

$file linkedlist
linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

And with the option to true :

$file linkedlist
linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

1.3. Separate debug symbols from ELF executable

The chapter above shows how to remove all the symbols for the executable directly in the Zig build system. In this section, the interest is more about the Zig objcopy command. In fact, Zig has implemented his own objcopy utils to strip, and remove symbols like the objcopy from the bin utils (GNU utils). (Note: for now, the Zig objcopy has fewer features)

The interest in having objcopy directly in Zig is there is no need to have multiple objcopy executables. Indeed, in a cross-compilation world, each binutil needs to be compatible with the CPU architecture target. So, it avoids using different toolchains and scripts to build on various targets.

Here's how to invoke the help of the Zig build objcopy:

zig objcopy --help

The following example illustrates several applications of the objcopy command. First, remove the debug symbols of the executable. After that, keep only the debug symbols in a separate file. And finally, how to link the debug symbols file to an executable (that has no debug file).

For this example, the output executable of the Zig build toolchain is named main.

$file main
main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

The first step is to strip the debug file from the executable:

zig objcopy --strip-debug main mainStripped

The small flag alternative can also be used:

zig objcopy -g main mainNoDebugInfo

After this command, the new executable has no longer the debug info.

$file mainNoDebugInfo
mainNoDebugInfo: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped

The second step shows how to keep only the debug symbols in the main.dbg file:

zig objcopy --only-keep-debug main main.dbg

Finally, to recreate an executable with a link to a debug symbols file, the following command can be used:

zig objcopy --add-gnu-debuglink=main.dbg mainNoDebugInfo mainWithDbgSymbols

Now, the executable named mainWithDbgSymbols can be used, for example in GDB to debug it (GDB will know where its debug symbols file is located).

Removing the debug symbol will create a smaller executable. Bellow shows the process with a mini-example:

executable name size
main 1.9M
main.dbg 1.2M
mainNoDebugInfo 690K
mainWithDbgSymbols 690K

The advantage of this method is that the executable can be sent to production without debug symbols (it's more difficult to reverse engineering the exe, and it has a smaller size). But when a bug is reported, it is always possible to debug it because all you have to do is link the debug symbol to the executable.

2. Cross-compilation

2.1. Cross-compile with an embedded linux

The -target flag of zig build-exe can select the target architecture for the cross-compilation. There are multiple target selections, see {{< cite "CrossCompilationZig2024" >}} to have the entire list of targets.

Here's an example of a test that was performed, the aim being to run an executable produced by Zig (from a x86_64) for a -nanopi neo plus 2- which has an arm64 architecture and runs a Linux kernel. The information of the embedded target :

uname -a
Linux csel 5.15.148 #2 SMP PREEMPT Mon Mar 4 21:21:00 UTC 2024 aarch64 GNU/Linux

Here's the command to cross-compile the executable (the executable is also stripped to have a smaller size):

zig build-exe src/main.zig -target aarch64-linux -fstrip

The executable on the embedded target runs flawlessly, below is shown the output of the target:

# ./main
All your codebase are belong to us.
Run `zig build test` to run the tests.

3. Performance comparaison & SIMD

The following chapters will explain different benchmark and code.

3.1. Vectors

A Zig vector is an array of child type: booleans, integers, floats or pointers. It provides SIMD instructions for parallelizing operations where possible.

As described in the standard documentation, the vectors smaller than the target machine's SIMD instruction will be compiled to a single SIMD instruction. In other hands, the vectors longer than the the target machine's SIMD will be compile to multiple SIMD instructions. {{< cite "DocumentationZigProgramming" >}}

The Zig vector does not work in the same way as the C++ vector. {{< cite "WelcomeZigGuide2024" >}}

We can use multiple opteration on Zig vectors:

  • Arithmetics
  • Bitwise operation
  • Comparaison operator
const stdout = std.io.getStdOut();
const size = 4;
fn print(comptime T: type, name: []const u8, vector: T) !void {
    try stdout.writer().print("{s}: ", .{name});
    for (0..size) |index| {
        try switch (T) {
            @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}),
            @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}),
            else => stdout.writer().print("{d}, ", .{vector[index]}),
        };
    }
    try stdout.writer().print("\n", .{});
}

pub fn main() !void {
    const a: @Vector(size, i32) = .{ 1, 2, 3, 4 };
    const b: @Vector(size, i32) = .{ 5, 6, 7, 8 };

    try stdout.writer().print("Access to an item: {d}\n", .{a[2]});

    const c = a + b;
    try print(@TypeOf(c), "c", c);

    const d = a * b;
    try print(@TypeOf(d), "d", d);

    const e = a << b;
    try print(@TypeOf(e), "e", e);

    const f = a == b;
    try print(@TypeOf(f), "f", f);
}

Zig support a vector size of:

\begin{equation} 2^{32} - 1 \end{equation}

But currently on the last Zig version, a long size (220) can crash at compile-time. see documentation {{< cite "DocumentationZigProgramminga" >}}

There are utility functions to simplify the use of vectors:

  1. @splat, this function will create a vector of a given size with the same value. @splat
  2. @Reduce, this function will reduce the vector into a scalar. We can specify the operation. For intergers, every operation is available, for booleans plus there's .And, .Or, .Xor. And finally, for floating points they have also .Min, .Max, .Add, .Mul. {{< cite "DocumentationZigProgramming" >}} @reduce
  3. @shuffle, this function will construct a new vector by selecting the elements of two vectors (a and b) with a mask array. The function will iterate on each element, and select the value of the mask. With the mask value, it will select the element of the array a or b. If the number of the mask value is 0 or higher, it will select the index of the first vector (a). If the number of the mask value is -1 or less, it will select the index starting at -1 and decrementing. (read the example to be more clear) @shuffle
  4. @select, this function will select from two vectors (a and b) based on a pred vector. If pred[i] is positif, the element will be a[i] otherwise b[i] @select
const stdout = std.io.getStdOut();
const size = 4;
fn print(comptime T: type, name: []const u8, vector: T) !void {
    try stdout.writer().print("{s}: ", .{name});
    for (0..size) |index| {
        try switch (T) {
            @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}),
            @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}),
            else => stdout.writer().print("{d}, ", .{vector[index]}),
        };
    }
    try stdout.writer().print("\n", .{});
}

pub fn main() !void {
    const a: @Vector(size, i32) = @splat(3);
    try print(@TypeOf(a), "a", a);

    const b = @reduce(.Add, a);
    try stdout.writer().print("b: {d}\n", .{b});

    const mask: @Vector(size, i32) = .{ 1, -2, -1, 0};
    const c: @Vector(size, u8) = .{ 't', 't', 'q', 'q'};
    const d: @Vector(size, u8) = .{ 's', 'e', 't', 'a'};
    const e = @shuffle(u8, c, d, mask);
    try print(@TypeOf(e), "e", e);

    const pred: @Vector(size, bool) = .{ true, false, false, true};
    const f: @Vector(size, u8) = .{ 'a', 'a', 'a', 'a'};
    const g: @Vector(size, u8) = .{ 'b', 'b', 'b', 'b'};
    const h = @select(u8, pred, f, g);
    try print(@TypeOf(h), "h", h);
}

3.2. Leibniz algorithm

This chapter shows the performance of Zig compared to other language, in particular with C.

The speed-comparaison github repository {{< cite "NiklasheerSpeedcomparisonRepo" >}} has differents examples code that shows the Leibniz algorithm {{< cite "LeibnizFormula2024" >}} in action. This algorithm has puspose to compute the PI number with the following formula:

\begin{equation} \frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \ldots = \sum_{k = 0}^{\infty} \frac{(-1)^k}{2k + 1} \end{equation}

Below, we can find example from the speed-comparaison github repository. They have been modified so that they can be run in org mode.

The C code shows the standard way to compute the Leibniz algorithm, at each iteration, PI will improve. The C compiler will optimize the C code to use SIMD instructions under the hood. The line: x = -1.0 + 2.0 * (i & 0x1); is written in this way to enable the compiler to be optimised. The compiler can then use the fmadd instruction, which performs a multiplication and an addition in a single operation. {{< cite "IntelIntrinsicsGuide" >}} {{< cite "ArmA64Instruction" >}}

unsigned rounds = 1000000000;
double pi = 1.0;

rounds += 2u; // do this outside the loop

for (unsigned i=2u; i < rounds; ++i)
{
    double x = -1.0 + 2.0 * (i & 0x1); // allows vectorization
    pi += (x / (2u * i - 1u)); // double / unsigned = double
}

pi *= 4;
printf("%.16f\n", pi); // print 16 decimal digits of pi

The code below shows the adaptation of the C code but for the Zig language.

var rounds: u64 = 1000000000;
rounds += 2;

var pi: f64 = 1.0;
var i: usize = 2;
while (i < rounds) : (i += 1) {
    const isOdd: f64 = @floatFromInt(i & 0x1);
    const x: f64 = -1.0 + 2.0 * isOdd;
    const den: f64 = @floatFromInt(2 * i - 1);
    pi += (x / den);
}
pi *= 4;
try std.io.getStdOut().writer().print("{}", .{pi});

The Zig optimize code below will produce a similar output to the two other codes but it uses SIMD instruction to try to manually optimize the code. Now the number of iterations is divided by 4 because the calculations are parallelized by 4. Note: you can see that the length of the vector is 4. At the end of the example, the code doesn't use vectors, it will only use for the last iterations. Indeed, the vectorization will only work for a number of rounds divisible by 4 (%4).

var rounds: u64 = 1000000000;
rounds += 2;
const unroll = 4;

const x = @Vector(4, f64){ -1.0, 1.0, -1.0, 1.0 };
var den: @Vector(4, f64) = @splat(0.0);
const inc: @Vector(4, f64) = @splat(4.0);
const two: @Vector(4, f64) = @splat(2.0);
const mone: @Vector(4, f64) = @splat(-1.0);
var ivec = @Vector(4, f64){ 2.0, 3.0, 4.0, 5.0 };
var pivec: @Vector(4, f64) = @splat(0.0);

const vec_end: u64 = rounds - rounds % unroll;

var i: u64 = 2;
while (i < vec_end) : (i += unroll) {
    den = mone + (two * ivec);
    ivec = ivec + inc;
    pivec = pivec + (x / den);
}

var _x: f64 = 1.0;
var pi: f64 = 1.0;
pi += @reduce(.Add, pivec);
while (i < vec_end) : (i += 1) {
    _x = -_x;
    const i_2: f64 = @floatFromInt(i);
    pi += (_x / 2 * i_2 - 1);
}
pi *= 4;

try std.io.getStdOut().writer().print("{d:.16}\n", .{pi});

3.3. Benchmark your x86\64 CPU

The code below shows the comamnds to compile the three examples explains upper. Moreover before to execute the code block, you need to tangle the examples with the shortcut C-c C-v t.

The command to compile the C code (with gcc) is specific to an x8664 architecture, if you want to follow the examples make sure to have a compatible CPU. Note: If emacs is unable to find the path to Zig during execution, the environment variable must be exported to the .zshenv file for a zsh shell.

gcc leibniz.c -o leibniz_c -O3 -s -flto -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math
zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig -fstrip
zig build-exe leibniz_simd.zig -OReleaseFast -femit-bin=leibniz_simd_zig -fstrip

You will find the three executables in your directory.

Now to benchmark your code you need a tool to run the analysis, it's called hyperfine {{< cite "peterHyperfine2023" >}}. Il faut installer hyperfine pour faire le benchmark : https://github.com/sharkdp/hyperfine

To code bellow will run a benchmark with the hyperfine tool. To use this tool, you need to install it and have in your path.

hyperfine --warmup=3 './leibniz_c' './leibniz_zig' './leibniz_simd_zig' -i -N

If you run the benchmark in your computer, you can see the result in the block upper.

3.4. ARM benchmark example

We made the same benchmark but for an arm CPU, here are the compilers's options used. The target is a nanopi neo plus 2.

aarch64-buildroot-linux-gnu-gcc leibniz.c -o leibniz_c -O3 -march=armv8-a -mtune=cortex-a53 -s -fno-signed-zeros -fno-trapping-math -fassociative-math
zig build-exe leibniz_smid.zig -OReleaseFast -femit-bin=leibniz_zig_simd_aarch -target aarch64-native -mcpu cortex_a53 -fstrip
zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig_aarch -target aarch64-native -mcpu cortex_a53 -fstrip

On our arm target we cannot execute hyperfine, we measure 8 times the each program and we made statistics.

Program Mean [s] Max [s] Min [s]
./leibniz_c 20.57 20.58 20.57
./leibniz_zig_aarch 19.95 19.96 19.95
./leibniz_zig_simd_aarch 14.43 14.44 14.43

We can find a difference in performance between the ARM and the x86 benchmarks. We can note that the set of instructions used between the two benchmarks is different, the optimisations may then differ. Moreover, the version of the compilator used is not the same for the C example:

  • For x86: gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
  • For ARM: aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08.3-dirty) 11.3.0

3.5. Different code examples benchmark

Mes benchmarks sur mon ordi :

Command Mean [ms] Min [ms] Max [ms] Relative
./leibniz_zig 85.9 ± 0.4 85.0 87.0 2.02 ± 0.03
./leibniz_zig_simd 42.5 ± 0.5 41.6 43.7 1.00
./leibniz_c 43.3 ± 0.8 42.2 45.4 1.02 ± 0.02
./leibniz_rs 86.4 ± 1.5 85.2 94.3 2.03 ± 0.04
./leibniz_avx2 43.3 ± 0.5 42.1 44.3 1.02 ± 0.02
Command Mean [ms] Min [ms] Max [ms] Relative
./leibniz_zig 883.9 ± 9.5 864.1 896.0 1.99 ± 0.03
./leibniz_zig_simd 443.8 ± 5.2 437.2 452.0 1.00
./leibniz_c 481.5 ± 4.4 476.8 489.2 1.08 ± 0.02
./leibniz_rs 912.7 ± 9.7 897.1 925.1 2.06 ± 0.03
./leibniz_avx2 451.2 ± 6.3 443.8 465.7 1.02 ± 0.02
Command Mean [s] Min [s] Max [s] Relative
./leibniz_zig 8.889 ± 0.095 8.722 9.058 13.91 ± 0.33
./leibniz_zig_simd 4.386 ± 0.028 4.357 4.420 6.86 ± 0.15
./leibniz_c 0.665 ± 0.003 0.658 0.669 1.04 ± 0.02
./leibniz_rs 8.903 ± 0.053 8.855 8.973 13.93 ± 0.31
./leibniz_avx2 0.639 ± 0.013 0.618 0.657 1.00

4. Programming language Benchmarks

Another way of seeing the difference in performance between C and Zig is to use slightly more complex code. The Programming Language Benchmark {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}} git repository shows differents codes writing in different languages. Their website shows benchmarks made with their hardware zig-vs-c. {{< cite "ZigVSBenchmarks" >}} We then reproduced the benchmarks between the C and the Zig to see if there were any differences with our hardware.

Here's our results: CPU INFO:[x86\64][8 cores] Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (Model 142)

Here's what you need to know about file naming {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}}:

  • *-m in a file name stands for multi-threading or multi-processing
  • *-i in a file name stands for direct intrinsics usage. (Usage of SIMD intrinsics via libraries is not counted). It means that the SIMD instruction is directly use in the code. The instructions can be found in the Intel Intrinsics Guide.
  • *(You may find time < time(user) + time(sys) for some non-parallelized programs, the overhead is from GC or JIT compiler, which are allowed to take advantage of multi-cores as that's more close to real-world scenarios.)

You can find in the repo how the examples are compiled here.

knucleotide

Input: 2500000:

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
c 1-m.c 185ms 8.3ms 23.6MB 623ms 13ms gcc 13.2.0
c 1-m.c 308ms 19ms 27.5MB 2143ms 80ms clang 17.0.6
zig 1.zig 856ms 6.0ms 19.9MB 907ms 10ms zig 0.12.0

Input: 250000:

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
c 1-m.c 33ms 1.8ms 11.9MB 153ms 0ms gcc 13.2.0
c 1-m.c 47ms 7.8ms 16.8MB 310ms 15ms clang 17.0.6
zig 1.zig 90ms 2.1ms 7.6MB 90ms 0ms zig 0.12.0

nbody Input: 5000000

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
zig 2.zig 179ms 2.1ms 1.3MB 187ms 0ms zig 0.12.0
c 5.c 193ms 1.7ms 2.1MB 200ms 0ms clang 17.0.6
c 5.c 211ms 7.9ms 1.4MB 220ms 0ms zigcc 0.12.0
c 8-i.c 215ms 2.9ms 1.4MB 223ms 0ms zigcc 0.12.0
c 8-i.c 222ms 2.9ms 2.0MB 233ms 0ms gcc 13.2.0
c 8-i.c 225ms 3.5ms 2.1MB 233ms 0ms clang 17.0.6
c 2.c 225ms 1.4ms 2.0MB 233ms 0ms clang 17.0.6
c 2.c 240ms 5.2ms 2.0MB 253ms 0ms gcc 13.2.0
c 5.c 242ms 11ms 2.0MB 250ms 0ms gcc 13.2.0
c 2.c 254ms 2.5ms 1.4MB 267ms 0ms zigcc 0.12.0
zig 1.zig 265ms 7.1ms 1.3MB 280ms 0ms zig 0.12.0

Input: 500000

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
zig 2.zig 18ms 0.2ms 1.3MB 10ms 0ms zig 0.12.0
c 5.c 21ms 0.3ms 2.0MB 13ms 0ms clang 17.0.6
c 8-i.c 21ms 0.3ms 1.5MB 17ms 0ms zigcc 0.12.0
c 5.c 21ms 0.9ms 1.4MB 13ms 0ms zigcc 0.12.0
c 8-i.c 22ms 1.4ms 2.3MB 17ms 0ms gcc 13.2.0
c 8-i.c 25ms 0.2ms 2.1MB 20ms 0ms clang 17.0.6
c 5.c 25ms 1.2ms 2.3MB 20ms 0ms gcc 13.2.0
c 2.c 25ms 0.4ms 2.1MB 20ms 0ms clang 17.0.6
c 2.c 25ms 1.0ms 2.0MB 20ms 0ms gcc 13.2.0
zig 1.zig 27ms 0.4ms 1.1MB 20ms 0ms zig 0.12.0
c 2.c 28ms 1.7ms 1.4MB 20ms 0ms zigcc 0.12.0

spectral-norm Input: 8000

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
zig 2-m.zig 340ms 1.5ms 1.6MB 2550ms 3ms zig 0.12.0
c 6-im.c 547ms 11ms 4.6MB 4593ms 17ms clang 17.0.6
c 4-m.c 566ms 7.7ms 4.6MB 4780ms 13ms clang 17.0.6
c 6-im.c 568ms 14ms 2.1MB 4820ms 0ms gcc 13.2.0
c 4-m.c 586ms 11ms 2.1MB 4980ms 0ms gcc 13.2.0
c 5-im.c 727ms 4.5ms 2.1MB 6203ms 0ms gcc 13.2.0
c 3-m.c 749ms 13ms 4.6MB 6327ms 7ms clang 17.0.6
c 5-im.c 799ms 11ms 4.6MB 6770ms 23ms clang 17.0.6
c 3-m.c 833ms 23ms 2.1MB 7070ms 0ms gcc 13.2.0
zig 2.zig 1168ms 3.1ms 1.5MB 1257ms 0ms zig 0.12.0
zig 1.zig 2774ms 59ms 1.5MB 3010ms 0ms zig 0.12.0

Input: 4000

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
zig 2-m.zig 89ms 1.6ms 1.6MB 493ms 0ms zig 0.12.0
c 6-im.c 157ms 11ms 4.9MB 1237ms 3ms clang 17.0.6
c 6-im.c 163ms 8.7ms 2.1MB 1327ms 0ms gcc 13.2.0
c 4-m.c 164ms 4.2ms 4.8MB 1300ms 3ms clang 17.0.6
c 4-m.c 176ms 5.4ms 2.0MB 1420ms 0ms gcc 13.2.0
c 5-im.c 205ms 17ms 2.0MB 1683ms 0ms gcc 13.2.0
c 3-m.c 224ms 26ms 4.8MB 1797ms 10ms clang 17.0.6
c 5-im.c 226ms 2.9ms 4.9MB 1820ms 10ms clang 17.0.6
zig 2.zig 292ms 1.1ms 1.4MB 310ms 0ms zig 0.12.0
c 3-m.c 488ms 33ms 2.1MB 3600ms 0ms gcc 13.2.0
zig 1.zig 720ms 11ms 1.3MB 773ms 0ms zig 0.12.0

Input: 2000

lang code time stddev peak-mem mem time(user) time(sys) compiler compiler/runtime
zig 2-m.zig 28ms 0.6ms 1.4MB 47ms 0ms zig 0.12.0
c 6-im.c 54ms 12ms 4.6MB 363ms 0ms clang 17.0.6
c 3-m.c 61ms 4.1ms 4.6MB 427ms 0ms clang 17.0.6
c 4-m.c 65ms 3.5ms 4.9MB 433ms 7ms clang 17.0.6
zig 2.zig 72ms 1.0ms 1.4MB 70ms 0ms zig 0.12.0
c 5-im.c 73ms 11ms 4.6MB 520ms 3ms clang 17.0.6
c 6-im.c 78ms 9.4ms 2.0MB 577ms 0ms gcc 13.2.0
c 3-m.c 78ms 10ms 2.1MB 600ms 0ms gcc 13.2.0
c 4-m.c 80ms 12ms 2.1MB 590ms 0ms gcc 13.2.0
c 5-im.c 81ms 13ms 2.3MB 620ms 0ms gcc 13.2.0
zig 1.zig 192ms 3.6ms 1.3MB 203ms 0ms zig 0.12.0

4.1. Results

We can see that execution times vary greatly depending on the programming language, the implementation, the target, and finally, the compiler used. In addition, we found that Zig codes that use vectors perform better than those that do not. Zig codes that are not optimised tend to perform less well than C codes.

On the other hand, we can see that Zig codes consume less memory than C codes. For the comparison, to be fair, the allocator used in Zig is the libC allocator. And finally, we can see that the performances depend also on the input.

We can't say that one language is faster than the other, as this will depend enormously on the context, the optimisations used, and the complexity of the code. We can make a hypothesis based on the measurements taken: if dynamic memory allocation is performed correctly, we can expect Zig code to generally consume less memory than C code.

As a reminder, the results obtained are specific to the machine and operating system used. In addition, we can compare the differences in performance between the benchmarks obtained on a local machine and the machine used by the Programming Language Benchmark site. For their part, the Programming Language Benchmark site uses a machine with the following information:

  • CPU INFO:[x8664][4 cores] AMD EPYC 7763 64-Core Processor (Model 1)
  • We should also note that compiler versions have changed, which can also add variability to the results
    1. gcc 13.2.0
    2. clang 14.0.0-1ubuntu1.1
    3. zig 0.12.0-dev.2341+92211135f
    4. zigcc 0.12.0-dev.2341+92211135f

For the knucleotide example, we obtain a comparable ranking in the two results. However, we notice that my x86 target is 85 ms slower for the version of Zig with an input of 2500000, but it consumes less memory (around 500KB).

For the nbody example, we can note that on average, the performance obtained is better on my x86 target, but the memory used remains in the same order of magnitude. We also see that the 8-i.c implementation is faster under AMD Epics for the zigcc compiler and for gcc. And that the 5.c implementation with the Clang compiler is among the first in the x86 test, but is among the last for AMD. We note that the difference between version 14 and version 17 of clang can also affect the results. This is the only example where the zigcc compiler is used for benchmarking. We can see that the performance results obtained are similar to those we can get with gcc or clang. We can also observe that the same code compiled with zigcc uses less memory than the other compilers. Further tests would be needed to confirm this, as the nbody example does not allocate memory dynamically.

In the last example, we notice a big difference between the result of the 2.zig code for an input of 8000: it is second to last for performance on x86, while it is second best on AMD. Furthermore, we can see that the standard deviation may be higher for certain codes, but the results are different depending on the target. Finally, the 1.zig code is the worst-performing code for both benchmarks. It runs 2x slower than the second-worst code

5. 1BRC

The one billion row challenge is a challenge that involves calculating the min, max, and average of 1 billion rows of measurements (a 12 Gio file). {{< cite "BillionRowChallenge" >}} Everyone can submit a code to try to have a better implementation. Originally the challenge was for Java, but now we can find multiple solutions in many languages.

In this documentation, we tried to see the differences between three implementations: one in C and two in Zig:

In each repository, we can find how to compile their binary, we followed their instructions.

After each implementation build, we benchmark with hyperfine to have the results:

  1. First, we used hyperfine to benchmark the three implementations one after the other:
hyperfine --warmup=3 './1brc/bin/analyze ~/Documents/1brc/measurements.txt' './1brc-zig/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' './1brc-zig2/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' --export-orgmode "benchmark-1brc.org" --output null -i

And we have the results:

Command Mean [s] Min [s] Max [s] Relative
C implementation 11.774 ± 2.051 7.582 13.034 1.00
1th Zig implementation 21.283 ± 0.051 21.228 21.390 1.81 ± 0.31
2th Zig implementation 24.788 ± 0.187 24.332 24.931 2.11 ± 0.37

We can see the differences in performances between C and Zig. We note that the differences between the best C result and the worst C result have a difference of 6 seconds.

To see if we get different performances, we benchmark each implementation on its own:

The first Zig implementation 1brc-zig

Command Mean [s] Min [s] Max [s] Relative
1th Zig implementation 15.647 ± 2.206 13.301 21.649 1.00

Sometimes when we redo the benchmark we don't get this huge difference. We use the Linux kernel to map the memory of the measurements file. Depending on the computer load, some differences can be found.

Command Mean [s] Min [s] Max [s] Relative
1th Zig implementation 12.749 ± 0.345 12.202 13.545 1.00

The second Zig implementation 1brc-zig - second implementation

Command Mean [s] Min [s] Max [s] Relative
2th Zig implementation 18.730 ± 2.837 15.192 24.906 1.00

And the C implementation

Command Mean [s] Min [s] Max [s] Relative
C implementation 8.898 ± 0.591 7.507 9.510 1.00

In the C implementation, we note that his writer uses a custom hashmap to hash the city names. He uses a fast multiplication hash combined with a linear search probing. The load factor of his hashmap is 0.5. {{< cite "BillionRowChallenge" >}}

The first Zig implementation use also a hashmap. But the hashmap comes from the Zig standard library. His purpose is generic and uses the Wyhash algorithm to compute the hashes.{{< cite "wangyiWangyifudanWyhash2024" >}}

The Hotspot program has been used to create flamegraph of the implementations:

If you look at Zig's flamegraph, you can see that 40% of the cycle time is used to insert the city into the hashmap.

flamegraph_1brc_zig_1.png

Figure 1: Flamegrph of the Zig's 1brc implementation

In C, on the other hand, the insertion time in the hashmap is not measurable because it is included in the process_chunck function but does not require any complex hash calculation. Calculating the base index is mapped into an array, and then, if there are any collisions, the index is incremented.

flamegrah_1brc_c.png

Figure 2: Flamegrph of the C's 1brc implementation

In the example below, Danny van Kooten, the writer of the C One Billion Rows Challenge, explains how the hashes are calculated for each city. {{< cite "OneBillionRows2024a" >}} He uses a simple fast multiplication hash combined with linear probing.

In the first bloc, he multiples each letter of the city by 31. After that, he sums up all the results.

In the second bloc, he checks if the hash can be inserted into the hashmap. If there is a collision, he iterates over the hashmap until he finds a free spot or a matching key.

// hash everything up to ';'
// assumption: key (city name) is at least 1 char
unsigned int len = 1;
unsigned int hash = (unsigned char)buf[0];
while (buf[len] != ';') {
  hash = (hash * 31) + (unsigned char)buf[len++];
}

// probe map until free spot or matching key
unsigned int idx = hashmap[hash & cap-1];
while (idx != 0 && memcmp(results[idx], buf, len) != 0) {
    hash++;
    idx = hashmap[hash & cap-1];
}

// idx is now either 0 (new entry)
// or contains the index of our key in the results a

For the second implementation of the Zig code, we note that it has been implemented with a btree, used for data merge. This merge is performed because there are several threads processing the data. However, it also uses the same hashmap as the one used in the first version.

{{< references >}}

Author: tetratrux

Created: 2024-07-22 lun 16:25

Validate