Build system
Table of Contents
1. Build modes
Zig build modes are way easier for beginners to graps compared to C
There are only 4 different build modes:
Debug
(default)ReleaseFast
ReleaseSafe
ReleaseSmall
You can check them in more details here
There are also a few other options you can tweak like avx here
1.1. Generate automatically documentation
In Zig there is an experimental feature, it's automatically generated documentation.
It will scan all the public structures and functions and will create documentation.
Moreover, the comments like ///
are used to give more information to types.
This example is created with the help of the build.zig
of the Zig language GitHub repository.
This is used the same way as the Zig teams auto-generate the standard library documentation. See: build.zig
pub fn build(b: *std.Build) void { const autodoc_test = b.addTest(.{ .root_source_file = .{ .path = "src/main.zig" }, }); const install_std_docs = b.addInstallDirectory(.{ .source_dir = autodoc_test.getEmittedDocs(), .install_dir = .prefix, .install_subdir = "doc", }); const doc_step = b.step("docs", "Generate documentation"); doc_step.dependOn(&install_std_docs.step); doc_step.makeFn = generateDocumentation; } fn generateDocumentation(self: *std.build.Step, progress: *std.Progress.Node) !void { _ = self; _ = progress; std.log.info("The documentation has been generated", .{}); }
1.2. Strip output binary in Zig in Linux
ELF format
In an ELF
executable there are various sections that hold program and control information.
For example:
.bss
: it holds uninitialized variables and data..text
: it holds the instruction of the program.debug
: it holds unspecified information for debugging
Removing the debug symbols will reduce his size, make it harder to reverse engineering, and improve speed performance.
By default, Zig will produce an executable that has all the debug symbols.
However, Zig has reproduced a drop-in replacement for the program GNU objcopy
.
To strip an output in the zig command line :
zig build-exe -fstrip src/main.zig
To be sure that the output executable is stripped, the command file
can be useful :
file main
And the output :
main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
With the Zig build system (build.zig
) there is an option to strip an executable :
const target = b.standardTargetOptions(.{}); const optimize = b.standardOptimizeOption(.{}); var exe = b.addExecutable(.{ .name = "linkedlist", .root_source_file = .{ .path = "src/main.zig" }, .target = target, .optimize = optimize, }); exe.strip = true;
If the option strip is set to false, we have :
$file linkedlist linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped
And with the option to true
:
$file linkedlist linkedlist: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
1.3. Separate debug symbols from ELF executable
The chapter above shows how to remove all the symbols for the executable directly in the Zig build system.
In this section, the interest is more about the Zig objcopy command.
In fact, Zig has implemented his own objcopy
utils to strip, and remove symbols like the objcopy from the bin utils (GNU utils).
(Note: for now, the Zig objcopy has fewer features)
The interest in having objcopy directly in Zig is there is no need to have multiple objcopy executables. Indeed, in a cross-compilation world, each binutil needs to be compatible with the CPU architecture target. So, it avoids using different toolchains and scripts to build on various targets.
Here's how to invoke the help of the Zig build objcopy:
zig objcopy --help
The following example illustrates several applications of the objcopy command. First, remove the debug symbols of the executable. After that, keep only the debug symbols in a separate file. And finally, how to link the debug symbols file to an executable (that has no debug file).
For this example, the output executable of the Zig build toolchain is named main
.
$file main
main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped
The first step is to strip the debug file from the executable:
zig objcopy --strip-debug main mainStripped
The small flag alternative can also be used:
zig objcopy -g main mainNoDebugInfo
After this command, the new executable has no longer the debug info.
$file mainNoDebugInfo
mainNoDebugInfo: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
The second step shows how to keep only the debug symbols in the main.dbg
file:
zig objcopy --only-keep-debug main main.dbg
Finally, to recreate an executable with a link to a debug symbols file, the following command can be used:
zig objcopy --add-gnu-debuglink=main.dbg mainNoDebugInfo mainWithDbgSymbols
Now, the executable named mainWithDbgSymbols
can be used, for example in GDB to debug it (GDB will know where its debug symbols file is located).
Removing the debug symbol will create a smaller executable. Bellow shows the process with a mini-example:
executable name | size |
---|---|
main | 1.9M |
main.dbg | 1.2M |
mainNoDebugInfo | 690K |
mainWithDbgSymbols | 690K |
The advantage of this method is that the executable can be sent to production without debug symbols (it's more difficult to reverse engineering the exe, and it has a smaller size). But when a bug is reported, it is always possible to debug it because all you have to do is link the debug symbol to the executable.
2. Cross-compilation
2.1. Cross-compile with an embedded linux
The -target
flag of zig build-exe
can select the target architecture for the cross-compilation.
There are multiple target selections, see {{< cite "CrossCompilationZig2024" >}} to have the entire list of targets.
Here's an example of a test that was performed, the aim being to run an executable produced by Zig (from a x86_64
) for a -nanopi neo plus 2- which has an arm64 architecture and runs a Linux kernel.
The information of the embedded target :
uname -a Linux csel 5.15.148 #2 SMP PREEMPT Mon Mar 4 21:21:00 UTC 2024 aarch64 GNU/Linux
Here's the command to cross-compile the executable (the executable is also stripped to have a smaller size):
zig build-exe src/main.zig -target aarch64-linux -fstrip
The executable on the embedded target runs flawlessly, below is shown the output of the target:
# ./main All your codebase are belong to us. Run `zig build test` to run the tests.
3. Performance comparaison & SIMD
The following chapters will explain different benchmark and code.
3.1. Vectors
A Zig vector is an array of child type: booleans, integers, floats or pointers. It provides SIMD instructions for parallelizing operations where possible.
As described in the standard documentation, the vectors smaller than the target machine's SIMD instruction will be compiled to a single SIMD instruction. In other hands, the vectors longer than the the target machine's SIMD will be compile to multiple SIMD instructions. {{< cite "DocumentationZigProgramming" >}}
The Zig vector does not work in the same way as the C++ vector. {{< cite "WelcomeZigGuide2024" >}}
We can use multiple opteration on Zig vectors:
- Arithmetics
- Bitwise operation
- Comparaison operator
const stdout = std.io.getStdOut(); const size = 4; fn print(comptime T: type, name: []const u8, vector: T) !void { try stdout.writer().print("{s}: ", .{name}); for (0..size) |index| { try switch (T) { @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}), @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}), else => stdout.writer().print("{d}, ", .{vector[index]}), }; } try stdout.writer().print("\n", .{}); } pub fn main() !void { const a: @Vector(size, i32) = .{ 1, 2, 3, 4 }; const b: @Vector(size, i32) = .{ 5, 6, 7, 8 }; try stdout.writer().print("Access to an item: {d}\n", .{a[2]}); const c = a + b; try print(@TypeOf(c), "c", c); const d = a * b; try print(@TypeOf(d), "d", d); const e = a << b; try print(@TypeOf(e), "e", e); const f = a == b; try print(@TypeOf(f), "f", f); }
Zig support a vector size of:
\begin{equation} 2^{32} - 1 \end{equation}But currently on the last Zig version, a long size (220) can crash at compile-time. see documentation {{< cite "DocumentationZigProgramminga" >}}
There are utility functions to simplify the use of vectors:
@splat
, this function will create a vector of a given size with the same value.@splat
@Reduce
, this function will reduce the vector into a scalar. We can specify the operation. For intergers, every operation is available, for booleans plus there's.And
,.Or
,.Xor
. And finally, for floating points they have also.Min
,.Max
,.Add
,.Mul
. {{< cite "DocumentationZigProgramming" >}}@reduce
@shuffle
, this function will construct a new vector by selecting the elements of two vectors (a and b) with a mask array. The function will iterate on each element, and select the value of the mask. With the mask value, it will select the element of the array a or b. If the number of the mask value is 0 or higher, it will select the index of the first vector (a). If the number of the mask value is -1 or less, it will select the index starting at -1 and decrementing. (read the example to be more clear)@shuffle
@select
, this function will select from two vectors (a and b) based on apred
vector. Ifpred[i]
is positif, the element will bea[i]
otherwiseb[i]
@select
const stdout = std.io.getStdOut(); const size = 4; fn print(comptime T: type, name: []const u8, vector: T) !void { try stdout.writer().print("{s}: ", .{name}); for (0..size) |index| { try switch (T) { @Vector(size, bool) => stdout.writer().print("{}, ", .{vector[index]}), @Vector(size, u8) => stdout.writer().print("{c}, ", .{vector[index]}), else => stdout.writer().print("{d}, ", .{vector[index]}), }; } try stdout.writer().print("\n", .{}); } pub fn main() !void { const a: @Vector(size, i32) = @splat(3); try print(@TypeOf(a), "a", a); const b = @reduce(.Add, a); try stdout.writer().print("b: {d}\n", .{b}); const mask: @Vector(size, i32) = .{ 1, -2, -1, 0}; const c: @Vector(size, u8) = .{ 't', 't', 'q', 'q'}; const d: @Vector(size, u8) = .{ 's', 'e', 't', 'a'}; const e = @shuffle(u8, c, d, mask); try print(@TypeOf(e), "e", e); const pred: @Vector(size, bool) = .{ true, false, false, true}; const f: @Vector(size, u8) = .{ 'a', 'a', 'a', 'a'}; const g: @Vector(size, u8) = .{ 'b', 'b', 'b', 'b'}; const h = @select(u8, pred, f, g); try print(@TypeOf(h), "h", h); }
3.2. Leibniz algorithm
This chapter shows the performance of Zig compared to other language, in particular with C.
The speed-comparaison github repository {{< cite "NiklasheerSpeedcomparisonRepo" >}} has differents examples code that shows the Leibniz algorithm {{< cite "LeibnizFormula2024" >}} in action. This algorithm has puspose to compute the PI number with the following formula:
\begin{equation} \frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \ldots = \sum_{k = 0}^{\infty} \frac{(-1)^k}{2k + 1} \end{equation}Below, we can find example from the speed-comparaison github repository. They have been modified so that they can be run in org mode.
The C code shows the standard way to compute the Leibniz algorithm, at each iteration, PI will improve.
The C compiler will optimize the C code to use SIMD instructions under the hood.
The line: x = -1.0 + 2.0 * (i & 0x1);
is written in this way to enable the compiler to be optimised.
The compiler can then use the fmadd
instruction, which performs a multiplication and an addition in a single operation. {{< cite "IntelIntrinsicsGuide" >}} {{< cite "ArmA64Instruction" >}}
unsigned rounds = 1000000000; double pi = 1.0; rounds += 2u; // do this outside the loop for (unsigned i=2u; i < rounds; ++i) { double x = -1.0 + 2.0 * (i & 0x1); // allows vectorization pi += (x / (2u * i - 1u)); // double / unsigned = double } pi *= 4; printf("%.16f\n", pi); // print 16 decimal digits of pi
The code below shows the adaptation of the C code but for the Zig language.
var rounds: u64 = 1000000000; rounds += 2; var pi: f64 = 1.0; var i: usize = 2; while (i < rounds) : (i += 1) { const isOdd: f64 = @floatFromInt(i & 0x1); const x: f64 = -1.0 + 2.0 * isOdd; const den: f64 = @floatFromInt(2 * i - 1); pi += (x / den); } pi *= 4; try std.io.getStdOut().writer().print("{}", .{pi});
The Zig optimize code below will produce a similar output to the two other codes but it uses SIMD instruction to try to manually optimize the code.
Now the number of iterations is divided by 4 because the calculations are parallelized by 4.
Note: you can see that the length of the vector
is 4.
At the end of the example, the code doesn't use vectors, it will only use for the last iterations.
Indeed, the vectorization will only work for a number of rounds divisible by 4 (%4).
var rounds: u64 = 1000000000; rounds += 2; const unroll = 4; const x = @Vector(4, f64){ -1.0, 1.0, -1.0, 1.0 }; var den: @Vector(4, f64) = @splat(0.0); const inc: @Vector(4, f64) = @splat(4.0); const two: @Vector(4, f64) = @splat(2.0); const mone: @Vector(4, f64) = @splat(-1.0); var ivec = @Vector(4, f64){ 2.0, 3.0, 4.0, 5.0 }; var pivec: @Vector(4, f64) = @splat(0.0); const vec_end: u64 = rounds - rounds % unroll; var i: u64 = 2; while (i < vec_end) : (i += unroll) { den = mone + (two * ivec); ivec = ivec + inc; pivec = pivec + (x / den); } var _x: f64 = 1.0; var pi: f64 = 1.0; pi += @reduce(.Add, pivec); while (i < vec_end) : (i += 1) { _x = -_x; const i_2: f64 = @floatFromInt(i); pi += (_x / 2 * i_2 - 1); } pi *= 4; try std.io.getStdOut().writer().print("{d:.16}\n", .{pi});
3.3. Benchmark your x86\64 CPU
The code below shows the comamnds to compile the three examples explains upper.
Moreover before to execute the code block, you need to tangle the examples with the shortcut C-c C-v t
.
The command to compile the C code (with gcc) is specific to an x8664 architecture, if you want to follow the examples make sure to have a compatible CPU.
Note: If emacs is unable to find the path to Zig during execution, the environment variable must be exported to the .zshenv
file for a zsh
shell.
gcc leibniz.c -o leibniz_c -O3 -s -flto -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig -fstrip zig build-exe leibniz_simd.zig -OReleaseFast -femit-bin=leibniz_simd_zig -fstrip
You will find the three executables in your directory.
Now to benchmark your code you need a tool to run the analysis, it's called hyperfine {{< cite "peterHyperfine2023" >}}. Il faut installer hyperfine pour faire le benchmark : https://github.com/sharkdp/hyperfine
To code bellow will run a benchmark with the hyperfine
tool.
To use this tool, you need to install it and have in your path.
hyperfine --warmup=3 './leibniz_c' './leibniz_zig' './leibniz_simd_zig' -i -N
If you run the benchmark in your computer, you can see the result in the block upper.
3.4. ARM benchmark example
We made the same benchmark but for an arm CPU, here are the compilers's options used. The target is a nanopi neo plus 2.
aarch64-buildroot-linux-gnu-gcc leibniz.c -o leibniz_c -O3 -march=armv8-a -mtune=cortex-a53 -s -fno-signed-zeros -fno-trapping-math -fassociative-math zig build-exe leibniz_smid.zig -OReleaseFast -femit-bin=leibniz_zig_simd_aarch -target aarch64-native -mcpu cortex_a53 -fstrip zig build-exe leibniz.zig -OReleaseFast -femit-bin=leibniz_zig_aarch -target aarch64-native -mcpu cortex_a53 -fstrip
On our arm target we cannot execute hyperfine
, we measure 8 times the each program and we made statistics.
Program | Mean [s] | Max [s] | Min [s] |
---|---|---|---|
./leibniz_c |
20.57 | 20.58 | 20.57 |
./leibniz_zig_aarch |
19.95 | 19.96 | 19.95 |
./leibniz_zig_simd_aarch |
14.43 | 14.44 | 14.43 |
We can find a difference in performance between the ARM and the x86 benchmarks. We can note that the set of instructions used between the two benchmarks is different, the optimisations may then differ. Moreover, the version of the compilator used is not the same for the C example:
- For x86:
gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
- For ARM:
aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08.3-dirty) 11.3.0
3.5. Different code examples benchmark
Mes benchmarks sur mon ordi :
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./leibniz_zig |
85.9 ± 0.4 | 85.0 | 87.0 | 2.02 ± 0.03 |
./leibniz_zig_simd |
42.5 ± 0.5 | 41.6 | 43.7 | 1.00 |
./leibniz_c |
43.3 ± 0.8 | 42.2 | 45.4 | 1.02 ± 0.02 |
./leibniz_rs |
86.4 ± 1.5 | 85.2 | 94.3 | 2.03 ± 0.04 |
./leibniz_avx2 |
43.3 ± 0.5 | 42.1 | 44.3 | 1.02 ± 0.02 |
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./leibniz_zig |
883.9 ± 9.5 | 864.1 | 896.0 | 1.99 ± 0.03 |
./leibniz_zig_simd |
443.8 ± 5.2 | 437.2 | 452.0 | 1.00 |
./leibniz_c |
481.5 ± 4.4 | 476.8 | 489.2 | 1.08 ± 0.02 |
./leibniz_rs |
912.7 ± 9.7 | 897.1 | 925.1 | 2.06 ± 0.03 |
./leibniz_avx2 |
451.2 ± 6.3 | 443.8 | 465.7 | 1.02 ± 0.02 |
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./leibniz_zig |
8.889 ± 0.095 | 8.722 | 9.058 | 13.91 ± 0.33 |
./leibniz_zig_simd |
4.386 ± 0.028 | 4.357 | 4.420 | 6.86 ± 0.15 |
./leibniz_c |
0.665 ± 0.003 | 0.658 | 0.669 | 1.04 ± 0.02 |
./leibniz_rs |
8.903 ± 0.053 | 8.855 | 8.973 | 13.93 ± 0.31 |
./leibniz_avx2 |
0.639 ± 0.013 | 0.618 | 0.657 | 1.00 |
4. Programming language Benchmarks
Another way of seeing the difference in performance between C and Zig is to use slightly more complex code. The Programming Language Benchmark {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}} git repository shows differents codes writing in different languages. Their website shows benchmarks made with their hardware zig-vs-c. {{< cite "ZigVSBenchmarks" >}} We then reproduced the benchmarks between the C and the Zig to see if there were any differences with our hardware.
Here's our results: CPU INFO:[x86\64][8 cores] Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (Model 142)
Here's what you need to know about file naming {{< cite "Hanabi1224ProgrammingLanguageBenchmarksAnother" >}}:
- *-m in a file name stands for multi-threading or multi-processing
- *-i in a file name stands for direct intrinsics usage. (Usage of SIMD intrinsics via libraries is not counted). It means that the SIMD instruction is directly use in the code. The instructions can be found in the Intel Intrinsics Guide.
- *(You may find time < time(user) + time(sys) for some non-parallelized programs, the overhead is from GC or JIT compiler, which are allowed to take advantage of multi-cores as that's more close to real-world scenarios.)
You can find in the repo how the examples are compiled here.
knucleotide
Input: 2500000:
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
c | 1-m.c | 185ms | 8.3ms | 23.6MB | 623ms | 13ms | gcc 13.2.0 |
c | 1-m.c | 308ms | 19ms | 27.5MB | 2143ms | 80ms | clang 17.0.6 |
zig | 1.zig | 856ms | 6.0ms | 19.9MB | 907ms | 10ms | zig 0.12.0 |
Input: 250000:
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
c | 1-m.c | 33ms | 1.8ms | 11.9MB | 153ms | 0ms | gcc 13.2.0 |
c | 1-m.c | 47ms | 7.8ms | 16.8MB | 310ms | 15ms | clang 17.0.6 |
zig | 1.zig | 90ms | 2.1ms | 7.6MB | 90ms | 0ms | zig 0.12.0 |
nbody Input: 5000000
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
zig | 2.zig | 179ms | 2.1ms | 1.3MB | 187ms | 0ms | zig 0.12.0 |
c | 5.c | 193ms | 1.7ms | 2.1MB | 200ms | 0ms | clang 17.0.6 |
c | 5.c | 211ms | 7.9ms | 1.4MB | 220ms | 0ms | zigcc 0.12.0 |
c | 8-i.c | 215ms | 2.9ms | 1.4MB | 223ms | 0ms | zigcc 0.12.0 |
c | 8-i.c | 222ms | 2.9ms | 2.0MB | 233ms | 0ms | gcc 13.2.0 |
c | 8-i.c | 225ms | 3.5ms | 2.1MB | 233ms | 0ms | clang 17.0.6 |
c | 2.c | 225ms | 1.4ms | 2.0MB | 233ms | 0ms | clang 17.0.6 |
c | 2.c | 240ms | 5.2ms | 2.0MB | 253ms | 0ms | gcc 13.2.0 |
c | 5.c | 242ms | 11ms | 2.0MB | 250ms | 0ms | gcc 13.2.0 |
c | 2.c | 254ms | 2.5ms | 1.4MB | 267ms | 0ms | zigcc 0.12.0 |
zig | 1.zig | 265ms | 7.1ms | 1.3MB | 280ms | 0ms | zig 0.12.0 |
Input: 500000
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
zig | 2.zig | 18ms | 0.2ms | 1.3MB | 10ms | 0ms | zig 0.12.0 |
c | 5.c | 21ms | 0.3ms | 2.0MB | 13ms | 0ms | clang 17.0.6 |
c | 8-i.c | 21ms | 0.3ms | 1.5MB | 17ms | 0ms | zigcc 0.12.0 |
c | 5.c | 21ms | 0.9ms | 1.4MB | 13ms | 0ms | zigcc 0.12.0 |
c | 8-i.c | 22ms | 1.4ms | 2.3MB | 17ms | 0ms | gcc 13.2.0 |
c | 8-i.c | 25ms | 0.2ms | 2.1MB | 20ms | 0ms | clang 17.0.6 |
c | 5.c | 25ms | 1.2ms | 2.3MB | 20ms | 0ms | gcc 13.2.0 |
c | 2.c | 25ms | 0.4ms | 2.1MB | 20ms | 0ms | clang 17.0.6 |
c | 2.c | 25ms | 1.0ms | 2.0MB | 20ms | 0ms | gcc 13.2.0 |
zig | 1.zig | 27ms | 0.4ms | 1.1MB | 20ms | 0ms | zig 0.12.0 |
c | 2.c | 28ms | 1.7ms | 1.4MB | 20ms | 0ms | zigcc 0.12.0 |
spectral-norm Input: 8000
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
zig | 2-m.zig | 340ms | 1.5ms | 1.6MB | 2550ms | 3ms | zig 0.12.0 |
c | 6-im.c | 547ms | 11ms | 4.6MB | 4593ms | 17ms | clang 17.0.6 |
c | 4-m.c | 566ms | 7.7ms | 4.6MB | 4780ms | 13ms | clang 17.0.6 |
c | 6-im.c | 568ms | 14ms | 2.1MB | 4820ms | 0ms | gcc 13.2.0 |
c | 4-m.c | 586ms | 11ms | 2.1MB | 4980ms | 0ms | gcc 13.2.0 |
c | 5-im.c | 727ms | 4.5ms | 2.1MB | 6203ms | 0ms | gcc 13.2.0 |
c | 3-m.c | 749ms | 13ms | 4.6MB | 6327ms | 7ms | clang 17.0.6 |
c | 5-im.c | 799ms | 11ms | 4.6MB | 6770ms | 23ms | clang 17.0.6 |
c | 3-m.c | 833ms | 23ms | 2.1MB | 7070ms | 0ms | gcc 13.2.0 |
zig | 2.zig | 1168ms | 3.1ms | 1.5MB | 1257ms | 0ms | zig 0.12.0 |
zig | 1.zig | 2774ms | 59ms | 1.5MB | 3010ms | 0ms | zig 0.12.0 |
Input: 4000
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
zig | 2-m.zig | 89ms | 1.6ms | 1.6MB | 493ms | 0ms | zig 0.12.0 |
c | 6-im.c | 157ms | 11ms | 4.9MB | 1237ms | 3ms | clang 17.0.6 |
c | 6-im.c | 163ms | 8.7ms | 2.1MB | 1327ms | 0ms | gcc 13.2.0 |
c | 4-m.c | 164ms | 4.2ms | 4.8MB | 1300ms | 3ms | clang 17.0.6 |
c | 4-m.c | 176ms | 5.4ms | 2.0MB | 1420ms | 0ms | gcc 13.2.0 |
c | 5-im.c | 205ms | 17ms | 2.0MB | 1683ms | 0ms | gcc 13.2.0 |
c | 3-m.c | 224ms | 26ms | 4.8MB | 1797ms | 10ms | clang 17.0.6 |
c | 5-im.c | 226ms | 2.9ms | 4.9MB | 1820ms | 10ms | clang 17.0.6 |
zig | 2.zig | 292ms | 1.1ms | 1.4MB | 310ms | 0ms | zig 0.12.0 |
c | 3-m.c | 488ms | 33ms | 2.1MB | 3600ms | 0ms | gcc 13.2.0 |
zig | 1.zig | 720ms | 11ms | 1.3MB | 773ms | 0ms | zig 0.12.0 |
Input: 2000
lang | code | time | stddev | peak-mem mem | time(user) | time(sys) | compiler compiler/runtime |
---|---|---|---|---|---|---|---|
zig | 2-m.zig | 28ms | 0.6ms | 1.4MB | 47ms | 0ms | zig 0.12.0 |
c | 6-im.c | 54ms | 12ms | 4.6MB | 363ms | 0ms | clang 17.0.6 |
c | 3-m.c | 61ms | 4.1ms | 4.6MB | 427ms | 0ms | clang 17.0.6 |
c | 4-m.c | 65ms | 3.5ms | 4.9MB | 433ms | 7ms | clang 17.0.6 |
zig | 2.zig | 72ms | 1.0ms | 1.4MB | 70ms | 0ms | zig 0.12.0 |
c | 5-im.c | 73ms | 11ms | 4.6MB | 520ms | 3ms | clang 17.0.6 |
c | 6-im.c | 78ms | 9.4ms | 2.0MB | 577ms | 0ms | gcc 13.2.0 |
c | 3-m.c | 78ms | 10ms | 2.1MB | 600ms | 0ms | gcc 13.2.0 |
c | 4-m.c | 80ms | 12ms | 2.1MB | 590ms | 0ms | gcc 13.2.0 |
c | 5-im.c | 81ms | 13ms | 2.3MB | 620ms | 0ms | gcc 13.2.0 |
zig | 1.zig | 192ms | 3.6ms | 1.3MB | 203ms | 0ms | zig 0.12.0 |
4.1. Results
We can see that execution times vary greatly depending on the programming language, the implementation, the target, and finally, the compiler used. In addition, we found that Zig codes that use vectors perform better than those that do not. Zig codes that are not optimised tend to perform less well than C codes.
On the other hand, we can see that Zig codes consume less memory than C codes. For the comparison, to be fair, the allocator used in Zig is the libC allocator. And finally, we can see that the performances depend also on the input.
We can't say that one language is faster than the other, as this will depend enormously on the context, the optimisations used, and the complexity of the code. We can make a hypothesis based on the measurements taken: if dynamic memory allocation is performed correctly, we can expect Zig code to generally consume less memory than C code.
As a reminder, the results obtained are specific to the machine and operating system used. In addition, we can compare the differences in performance between the benchmarks obtained on a local machine and the machine used by the Programming Language Benchmark site. For their part, the Programming Language Benchmark site uses a machine with the following information:
- CPU INFO:[x8664][4 cores] AMD EPYC 7763 64-Core Processor (Model 1)
- We should also note that compiler versions have changed, which can also add variability to the results
- gcc 13.2.0
- clang 14.0.0-1ubuntu1.1
- zig 0.12.0-dev.2341+92211135f
- zigcc 0.12.0-dev.2341+92211135f
For the knucleotide example, we obtain a comparable ranking in the two results. However, we notice that my x86 target is 85 ms slower for the version of Zig with an input of 2500000, but it consumes less memory (around 500KB).
For the nbody example, we can note that on average, the performance obtained is better on my x86 target, but the memory used remains in the same order of magnitude.
We also see that the 8-i.c
implementation is faster under AMD Epics for the zigcc
compiler and for gcc
.
And that the 5.c
implementation with the Clang
compiler is among the first in the x86 test, but is among the last for AMD.
We note that the difference between version 14 and version 17 of clang can also affect the results.
This is the only example where the zigcc compiler is used for benchmarking.
We can see that the performance results obtained are similar to those we can get with gcc or clang.
We can also observe that the same code compiled with zigcc uses less memory than the other compilers.
Further tests would be needed to confirm this, as the nbody example does not allocate memory dynamically.
In the last example, we notice a big difference between the result of the 2.zig
code for an input of 8000: it is second to last for performance on x86, while it is second best on AMD.
Furthermore, we can see that the standard deviation may be higher for certain codes, but the results are different depending on the target.
Finally, the 1.zig
code is the worst-performing code for both benchmarks.
It runs 2x slower than the second-worst code
5. 1BRC
The one billion row challenge is a challenge that involves calculating the min, max, and average of 1 billion rows of measurements (a 12 Gio file). {{< cite "BillionRowChallenge" >}} Everyone can submit a code to try to have a better implementation. Originally the challenge was for Java, but now we can find multiple solutions in many languages.
In this documentation, we tried to see the differences between three implementations: one in C and two in Zig:
- 1brc C {{< cite "kootenDannyvankooten1brc2024" >}} {{< cite "OneBillionRows2024" >}}
- 1brc-zig {{< cite "leeCandrewlee141brczig2024" >}}
- 1brc-zig - second implementation {{< cite "mykolaichukAlmmiko1brczig2024" >}}
In each repository, we can find how to compile their binary, we followed their instructions.
After each implementation build, we benchmark with hyperfine to have the results:
- First, we used
hyperfine
to benchmark the three implementations one after the other:
hyperfine --warmup=3 './1brc/bin/analyze ~/Documents/1brc/measurements.txt' './1brc-zig/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' './1brc-zig2/zig-out/bin/1brc-zig ~/Documents/1brc/measurements.txt' --export-orgmode "benchmark-1brc.org" --output null -i
And we have the results:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
C implementation |
11.774 ± 2.051 | 7.582 | 13.034 | 1.00 |
1th Zig implementation |
21.283 ± 0.051 | 21.228 | 21.390 | 1.81 ± 0.31 |
2th Zig implementation |
24.788 ± 0.187 | 24.332 | 24.931 | 2.11 ± 0.37 |
We can see the differences in performances between C and Zig. We note that the differences between the best C result and the worst C result have a difference of 6 seconds.
To see if we get different performances, we benchmark each implementation on its own:
The first Zig implementation 1brc-zig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
1th Zig implementation |
15.647 ± 2.206 | 13.301 | 21.649 | 1.00 |
Sometimes when we redo the benchmark we don't get this huge difference. We use the Linux kernel to map the memory of the measurements file. Depending on the computer load, some differences can be found.
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
1th Zig implementation |
12.749 ± 0.345 | 12.202 | 13.545 | 1.00 |
The second Zig implementation 1brc-zig - second implementation
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
2th Zig implementation |
18.730 ± 2.837 | 15.192 | 24.906 | 1.00 |
And the C implementation
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
C implementation |
8.898 ± 0.591 | 7.507 | 9.510 | 1.00 |
In the C implementation, we note that his writer uses a custom hashmap to hash the city names. He uses a fast multiplication hash combined with a linear search probing. The load factor of his hashmap is 0.5. {{< cite "BillionRowChallenge" >}}
The first Zig implementation use also a hashmap. But the hashmap comes from the Zig standard library. His purpose is generic and uses the Wyhash algorithm to compute the hashes.{{< cite "wangyiWangyifudanWyhash2024" >}}
The Hotspot
program has been used to create flamegraph of the implementations:
If you look at Zig's flamegraph, you can see that 40% of the cycle time is used to insert the city into the hashmap.
Figure 1: Flamegrph of the Zig's 1brc implementation
In C, on the other hand, the insertion time in the hashmap is not measurable because it is included in the process_chunck
function but does not require any complex hash calculation.
Calculating the base index is mapped into an array, and then, if there are any collisions, the index is incremented.
Figure 2: Flamegrph of the C's 1brc implementation
In the example below, Danny van Kooten, the writer of the C One Billion Rows Challenge, explains how the hashes are calculated for each city. {{< cite "OneBillionRows2024a" >}} He uses a simple fast multiplication hash combined with linear probing.
In the first bloc, he multiples each letter of the city by 31. After that, he sums up all the results.
In the second bloc, he checks if the hash can be inserted into the hashmap. If there is a collision, he iterates over the hashmap until he finds a free spot or a matching key.
// hash everything up to ';' // assumption: key (city name) is at least 1 char unsigned int len = 1; unsigned int hash = (unsigned char)buf[0]; while (buf[len] != ';') { hash = (hash * 31) + (unsigned char)buf[len++]; } // probe map until free spot or matching key unsigned int idx = hashmap[hash & cap-1]; while (idx != 0 && memcmp(results[idx], buf, len) != 0) { hash++; idx = hashmap[hash & cap-1]; } // idx is now either 0 (new entry) // or contains the index of our key in the results a
For the second implementation of the Zig code, we note that it has been implemented with a btree
, used for data merge.
This merge is performed because there are several threads processing the data.
However, it also uses the same hashmap as the one used in the first version.
{{< references >}}