Benchmarking

The general problem with benchmarks is that the benchmark and the actual application may behave differently.

The actual application has some performance characterstics - it can be CPU bound, it can be IO bound or it can be memory bound. CPU usage is a mix of integer operations, floating point operations and string operations. IO can be read only, write only or read and write. IO can be sequential or random.

The chance that the actual application is really close to any benchmark in performance characteristics is small.

Also the performance characterstics are impacted by the involved hardware and software.

The relative speed of different code may be different among different CPU architectures: x86-64, ARM, Power, RISC-V.

The relative speed of different code may even be different between different implementations of the same CPU architecture and therefore change after a hardware upgrade.

The relative speed of different code may depend on the level of optimization done by the compiler and therefore change after a compiler upgrade.

The relative speed of different code may depend on the platform (runtime library, OS etc.) and therefore change after a platform upgrade.

The only reeally valid benchmark is the actual application on the actual system.

Standard benchmarks:

CPU speed problems:

Too specialized code:

Problem: the benchmark is so small and only use a few constructs and as a result the benchmark only measure those few constructs and that may not be valid for more general cases.

Tricked by the optimizer:

Problem: sometimes the optimizers detect that there is no point in executing the core benchmark code because the result is not used for anything and as a result completely eliminate the core benchmark code showing way to good results.

Mitigation: prevent such optimization - often by calling external (non-optimizable) code with the result from the benchmark.

Wall time vs CPU time:

Problem: on a multi-user OS there may be a difference between wall time usage and CPU time usage if the process is swapped out of CPU during the benchmark for something else to run on the CPU.

Mitigation: try not to run anything else during benchmark *and* run the bencmark multiple times to avoid rare incidences of something else running.

JIT compilation:

But if we look at .NET code then the JIT compilation is happening the first time the code is called (unless NGEN is used):

Java is even worse as it starts by interpreting the byte code and first do the JIT compilation when it detects that the code is being executed a lot:

Mitigation: use "warmup" iterations - before starting measuring call the code some times to "warmup" aka trigger the JIT compilation.

Example:

The article Language Speed has benchmark code in many languages (Java, C#, C, Pascal, Ada, Python, PHP).

But special benchmnarks frameworks has been introduced to make it easier to write good (micro-)benchmarks.

package test;

public class Work {
    public static double sqrtBuiltin(double a) {
        return Math.sqrt(a);
    }
    public static double sqrtNewton(double a) {
        double x = a / 2;
        double xprev;
        do {
            xprev = x;
            x = x + (a - x * x) / (2 * x);
        } while(Math.abs(x - xprev) > 0.0000001);
        return x;
    }
}

using System;

namespace Lib
{
    public class Work
    {
        public static double SqrtBuiltin(double a)
        {
            return Math.Sqrt(a);
        }
        public static double SqrtNewton(double a)
        {
            double x = a / 2;
            double xprev;
            do
            {
                xprev = x;
                x = x + (a - x * x) / (2 * x);
            }
            while(Math.Abs(x - xprev) > 0.0000001);
            return x;
        }
    }
}

We will test with JUnit. For more about JUnit see here.

package test;

import static org.junit.Assert.*;

import org.junit.Test;

public class FuncTestWork {
    @Test
    public void testSqrtBuiltin() {
        assertEquals("sqrt(4)", 2.0, Work.sqrtBuiltin(4), 0.0000001);
        assertEquals("sqrt(9)", 3.0, Work.sqrtBuiltin(9), 0.0000001);
        assertEquals("sqrt(16)", 4.0, Work.sqrtBuiltin(16), 0.0000001);
        assertEquals("sqrt(25)", 5.0, Work.sqrtBuiltin(25), 0.0000001);
        assertEquals("sqrt(36)", 6.0, Work.sqrtBuiltin(36), 0.0000001);
    }
    @Test
    public void testIsSqrtNewton() {
        assertEquals("sqrt(4)", 2.0, Work.sqrtNewton(4), 0.0000001);
        assertEquals("sqrt(9)", 3.0, Work.sqrtNewton(9), 0.0000001);
        assertEquals("sqrt(16)", 4.0, Work.sqrtNewton(16), 0.0000001);
        assertEquals("sqrt(25)", 5.0, Work.sqrtNewton(25), 0.0000001);
        assertEquals("sqrt(36)", 6.0, Work.sqrtNewton(36), 0.0000001);
    }
}

We will test with NUnit. For more about NUnit see here.

using System;

using NUnit.Framework;

using  Lib;

namespace FuncTest
{
    [TestFixture]
    public class FuncTestWork
    {
        [Test]
        public void TestSqrtBuiltin()
        {
            Assert.AreEqual(2.0, Work.SqrtBuiltin(4), 0.0000001, "Sqrt(4)");
            Assert.AreEqual(3.0, Work.SqrtBuiltin(9), 0.0000001, "Sqrt(9)");
            Assert.AreEqual(4.0, Work.SqrtBuiltin(16), 0.0000001, "Sqrt(16)");
            Assert.AreEqual(5.0, Work.SqrtBuiltin(25), 0.0000001, "Sqrt(25)");
            Assert.AreEqual(6.0, Work.SqrtBuiltin(36), 0.0000001, "Sqrt(936");
        }
        [Test]
        public void TestIsSqrtNewton()
        {
            Assert.AreEqual(2.0, Work.SqrtBuiltin(4), 0.0000001, "Sqrt(4)");
            Assert.AreEqual(3.0, Work.SqrtNewton(9), 0.0000001, "Sqrt(9)");
            Assert.AreEqual(4.0, Work.SqrtNewton(16), 0.0000001, "Sqrt(16)");
            Assert.AreEqual(5.0, Work.SqrtNewton(25), 0.0000001, "Sqrt(25)");
            Assert.AreEqual(6.0, Work.SqrtNewton(36), 0.0000001, "Sqrt(936");
        }
    }
}

(strictly speaking the functional test is not necessary to do the benchmark, but there is really no point in benchmarking code that is not working correct, so it makes sense to verify correctness first)

package test;

import java.util.Random;

public class PerfTestManual {
    private static Random rng = new Random();
    private static void testBuiltin(int rep) {
        long t1 = System.currentTimeMillis();
        for(int j = 0; j < rep; j++) {
            double a = rng.nextDouble() * 1000.0;
            double sqrta = Work.sqrtBuiltin(a);
            if(Math.abs(a - sqrta * sqrta) > 0.0000001) throw new RuntimeException("Ooops");
        }
        long t2 = System.currentTimeMillis();
        System.out.printf("Builtin : %.1f ns\n", (t2 - t1) * 1000000.0 / rep);
    }
    private static void testNewton(int rep) {
        long t1 = System.currentTimeMillis();
        for(int j = 0; j < rep; j++) {
            double a = rng.nextDouble() * 1000.0;
            double sqrta = Work.sqrtNewton(a);
            if(Math.abs(a - sqrta * sqrta) > 0.0000001) throw new RuntimeException("Ooops");
        }
        long t2 = System.currentTimeMillis();
        System.out.printf("Newton : %.1f ns\n", (t2 - t1) * 1000000.0 / rep);
    }
    public static void main(String[] args) {
        for(int k = 0; k < 5; k++) {
            testBuiltin(10000000);
        }
        for(int k = 0; k < 5; k++) {
            testNewton(10000000);
        }
    }
}

Output:

java -cp . test.PerfTestManual
Builtin : 32.0 ns
Builtin : 31.0 ns
Builtin : 31.0 ns
Builtin : 31.0 ns
Builtin : 32.0 ns
Newton : 118.0 ns
Newton : 118.0 ns
Newton : 117.0 ns
Newton : 118.0 ns
Newton : 117.0 ns

using System;

using Lib;

namespace PerfTestManual
{
    public class Program
    {
        private static Random rng = new Random();
        private static void TestBuiltin(int rep)
        {
            DateTime t1 = DateTime.Now;
            for(int j = 0; j < rep; j++)
            {
                double a = rng.NextDouble() * 1000.0;
                double sqrta = Work.SqrtBuiltin(a);
                if(Math.Abs(a - sqrta * sqrta) > 0.0000001) throw new Exception("Ooops");
            }
            DateTime t2 = DateTime.Now;
            Console.WriteLine("Builtin : {0:F1} ns", (t2 - t1).TotalMilliseconds * 1000000.0 / rep);
        }
        private static void TestNewton(int rep)
        {
            DateTime t1 = DateTime.Now;
            for(int j = 0; j < rep; j++)
            {
                double a = rng.NextDouble() * 1000.0;
                double sqrta = Work.SqrtNewton(a);
                if(Math.Abs(a - sqrta * sqrta) > 0.0000001) throw new Exception("Ooops");
            }
            DateTime t2 = DateTime.Now;
            Console.WriteLine("Newton : {0:F1} ns", (t2 - t1).TotalMilliseconds * 1000000.0 / rep);
        }
        public static void Main(string[] args)
        {
            for(int k = 0; k < 5; k++)
            {
                TestBuiltin(10000000);
            }
            for(int k = 0; k < 5; k++)
            {
                TestNewton(10000000);
            }
        }
    }
}

Output:

PerfTestManual
Builtin : 29.0 ns
Builtin : 27.0 ns
Builtin : 27.0 ns
Builtin : 26.0 ns
Builtin : 27.0 ns
Newton : 152.0 ns
Newton : 152.0 ns
Newton : 152.0 ns
Newton : 152.0 ns
Newton : 151.0 ns

JMH (Java Microbenchmark Harness) is an open source benchmark framework for benchmarking Java code. It is created by the OpenJDK project. It is available from GitHub.

package test;

import java.util.Random;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 1)
@Warmup(iterations = 5, timeUnit = TimeUnit.MILLISECONDS, time = 1000)
@Measurement(iterations = 5, timeUnit = TimeUnit.MILLISECONDS, time = 1000)
public class PerfTestJMH {
    private static Random rng = new Random();
    @Benchmark
    public void testBuiltin() {
        double a = rng.nextDouble() * 1000.0;
        double sqrta = Work.sqrtBuiltin(a);
        if(Math.abs(a - sqrta * sqrta) > 0.0000001) throw new RuntimeException("Ooops");
    }
    @Benchmark
    public void testNewton() {
        double a = rng.nextDouble() * 1000.0;
        double sqrta = Work.sqrtNewton(a);
        if(Math.abs(a - sqrta * sqrta) > 0.0000001) throw new RuntimeException("Ooops");
    }
}

The only practical way to use JMH seems to be via Maven. Create the proper Maven project and build and one get an executable benchmark.jar file that are ready to run.

Output:

java -jar benchmarks.jar
# JMH version: 1.36
# VM version: JDK 11.0.16.1, Java HotSpot(TM) 64-Bit Server VM, 11.0.16.1+1-LTS-1
# VM invoker: C:\SUN\64bit\jdk-11.0.16.1\bin\java.exe
# VM options: <none>
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 1000 ms each
# Measurement: 5 iterations, 1000 ms each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: test.PerfTestJMH.testBuiltin

# Run progress: 0.00% complete, ETA 00:00:20
# Fork: 1 of 1
# Warmup Iteration   1: 38.110 ns/op
# Warmup Iteration   2: 32.375 ns/op
# Warmup Iteration   3: 31.702 ns/op
# Warmup Iteration   4: 31.723 ns/op
# Warmup Iteration   5: 31.687 ns/op
Iteration   1: 31.710 ns/op
Iteration   2: 31.707 ns/op
Iteration   3: 31.710 ns/op
Iteration   4: 31.694 ns/op
Iteration   5: 31.706 ns/op


Result "test.PerfTestJMH.testBuiltin":
  31.705 ±(99.9%) 0.027 ns/op [Average]
  (min, avg, max) = (31.694, 31.705, 31.710), stdev = 0.007
  CI (99.9%): [31.679, 31.732] (assumes normal distribution)


# JMH version: 1.36
# VM version: JDK 11.0.16.1, Java HotSpot(TM) 64-Bit Server VM, 11.0.16.1+1-LTS-1
# VM invoker: C:\SUN\64bit\jdk-11.0.16.1\bin\java.exe
# VM options: <none>
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 1000 ms each
# Measurement: 5 iterations, 1000 ms each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: test.PerfTestJMH.testNewton

# Run progress: 50.00% complete, ETA 00:00:10
# Fork: 1 of 1
# Warmup Iteration   1: 118.710 ns/op
# Warmup Iteration   2: 118.403 ns/op
# Warmup Iteration   3: 119.326 ns/op
# Warmup Iteration   4: 119.328 ns/op
# Warmup Iteration   5: 119.340 ns/op
Iteration   1: 119.344 ns/op
Iteration   2: 119.415 ns/op
Iteration   3: 119.330 ns/op
Iteration   4: 119.334 ns/op
Iteration   5: 119.335 ns/op


Result "test.PerfTestJMH.testNewton":
  119.351 ±(99.9%) 0.139 ns/op [Average]
  (min, avg, max) = (119.330, 119.351, 119.415), stdev = 0.036
  CI (99.9%): [119.213, 119.490] (assumes normal distribution)


# Run complete. Total time: 00:00:21

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), designfactorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                Mode  Cnt    Score   Error  Units
PerfTestJMH.testBuiltin  avgt    5   31.705 ± 0.027  ns/op
PerfTestJMH.testNewton   avgt    5  119.351 ± 0.139  ns/op

The last few lines are really all you need.

BenchmarkDotNet is an open source benchmark framework for benchmarking .NET code. It is available from GitHub.

using System;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

using Lib;

namespace PerfTestBMDN
{
    public class PerfTest
    {
        private static Random rng = new Random();
        [Benchmark]
        public void TestBuiltin() {
            double a = rng.NextDouble() * 1000.0;
            double sqrta = Work.SqrtBuiltin(a);
            if(Math.Abs(a - sqrta * sqrta) > 0.0000001) throw new Exception("Ooops");
        }
        [Benchmark]
        public void TestNewton() {
            double a = rng.NextDouble() * 1000.0;
            double sqrta = Work.SqrtNewton(a);
            if(Math.Abs(a - sqrta * sqrta) > 0.0000001) throw new Exception("Ooops");
        }
    }
    public class Program
    {
        public static void Main(string[] args)
        {
            BenchmarkRunner.Run(typeof(PerfTest).Assembly);
            Console.ReadKey();
        }
    }
}

BenchmarkDotNet is very easy to use in .NET Core / .NET 5+. It should also work with .NET 4.x, but I could not get it working.

Output:

dotnet run -c Release
// Validating benchmarks:
// ***** BenchmarkRunner: Start   *****
// ***** Found 2 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start   *****
...
// ***** Done, took 00:00:05 (5.63 sec)   *****
// Found 2 benchmarks:
//   PerfTest.TestBuiltin: DefaultJob
//   PerfTest.TestNewton: DefaultJob

Setup power plan (GUID: 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c FriendlyName: High performance)
// **************************
// Benchmark: PerfTest.TestBuiltin: DefaultJob
// *** Execute ***
// Launch: 1 / 1
...
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2
// GC=Concurrent Workstation
// HardwareIntrinsics=SSE4.2,POPCNT VectorSize=128
// Job: DefaultJob

OverheadJitting  1: 1 op, 277124.43 ns, 277.1244 us/op
WorkloadJitting  1: 1 op, 1198531.24 ns, 1.1985 ms/op

OverheadJitting  2: 16 op, 239131.57 ns, 14.9457 us/op
WorkloadJitting  2: 16 op, 236896.69 ns, 14.8060 us/op

WorkloadPilot    1: 16 op, 1277.07 ns, 79.8169 ns/op
WorkloadPilot    2: 32 op, 1596.34 ns, 49.8856 ns/op
WorkloadPilot    3: 64 op, 1915.61 ns, 29.9314 ns/op
WorkloadPilot    4: 128 op, 3192.68 ns, 24.9428 ns/op
WorkloadPilot    5: 256 op, 5746.82 ns, 22.4485 ns/op
WorkloadPilot    6: 512 op, 11174.37 ns, 21.8249 ns/op
WorkloadPilot    7: 1024 op, 20752.41 ns, 20.2660 ns/op
WorkloadPilot    8: 2048 op, 39908.47 ns, 19.4866 ns/op
WorkloadPilot    9: 4096 op, 83648.16 ns, 20.4219 ns/op
WorkloadPilot   10: 8192 op, 164422.91 ns, 20.0712 ns/op
WorkloadPilot   11: 16384 op, 325972.40 ns, 19.8958 ns/op
WorkloadPilot   12: 32768 op, 648752.13 ns, 19.7983 ns/op
WorkloadPilot   13: 65536 op, 1294950.11 ns, 19.7594 ns/op
WorkloadPilot   14: 131072 op, 2600436.06 ns, 19.8398 ns/op
WorkloadPilot   15: 262144 op, 5195763.83 ns, 19.8203 ns/op
WorkloadPilot   16: 524288 op, 10350980.65 ns, 19.7429 ns/op
WorkloadPilot   17: 1048576 op, 20663968.43 ns, 19.7067 ns/op
WorkloadPilot   18: 2097152 op, 41421163.05 ns, 19.7511 ns/op
WorkloadPilot   19: 4194304 op, 60674606.43 ns, 14.4660 ns/op
WorkloadPilot   20: 8388608 op, 93013558.98 ns, 11.0881 ns/op
WorkloadPilot   21: 16777216 op, 186054894.26 ns, 11.0897 ns/op
WorkloadPilot   22: 33554432 op, 371953985.85 ns, 11.0851 ns/op
WorkloadPilot   23: 67108864 op, 743663093.32 ns, 11.0814 ns/op

OverheadWarmup   1: 67108864 op, 190947672.97 ns, 2.8453 ns/op
OverheadWarmup   2: 67108864 op, 130623941.83 ns, 1.9464 ns/op
OverheadWarmup   3: 67108864 op, 130713017.54 ns, 1.9478 ns/op
OverheadWarmup   4: 67108864 op, 130717806.55 ns, 1.9478 ns/op
OverheadWarmup   5: 67108864 op, 130686199.04 ns, 1.9474 ns/op
OverheadWarmup   6: 67108864 op, 130760588.44 ns, 1.9485 ns/op
OverheadWarmup   7: 67108864 op, 130657464.94 ns, 1.9469 ns/op

OverheadActual   1: 67108864 op, 130673109.06 ns, 1.9472 ns/op
OverheadActual   2: 67108864 op, 130671512.73 ns, 1.9472 ns/op
OverheadActual   3: 67108864 op, 133605902.88 ns, 1.9909 ns/op
OverheadActual   4: 67108864 op, 130707589.98 ns, 1.9477 ns/op
OverheadActual   5: 67108864 op, 130695457.81 ns, 1.9475 ns/op
OverheadActual   6: 67108864 op, 130761226.97 ns, 1.9485 ns/op
OverheadActual   7: 67108864 op, 130718125.82 ns, 1.9479 ns/op
OverheadActual   8: 67108864 op, 130679813.69 ns, 1.9473 ns/op
OverheadActual   9: 67108864 op, 130669597.12 ns, 1.9471 ns/op
OverheadActual  10: 67108864 op, 130655549.34 ns, 1.9469 ns/op
OverheadActual  11: 67108864 op, 130659061.28 ns, 1.9470 ns/op
OverheadActual  12: 67108864 op, 130691307.33 ns, 1.9475 ns/op
OverheadActual  13: 67108864 op, 130686837.58 ns, 1.9474 ns/op
OverheadActual  14: 67108864 op, 130777190.36 ns, 1.9487 ns/op
OverheadActual  15: 67108864 op, 130686837.58 ns, 1.9474 ns/op

WorkloadWarmup   1: 67108864 op, 743604667.31 ns, 11.0806 ns/op
WorkloadWarmup   2: 67108864 op, 743614883.88 ns, 11.0807 ns/op
WorkloadWarmup   3: 67108864 op, 743785692.14 ns, 11.0833 ns/op
WorkloadWarmup   4: 67108864 op, 743542090.83 ns, 11.0796 ns/op
WorkloadWarmup   5: 67108864 op, 743537621.08 ns, 11.0796 ns/op
WorkloadWarmup   6: 67108864 op, 746775634.89 ns, 11.1278 ns/op
WorkloadWarmup   7: 67108864 op, 743703640.32 ns, 11.0820 ns/op

// BeforeActualRun
WorkloadActual   1: 67108864 op, 743610094.86 ns, 11.0807 ns/op
WorkloadActual   2: 67108864 op, 743580402.96 ns, 11.0802 ns/op
WorkloadActual   3: 67108864 op, 743660858.44 ns, 11.0814 ns/op
WorkloadActual   4: 67108864 op, 743643617.98 ns, 11.0812 ns/op
WorkloadActual   5: 67108864 op, 743647129.93 ns, 11.0812 ns/op
WorkloadActual   6: 67108864 op, 743662454.78 ns, 11.0814 ns/op
WorkloadActual   7: 67108864 op, 743865509.09 ns, 11.0845 ns/op
WorkloadActual   8: 67108864 op, 743897435.86 ns, 11.0849 ns/op
WorkloadActual   9: 67108864 op, 743752488.29 ns, 11.0828 ns/op
WorkloadActual  10: 67108864 op, 743456846.33 ns, 11.0784 ns/op
WorkloadActual  11: 67108864 op, 743575294.68 ns, 11.0801 ns/op
WorkloadActual  12: 67108864 op, 743666924.53 ns, 11.0815 ns/op
WorkloadActual  13: 67108864 op, 744301628.87 ns, 11.0910 ns/op
WorkloadActual  14: 67108864 op, 744567259.66 ns, 11.0949 ns/op
WorkloadActual  15: 67108864 op, 743596685.62 ns, 11.0805 ns/op

// AfterActualRun
WorkloadResult   1: 67108864 op, 612923257.28 ns, 9.1333 ns/op
WorkloadResult   2: 67108864 op, 612893565.38 ns, 9.1328 ns/op
WorkloadResult   3: 67108864 op, 612974020.86 ns, 9.1340 ns/op
WorkloadResult   4: 67108864 op, 612956780.40 ns, 9.1338 ns/op
WorkloadResult   5: 67108864 op, 612960292.35 ns, 9.1338 ns/op
WorkloadResult   6: 67108864 op, 612975617.20 ns, 9.1340 ns/op
WorkloadResult   7: 67108864 op, 613178671.51 ns, 9.1371 ns/op
WorkloadResult   8: 67108864 op, 613210598.29 ns, 9.1375 ns/op
WorkloadResult   9: 67108864 op, 613065650.71 ns, 9.1354 ns/op
WorkloadResult  10: 67108864 op, 612770008.75 ns, 9.1310 ns/op
WorkloadResult  11: 67108864 op, 612888457.10 ns, 9.1327 ns/op
WorkloadResult  12: 67108864 op, 612980086.95 ns, 9.1341 ns/op
WorkloadResult  13: 67108864 op, 612909848.04 ns, 9.1331 ns/op

// AfterAll
// Benchmark Process 9848 has exited with code 0.

Mean = 9.134 ns, StdErr = 0.000 ns (0.01%), N = 13, StdDev = 0.002 ns
Min = 9.131 ns, Q1 = 9.133 ns, Median = 9.134 ns, Q3 = 9.134 ns, Max = 9.138 ns IQR = 0.001 ns, LowerFence = 9.131 ns, UpperFence = 9.136 ns
ConfidenceInterval = [9.132 ns; 9.136 ns] (CI 99.9%), Margin = 0.002 ns (0.02% of Mean)
Skewness = 0.53, Kurtosis = 2.61, MValue = 2

// ** Remained 1 (50.0 %) benchmark(s) to run. Estimated finish 2023-01-08 19:13 (0h 0m from now) **
Setup power plan (GUID: 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c FriendlyName: High performance)
// **************************
// Benchmark: PerfTest.TestNewton: DefaultJob
// *** Execute ***
// Launch: 1 / 1
...
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2
// GC=Concurrent Workstation
// HardwareIntrinsics=SSE4.2,POPCNT VectorSize=128
// Job: DefaultJob

OverheadJitting  1: 1 op, 257649.10 ns, 257.6491 us/op
WorkloadJitting  1: 1 op, 1260469.19 ns, 1.2605 ms/op

OverheadJitting  2: 16 op, 234023.28 ns, 14.6265 us/op
WorkloadJitting  2: 16 op, 239131.57 ns, 14.9457 us/op

WorkloadPilot    1: 16 op, 2554.14 ns, 159.6339 ns/op
WorkloadPilot    2: 32 op, 4150.48 ns, 129.7025 ns/op
WorkloadPilot    3: 64 op, 7662.43 ns, 119.7254 ns/op
WorkloadPilot    4: 128 op, 14367.05 ns, 112.2426 ns/op
WorkloadPilot    5: 256 op, 27776.30 ns, 108.5012 ns/op
WorkloadPilot    6: 512 op, 57148.93 ns, 111.6190 ns/op
WorkloadPilot    7: 1024 op, 111105.19 ns, 108.5012 ns/op
WorkloadPilot    8: 2048 op, 217421.36 ns, 106.1628 ns/op
WorkloadPilot    9: 4096 op, 434204.18 ns, 106.0069 ns/op
WorkloadPilot   10: 8192 op, 862342.27 ns, 105.2664 ns/op
WorkloadPilot   11: 16384 op, 1731069.90 ns, 105.6561 ns/op
WorkloadPilot   12: 32768 op, 3451923.22 ns, 105.3443 ns/op
WorkloadPilot   13: 65536 op, 6901292.30 ns, 105.3054 ns/op
WorkloadPilot   14: 131072 op, 13801946.06 ns, 105.3005 ns/op
WorkloadPilot   15: 262144 op, 27687540.29 ns, 105.6196 ns/op
WorkloadPilot   16: 524288 op, 55175218.95 ns, 105.2384 ns/op
WorkloadPilot   17: 1048576 op, 107179470.32 ns, 102.2143 ns/op
WorkloadPilot   18: 2097152 op, 206276676.82 ns, 98.3604 ns/op
WorkloadPilot   19: 4194304 op, 412525577.34 ns, 98.3538 ns/op
WorkloadPilot   20: 8388608 op, 825079569.51 ns, 98.3571 ns/op

OverheadWarmup   1: 8388608 op, 23877398.62 ns, 2.8464 ns/op
OverheadWarmup   2: 8388608 op, 23884422.51 ns, 2.8472 ns/op
OverheadWarmup   3: 8388608 op, 23871651.80 ns, 2.8457 ns/op
OverheadWarmup   4: 8388608 op, 23880591.30 ns, 2.8468 ns/op
OverheadWarmup   5: 8388608 op, 23874205.94 ns, 2.8460 ns/op

OverheadActual   1: 8388608 op, 23873248.14 ns, 2.8459 ns/op
OverheadActual   2: 8388608 op, 23873886.67 ns, 2.8460 ns/op
OverheadActual   3: 8388608 op, 23872928.87 ns, 2.8459 ns/op
OverheadActual   4: 8388608 op, 23873248.14 ns, 2.8459 ns/op
OverheadActual   5: 8388608 op, 23873248.14 ns, 2.8459 ns/op
OverheadActual   6: 8388608 op, 23871651.80 ns, 2.8457 ns/op
OverheadActual   7: 8388608 op, 23876760.08 ns, 2.8463 ns/op
OverheadActual   8: 8388608 op, 23874844.48 ns, 2.8461 ns/op
OverheadActual   9: 8388608 op, 23875802.28 ns, 2.8462 ns/op
OverheadActual  10: 8388608 op, 23871332.53 ns, 2.8457 ns/op
OverheadActual  11: 8388608 op, 20873408.09 ns, 2.4883 ns/op
OverheadActual  12: 8388608 op, 16329269.80 ns, 1.9466 ns/op
OverheadActual  13: 8388608 op, 16329908.34 ns, 1.9467 ns/op
OverheadActual  14: 8388608 op, 16327992.73 ns, 1.9464 ns/op
OverheadActual  15: 8388608 op, 16327673.46 ns, 1.9464 ns/op
OverheadActual  16: 8388608 op, 16329908.34 ns, 1.9467 ns/op
OverheadActual  17: 8388608 op, 16328631.26 ns, 1.9465 ns/op
OverheadActual  18: 8388608 op, 16323203.71 ns, 1.9459 ns/op
OverheadActual  19: 8388608 op, 16341401.98 ns, 1.9480 ns/op
OverheadActual  20: 8388608 op, 16328631.26 ns, 1.9465 ns/op

WorkloadWarmup   1: 8388608 op, 824957289.95 ns, 98.3426 ns/op
WorkloadWarmup   2: 8388608 op, 825080527.32 ns, 98.3573 ns/op
WorkloadWarmup   3: 8388608 op, 824679207.72 ns, 98.3094 ns/op
WorkloadWarmup   4: 8388608 op, 825740773.08 ns, 98.4360 ns/op
WorkloadWarmup   5: 8388608 op, 824774030.25 ns, 98.3207 ns/op
WorkloadWarmup   6: 8388608 op, 825357651.75 ns, 98.3903 ns/op

// BeforeActualRun
WorkloadActual   1: 8388608 op, 825830806.59 ns, 98.4467 ns/op
WorkloadActual   2: 8388608 op, 826759556.56 ns, 98.5574 ns/op
WorkloadActual   3: 8388608 op, 825054028.09 ns, 98.3541 ns/op
WorkloadActual   4: 8388608 op, 825348712.25 ns, 98.3892 ns/op
WorkloadActual   5: 8388608 op, 825339772.75 ns, 98.3882 ns/op
WorkloadActual   6: 8388608 op, 825481527.65 ns, 98.4051 ns/op
WorkloadActual   7: 8388608 op, 825647866.16 ns, 98.4249 ns/op
WorkloadActual   8: 8388608 op, 825199933.46 ns, 98.3715 ns/op
WorkloadActual   9: 8388608 op, 825344881.04 ns, 98.3888 ns/op
WorkloadActual  10: 8388608 op, 825101279.72 ns, 98.3597 ns/op
WorkloadActual  11: 8388608 op, 825100002.65 ns, 98.3596 ns/op
WorkloadActual  12: 8388608 op, 825331791.06 ns, 98.3872 ns/op
WorkloadActual  13: 8388608 op, 826023644.33 ns, 98.4697 ns/op
WorkloadActual  14: 8388608 op, 826543093.01 ns, 98.5316 ns/op
WorkloadActual  15: 8388608 op, 825075738.30 ns, 98.3567 ns/op

// AfterActualRun
WorkloadResult   1: 8388608 op, 803458436.28 ns, 95.7797 ns/op
WorkloadResult   2: 8388608 op, 802681657.78 ns, 95.6871 ns/op
WorkloadResult   3: 8388608 op, 802976341.94 ns, 95.7222 ns/op
WorkloadResult   4: 8388608 op, 802967402.44 ns, 95.7212 ns/op
WorkloadResult   5: 8388608 op, 803109157.33 ns, 95.7381 ns/op
WorkloadResult   6: 8388608 op, 803275495.85 ns, 95.7579 ns/op
WorkloadResult   7: 8388608 op, 802827563.15 ns, 95.7045 ns/op
WorkloadResult   8: 8388608 op, 802972510.73 ns, 95.7218 ns/op
WorkloadResult   9: 8388608 op, 802728909.41 ns, 95.6927 ns/op
WorkloadResult  10: 8388608 op, 802727632.34 ns, 95.6926 ns/op
WorkloadResult  11: 8388608 op, 802959420.75 ns, 95.7202 ns/op
WorkloadResult  12: 8388608 op, 803651274.02 ns, 95.8027 ns/op
WorkloadResult  13: 8388608 op, 804170722.70 ns, 95.8646 ns/op
WorkloadResult  14: 8388608 op, 802703367.99 ns, 95.6897 ns/op

// AfterAll
// Benchmark Process 9388 has exited with code 0.

Mean = 95.735 ns, StdErr = 0.014 ns (0.01%), N = 14, StdDev = 0.051 ns
Min = 95.687 ns, Q1 = 95.696 ns, Median = 95.721 ns, Q3 = 95.753 ns, Max = 95.865 ns
IQR = 0.057 ns, LowerFence = 95.610 ns, UpperFence = 95.839 ns
ConfidenceInterval = [95.678 ns; 95.793 ns] (CI 99.9%), Margin = 0.057 ns (0.06% of Mean)
Skewness = 1.18, Kurtosis = 3.45, MValue = 2

// ** Remained 0 (0.0 %) benchmark(s) to run. Estimated finish 2023-01-08 19:13 (0h 0m from now) **
Successfully reverted power plan (GUID: 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c FriendlyName: High performance)
// ***** BenchmarkRunner: Finish  *****

// * Export *
  BenchmarkDotNet.Artifacts\results\PerfTestBMDN.PerfTest-report.csv
  BenchmarkDotNet.Artifacts\results\PerfTestBMDN.PerfTest-report-github.md
  BenchmarkDotNet.Artifacts\results\PerfTestBMDN.PerfTest-report.html

// * Detailed results *
PerfTest.TestBuiltin: DefaultJob
Runtime = .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2; GC = Concurrent Workstation
Mean = 9.134 ns, StdErr = 0.000 ns (0.01%), N = 13, StdDev = 0.002 ns
Min = 9.131 ns, Q1 = 9.133 ns, Median = 9.134 ns, Q3 = 9.134 ns, Max = 9.138 ns IQR = 0.001 ns, LowerFence = 9.131 ns, UpperFence = 9.136 ns
ConfidenceInterval = [9.132 ns; 9.136 ns] (CI 99.9%), Margin = 0.002 ns (0.02% of Mean)
Skewness = 0.53, Kurtosis = 2.61, MValue = 2
-------------------- Histogram --------------------
[9.130 ns ; 9.139 ns) | @@@@@@@@@@@@@
---------------------------------------------------

PerfTest.TestNewton: DefaultJob
Runtime = .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2; GC = Concurrent Workstation
Mean = 95.735 ns, StdErr = 0.014 ns (0.01%), N = 14, StdDev = 0.051 ns
Min = 95.687 ns, Q1 = 95.696 ns, Median = 95.721 ns, Q3 = 95.753 ns, Max = 95.865 ns IQR = 0.057 ns, LowerFence = 95.610 ns, UpperFence = 95.839 ns
ConfidenceInterval = [95.678 ns; 95.793 ns] (CI 99.9%), Margin = 0.057 ns (0.06% of Mean)
Skewness = 1.18, Kurtosis = 3.45, MValue = 2
-------------------- Histogram --------------------
[95.659 ns ; 95.892 ns) | @@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.3, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i7 CPU 960 3.20GHz (Nehalem), 1 CPU, 8 logical and 4 physical cores
Frequency=3132167 Hz, Resolution=319.2678 ns, Timer=TSC
.NET SDK=6.0.401
  [Host]     : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2
  DefaultJob : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT SSE4.2


|      Method |      Mean |     Error |    StdDev |
|------------ |----------:|----------:|----------:|
| TestBuiltin |  9.134 ns | 0.0021 ns | 0.0018 ns |
|  TestNewton | 95.735 ns | 0.0573 ns | 0.0508 ns |

// * Hints *
Outliers
  PerfTest.TestBuiltin: Default -> 2 outliers were removed (11.09 ns, 11.09 ns)
  PerfTest.TestNewton: Default  -> 1 outlier  was  removed (98.56 ns)

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 ns   : 1 Nanosecond (0.000000001 sec)

// ***** BenchmarkRunner: End *****
Run time: 00:00:41 (41.11 sec), executed benchmarks: 2

Global total time: 00:00:47 (47.04 sec), executed benchmarks: 2
// * Artifacts cleanup *

The last few lines are really all you need.

Conclusion:

It is not easy to write good benchmarks. But hopefully reading this has given some insigth into potential pitfalls and how to avoid them.

Benchmarking

Content:

Introduction:

General problems:

Standard benchmarks:

CPU speed problems:

Too specialized code:

Tricked by the optimizer:

Wall time vs CPU time:

JIT compilation:

Example:

Conclusion:

Article history:

Other articles:

Comments: