Monthly Archives: August 2013

The fastest sum in F#

In this post I will explore the question of how to compute the sum of a collection in F# as fast as possible; that is, as close as optimised C++ as possible. This should give us some insights as to how the F# compiler and JIT optimize code.

So first let’s look at C++ in Visual Studio 2012. I’ve enabled optimizations, but not vector instructions (SSE) for the sake of comparison with the CLR JIT, which never vectorizes anything. That is to say, C++ can generate much more sophisticated assembly than this, but this is a good starting point from our perspective.

#include <array>
#include <iostream>

using namespace std;

template<size_t size>
int sum(array<int, size>& arr) {
	int total = 0;
	for (size_t i = 0; i < arr.size(); ++i) {
		total += arr[i];
	}
	return total;
}

int main() {
	array<int, 1000> arr = {};
	auto total = sum(arr);
	cout << total;
}

Of course the compiler inlines the function, but that doesn’t change the loop. It is compiled to the following assembly (comments by me)

001A129E  xor ecx,ecx                     // sum = 0
001A12A0  xor eax,eax                     // i = 0
001A12A2  xor edx,edx                     // sum2 = 0

001A12A4  add ecx,dword ptr [esp+eax*4]   // sum += arr[i]
001A12A7  add edx,dword ptr [esp+eax*4+4] // sum2 += arr[i + 1]
001A12AB  add eax,2                       // i += 2

001A12AE  cmp eax,3E8h                    // if i < 1000
001A12B3  jb main+34h (01A12A4h)          // goto 001A12A4  
 
001A12B5  add ecx,edx                      // sum = sum + sum2

Here the C++ compiler has decided to unroll the loop once, maintaining two running totals, and adding the two at the end. This is an interesting optimisation but it’s not very important; a strictly minimal loop could be reduced to 4 instructions:

add ecx dword ptr[esp+eax*4] // sum += arr[i]
inc eax                      // i++
cmp eax,3E8h                 // if i < 1000
jb 1                         // goto 1

Since the JIT compiler doesn’t do much in terms of advanced optimisations, this minimal loop is what we’d expect if it simply did its job correctly. Let’s see what we can get out of F#. We’ll compare four approaches: “functional” using Array.fold, “succint” using Array.sum, “imperative” using a for loop like we did in C++, and finally we’ll test the “for .. in” loop (foreach in C#).

Note that in order to see the optimized JIT assembly, you need to compile in Release mode and uncheck “Suppress JIT optimization on module load” and “Enable Just My Code” in the debugger options. This is in 32-bit mode as well.

[<EntryPoint>]
let main argv = 

    let arr = Array.zeroCreate 100

    let sum1 = Array.fold(fun acc elem -> acc + elem) 0 arr

    let sum2 = Array.sum arr

    let mutable sum3 = 0
    for i in 0 .. arr.Length - 1 do
        sum3 <- sum3 + arr.[i]
    
    let mutable sum4 = 0
    for elem in arr do
        sum4 <- sum4 + elem
    
    printfn "%d" sum1
    printfn "%d" sum2
    printfn "%d" sum3
    printfn "%d" sum4
    0

One might point out that since the array contains only zeroes, the compiler could optimize all sums away and simply print out zeroes, but fortunately for our study, it’s not smart enough to do that.

Let’s start with Array.fold. I won’t post all the assembly code because there’s quite a long prologue and epilogue which don’t really matter. The inner function is compiled to this:

00000000  mov         eax,edx 
00000002  mov         edx,dword ptr [esp+4] 
00000006  add         eax,edx 
00000008  ret         4 

This is quite good; it simply loads the next element of the array, adds it to the running total, and returns it.

The inner loop of the “fold” function is compiled to this:

0000002c  cmp         esi,ebx 
0000002e  jae         00000081 
00000030  push        dword ptr [edi+esi*4+8] 
00000034  mov         ecx,dword ptr [ebp-14h] 
00000037  mov         eax,dword ptr [ecx] 
00000039  mov         eax,dword ptr [eax+28h] 
0000003c  call        dword ptr [eax+14h] 
0000003f  mov         edx,eax 
00000041  inc         esi 
00000042  mov         eax,dword ptr [ebp-10h] 
00000045  inc         eax 
00000046  cmp         esi,eax 
00000048  jne         0000002C

I’m not going over this in detail; it basically just loads the arguments to the function in registers, calls the function, increments its counter, advances the position in the array, and loops until it hits the end of the array. This is not bad assembly code at all; it’s about as terse and efficient as you could expect a straightforward implementation of fold to be. That said, it’s definitely a lot of code compared to what we actually need, and it’s somewhat disappointing that such a short lambda function isn’t inlined and further optimisations done after that.

Let’s look at Array.sum now:

00000040  add         ebx,dword ptr [edi+esi*4+8] 
00000044  jo          000002DB 
0000004a  inc         esi 
0000004b  cmp         edx,esi 
0000004d  jg          00000040 

Whoa, now we’re talking. This is essentially our “minimal” 4 instructions loop, with the additional “jo” instruction after the “add”, which checks for overflow. If we force a jump to this 000002DB address, an OverflowException is thrown by the runtime. This makes a lot of sense; a sum of an arbitrary number of integers could easily overflow and it’s probably not a good thing in general.

The imperative loop is compiled to this:

0000005a  add         ebx,dword ptr [edi+esi*4+8] 
0000005e  inc         esi 
0000005f  cmp         edx,esi 
00000061  jg          0000005A 

Nice! This is as efficient as we’d hope the JIT to do. No overflow check is performed, so this will silently give incorrect results. If you’re really concerned about performance, your array is large, and if your array is large then overflow is quite likely. So in general I think Array.sum makes more sense than trying to bypass the overflow check to save 1 instruction, but it’s nice to see that you can avoid it if you really want to.

Finally let’s take a look at the for .. in loop:

0000006b  mov         eax,dword ptr [edi+ecx*4+8] 
0000006f  add         esi,eax 
00000071  inc         ecx 
00000072  cmp         edx,ecx 
00000074  jg          0000006B 

This is essentially the same thing, except that the “fetch array element and add to running total” operation is done in two steps instead of one, adding an unnecessary instruction. This is a bit disappointing but it certainly doesn’t change much in terms of performance.

In conclusion, we have seen that the fastest sum in F# is simply the for [index] in 0 .. [length – 1] loop, but that Array.sum gives us overflow checking and thus makes a lot more sense in general. The for .. in loop is ever so slightly less efficient, and Array.fold incurs substantial overhead and should be reserved for more complex computations. We’ve also seen that the C++ compiler can do interesting optimisations beyond what the JIT is capable of, but that the JIT can nonetheless generate quite good, if unsophisticated, assembly.