In this post I will explore the question of how to compute the sum of a collection in F# as fast as possible; that is, as close as optimised C++ as possible. This should give us some insights as to how the F# compiler and JIT optimize code.
So first let’s look at C++ in Visual Studio 2012. I’ve enabled optimizations, but not vector instructions (SSE) for the sake of comparison with the CLR JIT, which never vectorizes anything. That is to say, C++ can generate much more sophisticated assembly than this, but this is a good starting point from our perspective.
#include <array> #include <iostream> using namespace std; template<size_t size> int sum(array<int, size>& arr) { int total = 0; for (size_t i = 0; i < arr.size(); ++i) { total += arr[i]; } return total; } int main() { array<int, 1000> arr = {}; auto total = sum(arr); cout << total; }
Of course the compiler inlines the function, but that doesn’t change the loop. It is compiled to the following assembly (comments by me)
001A129E xor ecx,ecx // sum = 0 001A12A0 xor eax,eax // i = 0 001A12A2 xor edx,edx // sum2 = 0 001A12A4 add ecx,dword ptr [esp+eax*4] // sum += arr[i] 001A12A7 add edx,dword ptr [esp+eax*4+4] // sum2 += arr[i + 1] 001A12AB add eax,2 // i += 2 001A12AE cmp eax,3E8h // if i < 1000 001A12B3 jb main+34h (01A12A4h) // goto 001A12A4 001A12B5 add ecx,edx // sum = sum + sum2
Here the C++ compiler has decided to unroll the loop once, maintaining two running totals, and adding the two at the end. This is an interesting optimisation but it’s not very important; a strictly minimal loop could be reduced to 4 instructions:
add ecx dword ptr[esp+eax*4] // sum += arr[i] inc eax // i++ cmp eax,3E8h // if i < 1000 jb 1 // goto 1
Since the JIT compiler doesn’t do much in terms of advanced optimisations, this minimal loop is what we’d expect if it simply did its job correctly. Let’s see what we can get out of F#. We’ll compare four approaches: “functional” using Array.fold, “succint” using Array.sum, “imperative” using a for loop like we did in C++, and finally we’ll test the “for .. in” loop (foreach in C#).
Note that in order to see the optimized JIT assembly, you need to compile in Release mode and uncheck “Suppress JIT optimization on module load” and “Enable Just My Code” in the debugger options. This is in 32-bit mode as well.
[<EntryPoint>] let main argv = let arr = Array.zeroCreate 100 let sum1 = Array.fold(fun acc elem -> acc + elem) 0 arr let sum2 = Array.sum arr let mutable sum3 = 0 for i in 0 .. arr.Length - 1 do sum3 <- sum3 + arr.[i] let mutable sum4 = 0 for elem in arr do sum4 <- sum4 + elem printfn "%d" sum1 printfn "%d" sum2 printfn "%d" sum3 printfn "%d" sum4 0
One might point out that since the array contains only zeroes, the compiler could optimize all sums away and simply print out zeroes, but fortunately for our study, it’s not smart enough to do that.
Let’s start with Array.fold. I won’t post all the assembly code because there’s quite a long prologue and epilogue which don’t really matter. The inner function is compiled to this:
00000000 mov eax,edx 00000002 mov edx,dword ptr [esp+4] 00000006 add eax,edx 00000008 ret 4
This is quite good; it simply loads the next element of the array, adds it to the running total, and returns it.
The inner loop of the “fold” function is compiled to this:
0000002c cmp esi,ebx 0000002e jae 00000081 00000030 push dword ptr [edi+esi*4+8] 00000034 mov ecx,dword ptr [ebp-14h] 00000037 mov eax,dword ptr [ecx] 00000039 mov eax,dword ptr [eax+28h] 0000003c call dword ptr [eax+14h] 0000003f mov edx,eax 00000041 inc esi 00000042 mov eax,dword ptr [ebp-10h] 00000045 inc eax 00000046 cmp esi,eax 00000048 jne 0000002C
I’m not going over this in detail; it basically just loads the arguments to the function in registers, calls the function, increments its counter, advances the position in the array, and loops until it hits the end of the array. This is not bad assembly code at all; it’s about as terse and efficient as you could expect a straightforward implementation of fold to be. That said, it’s definitely a lot of code compared to what we actually need, and it’s somewhat disappointing that such a short lambda function isn’t inlined and further optimisations done after that.
Let’s look at Array.sum now:
00000040 add ebx,dword ptr [edi+esi*4+8] 00000044 jo 000002DB 0000004a inc esi 0000004b cmp edx,esi 0000004d jg 00000040
Whoa, now we’re talking. This is essentially our “minimal” 4 instructions loop, with the additional “jo” instruction after the “add”, which checks for overflow. If we force a jump to this 000002DB address, an OverflowException is thrown by the runtime. This makes a lot of sense; a sum of an arbitrary number of integers could easily overflow and it’s probably not a good thing in general.
The imperative loop is compiled to this:
0000005a add ebx,dword ptr [edi+esi*4+8] 0000005e inc esi 0000005f cmp edx,esi 00000061 jg 0000005A
Nice! This is as efficient as we’d hope the JIT to do. No overflow check is performed, so this will silently give incorrect results. If you’re really concerned about performance, your array is large, and if your array is large then overflow is quite likely. So in general I think Array.sum makes more sense than trying to bypass the overflow check to save 1 instruction, but it’s nice to see that you can avoid it if you really want to.
Finally let’s take a look at the for .. in loop:
0000006b mov eax,dword ptr [edi+ecx*4+8] 0000006f add esi,eax 00000071 inc ecx 00000072 cmp edx,ecx 00000074 jg 0000006B
This is essentially the same thing, except that the “fetch array element and add to running total” operation is done in two steps instead of one, adding an unnecessary instruction. This is a bit disappointing but it certainly doesn’t change much in terms of performance.
In conclusion, we have seen that the fastest sum in F# is simply the for [index] in 0 .. [length – 1] loop, but that Array.sum gives us overflow checking and thus makes a lot more sense in general. The for .. in loop is ever so slightly less efficient, and Array.fold incurs substantial overhead and should be reserved for more complex computations. We’ve also seen that the C++ compiler can do interesting optimisations beyond what the JIT is capable of, but that the JIT can nonetheless generate quite good, if unsophisticated, assembly.