The fastest sum in F#

In this post I will explore the question of how to compute the sum of a collection in F# as fast as possible; that is, as close as optimised C++ as possible. This should give us some insights as to how the F# compiler and JIT optimize code.

So first let’s look at C++ in Visual Studio 2012. I’ve enabled optimizations, but not vector instructions (SSE) for the sake of comparison with the CLR JIT, which never vectorizes anything. That is to say, C++ can generate much more sophisticated assembly than this, but this is a good starting point from our perspective.

#include <array>
#include <iostream>

using namespace std;

template<size_t size>
int sum(array<int, size>& arr) {
	int total = 0;
	for (size_t i = 0; i < arr.size(); ++i) {
		total += arr[i];
	}
	return total;
}

int main() {
	array<int, 1000> arr = {};
	auto total = sum(arr);
	cout << total;
}

Of course the compiler inlines the function, but that doesn’t change the loop. It is compiled to the following assembly (comments by me)

001A129E  xor ecx,ecx                     // sum = 0
001A12A0  xor eax,eax                     // i = 0
001A12A2  xor edx,edx                     // sum2 = 0

001A12A4  add ecx,dword ptr [esp+eax*4]   // sum += arr[i]
001A12A7  add edx,dword ptr [esp+eax*4+4] // sum2 += arr[i + 1]
001A12AB  add eax,2                       // i += 2

001A12AE  cmp eax,3E8h                    // if i < 1000
001A12B3  jb main+34h (01A12A4h)          // goto 001A12A4  
 
001A12B5  add ecx,edx                      // sum = sum + sum2

Here the C++ compiler has decided to unroll the loop once, maintaining two running totals, and adding the two at the end. This is an interesting optimisation but it’s not very important; a strictly minimal loop could be reduced to 4 instructions:

add ecx dword ptr[esp+eax*4] // sum += arr[i]
inc eax                      // i++
cmp eax,3E8h                 // if i < 1000
jb 1                         // goto 1

Since the JIT compiler doesn’t do much in terms of advanced optimisations, this minimal loop is what we’d expect if it simply did its job correctly. Let’s see what we can get out of F#. We’ll compare four approaches: “functional” using Array.fold, “succint” using Array.sum, “imperative” using a for loop like we did in C++, and finally we’ll test the “for .. in” loop (foreach in C#).

Note that in order to see the optimized JIT assembly, you need to compile in Release mode and uncheck “Suppress JIT optimization on module load” and “Enable Just My Code” in the debugger options. This is in 32-bit mode as well.

[<EntryPoint>]
let main argv = 

    let arr = Array.zeroCreate 100

    let sum1 = Array.fold(fun acc elem -> acc + elem) 0 arr

    let sum2 = Array.sum arr

    let mutable sum3 = 0
    for i in 0 .. arr.Length - 1 do
        sum3 <- sum3 + arr.[i]
    
    let mutable sum4 = 0
    for elem in arr do
        sum4 <- sum4 + elem
    
    printfn "%d" sum1
    printfn "%d" sum2
    printfn "%d" sum3
    printfn "%d" sum4
    0

One might point out that since the array contains only zeroes, the compiler could optimize all sums away and simply print out zeroes, but fortunately for our study, it’s not smart enough to do that.

Let’s start with Array.fold. I won’t post all the assembly code because there’s quite a long prologue and epilogue which don’t really matter. The inner function is compiled to this:

00000000  mov         eax,edx 
00000002  mov         edx,dword ptr [esp+4] 
00000006  add         eax,edx 
00000008  ret         4 

This is quite good; it simply loads the next element of the array, adds it to the running total, and returns it.

The inner loop of the “fold” function is compiled to this:

0000002c  cmp         esi,ebx 
0000002e  jae         00000081 
00000030  push        dword ptr [edi+esi*4+8] 
00000034  mov         ecx,dword ptr [ebp-14h] 
00000037  mov         eax,dword ptr [ecx] 
00000039  mov         eax,dword ptr [eax+28h] 
0000003c  call        dword ptr [eax+14h] 
0000003f  mov         edx,eax 
00000041  inc         esi 
00000042  mov         eax,dword ptr [ebp-10h] 
00000045  inc         eax 
00000046  cmp         esi,eax 
00000048  jne         0000002C

I’m not going over this in detail; it basically just loads the arguments to the function in registers, calls the function, increments its counter, advances the position in the array, and loops until it hits the end of the array. This is not bad assembly code at all; it’s about as terse and efficient as you could expect a straightforward implementation of fold to be. That said, it’s definitely a lot of code compared to what we actually need, and it’s somewhat disappointing that such a short lambda function isn’t inlined and further optimisations done after that.

Let’s look at Array.sum now:

00000040  add         ebx,dword ptr [edi+esi*4+8] 
00000044  jo          000002DB 
0000004a  inc         esi 
0000004b  cmp         edx,esi 
0000004d  jg          00000040 

Whoa, now we’re talking. This is essentially our “minimal” 4 instructions loop, with the additional “jo” instruction after the “add”, which checks for overflow. If we force a jump to this 000002DB address, an OverflowException is thrown by the runtime. This makes a lot of sense; a sum of an arbitrary number of integers could easily overflow and it’s probably not a good thing in general.

The imperative loop is compiled to this:

0000005a  add         ebx,dword ptr [edi+esi*4+8] 
0000005e  inc         esi 
0000005f  cmp         edx,esi 
00000061  jg          0000005A 

Nice! This is as efficient as we’d hope the JIT to do. No overflow check is performed, so this will silently give incorrect results. If you’re really concerned about performance, your array is large, and if your array is large then overflow is quite likely. So in general I think Array.sum makes more sense than trying to bypass the overflow check to save 1 instruction, but it’s nice to see that you can avoid it if you really want to.

Finally let’s take a look at the for .. in loop:

0000006b  mov         eax,dword ptr [edi+ecx*4+8] 
0000006f  add         esi,eax 
00000071  inc         ecx 
00000072  cmp         edx,ecx 
00000074  jg          0000006B 

This is essentially the same thing, except that the “fetch array element and add to running total” operation is done in two steps instead of one, adding an unnecessary instruction. This is a bit disappointing but it certainly doesn’t change much in terms of performance.

In conclusion, we have seen that the fastest sum in F# is simply the for [index] in 0 .. [length – 1] loop, but that Array.sum gives us overflow checking and thus makes a lot more sense in general. The for .. in loop is ever so slightly less efficient, and Array.fold incurs substantial overhead and should be reserved for more complex computations. We’ve also seen that the C++ compiler can do interesting optimisations beyond what the JIT is capable of, but that the JIT can nonetheless generate quite good, if unsophisticated, assembly.

Advertisements

Re: 8 Most common mistakes C# developers make

A recent blog article by Pawel Bejger lists some common mistakes made by C# developers. As noted by several in the comments, some of this information can be misleading or incorrect. Let’s review the problematic points:

String concatenation instead of StringBuilder

The author argues that using string concatenation is inefficient compared to using StringBuilder. While this is true in the general case, including the example presented, this must be taken with a grain of salt. For instance, if you just want to concatenate a list of strings as in the example presented, the easiest and fastest way is String.Concat(), or String.Join() if you need to insert something in between each pair of strings.

In addition, compile-time concatenations are automatically translated by the compiler into the appropriate calls to String.Concat(). Therefore, the use of StringBuilder should be reserved to building complex strings at runtime; don’t go blindly replacing every string concatenation code with StringBuilders.

Casting with “(T)” instead of “as (T)”

The author argues that if there’s the slightest probability a cast could fail, “as (T)” should be used instead of the regular cast. This is a very common myth and is misleading advice. First, let’s review what each cast does:

  • “(T)” works with both value and reference types. It casts the object to T, and throws an InvalidCastException if the cast isn’t valid.
  • “as (T)” works only with reference types. It returns the object casted to T if it succeeds, and null if the cast isn’t valid.

Each cast expresses a different intent. “(T)” means that you fully expect this cast to succeed, and that if it doesn’t, this is an error in the code. This is the simple and general case. “as (T)” means that you fully expect this cast NOT to succeed at least some of the time, that this is a normal occurrence, and that you will take care of manually handling it via a null check after.

The real mistake I often see is “as (T)” not followed by a null check. The developer fully expects the cast to succeed so doesn’t bother to write the null check, but the day something goes awry, no exception is thrown on the invalid cast, no null check is performed, and a hard to track bug has found its way into the code base. So my advice is to always use the regular cast “(T)” unless you intend to check yourself for the invalid cast via “as (T)” and a null check.

Using “foreach” instead of “for” for anything else than collections

The author claims that using “for” to iterate “anything that is not a collection (so through e.g. an array)””, the “for” loop is much more efficient than the “foreach” loop.

As with any claims regarding performance, this is easily verified. In this case, the verification proves the author wrong.

As the page he links to claims (and which is very outdated), the generated IL for a “foreach” loop is slightly larger than that for a “for” loop. This is no indication of its performance however, as the speed of code doesn’t correlate with its size, and IL isn’t actually executed but compiled just-in-time to assembly code, where further optimizations may happen. A very simple test performing the sum of an array of ints shows that the machine code generated for a foreach loop is in fact shorter, and the execution time also shorter, than for the for loop. Results on my machine (x86 release mode with optimisations, no debugger attached):

SumForEach: 482 ms

SumFor: 503ms

As you can see, this is in fact a very small difference, so I would advise that in general you use the loop that is the most idiomatic or readable, and not worry too much about the performance difference. But if you’re writing performance-sensitive code, don’t rely on someone else’s benchmark and test things out yourself. You may find that it’s not because  code is visually more compact or more C-like that it performs any better.

Darkstone’s MTF file format, part 2

Following my first attempt at unraveling Darkstone’s MTF file format, I suddenly found a ton of information on the format on the web. Well, so to speak. It is vaguely described at Xentax Game File Format Central, and there’s even a working open-source extractor: Dragon UnPACKer. The relevant code is in drv_default.dpr. It’s apparently Pascal, a language I had never looked at before, but that didn’t present a problem as it’s highly readable.

Remember I said some entries had invalid sizes? Well, they did not: the specified size was simply the uncompressed size, and I assumed no file compression. I had noticed that most files in DATA.MTF began with AF BE or AE BE even though they were of completely different formats, but it had not sprung to my mind that this was because they were compressed. Anyway, had I guessed, I couldn’t have figured out the technique to decompress them; even with the description at Xentax plus working source code in front of me, writing my own implementation was touchy.

My extractor now does at least as well as Dragon UnPACKer, so I’m pretty confident about having nailed the file format. Rather than try to explain it in words, I’ll refer you to my code on pastebin, which has enough comments that it should be clear even you don’t know C#.

Now that I at least extract all files properly, the internal file formats are starting to make more sense.

O3D: a file format for static meshes, can be opened at least with makeo3d.exe. The “Flyff” 3d converter (o3d2obj.exe) appears not to work with these.

DAT: mainly ASCII strings separated by long strings of 0s or other repeated numbers. I still have no real clue as to what they mean, but at least they’re partially legible now.

CDF: references O3D files by their names, separated by long strings of 0s. Maybe level descriptions?

Anyway… this is probably going nowhere as I don’t have infinite time nor much experience with reverse-engineering game files, but at least I get to enjoy the music knowing I extracted it myself! Ha.

 

 

Darkstone’s MTF file format

One of my guilty pleasures as a programmer is writing extractors for weird/old/undocumented file formats. I’m not very good at it, but I try. My latest victim was the “.MTF” format used by Delphine Software’s 1998 Darkstone, the first RPG I ever owned:

All the data files except for movies are in these obscure binary archives:

  • DATA.MTF
  • dsp001.MTF
  • dsp004.MTF
  • MUSIC.MTF
  • VOICES.MTF

I searched Google in vain for any information on the format. No, this is not the same thing as Microsoft Tape Format. The only thing I found to help was a utility called dsxtract.exe, which extracted all the mp2 music files from MUSIC.MTF. It didn’t run on Windows 7 x64 (and of course the source code is nowhere to be found), but DosBox did the trick.

Now let’s take a look at MUSIC.MTF in a hex editor:

Ugh, not even an ASCII header. We can see that starting at byte 9, there’s an ASCII string, so the first 8 bytes are probably integers.

First things first: integers are usually little-endian. This means that if you blindly paste the highlighted 4 bytes in calc.exe and make it convert from hex to decimal, you’ll get 419430400, a number that has no obvious meaning. The trick is to invert the order of bytes: 00000019 is 25 in decimal. Does “25” make any sense?

Well when I ran dsxtract.exe, it produced 25 mp2 tracks. So it would appear that this first integer is the number of entries in the file.

So let’s look at the next integer. D is 13 in decimal. Hum, ok?

What about that next string. “MUSIC\22.MP2”. That looks like a path name. And it’s 12 characters long. Hum, almost 13… wait! The next byte is 00, the null character, so this is a 13-byte null-terminated string, and the integer before it was its length! Every string at the beginning of this file has 12 characters followed by 00, and each one is prefixed with the number 13. To confirm this hypothesis, I also looked at DATA.MTF which had paths of various lengths, and each was prefixed with the number of characters + 1. So, there we go.

Between “MUSIC\22.MP2” and the next string, there are 8 bytes:

75 02 00 00 93 FC 15 00

This is most likely two integers: 629 and 1440915. They might somehow indicate where this file is located within the archive. Let’s see what we have at offset 629:

FF FD 90 04 53 33 11 11 11 11 11 11 11 11 11 24

Well, I know nothing of the MP2 file format, but if this is the start of “MUSIC\22.MP2” then it probably looks similar to any other MP2 file. Let’s look at one of the extracted files in a hex editor:

FF FD 90 04 55 22 11 11 11 11 11 11 11 11 11 24

Sweet! Now let’s look at the file size of 22.MP2: 1 440 915 bytes. That’s exactly the second number following the path name, so that would give its size.

Note that I make this look very easy; in fact I spent about 3 hours to find this. Anyway.

At this point we can write the spec down for Darkstone’s MTF file format:

4 bytes – numFiles: integer; Number of files in the archive

This is followed by [numFiles] data entries. Each is structured like so:

4 bytes – pathLength: integer; Length in bytes of the next string.

[pathLength] bytes – null-terminated ASCII string: path of the entry.

4 bytes – offset: integer; absolute offset of the data in the archive.

4 bytes – size: integer; size of the data in the archive

Then follows all the data.

I then wrote a little C# application that read an entry, fetched the corresponding data, put it in a file of the same name, and proceeded to the next entry until they were all done. That almost worked. It would throw exceptions while reading DATA.MTF, because some of the specified sizes there are invalid and result in reading past the end of the file. Ugh.

So I had to resort to a more involved approach. Instead of processing one entry at a time, I start by reading all entries (path, offset, size), making a list of that, sort it by offset, and then go over that list to fetch the corresponding data. For each entry, I check if the size makes sense; if it doesn’t, I use the next entry’s offset to calculate the “real” size. Note that the real size could actually be smaller (there are unused bytes), but I suppose that’s the best I can do.

This worked well. Extracting these archives reveals the many file formats used by Darkstone (a lot more fun in store!):

  • MP2 – well-known sound format, used for all music and speech
  • WAV – well-known sound format, used for most sounds
  • DAT – could be anything; used by a few relatively small files: “SND.DAT”, “LANGUAGE.DAT”, etc.
  • AND – looks related to 3D models, I don’t know.
  • B3D – maybe this? I don’t know.
  • BRM – no clue
  • CLD – very repetitive (FC 01 FC 01 FC C0 FC 00 01 FC 01 FC C0 FC 01 FC 01 FC C0…), but I don’t know.
  • MBR – obscure binary format, I haven’t got a clue
  • MDL – idem
  • SKA – idem
  • O3D – seems used for meshes, maybe it’s Objective-3D, about which no one knows apparently.

Damn.

Here’s the full listing if anyone wants to use it. This must be compiled with the /UNSAFE compiler option:

http://pastebin.com/hpFN93Um

Icewind Dale 2 Ultimate Installation Guide

Many people wonder if old Infinity Engine games like Icewind Dale 2, Planescape: Torment and Baldur’s Gate run well under Windows 7/Vista. The answer is yes: these games are not that old, and in many cases they will work fine out of the box like any other game. However, should you have any issues, this guide is here to help. I will show you how to install, configure, patch and mod the game for an optimal gameplay experience. In this guide, we’ll focus on Icewind Dale 2, but a lot of this information also applies to other Infinity Engine games.

Step 1) Installation

If there’s only one thing you should remember from this section, it is do not install to C:\Program Files! Windows Vista and 7 have new security restrictions that the developers of such old games did not expect, so some things may break.

Icewind Dale 2 ships on DVD, 2 CDs, or can be downloaded at www.gog.com. I will only illustrate the DVD installation here, but they are all very similar. For the CD version, insert CD 1 first, and then CD 2 when the setup asks for it, and finally CD 1 again when the setup asks for it at the end of the installation. You will use CD 2 to play the game.

After selecting Install, click your way to this screen. Select “Full” and click Next.

The default Destination Folder will start with C:\Program Files. Make sure to change this! This also applies if you are installing the game from gog.com. Here I install the game to my D drive. If you don’t have another drive, you could install, for example, under C:\Games\Icewind Dale II. As long as you’re not installing to C:\Program Files, you’re fine.

After that, click Next twice again and the game will install.

There is no need to re-install DirectX. Windows Vista and Windows 7 automatically keep your version of DirectX up-to-date.

At the end of the Setup, check the box to run the config utility. Alternatively, the config utility can be found in the Start Menu as illustrated below, and simply in your game’s installation folder. In my case it is at D:\Program Files (x86)\Black Isle\Icewind Dale II\Config.exe.

Step 2) Configuration

However you do it, here is what you should see when the Configuration Utility starts:

The three sliders under General, Graphics and Audio will be at the “Average” setting. Any computer running Windows Vista or 7 should be fast enough to run the game at its highest settings, so set all of these to “High End” as illustrated above.

Now select the Game tab.

Check “Display Quest Experience”, and, optionally, increase the Cache Size to at least 300 MB. If you have plenty of memory, feel free to set that even higher, but it doesn’t make a huge difference.

Make sure all your options are exactly like in this screenshot (“Display Movie Subtitles” is personal preference, but I like it), and click OK.

Step 3) Patching

This step should not be necessary if you are using the GOG version, as it should already be the latest version (2.01). You can double-check this by starting the game and going to the options page.

Note that the game may take a long while to start the first few times. I don’t know what causes this. Just wait patiently, after 10-30 seconds it should start.

This screenshot shows my Icewind Dale 2 in-game Options page right after installing from DVD. The version is earlier than 2.01, therefore we must install the patch.

Fortunately, there is only one patch to install, and that is the Official Icewind Dale 2 Patch v2.01 (file name is IWD2Patch201.exe). At the time of this writing, you can find it at

Anyway, simply run the executable, click through the wizard’s screen and that is that.

Step 4) Essential Mods

Icewind Dale 2 is already a highly polished game, so feel free to skip this step entirely. That said, I suggest installing the Icewind Dale 2 Tweak Pack, currently at http://www.gibberlings3.net/iwd2tweak/ . Very good installation instructions are provided in the Readme. I always use the following tweaks:

  • Unlimited Ammo Stacks
  • Unlimited Jewelry and Gem Stacks
  • Unlimited Potion Stacks
  • Unlimited Scroll Stacks
  • Bottomless Bags of Holding
  • Weapon Animation Tweaks
  • Force All Dialogue to Pause
  • 100% Learn Spells
  • Stores Sell Larger Stacks of Items

These mostly remove a lot of tedious inventory management, so you can focus more on the actually fun parts of the game.

Another popular mod is the Widescreen Mod, which will be discussed in the next section.

Step 5) Fix the resolution

Icewind Dale 2 is designed to run at 800×600, which is a 4:3 resolution. Chances are you are using a widescreen monitor, which the Infinity Engine doesn’t adapt well to. Here are two possible results:



There are basically two ways to deal with these problems: either fix the aspect ratio scaling, or install the widescreen mod.

5.1) Fix the aspect ratio scaling

If your image is stretched like in the “INCORRECT” screenshot, one option is to fix that in your video card’s driver. I cannot provide help for all drivers because these change all the time, but for what it’s worth, here’s how it works with the current NVIDIA Control Panel:

  1. Right-click the desktop and select NVIDIA Control Panel
  2. Under “Display”, select “Set desktop size and position”
  3. Choose the “width/height ratio” option
  4. Choose to do the scaling on the GPU rather than the Display
  5. Check the box below to override the scaling mode defined by games and programs.

Here’s a screenshot (mine is in French, but the options should be in the same place)

AMD drivers should have similar options. Please do not contact me for help with your video card’s drivers, instead go to their respective forums:

5.2) Install the Widescreen mod

Whether or not your image is stretched, you might want to take advantage of your entire screen rather than fiddle with drivers and then still have black bars. Well, that is possible with the Widescreen Mod, available at http://www.gibberlings3.net/widescreen/ . Installation instructions are provided there, so I will not repeat them. I suggest using the lowest available resolution that respects your monitor’s aspect ratio. Higher resolutions are of course possible, but will make everything look even smaller.

  • If your monitor is 16:9 (your highest resolution is 1280×720, 1600×900 or 1920×1080, for example), try 1280×720.
  • If your monitor is 16:10 (your highest resolution is 1440×900, 1680×1050 or 1920×1200, for example), try 1280×800.

Either way, here’s what the result looks like, for example at 1280×720:

Using the widescreen mod has several drawbacks:

  • The UI does not scale or stretch, so it ends up docked to the left with a weird black box to the right.
  • Also, everything looks quite a bit smaller than in 800×600. The higher the resolution, the worse it gets.
  • There’s also a bug in the Fell Wood (seen in the above screenshot!), where if you are running any resolution other than 800×600, the game will freeze after the battle with treants.

Therefore, while the widescreen mod allows you to play the game at a more widescreen-friendly resolution, it is by no means perfect. Whether you select this or the first option is up to you. Personally, I do not use the Widescreen Mod for IWD2.

Step 6) Fix visual artifacts and lag issues

If you haven’t done so already, start the game and see if it’s playable. Chances are everything works smoothly and you are set for a nice game of Icewind Dale 2!

However, several players report significant lag, especially when several spells animations are on-screen. Also, the fog of war may appear blocky and weird. You may try to set the game to 16-bit color mod and fiddle with the BLT Modes in the Configuration Utility (as illustrated in step 2), but even if it works, it doesn’t look as good as 32-bit color.

Here’s the real fix for all these issues. It might seem a bit complicated, but I will guide you step-by-step.

First, open up your browser and search for the Microsoft Application Compatibility Toolkit.

Click the first link. This should take you to Microsoft’s Download Center, where we want the ApplicationCompatibilityToolkitSetup.exe :

Install the toolkit by running the executable and just selecting Next all the way through, nothing special needs to be done there.

Once it’s installed, start the Compatibility Administrator (32-bit), which will be in your Start Menu under Microsoft Application Compatibility Toolkit:

Make sure a new database is selected as in the screenshot, and click Fix. Enter the name of the program and browse to its location (in this case IWD2.exe):

The next page will be “Compatibility Modes”; we don’t need anything there, so click Next again.

Under “Compatiblity Fixes”, check ForceDirectDrawEmulation. We don’t need anything else on the next 2 pages, so simply click next and then finish.

You have now created a fix. Click Save, give it some sensible name and save it anywhere you’d like. (I save it in the IWD2 install folder, but any other place is good.)

With the database saved, right-click it and select Install. You should see the following dialog:

Step 7) ???

If you still have technical issues with Icewind Dale 2 after following this guide, or if you have any comments, you can contact me and other helpful people at the Sorcerer’s Place Forums (http://www.sorcerers.net/forums/index.php), so register and come say hello! My nickname is Dr_Asik, you can send me a message directly, but you’ll probably get a quicker answer by creating a new thread under the appropriate section.