Continuing from yesterday lets look at how this simple algorithm can be made faster!

The first innovation came from XLife that shipped with the X11 X-Windows system on Unix. (personally I didn’t find out about this until later, but historically this came before)

The problem with the conventional type of loop (ignore the wrap problem, this is only an illustration):

for (int x = 0; x < max_x; x++) {
  for (int y = 0; y < max_y; y++) {
    int neighbors;
    neighbors =
      board[y-1][x-1] +
      board[y-1][x] +
      board[y-1][x+1] +
      board[y][x-1] +
      board[y][x+1] +
      board[y+1][x-1] +
      board[y+1][x] +

The problem is that there is a lot of dereferencing going on all over the place. For each cell you need to do eight double-indexed dereferences. This leads to a a lot of memory traffic on the system bus.

Now lets assume we have each cell stored in a byte and you have a double-whammy: modern processors are bad (cycle-wise) at accessing memory on a non-word boundary.

But, you can play some interesting games if you don’t mind getting closer to the bare metal of the processor! What if instead of treating each cell as a byte, you treat it as a native machine word? For modern processors that would be either 32 or 64 bits.

uint64 top = board[y-1][x]
uint64 middle = board[y][x]
uint64 bottom = board[y+1][x]

uint64 neighbors = top<<8 + top + top >>8 +
                   middle<<8 + middle>>8 +
                   bottom<<8 + bottom + bottom>>8;

This will get you the neighbor counts for the middle six cells. Of course a cell can’t have more than 8 live neighbors so you can take it a step further by storing the cells one per nibble and replace the “8”s in the shifts up there with “4”s, now you’re finding the neighbor counts for 14 cells at a time! Now the looping over the result can be handled all over the registers instead of memory.

Of course this doesn’t handle the cells at the edge, but that can be done cheaply with some messy code (left as an exercise for the reader!).

This code will run substantially faster than the simple code up top. It’s better, but it can get better still by looking at how life typically gets played instead of treating every cell the same.

Tomorrow: VLife – our own creation!