Tech Talk on Optimization

These are the slides from a tech talk I gave at the NWERC programming competition.

Topics barely covered: cache-friendly memory layouts, bit-parallelism, SIMD-within-a-register, vertical storage, DeBruijn sequences.

Next permutation

The example in the background was from Hacker's Delight:

unsigned snoob(unsigned x) {
   unsigned smallest, ripple, ones;
                                // x = xxx0 1111 0000
   smallest = x & -x;           //     0000 0001 0000
   ripple = x + smallest;       //     xxx1 0000 0000
   ones = x ^ ripple;           //     0001 1111 0000
   ones = (ones >> 2)/smallest; //     0000 0000 0111
   return ripple | ones;        //     xxx1 0000 0111
}

Given a subset (represented as a bit map), it finds the lexicographically next subset with the same number of elements. Equivalently, it finds the next integer with the same number of one-bits. Work it out, it's nice.

Slides

Some random images have been removed.

Slide 1 of ℵ0
[image missing]
Slide 2 of ℵ0
Erling Ellingsen

[email address removed]

Slide 3 of ℵ0
"know when not to optimize"
Slide 4 of ℵ0
not the topic today
Slide 5 of ℵ0
how to blatantly overoptimize
Slide 6 of ℵ0
sometimes useful
Slide 7 of ℵ0
task:
Slide 8 of ℵ0
 Foo Bar Baz
  1   2   3 
  1   4   1 
  2   4   3 
  5   3   2 
  3   5   6 
Slide 9 of ℵ0
 (foo == 1) &&   
 (bar == 2)      
Slide 10 of ℵ0
postgres can do 1 M records/second
Slide 11 of ℵ0

Slide 12 of ℵ0
[image missing]
Slide 13 of ℵ0
 Foo Bar Baz       
  1   2   3        
  1   4   1        
  2   4   3        
  5   3   2        
  3   5   6        
--------------     
struct rec {       
  byte foo,bar,baz;
};                 
rec *data;         
Slide 14 of ℵ0
for(o on data) {    
  if(o->foo == 1 && 
     o->bar == 2) { 
    ...             
  }                 
}                   
Slide 15 of ℵ0
80 M/s
Slide 16 of ℵ0
two cache lines per record
Slide 17 of ℵ0
store each field separately
Slide 18 of ℵ0
 Foo Bar Baz           
  1   2   3            
  1   4   1            
  2   4   3            
  5   3   2            
  3   5   6            
--------------         
foo: [ 1, 1, 2, 5, .. ]
bar: [ 2, 4, 4, 3, .. ]
baz: [ 3, 1, 3, 2, .. ]
Slide 19 of ℵ0
[image missing]
Slide 20 of ℵ0
 Foo Bar Baz        
  1   2   3         
  1   4   1         
  2   4   3         
  5   3   2         
  3   5   6         
--------------      
foo[0] = 0x11253....
                    
bar[0] = 0x24435....
Slide 21 of ℵ0
for(i) {                     
  a = awords[i];             
  b = bwords[i];             
  for(ofs : [0,4,8..]) {     
    if((a>>ofs) & 0xf == 1 &&
       (b>>ofs) & 0xf == 2) {
    }                        
  }                          
}                            
Slide 22 of ℵ0
3.75x
300 M/s
Slide 23 of ℵ0
[image missing]
Slide 24 of ℵ0
for(i) {                                         
  a = awords[i];                                 
  b = bwords[i];                                 
  if((a>> 0) & 0xf == 1 && (b>>0) & 0xf == 2) ...
  if((a>> 4) & 0xf == 1 && (b>>4) & 0xf == 2) ...
  if((a>> 8) & 0xf == 1 && (b>>8) & 0xf == 2) ...
}                                                
Slide 25 of ℵ0
5.8x
465 M/s
Slide 26 of ℵ0
[image missing]
Slide 27 of ℵ0
^ and &
are our friends
Slide 28 of ℵ0
32-bit words
8 4-bit values
Slide 29 of ℵ0
 0101 1100 1001 0101 ....
Slide 30 of ℵ0
we want to search for 0101
Slide 31 of ℵ0
0101 1100 1001 0101  values    
    xor                        
0101 0101 0101 0101  search key
   equals                      
0000 1001 1100 0000  wrong bits
Slide 32 of ℵ0
0101 1100 1001 0101  values        
    xor                            
1010 1010 1010 1010  NOT search key
   equals                          
1111 0110 0011 1111  matching bits 
----           ----                
Slide 33 of ℵ0
reduced problem: find all nibbles that are = 1111
Slide 34 of ℵ0
1111 0110 0011 1111                    
                    x = x & (x>>1);    
0111 0010 0001 1111                    
                    x = x & (x>>2);    
0001 0000 0000 0011                    
                    x = x & 0x11111...;
0001 0000 0000 0001                    
Slide 35 of ℵ0
8 comparisons at once
Slide 36 of ℵ0
   3    2    1    0
0000 0001 0000 0000
//                 
locate the 1 bits  
Slide 37 of ℵ0
nice trick in JDK
Slide 38 of ℵ0
    // java.lang.Integer:                            
    public static int numberOfTrailingZeros(int i) { 
        int y;                                       
        if (i == 0) return 32;                       
        int n = 31;                                  
        y = i <<16; if (y != 0) { n = n -16; i = y; }
        y = i << 8; if (y != 0) { n = n - 8; i = y; }
        y = i << 4; if (y != 0) { n = n - 4; i = y; }
        y = i << 2; if (y != 0) { n = n - 2; i = y; }
        return n - ((i << 1) >>> 31);                
    }                                                
Slide 39 of ℵ0
2.8x
1300 M/s
Slide 40 of ℵ0
[image missing]
Slide 41 of ℵ0
numberOfTrailingZeros is general
Slide 42 of ℵ0
general != fast
Slide 43 of ℵ0
the 1 bits can only be in positions 0, 4, 8...
Slide 44 of ℵ0
let's make a faster one
Slide 45 of ℵ0
first, we need to find the lowest bit
Slide 46 of ℵ0
X & -X
Slide 47 of ℵ0
000011010100110000  X        
111100101011001111  ~ X      
111100101011010000  (~ X) + 1
111100101011010000  -X       
                             
000000000000010000  X & -X   
Slide 48 of ℵ0
next:
from 24n to n
Slide 49 of ℵ0
 0x01234567 *         
 0x00000100 =         
 0x23456700           
                      
 0x23456700 >> 28 == 2
Slide 50 of ℵ0
while(x != 0) {                               
  int bit = x&-x;                             
  matched( ofs*8 + ((bit*0x01234567) >>> 28));
  x -= bit;                                   
}                                             
Slide 51 of ℵ0
18.7x
1500 M/s
Slide 52 of ℵ0
[image missing]
Slide 53 of ℵ0
before:                               
  x[0] = 0001 0111 0101 0111 1001 ....
  x[1] = ....                         
                                      
after:                                
  x[0] = 000011.                      
  x[1] = 011100.                      
  x[2] = 010101.                      
  x[3] = 111111.                      
Slide 54 of ℵ0
EGA format!
Slide 55 of ℵ0
32 matches at once
Slide 56 of ℵ0
for(ofs) {                                                  
  int matches =  x[ofs] & ~x[ofs+1] &  x[ofs+2] & ~x[ofs+3] 
              & ~y[ofs] &  y[ofs+1] & ~y[ofs+2] &  y[ofs+3];
   while(matches != 0) {                                    
    matched( ofs*32 + numberOfTrailingZeros(v) );           
    matches &= v-1;                                         
  }                                                         
}                                                           
Slide 57 of ℵ0
24.3x
1950 M/s
Slide 58 of ℵ0
more magic numbers
Slide 59 of ℵ0
int evilBitPos(x) {                         
  int bit = x & -x;                         
  return table[ (bit * 2510000032) >>> 27 ];
}                                           
Slide 60 of ℵ0
 10010101100110111000111110100000 
Slide 61 of ℵ0
(000) 10010101100110111000111110100000 
 ---  --     -----        -----        
                                       
all 5-bit substrings are different     
Slide 62 of ℵ0
31x
2500 M/s
Slide 63 of ℵ0
2G of RAM =
25M rows

10 ms

[Comment on this]

Posted November 13 2005
How to enumerate bits quickly using DeBruijn sequences.
GCC has some clever tricks up its sleeve for compiling the humble division operator.
Some simple bite-sized browser enhancements. Drag them to your bookmarks toolbar.
These little hacks were originally posted at the Medallia blog; since I no longer work there, I thought it would be more appropriate to keep them here.
More...
Nov 27 2008
15:18 ui\
15:18 wef
Nov 28 2008
22:08 cool, random art, right?
22:08 seen better, though
Nov 29 2008
19:35 wtf :)
Dec 1 2008
14:12 Kom til Trondheim da...
15:10 hi
Dec 3 2008
14:52 пп
Dec 4 2008
03:56 blabb
04:58 bleep
Dec 5 2008
06:10 yah!
Dec 11 2008
02:38 irc+code
Dec 12 2008
15:22 Not a search box, sorry. And the WAP IRC is gone forever.
Dec 28 2008
20:02 shit
Dec 31 2008
23:09 test - bb
Jan 3 2009
17:11 nice site

My CSS-fu is weak; please use a recent browser.

Some rights reserved.

Random, semi-related background image by IntangibleArts.