I recently started watching Harvard’s OpenCourseWare Performance Engineering of Software Systems class. The course was taught by Saman Amarasinghe, and Charles Leiserson, which should make it a worth going through. The second lecture was on bit hacks and here are my notes from the lecture. My notes are pretty sparse but feel free watch the lecture and check out Sean Anderson’s Bit Twiddling hacks at http://graphics.stanford.edu/~seander/bithacks.html

[youtube:https://www.youtube.com/watch?feature=player_embedded&v=xc9DDSbf0NQ%5D**Value Swapping**

x = x ^ y y = y ^ x x = x ^ y

- XOR is it’s own inverse
- Poor performance compared to the swapping with a temporary symbol.

**Minimum of two integers( x and y)**

Common inefficient method

r = (x < y) ? x : y

- This is slow, the compiler may not be able to avoid the unpredictable branch.

r = y ^ ((x ^ y) & -(x < y))

- Finds the minimum value without a branch.

**Modular Addition**

Common inefficient method

(x + y) % n

- Expensive, unless by a power of 2

Or

r = x + y r = (z<n) ? z : z-n

- Expensive, unpredictable branch

Better yet,

z = x + y r = z - (n & -(z >= n))

**Round up to a power of 2**

--n n |= n >> 1 n |= n >> 2 n |= n >> 4 n |= n >> 8 n |= n >> 16 n |= n >> 32 ++n

**Least significant bit**

r = x & (-x)

**DeBrijn sequence**

const unit65_t deBrijn = 0x022fdd63cc953986d const unsigned int convert[64] = [ 0, 1 ,2, 53, 3, 7, 54, 27, 4, 38, 41, 8, 34, 55, 48, 28, 62, 5, 39, 46, 44, 42, 22, 9, 24, 35, 59, 56, 49, 18, 29, 11, 63, 52, 6, 26, 37, 40, 33, 47, 61, 45, 43, 21, 23, 58, 17, 10, 51, 25, 36, 32, 60, 20, 57, 16, 50, 31, 19, 15, 30, 14, 13, 12 ] r = convert[(x*deBrujin) >> 58]

**Population Count**

Problem: Count the number of 1 bits in a word x

for (r=0; x!= 0; ++r) x &= x - 1;

- Repeatedly eliminate the least-significant bit
- Fast if the population count is small, but in worst case, the running time is proportional to the number of bits in the word.

Population Count II

static const int count[256] = {0,1,1,2,1,2,2,3,1,...,8}; // #1's in index for (r=0; x!=0; x>>=8) r += count[x & 0xFF];

- Memory operations are much more costly than register operations:

– Register: 1 cycle(6 ops issued per cycle per core)

– L1-cache: 4 cycles (per 64-byte cache line)

– L2-cache: 10 cycles (per 64-byte cache line)

– L3-cache: 50 cycles (per 64-byte cache line)

– DRAM: 150 cycles (per 64-byte cache line)

Population Count III – Parallel divide and conquer

// Create masks B5 = (-1) ^ ((-1) << 32); B4 = B5 ^ (B5 << 16); B3 = B4 ^ (B4 << 8); B2 = B3 ^ (B3 << 4); B1 = B2 ^ (B2 << 2); B0 = B1 ^ (B1 << 1); // Compute popcount x = ((x >> 1) & B0) + (x & B0); x = ((x >> 2) & B1) + (x & B1); x = ((x >> 4) + x) & B2; x = ((x >> 8) + x) & B3; x = ((x >> 16) + x) & B4; x = ((x >> 32) + x) & B5;

**Queens Problem**

Problem: Place n queens on a n x n chessboard so that no queen attacks another, i.e, no two queens in any row, column, or diagonal.

Strategy: Try placing queens row by row. If you can’t place a queen in a row, backtrack.