My go-to 'make code go faster' ideas

Nov 09, 2023

I’m not an optimization guru by any means. It’s never been something I’ve been allowed to focus on at work, sadly. At some jobs, performance is secondary to correctness and robustness, and at others, it’s secondary to flashy features. But, I’ve used the following tricks in hotloops Dimension reduction (esp via convolution) Branchless calculation SIMD SIMD gets a lot of love, but it’s a constant-factor improvement and can be tough to coax out of the compiler (unless you use a library). view full post

development

I’m not an optimization guru by any means. It’s never been something I’ve been allowed to focus on at work, sadly. At some jobs, performance is secondary to correctness and robustness, and at others, it’s secondary to flashy features.

But, I’ve used the following tricks in hotloops

Dimension reduction (esp via convolution)
Branchless calculation
SIMD

SIMD gets a lot of love, but it’s a constant-factor improvement and can be tough to coax out of the compiler (unless you use a library).

Branchless calculation is getting less relevant as a “trick”. A really dumb example is if you have to do this:

// Example 1
if (x>=0) {
    return x*y;
} else {
    return x*b;
}

Then you can speed it up by doing this:

// Example 2:
int c = (x>0);
return x*(c*y + (1-c)*b); // either x*y or x*b;
//better: x*(c*(y-b) + (b));

And if you can combine this with unrolling and SIMD, then you’re really well off. The second example works just as well if those are Eigen::VectorXd length 1000, and now you’re really cranking out the speed.

Something like this:

// Example 3
for ( i .. N )
{
  if (x[i]>=0) {
      out[i] = x[i]*y[i];
  } else {
      out[i] x[i]*b[i];
  }
}

Becomes trivial, and looks exactly like //Example 2

But even more fundamentally, those usually need to come after:

Using the right data structures
Using the right algorithms
Reframing the problem for speed, e.g., giving up a little to go a lot faster
Memoization/caching/pre-calculating

These can usually get me 80% of the way there. When they all come together, you’re staring at your machine wondering how it is done already.

I rarely see great returns on multi-threaded / multi-process, unless you’re doing rote calculations on a large dataset in a batch.

Comments

I have not configured comments for this site yet as there doesn't seem to be any good, free solutions. Please feel free to email, or reach out on social media if you have any thoughts or questions. I'd love to hear from you!