I’m not an optimization guru by any means. It’s never been something I’ve been allowed to focus on at work, sadly. At some jobs, performance is secondary to correctness and robustness, and at others, it’s secondary to flashy features.
But, I’ve used the following tricks in hotloops
- Dimension reduction (esp via convolution)
- Branchless calculation
- SIMD
SIMD gets a lot of love, but it’s a constant-factor improvement and can be tough to coax out of the compiler (unless you use a library).
Branchless calculation is getting less relevant as a “trick”. A really dumb example is if you have to do this:
// Example 1
if (x>=0) {
return x*y;
} else {
return x*b;
}
Then you can speed it up by doing this:
// Example 2:
int c = (x>0);
return x*(c*y + (1-c)*b); // either x*y or x*b;
//better: x*(c*(y-b) + (b));
And if you can combine this with unrolling and SIMD, then you’re really well
off. The second example works just as well if those are Eigen::VectorXd
length 1000, and now you’re really cranking out the speed.
Something like this:
// Example 3
for ( i .. N )
{
if (x[i]>=0) {
out[i] = x[i]*y[i];
} else {
out[i] x[i]*b[i];
}
}
Becomes trivial, and looks exactly like //Example 2
But even more fundamentally, those usually need to come after:
- Using the right data structures
- Using the right algorithms
- Reframing the problem for speed, e.g., giving up a little to go a lot faster
- Memoization/caching/pre-calculating
These can usually get me 80% of the way there. When they all come together, you’re staring at your machine wondering how it is done already.
I rarely see great returns on multi-threaded / multi-process, unless you’re doing rote calculations on a large dataset in a batch.
Comments
I have not configured comments for this site yet as there doesn't seem to be any good, free solutions. Please feel free to email, or reach out on social media if you have any thoughts or questions. I'd love to hear from you!