jlebar

Here's a technique I came up with for busy-waiting in a "nice" way. I'm probably not the first person to think of this, but this was a new idea to me.

Suppose you have a multicore machine and you're running a multithreaded program. You have small critical sections, and you find that busy-waiting on mutexes gives you a speedup over blocking. In pseudocode, your busy-wait loop might look like this.

void busy_wait(mutex& mu) { while (true) { if (mu.try_lock()) { return; // Acquired the lock! } } }

But naive busy-waiting is only beneficial when the load on your system is low. If anything running on the machine (be it your program or a different one) has "useful" work to do, you don't want to waste cores spinning.

You could mitigate this somewhat by inserting a call to sched_yield() inside the busy-wait loop, like so.

void busy_wait_with_yield(mutex& mu) { while (true) { if (mu.try_lock()) { return; } sched_yield(); // Give other threads a chance to run. } }

The sched_yled() tells the operating system to switch to another thread, if one is runnable. It also moves the yielding thread to the end of the run queue, so all other threads will have a chance to run before it runs next. And it's relatively cheap in the case that there's no other thread to switch to.

But this still isn't ideal, for a few reasons.

Suppose you have lots of busy-waiting threads and one thread with "useful" work to do. Every time the useful thread's timeslice finishes, we have to switch to each of the busy-waiting threads, only to find that they're going to yield.
Now suppose the mutex one of the busy-waiting threads is waiting on becomes available. We'd want the OS to switch to that thread immediately, but it can't! We have to cycle through all the threads stuck in busy-loops until we happen to stumble on the one that can actually make progress.

The approach I was familiar with for solving these problems is to busy-wait for some amount of time. If we haven't acquired the mutex by the end of this period of time, we block.

void busy_wait_then_block(mutex& mu) { let start = now(); while (now - start < THRESHOLD) { if (mu.try_lock()) { return; } } mu.lock(); // If all else fails, block. }

But this still has some problems.

If I have CPU cores available with nothing better to do, maybe I do want to busy-wait indefinitely.
I have to choose a threshold. As I remember Dawson Engler saying in class one day (paraphrase):

Whenver you see a paper with a magic number, you should ask yourself, how hard is it to choose that number, and how bad is it if you choose wrong? The answers are usually, "pretty hard" and "pretty bad".

So now comes my idea. Instead of busy-waiting for a set period of time, we busy-wait until sched_yield() is "slow". This indicates that the OS had another thread to run. If so, we immediately block. It looks like this.

void block_if_yield_is_slow(mutex& mu) { while (true) { if (mu.try_lock()) { return; } let start = now(); sched_yield(); if (now() - start > SLOW_YIELD_THRESHOLD) { break; } } mu.lock(); // If all else fails, block. }

This approach worked well for me on a workload of mine with ~64 threads. When I ran on a 64-core machine, it was almost as fast as pure busy-waits with no sched_yield()'s. But whereas the pure busy-wait code was very slow on my 12-core laptop, block_if_yield_is_slow was just as fast as if I had no busy-waits!

To address some possible objections:

This still has a threshold, but it's easy to set. You just want something larger than the cost of a nop sched_yield() and smaller than the scheduler's timeslice. I found that anything between 50μs and 1ms worked fine.
Reading the current time is very fast, it's usually not even a proper syscall. So this doesn't add significant overhead beyond the call to sched_yield().

jlebar

Scheduler-aware busywaiting