The Best Synchronization Is No Synchronization: Alternatives to semaphores, mutexes, and CRITICAL_SECTIONs |
|
This essay was prompted by a recent posting in the newsgroup which took someone to task for "using messages for synchronization", saying in effect this was a Bad Thing and that there were perfectly fine mechanisms for synchronization already in place. He cited timers, events, mutexes and semaphores.
I found this a bit odd because timers are not a synchronization mechanism. Events can be used in certain situations, but are not always the best choice. And while mutexes are certainly one of the two important choices for mutual exclusion synchronization, the far more efficient CRITICAL_SECTION was not mentioned. And while semaphores can be used to synchronize the use of countable resources, they are not really a mutual exclusion primitive.
So this essay is going to talk about two mechanisms I use for interthread synchronization: messages and I/O completion ports. A discussion of the tradeoffs between these is covered in a separate essay, so this is going to discuss the philosophy of why these are actually excellent choices for synchronization.
I have seen code of the form
volatile BOOL running; // TRUE to allow thread to run, FALSE to stop it
void CSomeClass::OnBnClickedStop() { ...set running to FALSE }
Now, the problem is how to set the running flag to FALSE.
The correct answer is
running = FALSE;
yet I have seen both of these alternatives:
EnterCriticalSection(&lockRunningFlag); running = FALSE; LeaveCriticalSection(&lockRunningFlag);
and even
InterlockedExchange((volatile long*)&running, FALSE);
The first form is merely grotesque. The second form follows the same error of reasoning, but does so much more efficiently. But in neither case does it make sense to provide an interlock.
An interlock is required only if two or more threads are modifying nonscalar values, or if two or more threads are modifying a scalar in a way that depends upon the previous value in the variable.. In this case, the value is a single scalar value, and is being modified by only one thread, the thread that sets it to FALSE, no matter what the previous value is. The other thread is only reading the value.
An operation like a++, where a can be modified by more than one thread, is not multiprocessor-safe, even if implemented in a single instruction such as inc. In this case, two or more threads would be attempting to modify the value at the same time, and in this case synchronization is mandatory. But it is important to realize that the hardware provides synchronization on 32-bit scalar values implicitly, so an operation that merely stores a value which does not depend on any previous state of the variable does not require synchronization.
An additional property that comes into play here is restricted monotonic mutability. Even if two or more threads were writing the value, they would always write the same value (FALSE), they write the same value no matter what the original value was, and they never change the value to anything else ever again. So the value goes from one particular value to another particular value and only goes one way (monotonic).
If you need to do synchronization, you've already added complexity to your code. If you didn't need to add the synchronization, you have added either unnecessary complexity, or gratuitous complexity. But it is clear that if if two threads are touching the same variable in a way that requires the previous value, or two threads are trying to modify a set of values in one or more objects in a way that require that the entire set of modifications give the illusion of being a single atomic action, then it is clear that you cannot allow two threads to operate on the values without guaranteeing that there will not be inconsistencies caused by concurrent access to the values by other threads.
One of the problems of synchronization primitives is that you must remember to actually use them. If you fail to use them, then your code will malfunction and do something between producing erroneous results, corrupting data, and fatal access faults or storage assertions caused by fatal memory damage. So first, you have to remember to use them.
One way to handle this is to never make the synchronization visible to the programmer using the object or objects. To do this, you can conceal the synchronization within a class, and make sure that the only access to the values is via methods that are public while the values themselves remain protected. But protected is not good enough. A thread in a subclass of the object could access the fields without using synchronization. So in this case, if you anticipate subclasses will be created, you would create protected methods to access the data, but declare the actual variables private so they cannot be accessed by any subclass.
class Something { public: Something() { InitializeCriticalSection(&lock); } ~Something() { DeleteCriticalSection(&lock); } ... method1(...); ... method2(...); protected: ... method3(...); protected: CRITICAL_SECTION lock; private: ...the data being operated upon under control of the lock };
and generically,
... Something::method1(...) { EnterCriticalSection(&lock); ...computation... LeaveCriticalSection(&lock); }
This works well until something within the critical section calls some other method of a class that also does locking. Then we have the potential for deadlock, but because the locking is concealed in the class, we don't actually notice it.
There are other problems with synchronization primitives. One is that we should really set locks as part of a constructor for a locking primitive and release them when the object is destroyed; this means that if exceptions are thrown, we will not leave a lock hanging. The MFC primitives such as CCriticalSection were designed with this in mind, but they have been done so poorly that many programmers (myself included) refuse to have anything to do with them. In fact, the MFC primitives are so riddled with errors that they are virtually unusable, and should be avoided completely. For example, a CMutex will not indicate if there is a failure because of an abandoned mutex, a fundamentally flawed decision that makes it impossible to use CMutex for any serious programming.
Note that the above scenario can work well if the computation cannot throw an exception and makes no calls that involve additional synchronization primitives. In MFC, where all exceptions that are thrown by MFC are subclasses of CException, it can be handled as
EnterCriticalSection(&lock); try { ...computation... } catch(CException * e) { ...make sure all state is restored correctly LeaveCriticalSection(&lock); throw; } LeaveCriticalSection(&lock);
Theoreticians in concurrency have spent a fair amount of effort to define higher-level synchronization mechanisms such as monitors (Per Brinch-Hansen, in Concurrent Pascal), the theory of monitors was elucidated in a seminal paper by C.A.R. (Tony) Hoare in 1974; the concept of monitors appeared in many languages thereafter, including work by Nicklaus Wirth (Modula) and Butler Lampson (Mesa); was adapted by Jean Ichbiah as the concept of rendezvous (Ada), and any google search will give thousands of citations to higher-level synchronization objects. So the notion of using well-structured synchronization mechanisms rather than something like mutexes follows Edsger Dijkstra's original paper ("Cooperating Sequential Processes", 1968) by only a few years.
Think of mutexes as being the "assembly code" of synchronization, and we would like to have the "C++" or "C#" level of conceptual description of synchronization. So in general, synchronization primitives are the technique last resort for synchronization.
Because we have so few high-level mechanisms available to us in C++/MFC (and in my opinion, the MFC classes like CMutex are just syntactic saccharine [not even as good as syntactic sugar] on the primitive concepts of synchronization and therefore would add nothing significant even if they had been correctly implemented; given how seriously broken they are, they actually have negative value in a program.
This is not to say that the use of mutexes and critical sections is inherently bad; in restricted scenarios, there may not be a choice. Key here is to not create a design that requires them to be visible. Hiding them inside other classes helps, but even then there are risks.
Synchronization represents where two threads touch. If two threads touch, just like in a mechanical system, synchronization causes friction. Friction in mechanical systems generates heat and wastes energy. Synchronization primitives add overhead without actually adding any value to the computation; they only make it correct, but they are not in and of themselves useful computation. The overhead of calling the kernel, descheduling the thread, rescheduling the thread, and so on all represent what is essentially useless computation insofar as the problem domain is concerned. In situations of high lock contention, two downsides arise: the aforementioned overheads of doing the synchronization, and in addition, the costs in terms of NUMA (Non-Uniform Memory Access) architectures (particularly as exhibited by the AMD64 architecture, currently the most popular NUMA architecture around--and I want to emphasize that their implementation is not a bad implementation; the problem is intrinsic to all NUMA architectures, but it is representative of what we are going to see in the near future) means that hitting that lock will cause interprocessor bus saturation as well.
This is a variant of the "apartment threading" model. In apartment threading, only the thread that creates an object is permitted to operate on that object. By imposing this restriction, the need for synchronization is defined out of existence.
The problem with this model is that all computations must be done by the owner thread; it cannot delegate the work to secondary threads. This would violate the requirement that it only be operated on by the owner thread.
Positive-handoff is a model that eliminates the restriction of having the object operated on only by the owner thread. Instead, the requirement for a positive-handoff protocol is that only one thread at a time can operate on the object. When a thread has completed whatever computation it wants to perform, it hands the object off to some other thread to do work, and it never again touches that object, unless it is handed back.
I use three different implementation mechanisms for positive-handoff, but all three are the same fundamental idea: put something into a queue and forget about it.
An object is allocated, typically on the heap. The pointer is passed around. Once the pointer is sent to a thread, that thread, and that thread alone, is permitted to access the object. The sender never again uses the pointer (in fact, it is common practice to set the pointer to NULL if it is not a local variable). Ultimately some thread is responsible for the deletion of the object. This model is most commonly used for messages carrying data from secondary threads to the main GUI thread, but it is not restricted to that use alone. It can be used for any interthread communication mechanism; the key is to design the threads to work with this protocol.
If the receiving thread has a window (typically the main GUI thread), then I use PostMessage to hand off the object. This works well for a low-bandwidth handoff (discussed below). If the receiving thread is a UI thread without any windows, I will use PostThreadMessage. This requires an active message pump, so the thread must not be blocked on any operation in such a way as to interfere with the handling of these messages. If posted to the main GUI thread, you cannot use PostThreadMessage.
See my essay on worker threads for examples of using the PostMessage implementation of positive handoff.
The problem with PostMessage to the main GUI thread is that you can "saturate" the main GUI message pump, and interfere with WM_TIMER and WM_PAINT messages (these messages are only dispatched if the message queue is empty, and PostMessage means that the queue might never be empty. In addition, there is a nominal limit of 10,000 messages in the input queue; after that, PostMessage will discard the message and return FALSE. In addition, even if the 10,000 limit is not hit, even a few hundred messages can mean that the user sees poor response to keyboard and mouse input, an unacceptable situation. To solve this, I use a hybrid of PostMessage and I/O Completion Ports (see my essay on this topic) to avoid message pump saturation.
The PostMessage/PostThreadMessage solution works well when a single thread can handle all the traffic. It does not work well if you want to delegate to one of many threads (using a thread pool). I often use an I/O Completion Port for this purpose.
The I/O Completion Port is one of the underappreciated features of Windows. Not only is it the absolutely best way to handle asynchronous I/O (much better than polling via GetCompletionStatus, far simpler than using callback I/O), but it is simply a nice kernel queuing mechanism. This allows me to create thread pools quite easily. I can have many threads waiting on the I/O Completion Port (which has no file handles associated with it) and thus build multi-producer/multi-consumer architectures. (For the low-level implementation of producer/consumer architectures, see my essay on semaphores).
This model is another variant of the apartment model. In this model, only the thread that owns the object is allowed to access it. However, other threads wish to operate on this object. I use this model to implement both the input and output sides of a serial port; the serial port is created (possibly by the main GUI thread), and handed off to the secondary thread; in fact, its "input" side is handed off to a worker thread and its "output" side is handed off to a a UI thread. After this, the only way other threads can interact with the serial port is by communicating with those threads. The input side of the serial port is handled by a worker thread that uses the Positive Handoff Protocol to send buffers back to the main GUI thread; the output side is handled by a UI thread that receives requests via PostThreadMessage and handles the output side of the serial port. No other thread in the entire application will touch the serial port. One thread is the central controller for serial port input, and one thread is the central controller for serial port output.
Another common use of this is to deal with access to any shared resource. No thread ever touches the object itself; in order to effect any modification, or perform any query, it must queue up a request to the thread that owns the resource. This thread will handle the request, and if any notification is required, will enqueue a notification to the owner thread. Thus there is no synchronous model as in call/return; initiation of a transaction is enqueued to the owner thread, and notifications of completion are asynchronously posted to the requestor thread.
One of the major failures in the architecture of concurrent systems is when the programmer is so locked into the synchronous call/return model that the asynchronous event-driven model is not even considered. The artifices they impose to try to force fundamentally asynchronous events into a synchronous model are most politely referred to as dead-end kludges.
The advantage of the central controller model is that you can in general "fire-and-forget" a request. You do not need to worry about synchronization, because there is none. Only the owner of the object is working on the object, and it dequeues only one request at a time, so there is no concurrency with respect to the object. No synchronization is necessary.
Actually, there is synchronization going on. It takes two forms: queue synchronization and storage allocator synchronization. Obviously, if several threads are doing PostMessage or PostThreadMessage, someone has to maintain the integrity of the thread message queue. And indeed, there is low-level synchronization going on. But it is in the Post logic, and it is very fast. Because so few instructions are required to enqueue a request, it most likely uses a spin lock, which is very low overhead.
The more serious lie deals with the synchronization at the storage allocator. In Positive Handoff, the storage allocator is called by every thread creating an object to be handed off, and every thread that ultimately deletes the object. This creates a possibility for serious collision in the allocator. Note that this is still lower overhead than the cost of a mutex collision. This is because the allocator is very fast, and is protected by a CRITICAL_SECTION. A CRITICAL_SECTION will act as a spin-lock for a brief period of time before calling the kernel; this time is typically sufficient for most allocation requests. But in situations in which performance is critical, this potential overhead might be significant. But the truth is that there is an overhead.
It is easier to reason about synchronization when you use protocols like Positive Handoff and Central Controller than if you try to reason about mutexes and semaphores. Because there is no explicit synchronization, there is no possibility of deadlock. Because there is never a case in which two or more threads are actually touching an object, there is no possibility of forgetting to lock an object before it is used, or failing to unlock it when done. Programs developed using this techniques are significantly easier to write, debug, and maintain than programs written with mutexes, CRITICAL_SECTIONs, or semaphores.
The key is to abandon traditional sequential models of thinking about problems, and adopt asynchronous, multithread-distributed models of computing, in which there is no explicit synchronization. Note that classic call/return is a form of synchronized computing: you fire off a task to perform a computation (call) and wait until it terminates (the return). This model is a single-thread model; it does not support concurrency. As soon as you introduce concurrency, introducing artificial sequentiality either defeats the advantage of the concurrency or results in program structures that are so baroque that they cannot be reasoned about or effectively debugged. Assuming the unlikely situation in which they actually are debugged, the structure of the program is fragile, and the slightest changes will result in catastrophe.
Interthread messaging is a powerful and flexible mechanism for avoiding the need for explicit synchronization. It results in programs that are easier to write, easier to debug, easier to reason about, and easier to modify and extend than programs based on the use of low-level synchronization primitives.
The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.