Optimization: Your Worst Enemy |
|
Now there's a title to grab your attention! But I'm serious!
First, a little background on me: my PhD was one of the earliest on the automatic creation of optimizing compilers from formal machine descriptions ("Machine-Independent Generation of Optimal Local Code", CMU Computer Science Department, 1975). After my PhD, I spent three years at CMU as a senior researcher on the C.mmp multiprocessor computer system which used our home-grown Hydra operating system, a secure, capability-based operating system. I then went back to compiler research on the PQCC (Production Quality Compiler-Compiler) project, which ultimately led to the formation of Tartan Laboratories (later, just Tartan, and now absorbed by Texas Instruments), a compiler company, where I was part of the tooling group. I spent a decade-and-a-half writing and using performance measurement tooling.
This essay has several parts and represents much of my own experience. The stories I tell are true. The names have not been changed, although a couple are carefully omitted.
A reasonably skilled programmer will not write a grossly inefficient program. At least not deliberately. Optimization is what you do when the performance is insufficient. Sometimes the optimizations are easy, sometimes they are hard. Sometimes they fly in the face of your original design, and sometimes they require that you grossly violate your beautiful abstractions in your class system. But always, and I repeat, always, my experience has been that no programmer has ever been able to predict or analyze where performance bottlenecks are without data. No matter where you think the time is going, you will be surprised to discover that it is going somewhere else.
You optimize because you have a problem in performance. Sometimes it is computational optimization: your bitmap manipulation is just too slow. Sometimes it is data access optimization: it just takes too long to get the data into the machine. And sometimes it is algorithmic optimization: you're doing it wrong. If you don't understand the difference between an n2 sort and an n log n sort, you're probably already in trouble, but that knowledge alone is not useful.
Some years ago, I was working on a complex program which had to perform a semantic cross-check between the "statements" of the program and the "declarations" (it was actually a 4GL constraint equation system, but the details don't matter). I discovered that the operation was of n3 complexity (well, actually m * n2, but m and n would be of comparable size most of the time). There are three paths that you can follow here:
The only valid path to optimization is the engineering path. I measured the performance, and on the largest "real" example we had, I discovered that n was almost always 1, sometimes 2, rarely 3, and had exactly one instance of 4. This was too small to matter. Sure, the algorithm was n3, but with n that small, there was no need to rewrite the code. Rewriting the code would have been incredibly complex, delayed the whole project for a couple weeks, and used up a couple pointers in each node of the tree in an already-tight minicomputer address space.
I also wrote the storage allocator that everyone used. I spent a lot of hours tweaking its performance to be the fastest allocator of its class. These adventures are detailed in the book IDL: The Language and its Implementation, now, alas, out of print. (Nestor, Newcomer, Gianinni and Stone, Prentice-Hall, 1990). One group that used it used a "PC sampling" performance tool on Unix. This is a performance tool that samples where the program counter (PC) is executing, and over a long execution derives a "density histogram" of where the program is spending its time. It clearly showed that a massive amount of time was being spent in the storage allocator. This made no sense to me, but fingers were being pointed in my direction.
So I wrote a little hook in the allocator that counted the number of times it was called. It turns out it was called over 4,000,000 times. No call took longer than the minimum measurement interval of 10µs (approximately ten instructions on our 1-MIPS machine), but 40,000,000 microseconds is 40 seconds. Actually, it was more than that because there were 4,000,000 free operations as well, which were even faster, but still the resulting time was more than 50% of the total execution time.
Why did this happen? Because, unknown to the programmers, a critical function they were calling in the inner loop of several algorithms would actually allocate a 5-to-10 byte working buffer, do its thing, and release it. When we changed this to be a 10-byte stack local, the time spent in the storage allocator dropped to about 3% of the total program time.
Without the data, we would not have known why the allocator accounted for so much time. PC-sampling performance tools are a very weak class of tools and their results are intrinsically suspect. See my article "Profiling for Performance", in Dr. Dobb's Journal (18,1) January, 1993, pp.80-87.
A classic blunder in optimization was committed some years ago by one of the major software vendors. We had their first interactive timesharing system, and it was a "learning experience" in a variety of ways. One such experience was that of the FORTRAN compiler group. Now any compiler writer knows that the larger the hash table you use for a symbol table, the better your performance on lookup will be. When you are writing a multipass compiler in a 32K mainframe, you end up using a relatively small symbol table, but you create a really, really good hash algorithm so that the probability of a hash collision is reduced (unlike a binary seach, which is log n, a good hash table has constant performance, up to a certain density, so as long as you keep the density below this threshold you might expect that you will typically have an order 1 or 2 cost to enter or lookup a symbol, on the average. A perfect hash table (which is usually precomputed for constant symbols), has a constant performance between 1.0 and 1.3 or thereabouts; if it gets to 1.5 you rework the hashing to get it lower).
So anyway, this compiler group discovers that they no longer have 32K, or 128K, or 512K. Instead, they now have a 4GB virtual address space. "Hey, let's use a really big hash table!" you can hear them saying, "Like, how about 1 MB table?" So they did. But they also had a very sophisticated compiler technology designed for small and relatively dense hash tables. So the result was that the symbols were fairly uniformly distributed over the 256 4K pages in that 1MB, which meant that every symbol access caused a page fault. The compiler was a dog. When they finally went back to a 64K symbol table, they found that although the algorithm had poorer "absolute" performance from a purely algorithmic viewpoint (taking many more instructions to look up a symbol), because it did not cause nearly as many page faults, it ran over an order of magnitude faster. So third-order effects do matter.
Also, beware of C. No, not the speed of light. When we talk about performance, the algorithmic performance for n is expressed as a function C × f(n). Thus an n2 algorithm is formally C × n2, meaning that the performance is a constant multiple of the square of the number of elements being considered. We shorten this to O(n2), meaning "order of n2", and in common parlance just drop the "order of" designation. But never forget the C is there. Some years ago, I was doing a project that produced summary set of listings, sorted in a variety of ways. In the first attempt (this was in the days before C and qsort) I just did an ordinary bubble sort, an O(n2) algorithm. After initial testing, I fed it some live data. Ten minutes later, after it had printed the console message "Starting reports", it had not yet produced any reports. A series of probes showed that most of the time was in the sort routine. OK, I was done in by my laziness. So I pulled out my trusty heapsort (n log n) sort and spent an hour modifying it to work in my application (remember, I said qsort did not yet exist). Having solved the problem, I started running it again. Seven minutes into the report phase, nothing had yet appeared. Some probes revealed something significant: it was spending most of its time in the equivalent of strcmp, comparing the strings. While I'd fixed the O issue, I had serious neglected the C issue. So what I did was do one sort of the composite symbol table, all of the names, and then assigning an integer to each symbol structure. Thereafter, when I had to sort a substructure, I just did an integer sort on its ordinal position. This reduced C to the point where less than 30 seconds were required to do the entire report phase. A second-order effect, but a significant one.
So algorithmic performance, particularly paging performance, can matter. Unfortunately, we have neither the proper tools for measuring paging hits nor for reorganizing code to minimize paging of code pages.
Some performance tools measure the total time spent in user space, and treat kernel time as free. This can mask the impact the application has on the kernel time. For example, a few years ago we were measuring the performance of a program whose performance was exceptionally poor. No "hotspot" showed up in terms of program time wasted. However, at one point I was looking at the trace data and noticed that the input routine was called about a million times, which is not surprising when you are reading a megabyte of data, but something seemed odd to me. I looked. Each time it was called, it called the kernel to read a single byte of the file! I changed it to read 8K bytes and unpack the buffer itself, and got a factor of 30 performance improvement! Note the lesson here: kernel time matters, and kernel transitions matter. It is not accidental that the GDI in NT 4.0 is no longer a user-level process but integrated into the kernel. The kernel transitions dominated the performance.
So what to optimize is easy: those parts of the program that are consuming the time. But local optimizations that ignore global performance issues are meaningless. And first-order effects (total time spent in the allocator, for example), may not be the dominant effects. Measure, measure, measure.
(OK, this is really a sidebar, and a bit of an ego trip. You can skip to the next section if you don't want to read it. You've been warned).
Back in the early days of C, the C storage allocator was one of the worst-performing storage allocators in existence. It was "first fit", which meant that the way it worked was the allocator went traipsing down the free list, looking for a block at least as big as the one requested, and if it found one, it split it and returned the residue to the free list. This had the advantages of being as slow as possible and fragmenting memory as badly as possible. In fact, it was worse than you can imagine. It actually walked the list of all blocks of storage, free and allocated, and had to ignore the allocated blocks. So as you got more and more blocks, its performance degraded, and as the blocks got too small to be usable, they simply added to the overhead without adding to the value.
I was at CMU on a one-year research contract. My first remark upon using the Unix environment was to turn to one of the people and say "How can you live this way?" The software technology in 1990 was exactly what it had been a decade previous when I left CMU, except that in the modern case the compiler didn't work (it generated incorrect code for simple C constructs), the debugger didn't work, the backtrace (being entirely a list of hex addresses with no symbols) was useless, the linker didn't work, and there wasn't anything resembling a decent document production system. Other than these minor defects, I suppose it was an OK environment. Having used Microsoft C, with CodeView, and even the earliest Visual C environment, I had certain high standards of what I expected out of my tooling, and Unix (at least in that era) fell short. By miles. I was, of course, outspoken on this subject.
One day we were discussing an algorithm, and it required doing some storage allocation. I was assured that this was unacceptable because storage allocation was very expensive. I said something like "Well, of course, if you use the brain-dead Unix allocator, you're bound to have performance problems. A decent storage allocator makes this a non-issue". One of the people at the meeting immediately challenged me: "I'm sick of hearing you put down Unix. What do you know about storage allocators, anyway?". I said "hold that thought, I'll be right back". I went to my office, where I had a copy of the IDL book, brought it back, and held it open to the chapter labeled "Storage allocation". "See this?" "Yes". "What is the title of this chapter?" "Storage allocation". I closed the book and pointed to the cover. "Recognize this name?" "Yes, of course, that's you". "Fine. I wrote that chapter. It is on how to write a high-performance, minimum-fragmentation storage allocator. So you asked what I know about storage allocation. Well, I wrote the book on it."
I was never again challenged when I put down Unix.
Just as an aside, the allocators used in NT work very much like the algorithm I described in the IDL book, which is based on the "quickfit" work that Chuck Weinstock did for his PhD dissertation at CMU around 1974.
Do not do clever optimizations that have no meaning. For example, people who try to "optimize" the GUI interface. Hardwired constants, distributed enabling, cute algorithms. The result is something that is hard to develop, difficult to debug, and absolutely impossible to maintain.
Optimization is meaningless here. For more details, you might want to read my essay on the only sensible way I've found to manage dialog controls. I'll summarize the key idea here.
Why doesn't efficiency matter when you're updating menus or controls? Look at the human factors. A mouse is held approximately 2 feet from the ear. Sound travels at approximately 1100 ft/sec. This means that it takes approximately 2ms for the sound of the mouse click or keystroke to reach the ear. The neural path from the fingertip to the brain of an adult is approximately 3 feet. Propagation of nerve impulses is approximately 300 ft/sec, meaning the sensation of the mouse click or keystroke takes approximately 10ms to reach the brain. Perceptual delay in the brain can add between 50 and 250ms more.
Now, how many Pentium instructions can you execute in 2ms, 10ms, or 100ms? In 2ms, on a 500MHz machine that's 1,000,000 clock cycles, so you can execute a lot of instructions in that time. Even on a now-clunky 120MHz Pentium there is no noticeable delay in handling the controls.
This did not stop Microsoft from totally violating the C++ object model on message handlers; if you call CWnd::OnWhatever(...), instead of actually calling DefWindowProc with the parameters you provide, they re-use the parameters of the last message to call ::DefWindowProc. The goal is to "reduce the size of the MFC runtime", as if a few hundred instructions more in a massive DLL would matter! Even I can figure out how to get CWnd::OnAnything to inline-expand to a call on DefWindowProc.
Some many years ago, as I indicated in the introduction, I worked on a large (16-processor) multiprocessor system. We were using custom-modified PDP-11 minicomputers, which were relatively slow. We were programming them in Bliss-11, which as far as I've been able to tell still holds the record for the world's tightest optimizing compiler (although I've seen some quite impressive optimizations in Microsoft C/C++). After doing some performance measurement, we determined that the paging algorithm was the outstanding performance bottleneck. So our first assumption was that the paging algorithm was faulty. We examined the code, and the paging algorithm maintainer rewrote it, taking our performance data into account, and had a new, and much faster, page management algorithm installed within a week.
Meanwhile, up at MIT, MULTICS was still running. They traced a serious performance problem to the paging system. Because it was written in a PL/1 like language, EPL, the assumption was that because it was written in a high-level language, the code was suboptimal, so they launched an effort to rewrite the page management algorithm in assembly code. A year later, the code was working, and was put into the production system. Performance dropped 5%. Upon inspection, it was found that the fundamental algorithm was at fault. They took the EPL code, rewrote the algorithm, and had the improved algorithm working and installed in a few weeks. The lesson: don't optimize something that is not the problem. Understand the problem first. Then, and only then, do the optimization. Otherwise, the optimization is a waste of time and may even make the performance worse.
In the Bliss compiler, the 'register' attribute on a variable said to the compiler "You will assign this variable to a register". In C, it means "I'd like you to assign this variable to a register". Many programmers decided that they should put certain variables in registers to get better code. But the Bliss compiler was very good; it had a very sophisticated register allocation system, and in the absence of direction from the programmer felt free to assign a variable to a register if that produced the best code. Explicitly assigning a variable to a register meant that the register was not available for common subexpressions, particularly those subexpressions implicit in data structure access. After a fair amount of experiments, we determined that almost invariably adding 'register' attributes to declarations produced significantly worse code than letting the compiler do the assignment. For inner-loop code, many hours of effort would usually result in a small improvement in performance, but it was generally understood by the programmers that unless you studied the generated machine code and did a number of calibrated experiments, any attempts at optimization would make the code worse.
If you've heard of the SPECmark performance, you may also be aware of how they are gamed. In particular, IBM wrote a program that took a FORTRAN program as input, such as the SPEC matrix-multiply benchmark, and produced another FORTRAN program as output, but one which was optimized for the cache architecture of the machine it would run on. A small number of parameters described all the cache strategies of each of the models of the RISC 6000 product line. The "optimized" original FORTRAN program performed on one machine at 45 SPECmarks. After being transformed, it performed on the same machine at over 900 SPECmarks. That's a factor of 20 performance improvement based solely on fourth-order effects, cache line hits. If you're doing image processing, particularly of large images, an awareness of cache issues (even relatively machine-independent) can buy you an order of magnitude performance.
A naive approach, optimizing at the code-line level, is not nearly as effective as higher-order optimizations. Paging optimizations, cache line optimizations, and memory allocation optimizations can often have vastly more significant effects than code-line optimization. Algorithmic optimizations are the next best bet, particularly if your problem is not amenable to paging/cache optimizations. Only after all these have been done, and, of course, you have measured everything in sight, does it make sense to start doing line-level optimizations. And if your problem domain demands it, it can even make sense to recode the inner loops, particularly of such algorithms as convolution algorithms and DSP algorithms, in assembly code, most often to take advantage of the instructions such as for MMX and streaming media.
Perhaps the best example of pure programmer stupidity in "optimizing" code occurred when I was porting a large library we used for our research project. Think of it as a 16-bit-to-32-bit port (it was an 18-bit-to-36-bit port, and the language wasn't C, but the details don't matter--you can write ghastly code in any language, and I've seen C programmers do things just as badly). The port mostly worked, but we had a really strange problem that showed up only under some rare conditions, but which crashed the program using the library. I started looking. The heap was damaged. When I found how the heap was being damaged, it was being damaged via a bad pointer which allowed a store into a random place in the heap. OK, how did that pointer get bad? Four levels of storage damage down, and after 12 solid hours of debugging, I found the real culprit. By why did it fail? Another 5 hours, and I found that the programmer who wrote the constructor code for the data structure had a struct-equivalent something like {char * p1; char * p2;} where the pointers had been 16-bit, and we now used 32-bit pointers. In looking at the initialization code, instead of seeing something like something->p1 = NULL; something->p2= NULL;, I found the moral equivalent of (*(DWORD*)&something.p1) = 0! When I confronted the programmer, he justified it by explaining that he was now able to zero out two pointers with only a single doubleword store instruction (it wasn't an x86 computer, but a mainframe), and wasn't that a clever optimization? Of course, when the pointers became 32-bit pointers, this optimization only zeroed one of the two pointers, leaving the other either NULL (most of the time), or, occasionally, pointing into the heap in an eventually destructive manner. I pointed out that this optimization happened once, at object creation time; the average application that used our library created perhaps six of these objects, and that according to the CPU data of the day before I'd spent not only 17 hours of my time but 6 hours of CPU time, and that if we fixed the bug and ran the formerly-failing program continuously, starting it up instantly after it finished, for fourteen years, the time saved by his clever hack would just about break even with the CPU time required to find and fix this piece of gratuitous nonsense. Several years later he was still doing tricks like this; some people never learn.
Optimization matters only when it matters. When it matters, it matters a lot, but until you know that it matters, don't waste a lot of time doing it. Even if you know it matters, you need to know where it matters. Without performance data, you won't know what to optimize, and you'll probably optimize the wrong thing.
The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.
Hard to beat that combination! Now do you understand what I meant in the title?
The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.