Indexed information

Indexed information: creation and lookup

	Home
	Back To Tips Page

This essay is going to cover some non-obvious issues of representation of information. These are techniques I have developed over many years, and mostly they involve "thinking outside the box" of traditional Computer Science training.

One question I like to torture students with is a classic design question. I have to store 10,000 integers in a "collection". There will be a lot of insertions, deletions, and lookups. What sort of "collection" should I use?

Most students will rule out an array or vector representation. Why? Because insertion is slow, and deletion is slow, because of the necessary copies to create space or compress it. A linked list is a better choice, because insertion and deletion take constant time.

This answer is wrong.

Why is it wrong?

Well, first, the assumption that insertion is constant time. Insertion is constant time once you know where the insertion is happening. But if the list is being maintained in sorted order, the expected time to compute the proper insertion point is O(n/2). So even though the insertion time is constant (call it 1), the total cost of O(n/2)+1 is dominated by the n/2 factor once n gets to interesting sizes.

Second, and this is really fundamental to all algorithmic analysis: these performance numbers are based on an assumption of constant memory access time. This is rarely true in modern machines. Between the multiple levels of cache, and the concept of page faults, memory access time is not constant and can vary by seven orders of magnitude.

Here's some performance numbers. These do not include L3 numbers for the new Core series of processors, because Intel refuses to divulge the behavior of the L3 cache, and these numbers vary depending on particular models (which is why there are ranges given)

If the data is found in the write pipe or L1 cache, it takes 1 CPU clock cycle to retrieve it
If the data is not in the write pipe or L1 cache, but in the L2 cache, it takes 1 or 2 CPU clock cycles to retrieve it
If the data is not in the L2 cache, and the CPU has to go out to memory, it takes 20-100 CPU clock cycles to retrieve it
If there is a page fault, it takes 20,000,000 to 40,000,000 CPU clock cycles to retrieve it

So let's look at our original problem: we have a collection with lots of insertions, deletions, and lookups. We choose a linked list to represent the collection. After thousands of allocations and deallocations, it is entirely possible that each cell access can cause a page fault. So on the average, O(n/2) for 10,000 values will require between 20,000,000,000 and 40,000,000,000 clock cycles per lookup. This is a far cry from the "5,000" number that O(n/2) would suggest. Note also that if you realize that there are several instructions involved in the computation, it is not 5,000 CPU clock cycles, but 5,000 cell-access events, which can involve a lot more CPU clock cycles, even if there are no page faults!

The problem is that big-O, the O(f(n)) for some function f, is an oversimplification. The real formula is

K_setup + C × f(n) + K_teardown

and this, too, is an oversimplification but one which is usually acceptable. The value K_setup is the "setup time" and the simplifying assumption is that it is constant time independent of the number of objects in the collection, and K_teardown is the time required to "tear down" the computation. It, too, is generally assumed to be constant and independent of the number of objects in the collection. The reason these terms are normally ignored in the characterization of big-O is that they are not only constant, but they are very small relative to f(n), particularly for large values of n. What is also ignored, but should not be, is C, the constant of proportionality. This is, for example, the cost of each loop that involves iterating through the collection. In involves the loop counter manipulation time, or the test-for-NULL-pointer time, the cost of loading up whatever pointers are involved, performing each index computation, and the cost of the comparison and conditional branch that determines if the comparison has succeeded or failed. And here's the real catch: C can be very large! So that "5,000" represents how many loops are required, not the cost-per-loop.

For example, if the comparison involves string comparisons, the cost of a string comparison is very high. There is a real risk in ignoring C.

Now let's return to our original question: what's the best collection to use?

Here are some possible choices

An array of 10,000 objects
A linked list of 10,000 cells
A balanced tree of 10,000 cells

So which one should we use?

If we look at the problems of lookup, we discover that for an array, the average expected time is O(n/2), and for a linked list, the average expected time is O(n/2), but a balanced tree has an average expected lookup time of O(log₂(n)). However, we can change these figures by thinking more about the representation.

Consider the use of a sorted collection. Now things have changed. For an array, the average expected is O(log₂(n)) (using binary search), thus is comparable to a balanced tree, and also substantially faster than a linked list.

Now let's consider the insertion time. For an array, because we have to make space for it, we have to move (on the average) O(n/2) elements. For a linked list, we have to spend O(n/2) finding the correct place to put the new element, even if the insertion time is a constant k (nominally, we call this O(1) and let k be absorbed into the implicitly dropped C). The cost of an insertion in a balanced tree depends on how much rebalancing is required, but for most balanced trees is O(log₂(n)). Deletion costs for a balanced tree are typically comparable to insertion costs.

Again, however, these computations naively assume constant memory access time, which we now know is simply not true.

For example, consider 10,000 integers. For purposes of this discussion, we will assume that all the integers are 32-bit, just to have a value we can use for actual computations. An array as a collection would then occupy 40,000 bytes, or just about 10 4096-byte pages (most pages in most systems are 4,096 bytes). Now, to copy the elements upwards involves copying integers in reverse order, for example, to insert the value "37" in the array below, a space has to be created for it:

[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]
1	2	7	9	11	23	31	42	48	56	73	99

Having determined the insertion point, as indicated by the two colors above, we shift everything up

[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]	[12]
1	2	7	9	11	23	31	42	42	48	56	73	99

The new element can then be inserted as shown:

[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]	[12]
1	2	7	9	11	23	31	37	42	48	56	73	99

Note in the above examples, I have assumed that there is space to do the shift-up. If there is not, a new array must be allocated, of size n+1, and the n elements must be copied into it. We will ignore this (rather important) problem for the moment.

But consider the impact of caching. In this simplified example, we will assume that a cache line is 16 bytes (the typical size of a cache line on the x86 architecture). Under these assumptions, when we copy element [11], we end up caching the values in positions [8], [9], [10] and [11]. So there is a constant memory time to fetch the data to a cache line, shown below as C1.

C1								48	56	73	99
[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]
1	2	7	9	11	23	31	42	48	56	73	99

Now, when we go to copy [11] into [12], the cost is constant, one memory access (to load the L1 cache), and this goes into cache line C2. But the copy of [10] to [11], [9] to [10], and [8] to [9], are all essentially zero cost.

C2												99	?	?	?
C1								48	48	56	73
[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]
1	2	7	9	11	23	31	42	48	56	73	99

When we go to copy [7] to [8], we incur another memory cycle to get the data into the L1 cache. Because of how caches are organized, this will almost certainly not reuse any of the cache lines previously used, but will go to a different cache line, C3, as shown below.

C3				11	23	31	42
C2												99	?	?	?
C1								42	48	56	73
[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]
1	2	7	9	11	23	31	42	48	56	73	99

Writing the new value, 37, is also zero cost. The data is written to the L1 cache, and from the L1 to the L2 cache, and this has no cost to the CPU.

C3				11	23	31	37
C2												99	?	?	?
C1								42	48	56	73
[0]	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]
1	2	7	9	11	23	31	42	48	56	73	99

The cache behavior has a big impact on the value of C. Note, by the way, that the original values in memory have not changed; the changed values are in the L1/L2 cache. They will not be written back until those cache lines are needed. (In the case of multiprocessor access to memory, the complexities of cache coherency would constitute a complete essay in its own right, so we will ignore that problem here. However, note that the x86 is cache-coherent and will not have errors caused by one core having a cached value that is not reflected to memory)

Note that we have reduced memory costs by a factor of 4 with no work at all! It would appear that the cost of the update should be O(6) is actually O(3). And, because, in order to find the insertion point, we probably already had elements [5]..[8] in the cache, it is more like O(2). The search, because of the caching, you might think would be O(7), but because of caching, is O(3).

And there's more good news. Because there are only 10 pages involved, they are very likely already in the working set, and therefore there would be no page faults in accessing this array! But if there are 10,000 pages under worst-case scenarios, it is extremely unlikely that the working set will accommodate the entire set of elements, and therefore there would be a significant number of page faults to locate the insertion point.

In general, if you have a problem which is highly iterative across arrays of data, if you rewrite your algorithm to be cache-aware, you can gain an order of magnitude performance.

There are many other optimizations that can be done. For example, if the array is an array of pointers-to-values, we can indicate that a cell is unused by putting a NULL value in it. Thus, on a deletion, we do not need to actually "compress" the array; instead, we can leave the NULL value there. This has another advantage: when we need to slide elements around to "make a hole" to put a new value in, we only need to locate the next-higher NULL value, and move up only the elements between the insertion point and the NULL value. (In the integer case, if we know that a value like 0x80000000 is impossible, we can use this as a similar sentinel value).

Consider an array of the following, of pointers-to-integers (yes, this is an oversimplified and rather contrived example, but follow along for a bit)

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

*11

*23

*31

*42

*48

*56

*73

*99

Now suppose I have deleted element [8]. The naive solution is to compress the array.

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

*11

*23

*31

*42

*56

*73

*99

Instead of doing a compression, I set the values as shown below:

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

*11

*23

*31

*42

NULL

*56

*73

*99

This means that when I decide to insert the value "37" in the array, I only have to shift element [7] to [8] to make a hole at position [7]:

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

*11

*23

*31

*37

*42

*56

*73

*99

Thus, in a scenario in which there are a lot of insertions and deletions, insertions go down in cost significantly, because the deletions have left a lot of holes that will terminate the up-copy required to create a hole; in some cases, there may be a hole at exactly the place the new insertion should go.

Since the array is sorted, the expected search time to find an insertion point is O(log₂(n)), that is, for this example, somewhat under 4. A variant of bsearch would be used to locate the position for insertion, or to do a lookup; the reason for a custom routine is that the presence of the NULL values, which have to be ignored.

Now, you say, those are pointers. This means that to access the values, the pointer must be dereferenced, and this could cause a page fault.

You would be absolutely right. This would be a bad choice.

However, consider that the integer in question is not just a pointer-to-integer, but a pointer-to-struct where an integer field in the struct is what we are concerned with. So each element access involves dereferencing a pointer and obtaining the integer field from this dereference, and potentially causing a page fault. But now, consider a representation where we store <pointer, integer> in the array. We have, at this point, doubled the size of the array (this assumes that sizeof(int) == sizeof(void*), but it actually doesn't matter if I were using 64-bit pointers, because the size would only go up by 1.5 in that case), but the integers are now in the array. If the pointer value is NULL the meaning of the integer component is undefined. Thus, there are no dereferences at all until we need to actually use the data in some way; there are no pointer dereferences during the search. Note that doubling the array size from 10 pages to 20 pages (for 64-bit pointers, from 20 pages to 30 pages). This is still using a small number of pages, which will likely fit into the working set, so there will very likely be no page faults.

If we preallocate the size of the array, we do not get into the exponential-copy problem, where adding an element takes progressively more time because each time the array has to expand, the old contents have to be copied to the new array. Or, use an allocation strategy much like std::vector in the C++ Standard Library, where excess space is allocated on a copy, and a new allocation is not required until that extra space is used up.

After a lot of insertions and deletions, my array implementation began to run with search time O(log₂(n)) (as expected) but constant insertion and deletion time.

Reducing C

There are other tricks you can use. For example, in one situation I was in I found that the C of strcmp was the major performance bottleneck, and in addition, because the array involved dereferencing pointers (thus potentially generating a page fault) and then dereferencing the string pointer could cause another page fault, I used a trick. The nature of the problem was that there were subsets of the symbols which had to be sorted in alphabetical order (example: subkeys under a major key were displayed in alphabetical order).

I knew the entire set of strings that existed (they were all in a symbol table). So what I did was one sort of all the strings in the symbol table, of an array <pointer-to-string, pointer-to-symbol-table>. I ran through this array in sorted order, and when I was at element[i], I reached out to the symbol table, where I had added a new "order" field and wrote the index [i] into that field. Then, when I built a set of objects to be sorted (there were many such sorts), instead of having to go out to the symbol table, and then dereference the string pointer, and do the expensive strcmp, I created a sort vector where I represented the string by its integer from the symbol table. Now, I only had to sort integers. The sorts ran a couple orders of magnitude faster. That constant C matters.

Is K_setup really negligible?

Generally, yes. However, if you are writing a page-aware, cache-aware computation, K_setup may actually be pretty hairy. But, expending a few hundred thousand cycles to compute what the best approach is, offsetting even a single page fault (at 20,000,000 to 40,000,000 CPU cycles) still puts you well ahead of the game.

And, in many cases, you will expend K_setup exactly once per execution, but the loops over the data may occur many times, so it is truly a one-time cost. Once you've derived the proper parameters for your loops, you just use them, no matter how many times you need to run the computation itself. Think of a situation where there is some kind of image-processing loop over a large image. Once you determine the optimal page-aware and cache-aware parameters, every time the user asks for an image transformation, you simply reuse those parameters to drive the data conversion. So no matter how large K_setup may look to you, think instead of how many cache misses it avoids, and how many page faults it avoids, and it will be clear that the cost is irrelevant. And therefore it is "negligible" in the Big Picture of your algorithm.

Hinting at a solution

Now, there are sometimes when a linked list is really the best representation of information, particularly when the objects are large. But we might still have to access it quickly. There are many suggestions about how to accomplish this, but many of them have significant performance problems because they assume that there is a need to keep the auxiliary structure that provides the quick access in perfect synchronization with the contents of the primary structure (for example, a vector index has to be kept in perfect synchronization with the linked list).

I was in a situation in which I needed to do very fast access of elements on a fairly slow machine (OK, it was an 8088). I developed a technique based on some ideas from Xerox PARC, known as hints.

One way to manage fast access to a linked list of n elements is to create an array of size n, which is in sorted order also, and is an array of pointers into the list. Now, the only problem with this is that every time I add an element to the list, I have to add an element to the array, in sorted order. Every time I delete an element from the list, I have to delete an element from the array. This can be a serious performance problem.

The idea of a hint is that it is a value whose validity is easy to determine, but whose initial computation might be difficult. What I did for my hints was create an array of 100 elements (this was an 8088, I only had 640K! So, to avoid dynamic allocation, I had a static array of 100 elements). What I did was take my linked list of several thousand elements, and run the following computation:

Count the number of elements in the list, call it n.
Compute a value which is n divided by the number of elements in my quick-access array, k, in my case, n/100. Call this value m. If the hint table is statically allocated, k is _countof(hint_table).
Run across the linked list, and put a pointer to every m^th element in the array. Thus m represents a modulus for the hint.

Now, when I add a new element to the list, I just add it. I don't need to recompute the array. Either it goes in after an element in the array, or within a sequence of elements between two elements of the array. In either case, the array provides a hint about the position. I'll discuss the "hint resolution" problem shortly, so bear with me for a moment.

Now, if I delete an element from the list, either it was pointed to by an element in the array, or not. If no element in the array points to it, what I do is delete the element from the list, and I'm done. If an array element points to it, which I can determine in O(k/2) where k is the number of elements in the array, I just set that array element to NULL. Then I can delete the element from the list.

Note that I do not have to recompute the array at any time. While I don't have to, for performance reasons it might be reasonable to consider recomputing it from time to time (more on this shortly).

To resolve a hint, I start with a key that I want to locate in the list. I first search for the key in the hint array. This is a bit subtler than bsearch from the C library; what I do is find the element i in the hint array such that hint_array[i].key <= key, and furthermore, that element hint_array[i+1].key > key (and yes, there is a boundary condition if i == (k-1), no big deal). I use the hint at position i to go into the list. If the key of the element I'm referring to is equal to my key, I'm done. If it isn't, I start walking the list from where I am to the element of the list referenced by the next array element. Thus, I never have to examine more than (approximately) m elements, plus whatever elements may have been inserted. This allows me either to identify the position into which I insert a new element, or discover the element I want to delete (the error response if a lookup operation does not find the key is dependent on the problem domain).

Using this technique, I could, in real time, locate any element of the list in (on the average) O(log₂(k)) + O(m/2), which was always substantially smaller than O(n/2), which is what the old algorithm required. For example, if k is 100 and n is 3000, I can find the element in < 7+15 operations, instead of 1500.

What I needed to worry about was what happened after a large number of insertions, because I could then have an "unbalanced hint set", where I could have far more than m elements between two index points. I handled this rather simply, because insertion and deletion did not have to track real time: because I had figured the insertion point using the hints, I had an index into the hint table, hint[i]. After an insertion, I started at the list element pointed to by hint[i] and followed the chain until I hit the element indicated by the element referenced by hint[i+1]. I counted the number of elements that this range spanned. If it exceeded 2*m, I decided it was time to redo the hints. I chose this as an arbitrary threshold. Of course, there was a special boundary condition for insertions at the end of the list, that is, following hint[k - 1]. And I had to deal with the effect of deletions, because hint[i+1] might be NULL. But this is all merely programming detail.

Note that the hint array may be a pair <key, pointer> so I don't need to dereference the pointer to obtain the key for comparison purposes, thus minimizing potential page faults.

Summary

Many of the arguments about choices naively assume constant memory access time. They also naively assume that C doesn't matter. Many counterproposals for performance improvements of linked lists naively assume that the auxiliary structures must be kept in perfect synchronization with the contents of the linked list. Hopefully, I have helped dispel some of these false assumptions. Reality is considerably different from the theoretical models.

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Reducing C

Is Ksetup really negligible?

Hinting at a solution

Summary

Send mail to newcomer@flounder.com with questions or comments about this web site. Copyright © 2011, The Joseph M. Newcomer Co. All Rights Reserved Last modified: June 20, 2011

Is K_setup really negligible?

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 2011, The Joseph M. Newcomer Co. All Rights Reserved
Last modified: June 20, 2011