I feel sort of...unsorted!

A Guide to Sorting, including sorting CListCtrl, CComboBox, and CListBox

Home
Back To Tips Page

Are you feeling out-of-sorts these days? Do you want order in your life? Do you believe that A is less than B? Then you need to know how to sort!

Sorting is, on the whole, a complex task. Don Knuth devotes one entire volume of his famous "Art of Computer Programming" entirely to this topic, "Sorting and Searching". If you want the real truth, go read that. There are also many other books written on the topic.

Old Faithful: the Bubble Sort

If you have a very tiny number of objects to sort, you can use an algorithm called "bubble sort". It gets its name from the fact that elements "bubble" up to the end of the array, until the array is sorted.

The Good News: this is the easiest sort to write.

The Bad News: it is a really, really bad way to sort. (It is not the worst possible sort, but it is definitely in the running).

The problem with bubble sort is that it is, in the worst case, an "order n2 sort", that is, if it takes time T to sort k elements, then it will take time 100T to sort 10k elements. And 10,000T to sort 100k elements.As n goes up, the time goes up as n2. But if you've only got five or ten items to sort, it works fine.

template <class T> void bubble(T data[], UINT count)
   {
    BOOL changed = TRUE;
    while(changed)
      { /* scan */
       changed = FALSE;
       for(UINT i = 0; i < count - 1; i++)
         { /* compare */
          if(data[i+1] < data[i])
            { /* swap */
             T t = data[i];
             data[i] = data[i + 1];
             data[i+1] = t;
             changed = TRUE;
            } /* swap */
         } /* compare */  
      } /* scan */
   } // bubble

The State of the Art: qsort

The most commonly used sorting algorithm is the C function qsort. This is one of several algorithms known as "n log n sorts". The log function is officially log2 n, but the "2" is usually omitted in conversation.. By this measure of complexity, if it takes time T to sort k items, and I need to sort 1000k items, it will take approximately 9000T (1000 * log2 1000, noting that log2 1024 » 9). Compare that with n2 bubble sort, where the cost would be 1,000,000T. Pretty big difference!

qsort is a very sophisticated algorithm, and I don't plan to discuss how it works here. It is part of the C library. Its specification is sometimes a bit hard for a beginner to read:

void qsort(
   void *base,
   size_t num,
   size_t width,
   int (__cdecl *compare )(const void *, const void *) 
);

This is designed to be a very general algorithm, and it was designed for C, not for C++. Thus there is no "template" that would generate a family of qsort functions; the size of the object must be provided explicitly. You also have to provide a function which performs the comparison on your objects. That's the last parameter. It is passed two pointers to the elements of the array. So if you have an array of integer values and want to sort it, you will be passed int * pointers as the parameters to the function.

The function itself returns an int value. If the value is negative, it means you have determined that the first value is less than the second. If the value is positive, it means the first value is greater than the second value. If the value is 0, it means the two values are equal.

This makes it very easy to compare integers. You can write a simple function:

int compareint(const void * v1, const void * v2)
   {
    return *(int *)v1 - *(int *)v2;    
   } // compareint

Note that what happens here: you are passed the pointers to the elements of the array. You cast the pointers to int * pointers, then simply subtract. If v1 is less than v2, the result will be negative; if v1 is greater than v2, the result will be positive, and if they are the same, the value will be zero.

Suppose I have a struct of the form

typedef struct {
   CString name;
   UINT birthday;
  } myData, * pmyData;

and I have an array pointed to, and I know that there are 178 elements in the array (count). Then I could invoke the qsort function by doing

pmyData info;
UINT count;

qsort(info, count, sizeof(myData, byname);

where the function byname is defined as

int byname(const void * v1, const void * v2);

But what does that function get? It gets pointers to the array elments. Remember that this is an array of myData objects. So each time the byname function is called, it gets a pointer to two of the elements of that array. So you will need to cast them to be pointers to those objects:

int byname(const void * v1, const void * v2)
   {
    pmyData * data1 = (pmyData *)v1;
    pmyData * data2 = (pmyData *)v2;
    return _tcscmp(data1->name, data2->name);
   } // byname

The function must return a value of < 0 if v1 < v2, 0 if v1 == v2, and > 0 if v1 > v2. _tcscmp, which is the Unicode-aware version of strcmp, does exactly this.

I've represented the birthday by a UINT, which is the year of the person's birth. Suppose I wanted to sort by age. I could write

int byname(const void * v1, const void * v2)
    {
     pmyData * data1 = (pmyData *)v1;
     pmyData * data2 = (pmyData *)v2;
     return data1->birthday - data2->birthday;  // NOT QUITE RIGHT!!!
    } // byname

So this should subtract the two values. If data1->birthday < data2->birthday, then I should get a negative number, if they are equal, I should get 0, and if data1->birthday > data2->birthday I should have a positive number. So this should work. It actually gets a compiler warning.

Why? Because I declared the variable as UINT. You can't really subtract two UINTs and get a signed result. (OK, those of you who want to be pedantic will point out that the 2's complement arithmetic computes the correct bit pattern anyway, so it doesn't matter. But the compiler is unhappy).

There are a couple solutions. One is that I could declare the year of birth as an int, which would allow me negative years (if someone needs to use dates "B.C.E." (Before Christian Era) this could be useful. Or you could cast the values to int and do the arithmetic on an int, e.g.,

return (int)data1->birthday - (int)data2->birthday;

But I want more than a year. I want a day. So I instead choose to use the structure

typedef struct {
   CString name;
   COleDateTime birthday;
  } myData, * pmyData;

Now I have represented the birthday as a COleDateTime value. Why not a CTime? Because CTime has an "epoch date" of 1-Jan-1970. It wouldn't represent my birthday (I had been programming for six years at that point) or anyone else's who had been born before that time. COleDateTime has an epoch value of 1 January 1900, and is in fact a floating-point value; in addition, it allows negative values. The specified range is 1-January-100 to 31-December-9999. I would now rewrite the code as:

int byage(const void * v1, const void * v2)
   {
    pmyData * data1 = (pmyData *)v1;
    pmyData * data2 = (pmyData *)v2;
   COleDateTime deltaT = v1->birthday - v2->birthday;
   if(deltaT < 0)
     return -1;
   else
   if(deltaT > 0)
     return 1;
   else
     return 0;
   }

This is subtle. Why couldn't I just write the same different operator I had before? Why did I have to do this complicated computation?

Because it is a floating-point number! Thus if I had two birth dates, one on the late afternoon (something.75) and another early in the next morning (something+1.20), the difference would be less than a day (.45), and would be truncated to an integer (0), so the two people would look as if they had been born on the same day! So instead I have to decode the value so that the resulting sign is properly accounted for!

Using this in C++

Typically, you will be calling your sort routine from within a C++ class. However, the function must not be a class member function, one that requires a this pointer, because that is not supported in C. And qsort is a C library function, not a C++ library function.

A common misunderstanding about the C++ language forces many programmers to create a function that is not a class function. This ends up creating a function that is somehow separate from the class. This is not always a good idea. The simple way to handle this is to simply declare the function as a static member function of the class. It will not require a this pointer, and can be used as an argument to qsort.

I discuss more about callbacks in my companion essay on callback methods in C++. But the simple explanation is to simply put the function in the class.

class whatever {
     ...
    protected:
       static int compareint(const void * v1, const void * v2);
};
/* static */ int whatever::compareint(const void * v1, const void * v2)
   {
    return *(int *)v1 - *(int *)v2;
   }

Sorting a CList or CMap

You can use qsort in all kinds of interesting ways. For example, you don't always need to sort the things, you may only need to get a sorted reference to the things! Suppose you have a list of objects that you want to display in sorted order. For this discussion, we can assume either that they are unsorted, or they are sorted in a fashion that is not what you want to see. Or you might have a CMap, which essentially uses the same techniques. I will concentrate on the CList here, and leave CMap as an Exercise For The Reader.

typedef CList<pmyData, pmyData> DataList;
DataList people;

Now you want to sort this and display it in, say, name order. Do you have to sort the list? No, not at all! You can create a "side vector" that represents the sorted data, and sort that! Note that in the code below I assume the list is nonempty.

void showByName(DataList & data)
   {
    int n = data.GetCount();
    pmyData * vector = new pmyData[n];
    POSITION p = data.GetHeadPosition;
    int i = 0;
    while(p != NULL)
      { /* build vector */
       vector[i] = data.GetNext(p);
       i++;
      } /* build vector */
    qsort(vector, n, sizeof(pmyData), byname1);
    for(i = 0; i < n; i++)
      ShowAnElement(vector[i]);
    delete [] vector;
   } // showByName
int byname1(const void * v1, const void * v2)
   {
    pmyData * data1 = (pmyData *)v1;
    pmyData * data2 = (pmyData *)v2;
    return _tcscmp( (*data1)->name, (*data2)->name);
   } // byname1

Note that by creating the side vector of pointers, we introduce one more level of indirection; so what the sort routine gets is a pointer to the array element, which is itself a pointer to a myData item. So the extra level of indirection is required. You could also have written it as

int byname1(const void * v1, const void * v2)
   {
    pmyData data1 = *(pmyData *)v1;
    pmyData data2 = *(pmyData *)v2;
    return _tcscmp( data1->name, data2->name);
   } // byname1

Note that the "side vector" is valid only as long as the list remains unchanged. One thing I have done is encapsulate the list and the side vector in a single class. When I need the sorted display, I create the side vector and sort it. When I add to, or delete from, the list, I delete the side vector. I then recompute it as necessary.

template <class type, class arg_type> SortedList : public CList<type, arg_type> {
   public:
      POSITION AddHead(arg_type value) { CList<type, arg_type>::AddHead(value); vector.RemoveAll(); }
      void RemoveAll() { CList<type, arg_type>::RemoveAll(); vector.RemoveAll(); }
      ... etc
     void Sort() {
                  if(vector.GetSize() == 0)
                     { /* sort it */
                      vector.SetSize(GetCount());
                      POSITION p = GetHeadPosition();
                      int i = 0;
                      while(p != NULL)
                         { /* fill vector */
                          vector[i] = GetNext(p);
                         } /* fill vector */   
                     } /* sort it */
                    qsort(vector.GetData(), vector.GetSize(), sizeof(type *), compare); // see below
                   }
      int GetFirstSortPosition() { ASSERT(vector.GetSize() > 0); return 0; }
      int GetLastSortPosition() { ASSERT(vector.GetSize() > 0); return vector.GetSize() - 1; }
      type * GetSortAt(int n) { return vector.GetAt(n); }
  protected:
     CArray<type *, type *> vector;
     int compare(const void * v1, const void * v2) {
              if( *(type *)v1 < *(type *)v2)
                  return -1;
              else
              if( *(type *)v1 > *(type *)v2)
                  return 1;
              else
                  return 0;
             }
};

Note this assumes that there are relationship operators, > and <, defined on your type. This is a sketch of how you might do comparison. The side vector is maintained as a "cache". However, note that with this method, you must override all the operations of CList that exist; using one you have not overridden and which modifies the list without deleting the contents of the side vector would have serious consequences.

Another method is not to derive, but to embed. In this model, you provide only the methods you need, and the CList is contained entirely within the class.

template <class type, class arg_type> class SortedList {
     public:
         void AddTail(arg_type t) { list.AddTail(t); vector.RemoveAll(); }
         POSITION GetHeadPosition() { return list.GetHeadPosition(); }
         type GetNext(POSITION & p){ return list.GetNext(p); }
         void RemoveAll() { list.RemoveAll(); vector.RemoveAll(); }
     protected:
         CList<type, arg_type> list;
         CArray<type *, type *>vector;
};

Note that in this case you do not need to override all the CList methods, because the only methods that can be called on this class are the methods you export. On the whole, I find embedding to be cleaner than derivation for this purpose, and it is my preeferred mechanism for handling this situation.        

Sorting a CArray

A CArray is easy to sort. All we need is the sizes of the objects and the base pointer. The base pointer is easy. We use GetData to get that. The count is easy. That's GetSize. The width is easy. That's sizeof. So we have everything we need!

CArray<pmyData, pmyData> array;
qsort(array.GetData(), array.GetSize(), sizeof(pmyData), byname1);

Multifield sorting

Suppose I want to sort the data so that it is numeric by year, and within year, alphabetic by name. This is not terribly hard. You first sort by the main key, and then by the subkey.

int bynameinyear(const void * v1, const void * v2)
   {
    pmyData data1 = *(pmyData *)v1;
    pmyData data2 = *(pmyData *)v2;
    int dt = (int)data1->birthday - (int)data2->birthday;
    if(dt != 0)
       return dt;
    // same year
    return _tcscmp(data1->name, data2->name);    
   }

Note that this first checks to see if the years differ, If the years differ, the decision is already made. If the years are the same, we then need to compare the two values of the names. We can apply this repeatedly; each time we find an equal field, we compare the next subkey of the sort. So we could do alphabetic by state, within state by city, within city by streetname, within streetname by house number, and so on.

Sorting Control Values

There are three controls that have implicit sorting: CListBox, ComboBox, and CListCtrl. Each of these has a way of specifying a sort.

CListBox and CComboBox

These controls are basically the same. The downside is that you need to make them owner-draw controls without strings. This is because the message WM_COMPAREITEM is not sent to the parent window unless the control is both owner-draw and does not have strings. In MFC, you do not handle this in the parent window! Instead, you use the virtual method CompareItem. You use the ClassWizard or the events property (in VS7) to specify a CompareItem method. The CompareItem method gets a pointer to a COMPAREITEMSTRUCT:

typedef struct tagCOMPAREITEMSTRUCT { 
  UINT      CtlType; 
  UINT      CtlID; 
  HWND      hwndItem; 
  UINT      itemID1; 
  ULONG_PTR itemData1; 
  UINT      itemID2; 
  ULONG_PTR itemData2; 
  DWORD     dwLocaleId;
} COMPAREITEMSTRUCT; 

The normal template created when you add a CompareItem method is

int CompareItem(LPCOMPAREITEMSTRUCT lpCompareItemStruct)
   {
     return 0;
   }

The first thing I do is replace that ridiculous parameter name by something sensible:

int CompareItem(LPCOMPAREITEMSTRUCT cis)
   {
     return 0;
   }

You can then treat the itemData1 and itemData2 pointers as if they were the inputs to the compare routine we discussed for qsort. There is, however, a more serious problem. Where qsort could accept any negative value for item1 less than item2, and any positive value for item1 greater than item2, Windows requires very specific values: -1, 0, and 1. So the code looks a lot like the COleDateTime example. You create these methods in a subclass of the control you wish to implement.

In addition, this really only applies to combo boxes with the "drop list" style. You can't have an owner-drawn combo box with an edit control, but without the Has Strings option.

int CMyListBox::CompareItem(LPCOMPAREITEMSTRUCT cis)
   {
    pmyData data1 = (pmyData)cis->itemData1;
    pmyData data2 = (pmydata)cis->itemData2;
    if(data1->year < data2->year)
       return -1;
    if(data1->year > data2->year)
       return 1;
    int result = _tcscmp(data1->name, data2->name);
    if(result < 0)
      return -1;
    if(result > 0)
      return 1;
    return 0;
   }

If you only need a simple CListBox or CCombBox, the DrawItem code is quite simple. The string that is printed is based on your own transformation of the contents of your data structure. While it makes sense to print it out the fields in the order of the sort keys, I've done various kinds of decorations as well. For example, using color highlights, putting punctuation marks in, displaying a flag, and so on. And while I do it here as a single TextOut, you can columnize it nicely, or do whatever you want. I have highlighted the actual output code below; the rest is infrastructure. Note that I have also replaced the absurd name lpDrawItemStruct for the parameter with something much more sensible: dis.

void CInstrumentSelector::DrawItem(LPDRAWITEMSTRUCT dis) 
   {
    CRect r = dis->rcItem;
    CDC * dc = CDC::FromHandle(dis->hDC);
    COLORREF txcolor;
    COLORREF bkcolor;

    int saved = dc->SaveDC();

    if(dis->itemState & ODS_SELECTED)
       { /* selected */
        bkcolor = GetSysColor(COLOR_HIGHLIGHT);
        txcolor = GetSysColor(COLOR_HIGHLIGHTTEXT);
       } /* selected */
    else
       { /* unselected */
        if(dis->itemState & (ODS_DISABLED | ODS_GRAYED))
           txcolor = GetSysColor(COLOR_GRAYTEXT);
        else
           txcolor = GetSysColor(COLOR_WINDOWTEXT);
           bkcolor = GetSysColor(COLOR_WINDOW);
       } /* unselected */

    dc->SetBkColor(bkcolor);
    dc->SetTextColor(txcolor);

    dc->FillSolidRect(&r, bkcolor);

    //****************************************************************
    // Print the data out in the format
    // yyyy name...
    pmyData data = (pmyData)dis->itemData;
    CString s;
    s.Format(_T("%04d %s"), data->birthday, data->name);
    //****************************************************************

    if(dis->itemID != (UINT)-1 
       && (dis->itemState & (ODS_DISABLED | ODS_GRAYED)) == 0)
      { /* has item */
       dc->TextOut(r.left, r.top, s);
      } /* has item */

    if(dis->itemState & ODS_FOCUS && dis->itemAction != ODA_SELECT)
       dc->DrawFocusRect(&r);

    dc->RestoreDC(saved);
   }
   

The CListCtrl

The CListCtrl is very peculiar. Each element in a CListCtrl has an LPARAM value, which you can set explicitly as part of the structure you pass in, or by using the SetItemData method. When you invoke a sort request, you provide a sort routine. The sort routine is not provided with the index of the items to compare, but simply the LPARAM values.

Therefore, if you want a sortable CListCtrl, you must explicitly set the LPARAM field of each entry to point to, or contain, something useful for doing a comparison. A common technique is to put a pointer to an object in this field. For example, if I want to add the information to a CListCtrl, I might add the year as the first column (converted to a string representation), and the name in the second column, but I also set a pointer to the data object as the ItemData value. In the sort routine, I then have access to this.

void CMyListControl::sortbyageandname()
   {
    SortItems(compareByAgeAndName, NULL);
   }

The compare function must be declared as a static method, e.g.,

class CMyListControl : public CListCtrl {
   ...
   protected:
      static int compareByAgeAndName(LPARAM v1, LPARAM v2, LPARAM parm);
  };
/* static */ int CMyListControl::compareByAgeAndName(LPARAM v1, LPARAM v2, LPARAM)
   {
    pmyData data1 = (pmyData)v1;
    pmyData data2 = (pmyData)v2;
    int result = (int)data1->birthday - (int)data2->birthday;
    if(result != 0)
       return result;
    return _tcscmp(data1->name, data2->name);
   }

Note that unlike the CListBox and CComboBox, simple negative and positive values will function correctly. The specific values -1 and 1 are not required.

Suppose I want to have multiple sorts. For example, by clicking in a header control, detecting the desired column. What I do is respond to the button and set a "desired sort" field in my control. But I need to determine in the comparison routine which sort to do. But the sort function must be static, which means it can't access any class variables!

Now a naive (and incorrect) solution to this problem might be to store the desired sort in a static variable, which would make it accessible to the static method. This is so unbelievably incorrect that it is hard to imagine anyone would want to do it. Imagine if you had two controls! Which one would control the sorting? Whichever one the user selected most recently! It would set the sort style for both! This, of course, would be completely nonsensical.

There are two ways to handle this. The simplest one is to pass that variable in as the user-defined parameter. For example

class CMyListControl : public CListCtrl {
    ...
  public:
    typedef enum {SORT_BY_BIRTHDAY, SORT_BY_NAME} SortType;
    void SortByWhatever(SortType t);
  protected:
    static int compareWhatever(LPARAM v1, LPARAM v2, LPARAM parm);
    ...
};

Then to invoke the sort, you would have, in your dialog

class CMyDialog : public CDialog {
    protected:
       CMyListControl::SortType DesiredSort;
       void SortByWhatever();
};
void CMyDialog::SortByWhatever()
   {
    c_List.SortByWhatever(DesiredSort);
   }
void CMyListControl::SortByWhatever(SortType type)
   {
    SortItems(compareWhatever, (DWORD_PTR)type);
   }
/* static */ int CMyListControl::compareWhatever(LPARAM v1, LPARAM v2, LPARAM param)
   {
    pmyData data1 = (pmyData)v1;
    pmyData data2 = (pmyData)v2;
    switch((SortType)param)
       { /* SortType */
        case SORT_BY_BIRTHDAY:
          { /* birthday */
           int result = (int)data1->birthday - (int)data2->birthday;
           if(result != 0)
              return result;
           return _tcscmp(data1->name, data2->name);
          } /* birthday */
        case SORT_BY_NAME:
           return _tcscmp(data1->name, data2->name);
        default:
           ASSERT(FALSE); // unknown sort type
           return 0; // treat as same
       } /* SortType */
   }

The second method is to pass a pointer to the class and access the class member variable by casting this pointer to be the class pointer type. This seems to expose more of the class than is necessary, and is not a recommended method.

Stable and Unstable sorts

There are two kinds of sort algorithms: "stable sorts" and "unstable sorts". These terms refer to the action that is taken when two values compare as equal. If you have an array T0..size with two elements Tn and Tk for n < k, and these two elements compare equal, in a stable sort they will appear in the sorted output with the value that was in Tn preceding Tk. The output order preserves the original input order. An unstable sort, by contrast, there is no guarantee of the order of these two elements in the output.

"What difference does it make?" you may ask. After all, if I have an input sequence 1 7 3 4 3 5 2, and the output sequence is 1 2 3 3 4 5 7, what does it matter that the values 3 may be in a different order? Well, consider that what we have is not a set of simple scalar values, but a set of tuples: <year, name>. Suppose my input sequence is <1981, "Fred">, <1987, "Joe">, <1983, "Harry">, <1984, "Charles">, <1983, "Ellen">, <1985, "Mary"> and <1982, "Harriet">. Suppose that I only compare the year. Then in a stable sort, I am guaranteed that the output will be <1981, "Fred">, <1982, "Harriet">, <1983, "Harry">, <1983, "Ellen">, <1984, "Charles">, <1985, "Mary">, <1987, "Joe">. You have the guarantee that the order of the two 1983 entries in the output preserves the original order. In an unstable sort, you have no such guarantee, and those two elements might well have come out as <1983, "Ellen">, <1983, "Harry">.

Note that qsort is not specified as a stable sort. Whether it is or not is up to a particular implementation.

Engineering Tradeoffs

What is the best mechanism? Obviously, an log n sort is infinitely preferable to an n2 sort. No question. But sometimes you don't have the option. For example, I once had to keep a sorted list of objects. These objects had been stored in a file. Version v of the program stored these in most-recently-created order, although each object had a relationship to other objects which was later handled during the processing. I had added some new features to the program, which now required that the list be kept in sorted order. It was easy. I just read the list in, created a side vector, sorted the vector, then relinked the list based on the sorted order.

Unfortunately, this was on a small machine (MS-DOS, 640K). For a really large input file, by the time I read the whole list in, there was no space left to allocate the side vector for sorting. This was, of course, a fatal error; the program could not read existing large files. Fortunately, this was discovered during the beta testing. So what I did was read the file in, and try to malloc the side vector. If there was insufficient space, I got back a NULL pointer. At that point, I would do a bubble-sort of the list, swapping the positions of the elements in the list (it was a two-way linked list, so this was easy). The difference was that when I could call qsort, it only took a couple seconds (even on a 6MHz 286) to do the sort. But when I did the bubble sort, it took several minutes. So all I did was post a message saying "converting old file to new format, this may take several minutes", then sorted the file. Once it was sorted, of course, it did not need to be sorted again, so this happened exactly once for each large file read in.

In another program I wrote many years ago, I was sorting large symbol tables for various kinds of display (classes, members, etc.). Being in a hurry to get to the final stages, I tossed a bubble sort into the sort algorithm because I could write it in under five minutes, about as fast as I could type. After I'd run all the tests, I ran a "real" example. The program printed out the message "Starting reports" and ten minutes later it hadn't finished. Ouch! Obviously, the problem was my poor n2 sort. So I got out my favorite n log n sort algorithm (this was in the days before C; my program was not written in C, and qsort did not exist), dusted it off, translated it to the language I was programming in, and replugged the sort subroutine.

I then reran the program. It printed out "Starting reports", and ten minutes later it hadn't finished! A bit of poking with the debugger revealed that I was still in the sort routine. This didn't make any sense. So I tossed in my performance measurement tools, and learned something that I should not have forgotten.

When we talk about complexity, we simplify our discussions. We talk about n log n, n2, and such, but forget one thing: these designations are not accurate! The correct designation is C × (n log n) or C × n2, where C is the "constant of proportionality". My detailed measurement showed that I was actually expending all my time in the string-compare routines, the equivalent of strcmp. So while I had reduced the fundamental complexity, I had not reduced the string-compare time, so I was doing n log n string compares of somewhat lengthy strings. I had a very large value of C.

What to do? Well, because of the way I was handling the reports, I had to re-sort the information many times based on a multifield structure, where the fields were string names. So what I did was take my symbol table, build a side vector which embodied all the names in the symbol table, and for each symbol table entry, I assigned an integer which was its position in the side vector. Thereafter, when I needed to sort a set of names, instead of reaching out through its symbol table pointer and doing a string compare, I reached out and picked up the integer. If name n1 was less than name n2, then its corresponding integer value k1 was necessarily less than the integer value k2 for name n2, The result was that the sorting and report generation required less than one minute (on a machine roughly comparable to a 286 computer, with about as much memory).

A note from Kevin Nolish

Kevin Nolish, a long-time friend, read my article, and had this addendum, which I use with his permission:

I have one quibble with your Sorting Techniques note.

Quicksort is certainly the fastest sorting algorithm out there. Furthermore, it has a decent implementation in most standard C libraries. As such it is probably the best algorithm to use, except when it isn't. It's generally my first choice for non trivial sorting tasks. ( When I have to sort exactly 4 elements, well I just code a bubble sort. It's easier than thinking.)

You ought to mention that one drawback to Quicksort is that it has a perverse O(n2) case that you need to be aware of when choosing your sorting algorithm. Quicksort can be an O(n2) algorithm provided two things go
wrong:

  1. Your data set is already sorted, or mostly sorted
  2. You choose a bad pivot element. The worst pivot element is the minimal element in the sorted list.

If you have a mostly sorted array, but choose a good pivot element, you should be all right. Most experts recommend choosing an element at random. In this case you have a good probability of choosing a good pivot element and you should be fine. I don't know how the library qsort routine chooses its pivot element, but the implementation is probably good. One of the nice things about library routines that have been around for a while is that you can generally assume that the obvious stupidities have been found and fixed.

Furthermore, as a recursive algorithm, QuickSort requires some sort of intermediate storage, either on the call stack if the implementation is explicitly recursive, or as an auxiliary structure if the implementation simulates recursion via iteration. This might be a concern. I come from an embedded systems, no virtual memory, small task stacks, world so I have a bias against recursion.

So, if you're concerned about intermediate data use or think that you will generally be dealing with mostly sorted data, you should consider Heapsort. Heapsort is another C × O(log n) algorithm, with a somewhat larger value of C than Quicksort. Heapsort doesn't require the intermediate storage that Quicksort does and it's run time, while slightly slower than quicksort, is deterministic. It has no perverse cases that cause it to run in other than O(log n) time.


The worst possible sort was invented by Dr. Guy L. Steele, Jr. He calls it "bogo-sort" and it is intended to represent the worst sorting algorithm he could imagine. Do you know the game called "52 pickup"? You fill in the array in random order. You then check to see if it is in sorted order. If it is, you are done. If it isn't, you repeat the algorithm. You keep this up until you discover that the array is fully sorted.

The game of "52 pickup" is usually played with a small child. "Here's 52 cards", you say, holding the deck up. "Do you want to play 52 pickup?" The child, unless he or she has seen this before, says "Sure!". You toss all the cards on the floor. "That's the game. You pick them up". Now imagine that all the cards are face-down. Bogo-sort requires that when you assemble the cards, they are all in order by value and suit. If not, you play 52 pickup again. In bogo-sort, you are not permitted to look at the faces of the cards while picking them up.

Back to Bubble Sort

[Dividing Line Image]

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 2003 The Joseph M. Newcomer Co./FlounderCraft Ltd., All Rights Reserved
Last modified: May 14, 2011