Sorting Strings :: Data Structures in C#

Sorting Strings

We conclude our discussion of sorting with a look at a sorting algorithm designed specifically for sorting multi-keyed data. In such data there is a primary key, a secondary key, and so on. We want to sort the data so that element a precedes element b if:

the primary key of a is less than the primary key of b;
or their primary keys are equal, but the secondary key of a is less than the secondary key of b;
etc.

An example of multi-keyed data is strings. The first character of a string is its primary key, its second character is its secondary key, and so on. The only caveat is that the strings may not all have the same length; hence, they may not all have the same number of keys. We therefore stipulate that a string that does not have a particular key must precede all strings that have that key.

One algorithm to sort multi-keyed data is known as multi-key quick sort. In this section, we will describe multi-key quick sort as it applies specifically to sorting strings; however, it can be applied to other multi-keyed data as well.

One problem with sorting strings using a version of quick sort described in “Split Sorts ” is that string comparisons can be expensive. Specifically, they must compare the strings a character at a time until they reach either a mismatch or the end of a string. Thus, comparing strings that have a long prefix in common is expensive. Now observe that quick sort operates by splitting the array into smaller and smaller pieces whose elements belong near each other in the sorted result. It is therefore common to have some pieces whose elements all begin with the same long prefix.

Multi-key quick sort improves the performance by trying to avoid comparing prefixes after they have already been found to be the same (though the suffixes may differ). In order to accomplish this, it uses an extra int parameter k such that all the strings being sorted match in their first k positions (and by implication, all strings have length at least k). We can safely use a value of 0 in the initial call, but this value can increase as recursive calls are made.

Because all strings begin with the same prefix of length k, we can focus on the character at location k (i.e., following the first k characters) of each string. We need to be careful, however, because some of the strings may not have a character at location k. We will therefore use an int to store the value of the character at location k of a string, letting $ -1 $ denote the absence of a character at that location. We can also let $ -2 $ denote a null element, so that these elements are placed before all non-null elements in the sorted result.

The algorithm then proceeds a lot like those described in “Split Sorts ”. If the number of elements being sorted is greater than $ 1 $, a pivot element p is found. Note that p is not a string, but an int representing a character at location k, as described above. The elements are then partitioned into groups of strings whose character at location k is less than p, equal to p, or greater than p, respectively.

After these three groups are formed, the first and third group are sorted recursively using the same value for k. Furthermore, the second group may not be completely sorted yet — all we know is that all strings in this group agree on the first k + 1 characters. Thus, unless p is negative (indicating that either these strings are all null or they all have length k, and are therefore all equal), we need to recursively sort this group as well. However, because we know that the strings in this group all agree on the first k + 1 characters, we pass k + 1 as the last parameter.

One aspect of this algorithm that we need to address is whether the recursion is valid. Recall that when we introduced recursion , we stated that in order to guarantee termination, all recursive calls must be on smaller problem instances, where the size of a problem instance is given by a nonnegative integer. In the algorithm described above, we might reach a point at which all of the strings being sorted match in location k. In such a case, the second recursive call will contain all of the strings.

By being careful how we define the size of the problem instance, however, we can show that this recursion is, in fact, valid. Specifically, we define the size of the problem instance to be the number of strings being sorted, plus the total number of characters beginning at location k in all strings being sorted. Because there is at least one string containing p at location k, the number of strings in both the first and the third recursive call must be smaller, while the total number of characters beginning at location k can be no larger. Because k increases by $ 1 $ in the second recursive call, the total number of characters past this location must be smaller, while the number of strings can be no larger. Hence, the size decreases in all recursive calls.

The fact that we are doing recursion on the length of strings, however, can potentially cause the runtime stack to overflow when we are sorting very long strings. For this reason, it is best to convert the recursive call on the second group to a loop. We can do this by changing the if-statement that controls whether the splitting will be done into a while-loop that iterates as long as the portion being sorted is large enough to split. Then at the bottom of the loop, after doing recursive calls on the first and third parts, we check to see if p is $ -1 $ — if so, we exit the loop. Otherwise, we do the following:

increment k;
change the index giving the start of the portion we are sorting to the beginning of the second part; and
change the length of the portion we are sorting to the length of the second part.

The next iteration will then sort the second part.

This algorithm can be combined with insertion sort and heap sort , as was done for introsort in the previous section . However, we should also modify insertion sort and heap sort to use the information we already have about equal prefixes when we are comparing elements. Specifically, rather than comparing entire strings, we should begin comparing after the equal prefix. Because of the way multi-key quick sort does comparisons, the result tends to perform better than the single-key versions, assuming similar optimizations are made; however, cutoffs for running insertion sort and/or heap sort may need to be adjusted.