Sorting and Searching
The
sorting problem is to rearrange a set of items in ascending order. One reason
that it is so useful is that it is much easier to search for something in a
sorted list than an unsorted one. In this section, we will consider in detail
two classical algorithms for sorting and searching, along with several
applications where their efficiency plays a critical role.
Binary search.
In the game of "twenty
questions", your task is to guess the value of a hidden number that is one
of the N integers between 0 and N-1. (For simplicity, we will assume that N is
a power of two.) Each time that you make a guess, you are told whether your
guess is too high or too low. An effective strategy is to maintain an interval
that contains the hidden number, guess the number in the middle of the
interval, and then use the answer to halve the interval size. TwentyQuestions.java implements this strategy, which is an example of the
general problem-solving method known as binary search.
- Correctness proof. First, we have to convince ourselves that the method is correct: that it always leads us to the hidden number. We do so by establishing the following facts:
- The interval always contains the hidden number.
- The interval sizes are the powers of two, decreasing from N.
The
first of these facts is enforced by the code; the second follows by noting that
if the interval size (hi-lo) is a power of two, then the next interval size is
(hi-lo)/2, which is the next smaller power of two. These facts are the basis of
an induction proof that the method operates as intended. Eventually, the
interval size becomes 1, so we are guaranteed to find the number.
- Running time analysis. Since the size of the interval decreases by a factor of 2 at each iteration (and the base case is reached when N = 1), the running time of binary search is lg N.
- Linear-logarithm chasm. The alternative to using binary search is to guess 0, then 1, then 2, then 3, and so forth, until hitting the hidden number. We refer to such an algorithm as a brute-force algorithm: it seems to get the job done, but without much regard to the cost (which might prevent it from actually getting the job done for large problems). In this case, the running time of the brute-force algorithm is sensitive to the input value, but could be as much as N and has expected value N/2 if the input value is chosen at random. Meanwhile, binary search is guaranteed to use no more than lg N steps.
- Binary representation. If you look back to Program 1.3.7, you will recognize that binary search is nearly the same computation as converting a number to binary! Each guess determines one bit of the answer. In our example, the information that the number is between 0 and 127 says that the number of bits in its binary representation is 7, the answer to the first question (is the number less than 64?) tells us the value of the leading bit, the answer to the second question tells us the value of the next bit, and so forth. For example, if the number is 77, the sequence of answers no yes yes no no yes no immediately yields 1001101, the binary representation of 77.
- Inverting a function. As an example of the utility of binary search in scientific computing, we revisit a problem that we consider the problem of inverting an increasing function. To fix ideas, we refer to the Gaussian distribution Φ when describing the method. Given a value y, our task is to find a value x such that Φ(x) = y. In this situation, we use real numbers as the endpoints of our interval, not integers, but we use the same essential method as for guessing a hidden integer: we halve the size of the interval at each step, keeping x in the interval, until the interval is sufficiently small that we know the value of x to within a desired precision δ. We start with an interval (lo, hi) known to contain x and use the following recursive strategy:
- Compute m = lo + (hi - lo) / 2
- Base case: If (hi - lo) is less than δ, then returm m as an estimate of x
- Recursive step: otherwise, test whether Φ(m) < y. If so look for x in (lo, m); if not look for x in (m, hi)
The
key to this method is the idea that the function is increasing - for any values
a and b, knowing that Φ(a) <&Phi(b) tells us that a < b, and vice
versa. In this context, binary search is often called bisection search
because we bisect the interval at each stage.
- Binary search in a sorted array. During much of the last century people would use a publication known as a phone book to look up a person's phone number. Entries appears in order, sorted by a key that identifies it (the person's name) n both cases). A brute-force solution would be to start at the beginning, examine each entry one at a time, and continue until you find the name. No one uses that method: instead, you open the book to some interior page and look for the name on that page. If it is there, you are done; otherwise, you eliminate either the part of the book before the current page or the part of the book after the current page from consideration and repeat.
- Exception filter. We now use binary search to solve the existence problem: is a given key in a sorted database of keys? For example, when checking the spelling of a word, you need only know whether your word is in the dictionary and are not interested in the definition. In a computer search, we keep the information in an array, sorted in order of the key. The binary search code in BinarySearch.java differs from our other applications in two details. First, the file size N need not be a power of two. Second, it has to allow the possibility that the item sought is not in the array. The client program implements an exception filter: it reads a sorted list of strings from a file (which we refer to as the whitelist) and an arbitrary sequence of strings from standard input and prints those in the sequence that are not in the whitelist.
Insertion
sort.
Insertion sort is a brute-force sorting
algorithm that is based on a simple method that people often use to arrange
hands of playing cards. Consider the cards one at a time and insert each into
its proper place among those already considered (keeping them sorted). The
following code mimics this process in a Java method that sorts strings in an
array:
public static void sort(String[] a) {
int N = a.length;
for (int i = 1; i
< N; i++)
for (int j =
i; j > 0; j--)
if
(a[j-1].compareTo(a[j]) > 0)
exch(a, j, j-1);
else break;
}
|
The outer for loop sorts the first i entries in the array; the inner for loop can complete the sort by putting a[i] into its proper position in the array.
- Mathematical analysis. The inner loop of the insertion sort code is within a double for loop, which suggests that the running time is quadratic, but we cannot immediately draw this conclusion because of the break.
- Best case. When the input array is already in sorted order, the inner for loop amounts to nothing more than a comparison (to learn that a[j-1] is less than a[j]) and the break, so the total running time is linear.
- Worst case. When the input is reverse sorted, the inner loop fully completes without a break, so the frequency of execution of the instructions in the inner loop is 1 + 2 + ... + N-1 ~ N^2 and the running time is quadratic.
- Average case. When the input is randomly ordered To understand the performance of insertion sort for randomly ordered, we expect that each new element to be inserted is equally likely to fall into any position, so that element will move halfway to the left on average. Thus, we expect the running time to be 1/2 + 2/2 + ... + (N-1)/2 ~ N^2 / 2.
- Sorting other types of data. We want to be able to sort all types of data, not just strings. For sorting objects in an array, we need only assume that we can compare two elements to see whether the first is bigger than, smaller than, or equal to the second. Java provides the Comparable interface for precisely this purpose. Simply put, a class that implements the Comparable interface promises to implement a method compareTo() for objects of its type so that that a.compareTo(b) returns a negative integer if a is less than b, a positive integer if a is greater than b, and 0 if a is equal to b. The precise meanings of less than, greater than, and equal to are up to the data type, though implementations that do not respect the natural laws of mathematics surrounding these concepts will yield unpredictable results. With this convention, Insertion.java implements insertion sort so that it sorts arrays of Comparable objects.
- Empirical analysis. Program InsertionTest.java tests our hypothesis that insertion sort is quadratic for randomly ordered files. It relies on the helper data type Stopwatch.java.
- Sensitivity to input. Note that InsertionTest.java takes a command-line parameter M and runs M experiments for each array size, not just one. One reason for doing so is that the running time of insertion sort is sensitive to its input values. It is not correct to flatly predict that the running time of insertion sort will be quadratic, because your application might involve input for which the running time is linear.
Mergesort.
To develop a faster sorting method,
we use a divide-and-conquer approach to algorithm design that every
programmer needs to understand. This nomenclature refers to the idea that one
way to solve a problem is to divide it into independent parts, conquer
them independently, and then use the solutions for the parts to develop a
solution for the full problem. To sort an array with this strategy, we divide
it into two halves, sort the two halves independently, and then merge the
results to sort the full array. This method is known as mergesort. To
sort a[lo, hi), we use the following recursive strategy:
- base case: If the subarray size is 0 or 1, it is already sorted.
- recursive step: Otherwise, compute m = lo + (hi - lo)/2, sort (recursively) the two subarrays a[lo, m) and a[m, hi), and merge them to produce a sorted result.
Merge.java
is an implementation. As usual, the easiest way to understand the merge process
is to study a trace of the contents of the array during the merge.
- Mathematical analysis. The inner loop of mergesort is centered on the auxiliary array. The two for loops involve N iterations (and creating the array takes time proportional to N), so the frequency of execution of the instructions in the inner loop is proportional to the sum of the subarray sizes for all calls to the recursive function. The value of this quantity emerges when we arrange the calls on levels according to their size. On the first level, we have 1 call for size N, on the second level, we have 2 calls for size N/2, on the third level, we have 4 calls for size N/4, and so forth, down to the last level with N/2 calls of size 2. There are precisely lg N levels, giving the grand total N lg N for the frequency of execution of the instructions in the inner loop of mergesort. This equation justifies a hypothesis that the running time of mergesort is linearithmic.
- Quadratic-linearithmic chasm. The difference between N^2 and N lg N makes a huge difference in practical applications. Understanding the enormousness of this difference is another critical step to understanding the importance of the design and analysis of algorithms. For a great many important computational problems, a speedup from quadratic to linearithmic makes the difference between being able to solve a problem involving a huge amount of data and not being able to effectively address it at all.
- Divide-and-conquer algorithms. The same basic approach is effective for many important problems, as you will learn if you take a course on algorithm design.
- Reduction to sorting. A problem A reduces to a problem B if we can use a solution to B to solve A. Designing a new divide-and-conquer algorithm from scratch is sometimes akin to solving a puzzle that requires some experience and ingenuity, so you may not feel confident that you can do so at first. But it is often the case that a simpler approach is effective: given a new problem that lends itself to a quadratic brute-force solution, ask yourself how you would solve it if the data were sorted in some way. For example, consider the problem of determining whether the elements in an array are all different. This problem reduces to sorting because we can sort the array, the make a linear pass through the sorted array to check whether any entry is equal to the next (if not, the elements are all different.)
Frequency
counts.
FrequencyCount.java reads a sequence of strings from standard input and then
prints a table of the distinct values found and the number of times each was
found, in decreasing order of the frequencies. We accomplish this by two sorts.
- Computing the frequencies. Our first step is to sort the strings on standard input. In this case, we are not so much interested in the fact that the strings are put into sorted order, but in the fact that sorting brings equal strings together. If the input is
to be or not to be to
|
then
the result of the sort is
be be not or to toto
|
with
equal strings like the three occurrences of to brought together in the array. Now, with equal strings all
together in the array, we can make a single pass through the array to compute
all the frequencies. The Counter.java
data type that we considered in Section 3.x is the perfect tool for the job.
- Sorting the frequencies. Next, we sort the Counter objects. We can do so in client code without any special arrangements because Counter implements the Comparable interface.
- Zipf's law. The application highlighted in FrequencyCount is elementary linguistic analysis: which words appear most frequently in a text? A phenomenon known as Zipf's law says that the frequency of the ith most frequent word in a text of M distinct words is proportional to 1/i.
0 komentar:
Posting Komentar