Chapter 8 Sorting and Searching

@smilence 2020-03-12T20:41:51.000000Z 字数 8105 阅读 5030

"The Rules"

1.Common Sorting Algorithms
a.Merge Sort
Runtime: $O(nlogn)$ for average and worst case.Memory: $O(n)$
典型的D&C策略算法。将线性数据结构（如array, vector或list )分成两部分，假定两部分都可以得到解决，合并两部分。
由于Merge Sort不依赖于随机读写，因此具有很强的普适性，比如适合用于sort Linked List结构。且与heapsort及部分改进型quick sort相区别，merge sort一般是stable sort。

def mergeSort(arr: List[int]) -> None:
    tmp = [0] * len(arr)
    def impl(start: int, end: int) -> None:
        nonlocal arr
        nonlocal tmp
        if start >= end:
            return
        mid = start + (end - start) // 2
        # sort left
        impl(start, mid)
        # sort right
        impl(mid+1, end)
        for i in range(start, end+1):
            tmp[i] = arr[i]
        # merge left and right
        l, r, i = start, mid + 1, start
        while l <= mid and (r <= end):
            if tmp[l] <= tmp[r]:
                arr[i] = tmp[l]
                l += 1
            else:
                arr[i] = tmp[r]
                r += 1
            i += 1
        while l <= mid:
            arr[i] = tmp[l]
            i += 1
            l += 1
    impl(0, len(arr) - 1)

b.Quick Sort
Runtime： $O(nlogn)$ average, $O(n^2)$ worst case.Memory: $O(logn)$
典型的D&C策略算法，将数组用pivot值分成两部分，然后默认这两部分会自己解决sorting问题。

void qSort( int arr[], int left, int right ){
    if( left >= right )
        return;
    int index = partition( arr, left, right);
    qSort( arr, left, index - 1 );
    qSort( arr, index,right );
}
int partition( arr, left ,right ){
    int pivot = arr[( left + right )/2];
    while( left <= right ){
        while( arr[left] < pivot ) left++;
        while( arr[right] > pivot ) right--;
        if( left <= right )
            swap( arr, left++, right-- );
    }
    return left;
}

C++ STL提供API:
sort( iterator first, iterator last, Compare comp ); // can aslo be pointers here;

c.Heap Sort
Runtime： $O(nlogn)$ in all cases.Memory: $O(1)$ （除Quick Sort之外，另一个in place的排序算法）

vector<int> v;
std::make_heap (v.begin(),v.end());
int max = v.front();    // get the max;
std::pop_heap (v.begin(),v.end()); v.pop_back();  // pop the max;
std::sort_heap (v.begin(),v.end());

e.g. Merge k sorted linked lists and return it as one sorted list. (Leetcode: Merge k Sorted Lists);

d.Bin Sort, Bucket Sort 与 Radix Sort
Bin Sort： 利用table，将key的值直接作为index，将元素存入"bin"中。

for( i = 0; i < n; i++ )
    B[ getKey( A[i] ) ].append( A[i] );

Bucket Sort : 与Bin Sort类似，区别是将key在某一范围内的元素都存入一个"bin"中，再寻求后期处理。

Radix Sort : 如果元素的key可以看做一个有限位进制数（如string, vector<int>,int），那么可以按照从低位到高位（适用数值序）或高位到低位（适用词典序），根据key每一位的值进行Bin Sort，在维持其相对顺序的情况下进一步排序，直到遍历所有位。Runtime： $O(kN)$ in all cases.Memory: $O(N + k)$

相比其他Sorting方式，Bucket Sort与Radix Sort几乎不需要交换与比较。

2.Quick Select Algorithm
在无序数组内，在 $O(n)$ 时间内找到 k-th smallest ( 或 k-smallest ) item(s) 的方法(average case)，但局部无序。基本算法与Quick Sort一致，核心在于partition。

int quick_select(int *array, int left, int right, int k)
{
    if ( left == right ) return input[left];
    int index = partition(array, left, right);
    int size = index - left + 1;
    if ( size == k ) return index;  // the current index of the median
    else if ( size > k ) return quick_select(input, left, index - 1, k);
    else  return quick_select(input, index + 1, right , k - size);
}
int partition( arr, left ,right );

e.g.1 Compute the k-th largest element in an array A that runs in $O(n)$ expected time.( EPI: 11.13)
e.g.2 There are n points on a 2D plan, find the k points that are closest to origin ( x= 0, y= 0). (Facebook Interview)

3.External Sort
D&C策略。Divide the file into chunks, sort each chunk, and merge them.

Two options for merging:

a.Merge every two chunks, until there is only one file.

b.Form a min heap from the first element of all the k arrays. Extract the min element. It is the min element among all the arrays. Now take the next element from the array from which the min element has come, and put it in the min heap. Repeat the procedure.

4.Binary Search
D&C策略的搜索方式。对于已排序的有序container，如果不改变数据结构，Binary search几乎总是最优的检索方式，即将数据分割为两部分，再根据当前节点的分析进行剪枝。在使用binary search时，总是应该检查逼近方式与逼近条件的关系：

求解给定值的一个位置，则可以用 low<=high 作为搜索条件，根据A[mid] 与目标值的关系取 low = mid +1或high = mid - 1逼近; 如果没有找到给定值，那么给定值的index最终一定在[high, low]之间 (high一定小于low), 注意区间可能会超出边界比如high < 0或者low >= length.

e.g.1 Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. (LeetCode: Search Insert Position)

e.g.2 Implement int sqrt(int x). (LeetCode: Sqrt(x))

对于求解问题（无论是给定值还是范围）的下界的情况，应该在不符合条件时取low = mid+1，对应地，mid = low + (high-low)/2以保证每次循环都是“逼近”的；求解问题上界，在不符合条件时取high = mid-1，对应地，mid = high - (high-low)/2以保证“逼近”。在求上下界的情形下，都应该以 low<high 为搜索条件，结束时 low==high 一定成立。

对于局部有序的数据，也可以根据其有序局部的特性，尽可能地逼近、剪枝，使用变种Binary Search进行检索。

e.g.1 A magic index is defined to be an index such that A[i] = i.Find a magic index in a sorted array. ( CtCI 9.3 )
e.g.2 Find an element in a rotated array( was originally sorted).( CtCI 11.3， Leetcode: Search in Rotated Sorted Array II）
e.g.3 Given a sorted array of integers, find the starting and ending position of a given target value.( Leetcode: Search for a range )

e.g.4 Implement a function which takes as input a float varaible x and return its square root.( EPI: 11.9)

// 0 means equal, -1 means smaller, and 1 means larger.
int compare(double a, double b) {
  // Use normalization for precision problem.
  double diff = (a - b) / b;
  return diff < -numeric_limits<double>::epsilon()
             ? -1
             : diff > numeric_limits<double>::epsilon();
}
double square_root(double x) {
  // Decide the search range according to x.
  double l, r;
  if (compare(x, 1.0) < 0) {  // x < 1.0.
    l = x, r = 1.0;
  } else {  // x >= 1.0.
    l = 1.0, r = x;
  }
  // Keep searching if l < r.
  while (compare(l, r) == -1) {
    double m = l + 0.5 * (r - l);
    double square_m = m * m;
    if (compare(square_m, x) == 0) {
      return m;
    } else if (compare(square_m, x) == 1) {
      r = m;
    } else {
      l = m;
    }
  }
  return l;
}

e.g.5 There are two sorted arrays A and B of size m and n respectively. Find the median of the two sorted arrays. The overall run time complexity should be $O(log (m+n))$ . (Leetcode: Median of Two Sorted Arrays)

5. Distributed Computing
"Distributed systems are groups of networked computers, which have the same goal for their work."
In parallel computing, all processors may have access to a shared memory to exchange information between processors.
In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.
http://en.wikipedia.org/wiki/Distributed_computing

模式识别

1.动态数据结构的维护
BST 可以看一个动态维护的一个sorted array。建立BST（对应std::map )通常可以满足有序要求，将检索对象作为TreeNode的key，将检索内容作为TreeNode的val。利用BST维护的稳定性，BST的节点可以方便地记录其子树的信息，并在节点间传递全局信息，方便检索。

Heap结构能够很方便地维护动态数据（stream）的最大值、最小值或中位数。 priority_queue相比简单的queue适用于需要优先级的情境。与queue类似的快速push和提取，但不方便update和检索（部分有序）。
Linked List: 快速的update、插入和删除，但不方便检索。可以注意到Linked List同时也可以是queue的底层实现方式，因此适合使用queue的情境，一般也可以使用更为强大的list。
Table: 更快速的检索（ $O(1)$ )，如unordered_map以及array， set，bitset<int>。

LRU Cache: 对时间敏感的动态问题，往往可以采用这种强大的数据结构。用Linked List存储，用Hashtable辅助检索，节点按照插入或更新顺序排列，支持快速的update、删除与检索。可以想象成一个可以随时提取或加入信息并保持全局时间有序的"Cache Pool"。在JAVA语言中，用LinkedHashMap实现。http://stackoverflow.com/questions/2504178/lru-cache-design/

e.g.1 Design an algorithm for computing the k-th largest element in a sequence of elements, one presented at a time.(EPI: 11.14)
e.g.2 Implement data structures to look up the rank of a number x( the number of values less than or equal to x) (CtCI 11.8)
e.g.3 If you were designing a web crawler, how would you avoid getting into infinite loops.( CtCI 10.5 )
e.g.4 Design a caching mechanism for the most recent queires of a search engine.( CtCI 10.7 )
e.g.5 Given a server that has requests coming in. Design a data structure such that you can fetch the count of the number requests in the last second, minute and hour.( Google interview )
http://stackoverflow.com/questions/18364490/implementation-of-a-hits-in-last-second-minute-hour-data-structure

e.g.6 N nodes, each node consists of a couple fields and methods. These are:

int id; //every node has an ID. All of these IDs are sequential, and begin with 0. 
int val; //every node has a value  
int max; //max = N. Every node knows how many nodes are in the system.  
void send(int idTo, int payload);  
int recv(int idFrom);

Write a single piece of code which runs on every node simultaneously, such that when it is finished running every node in the system knows the sum of the values of all the nodes in the system.( Ebay interview )

int getSum(){    //as member function of the node
    int parent = ( id - 1)/2;
    int left = 2*id + 1;
    int right = 2*id + 2;
    // DFS
    int subsum = val;
    if( right < max ) subsum += recv(right);
    if( left < max ) subsum += recv(left);
    if(parent >= 0 ) send( parent, subsum);
    int sum = 0;
    if(parent >= 0) sum = recv( parent);
    else sum = subsum;
    send( left, sum);
    send(right, sum );
    return sum;
}

2. Key的范围有限（或大量重复）的排序问题，或者只要求Partial Order的排序问题，一般可以使用Bucket Sort，将Key值直接作为其index存入table。对于有限节点的key（如string, vector<int>,int），可以利用Radix Sort进行数值序或词典序排序。

e.g.1 Sort a large number of people by their ages.
e.g.2 Sort an array of strings that all the anagrams are next to each other.( CtCI 11.2 )

3.Scalability & Memory Limits 问题
对这类问题一般采用D&C策略，即对问题进行预处理，将问题的输入进行分割、归类（sorting），放入相应的Bucket（单机上的某一块Chunk，或者分布式系统中的一台单机），再对每个Bucket进行后期处理，最后合并结果。
整个过程中应该用hash提供全局lookup办法，对于Memory Limits问题一般可以使用简单的hash function；对Scalability问题一般可以用hash table来记录输入对象的key与machine之间的映射。

e.g.1 Find all documents that contain a list of words ( CtCI Chapter 10 Example )
e.g.2 Design the data structures for a very large social network and an algorithm to show the connection between two people.( CtCI 10.2)
e.g.3 Given an input file with one billion distinct non-negative integers, provide an algorithm to generate an integer not contained in the file.Assume you have only 10MB of Memory.(CtCI 10.3)
e.g.4 You have 10 billion URLS.How do you detect the duplicate ones?(CtCI 10.6)