CLRS9 中位数和顺序统计量

2017-05-13 | 阅读：次

介绍

第i个顺序统计量是集合中第i小的元素。

9.1 最小值和最大值

MINIMUM(A)
    min = A[1]
    for i = 2 to A.length
        if min > A[i]
            min = A[i]
    return min

需要n-1次比较。

MINIMUM-AND-MAXIMUM(A)
    if len(A) % 2 == 0
        if A[1] > A[2]
            min = A[2]
            max = A[1]
            i = 3
        else
            min = A[1]
            max = A[2]
            i = 2
    else
        min = max = A[1]
    while i < A.length
        if A[i] < A[i+1]
            f = A[i]
            s = A[i+1]
        else
            f = A[i+1]
            s = A[i]
        if f < min
            min = f
        if s > max
            max = s

这个算法最多进行3*(n//2)次比较。

练习9.1-1 不是很懂 Show that the second smallest of n elements can be found with n + floor(lgn) - 2 comparisons in the worst case. (Hint: Also find the smallest element.)

#include<iostream>
#include<queue>
using namespace std;

struct Node
{
    int v;
    Node *left;
    Node *right;
    bool dir;     //0 means left, 1 means right
    Node() : v(-1), left(nullptr), right(nullptr) {}
    Node(int _v, bool d): v(_v), left(nullptr), right(nullptr), dir(d){}
};


int second_smallest_element(int arr[], int n)
{
    queue<Node*> q;
    for (int i = 0; i < n; i++)
    {
        Node *new_node = new Node(arr[i], 0);
        q.push(new_node);
    }

    Node *root(nullptr);
    while(!q.empty())
    {
        int size = q.size();
        if (size == 1)
        {
            root = q.front();
            break;
        }
        for (int i = 0; i < size; i += 1)
        {
            if (i == size - 1)
            {
                Node *n1 = q.front(); q.pop();
                q.push(n1);
                break;
            }
            else
            {
                Node *n1 = q.front(); q.pop();
                Node *n2 = q.front(); q.pop();
                int smaller = 0;
                bool dir;
                if (n1 -> v <= n2 -> v)
                {
                    smaller = n1 -> v;
                    dir = false;
                }
                else 
                {
                    smaller = n2 -> v;
                    dir = true;
                }

                Node *new_node = new Node(smaller, dir);
                new_node->left = n1;
                new_node->right = n2;
                q.push(new_node);
            }
        }
    }

    int minimum = root -> v;
    int result(INT_MAX);
    while(root)
    {
        if (root->left && root->right)
        {
            result = min((root->left-v ^ root->right->v ^ minimum), result);
            root = root -> dir == true > root->right: root->left;
        }
        else break;
    }
    return result;
}

First we build a BST from bottom to top，each node contains the smaller one of a pair, we need n-1 comparisions to build BST. The top of this BST is minimum, then the second smallest must be in the path to generate the top. Because the second smallest can only be defeated by the smallest one. Because the height of BST does not exceed ceil(lgn) so we need ceil(lgn) - 1 comparisions.The total times is 
n - 1 + ceil(lgn) - 1 = n + ceil(lgn) - 2

练习9.1-2 Show that 3n//2-2 comparisons are necessary in the worst case to find both the maximum and minimum of n numbers. (Hint: Consider how many numbers are potentially either the maximum or minimum, and investigate how a comparison affects these counts.)

如果n是偶数，那么需要3n/2-2
如果n是奇数，需要3(n//2)
题中所给为相同的形式。

9.2 期望为线性时间的选择算法

RANDOMIZED-SELECT(A, p, r, i)
    if p == r
        return A[p]
    q = RANDOMIZED-PARTITION(A, p, r)
    k = q - p + 1
    if i == k
        return A[q]
    elif i < k
        return RANDOMIZED-SELECT(A, p, q-1, i)
    else return RANDOMIZED-SELECT(A, q+1, r, i-l)

最坏运行时间为θ(n^2),期望运行时间为θ(n)
练习9.2-1 Show that in RANDOMIZED-SELECT, no recursive call is ever made to a 0-length array.

如果划分左侧为空，i >= k
如果划分右侧为空，i < k
总是会调用非空的一侧。
两侧均空，则i == k

练习9.2-2 Argue that the indicator random variable image and the value T(max(k - 1, n - k)) are independent.

Xk的取值对T(max(k-1,n-k))没有影响。

练习9.2-3 Write an iterative version of RANDOMIZED-SELECT.

RANDOMIZED-SELECT(A, p, r, i)
    while true
        if p == r
            return A[p]
        q = RANDOMIZED-PARTITION(A, p, r)
        k = q - p + 1
        if i == k
            return A[q]
        elif i < k
            r = q - 1
        else
            p = q + 1

练习9.2-4 Suppose we use RANDOMIZED-SELECT to select the minimum element of the array A = {3, 2, 9, 0, 7, 5, 4, 8, 6, 1}. Describe a sequence of partitions that results in a worst-case performance of RANDOMIZED-SELECT.

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8   9
0 1 2 3 4 5 6 7   8 
0 1 2 3 4 5 6   7
0 1 2 3 4 5   6
0 1 2 3 4   5
0 1 2 3   4 
0 1 2   3
0 1   2
0   1
return 0

9.3 最坏情况为线性时间的选择算法

这个方法蛮巧妙的
练习9.3-1 In the algorithm SELECT, the input elements are divided into groups of 5. Will the algorithm work in linear time if they are divided into groups of 7? Argue that SELECT does not run in linear time if groups of 3 are used.

对每组是7，就像书中每组是5一样证
对每组是3，可以证T(n) > cn 对所有c成立
    T(n) = T(n/3) + T(2n/3+4) + O(n)
         > cn/3 + 2cn/3 + 2c + an
         = cn + 2c + an
         > cn
         = ω(n)

练习9.3-2 Analyze SELECT to show that if n ≥ 140, then at least ceil(n/4) elements are greater than the median-of-medians x and at least ceil(n/4) elements are less than x.

3n/10 - 6 >= ceil(n/4)
n >= 120

练习9.3-3 Show how quicksort can be made to run in O(nlgn) time in the worst case.

用线性时间找中位数，然后用中位数做Partition.

练习9.3-4 Suppose that an algorithm uses only comparisons to find the ith smallest element in a set of n elements. Show that it can also find the i - 1 smaller elements and the n - i larger elements without performing any additional comparisons.

给集合加上负无穷
所有元素取负

练习9.3-5 Suppose that you have a “black-box” worst-case linear-time median subroutine. Give a simple, linear-time algorithm that solves the selection problem for an arbitrary order statistic.

可以通过增加很小的值和很大的值的办法
把要选择的顺序统计量变成集合中位数

练习9.3-6 The kth quantiles of an n-element set are the k - 1 order statistics that divide the sorted set into k equal-sized sets (to within 1). Give an O(n lg k)-time algorithm to list the kth quantiles of a set.

QUANTILES(A, k, Q)
    if k == 1
        return
    else
        n = A.length
        i = k // 2
        x = SELECT(A, floor(i * n / k))
        PARTITION(A, x)
        Q.append(QUANTILES(A[1..floor(i*n/k)], k//2, Q))
        Q.append(QUANTILES(A[floor(i*n/k)+1..n], ceil(k/2), Q))
        return x

练习9.3-7 Describe an O(n)-time algorithm that, given a set S of n distinct numbers and a positive integer k ≤ n, determines the k numbers in S that are closest to the median of S.

1. 找出中位数
2. 所有数减去中位数再取绝对值
3. SELECT第k小的数字y
4. 遍历数组寻找小于等于y的数字

练习9.3-8 Let [1 .. n] and Y [1 .. n] be two arrays, each containing n numbers already in sorted order. Give an O(lg n)-time algorithm to find the median of all 2n elements in arrays X and Y.

中位数总是在两个小数组中位数之间。

def two_array_median(a, b):
    if len(a) == 1:
        return min(a[0], b[0])

    m = len(a) // 2 - 1 if len(a) % 2 else len(a) // 2
    i = m + 1
    if a[m] < b[m]:
        return two_array_median(a[-i:], b[:i])
    else:
        return two_array_median(a[:i], b[-i:])

练习9.3-9 Professor Olay is consulting for an oil company, which is planning a large pipeline running east to west through an oil field of n wells. From each well, a spur pipeline is to be connected directly to the main pipeline along a shortest path (either north or south), as shown in Figure 9.2. Given x- and y-coordinates of the wells, how should the professor pick the optimal location of the main pipeline (the one that minimizes the total length of the spurs)? Show that the optimal location can be determined in linear time.

We just find the median of the y coordinates. The x coordinates are irrelevant.

Let's assume that n is odd. There are n//2 south of the median and the same amount of wells north of the median. Let the pipeline pass through the median. We shall reason about why this location is optimal.

Suppose we move the pipeline one meter north. This reduces the total pipe length with n//2 meters for the pipes north of the median, but adds another n//2 for the pipes south of median, including the median itself. The more we move north, the more the total pipe length will increase. The same reasoning holds if we move the main pipeline south.

If n is even, then any location between the lower and upper median is optimal.

思考题

9-1 有序序列中的i个最大数

a. 对输入数据排序，需要nlgn时间，找出前i个最大数，需要i时间，总时间是O(nlgn)

def find1(lst, i):
    lst.sort()
    return lst[Li]

b. 对输入数据建立一个最大优先队列，并调用EXTRACT-MAX过程i次
    建立优先队列需要O(nlgn)时间，EXTRACT-MAX调用i次需要O(ilgn)
    总时间是O(nlgn)

def parent(i):
    return i//2

def left(i):
    return 2 * i

def right(i):
    return 2 * i + 1

def max_heapify(A, i):
    l = left(i)
    r = right(i)
    if l < len(A) and A[l] > A[i]:
        largest = l
    else: largest = i
    if r < len(A) and A[r] > A[largest]:
        largest = r
    if largest != i:
        A[largest], A[i] = A[i], A[largest]
        max_heapify(A, largest)

def build_max_heap(A):
    for i in range (len(A)//2, -1, -1):
        max_heapify(A, i)
    print(A)

def extract_max(A):
    largest, A[0] = A[0], A.pop()
    max_heapify(A, 0)
    return largest 

def find2(lst, i):
    build_max_heap(lst)
    res = []
    for i in range(i):
        res.append(extract_max(lst))
    return res

c. 利用一个顺序统计量算法找到第i大的元素，
    然后用它作为主元划分输入数组，再对前i个排序

我在这里用期望时间是n的顺序统计量算法(应该用O(n)的，但是我懒)


from random import randint

def randomized_partition(A, p, r):
    rand = randint(p, r)
    A[r], A[rand] = A[rand], A[r]
    i = p - 1
    pivot = A[r]
    for k in range(p, r):
        if A[k] >= pivot:
            i += 1
            A[k], A[i] = A[i], A[k]
    A[i+1], A[r] = A[r], A[i+1]
    return i + 1

def find(A, i):
    q, p, r = 0, 0, len(A) - 1
    while q != i-1:
        q = randomized_partition(A, p, r)
        if q > i-1: r = q - 1
        if q < i-1:
            p = q + 1
            i -= (q-p+1)
            
    return sorted(A[:i], reverse=True)
    

需要用O(n+ilgi)

9-2 带权中位数

a.
    如果n是奇数，那么左右都小于0.5
    如果n是偶数，那么左边小于0.5，右边等于0.5

b.
    按权排序，然后从第一个元素开始累加，直到0.5

c.
    下次做

d.
    如果在带权中位数的基础往右移动x，则有超过1/2的比重会增加d，小于1/2大地比重减少d
    同理，往左移，有超过1/2的比重会增加d，小于1/2大地比重减少d。

e.
    找一个点，纵坐标是所有y的带权中位数，横坐标是所有x的带权中位数。

9-3 Small order statistics

a. 
    如果i >= n/2, 直接调用select
    否则，两个两个比较生成小的一组，同时把大的一组保存起来，然后对小的一组递归调用U,
    找到小的一组中的第i个数，再找到这i个数对应的另一组的数，一共2i个，然后对这2i个调用
    SELECT

bcd.
    都可以推出来.