介绍
-
B树是为磁盘或其他直接存取的辅助存储设备而设计的一种平衡搜索树。B树类似于红黑树,但他们在降低磁盘I/O操作数方面要更好一些。
-
一个B树的结点通常和一个完整磁盘页一样大,并且磁盘页的大小限制了一个B树结点可以含有的孩子个数。
-
一个分支因子为1001,高度为2的B树可以存储超过10亿个关键字。
18.1 B树的定义
-
B+树把所有卫星数据都存储在叶结点中,内部结点至存放关键字和孩子指针,因此最大化了内部节点的分支因子。
-
每个叶结点具有相同的深度。
-
B树的最小度数t: 除了根结点以外的结点必须至少有t-1个关键字;每个结点至多可以有2t-1个关键字。当结点恰好有2t-1个关键字时,称该结点是满的。
-
对B树来讲,对数的底可以大很多倍。由于检查任意一个结点都需要一次磁盘访问,因此B树避免了大量的磁盘访问。
-
B*树要求每个内部结点至少2/3满(B树要求至少半满)。
-
练习18.1-1 Why don’t we allow a minimum degree of t = 1?
如果t=1,结点必须至少有0个关键字,所以不行。
- 练习18.1-2 For what values of t is the tree of Figure 18.1 a legal B-tree?
2 or 3.
- 练习18.1-3 Show all legal B-trees of minimum degree 3 that represent {1, 2, 3, 4, 5}
要记得所有叶结点都需要在同一深度。
[1 2 3 4 5]
[3]
[1,2] [4,5]
- 练习18.1-4 As a function of the minimum degree t, what is the maximum number of keys that can be stored in a B-tree of height h?
(2t)^(h+1)-1
- 练习18.1-5 Describe the data structure that would result if each black node in a red-black tree were to absorb its red children, incorporating their children with its own.
After absorbing each red node into its black parent, each black node may contain
1, 2 (1 red child), or 3 (2 red children) keys, and all leaves of the resulting tree
have the same depth.
A red-black tree will become a Btree with minimum degree t=2
i.e., a 2-3-4 tree.
18.2 B树上的基本操作
-
做两个约定: B树的根结点始终在主存中,无需对根做DISK-READ操作;任何被当作参数的结点在被传递前,都要对它们做一次DISK-READ操作。
-
搜索B树:对每个结点,做(x.n+1)路的分支选择。
B-TREE-SEARCH(x, k)
i = 1
while i <= x.n and k > x.keyi
i += 1
if i <= x.n and k == x.keyi
return (x, i)
elif x.leaf
return NIL
else DISK-READ(x, ci)
return B-TREE-SEARCH(x.ci, k)
运行时间为O(t log t n)
- 创建一棵空的B树
B-TREE-CREATE(T)
x = ALLOCATE-NODE()
x.leaf = TRUE
x.n = 0
DISK-WRITE(x)
T.root = x
需要O(1)磁盘操作和O(1)的CPU时间。
-
向B树中插入一个结点:当沿着树往下查找新的关键字所属位置时,分裂沿途遇到的每个满结点。
-
分裂B树中的结点:
B-TREE-SPLIT-CHILD(x, i)
z = ALLOCATE-NODE()
y = x.ci
z.leaf = y.leaf
z.n = t - 1
for j = 1 to t - 1
z.keyj = y.keyj+t
if not y.leaf
for j = 1 to t
z.cj = y.cj+t
y.n = t - 1
for j = x.n + 1 downto i + 1
x.cj+1 = x.cj
x.ci+1 = z
for j = x.n down to i
x.keyj+1 = x.keyj
x.keyi = y.keyj
x.n += 1
DISK-WRITE(y)
DISK-WRITE(z)
DISK-WRITE(x)
最后三行写出所有修改过的磁盘页面
占用CPU时间为Theta(t)
执行O(1)次磁盘操作
- 以沿树单程向下的方式向B树插入关键字
B-TREE-INSERT(T, k)
r = T.root
if r.n == 2t - 1
s = ALLOCATE-NODE()
T.root = s
s.leaf = FALSE
s.n = 0
s.c1 = r
B-TREE-SPLIT-CHILD(s, 1)
B-TREE-INSERT-NONFULL(s, k)
else B-TREE-INSERT-NONFULL(r, k)
需要O(h)次磁盘存取,所需CPU时间为O(th)=O(t log t n)
B-TREE-INSERT-NONFULL(x, k)
i = x.n
if x.leaf
while i >= 1 and k < x.keyi
x.keyi+1 = x.keyi
i -= 1
x.keyi+1 = k
x.n += 1
DISK-WRITE(x)
else while i >= 1 and k < x.keyi
i -= 1
i += 1
DISK-READ(x, ci)
if x.ci.n == 2t - 1
B-TREE-SPLIT-CHILD(x, i)
if k > x.keyi
B-TREE-INSERT-NONFULL(x.ci, k) i += 1
需要O(h)次磁盘存取,占用总CPU时间为O(th) = O(t log t n)
-
因为B-TREE-INSERT-NONFULL是尾递归的,所以它也可以用一个while循环来实现。从而说明在任何时刻,需要留在主存中的页面数为O(1).
-
练习18.2-1 Show the results of inserting the keys F, S, Q, K, C, L, H, T, V, W, M, R, N, P, A, B, X, Y, D, Z, E in order into an empty B-tree with minimum degree 2. Only draw the configurations of the tree just before some node must split, and also draw the final configuration.
It's easy.
- 练习18.2-2 Explain under what circumstances, if any, redundant DISK-READ or DISK-WRITE operations are performed during the course of executing a call to B-TREE-INSERT. (A redundant DISK-READ is a DISK-READ for a page that is already in memory. A redundant DISK-WRITE writes to disk a page of information that is identical to what is already stored there.)
这是不会出现的= =
- 练习18.2-3 Explain how to find the minimum key stored in a B-tree and how to find the predecessor of a given key stored in a B-tree.
find the left most leaf
FIND-PREDECESSOR
if x is not a leaf
return the maximum key in the i-th child of x
if x is a leaf
return the (i-1)th key of x
otherwise
look for the last node y (from the bottom up) and j > 0, such that x.keyi is the
leftmost key in y.cj; if j = 1, return NIL since x.keyi is the minimum key in
the tree; otherwise we return y.keyj–1.
- 练习18.2-4 Suppose that the keys {1, 2, …, n} are inserted into an empty B-tree with minimum degree 2. How many nodes does the final B-tree have?
n - 2lg(n+1)
1 1
2 1
3 1
4 3
5 3
6 4
7 4
8 5
9 7
10 8
......
- 练习18.2-5 Since leaf nodes require no pointers to children, they could conceivably use a different (larger) t value than internal nodes for the same disk page size. Show how to modify the procedures for creating and inserting into a B-tree to handle this variation.
We could set the new t(name it t') value of leaf node = 1.5t.
- 练习18.2-6 假设B-TREE-SEARCH的实现是在每个结点用二分查找,证明无论怎样选择t,这种实现所需的CPU时间都是O(lgn).
O(h) = log t n
O(lg t * log t n) = O(lg n)
- 练习18.2-7 Suppose that disk hardware allows us to choose the size of a disk page arbitrarily, but that the time it takes to read the disk page is a + bt, where a and b are specified constants and t is the minimum degree for a B-tree using pages of the selected size. Describe how to choose t so as to minimize (approximately) the B-tree search time. Suggest an optimal value of t for the case in which a = 5 milliseconds and b = 10 microseconds.
要最小化 (a+bt)*t*(log t n)
18.3 从B树中删除关键字
-
要防止因删除操作而导致树的结构违反B树性质。
-
删除操作只需要O(h)次磁盘操作,需要CPU时间为O(th) = O(t log t n)
-
练习18.3-1 Show the results of deleting C, P, and V , in order, from the tree of Figure 18.8(f).
LPTX
AEJK NO QRS UV YZ
LQTX
AEJK NO RS UV YZ
LQX
AEJK NO RSTU YZ
- 练习18.3-2 Write pseudocode for B-TREE-DELETE.
https://github.com/gzc/CLRS/blob/master/C18-B-Trees/btree.cpp
思考题
- 18-1 辅存上的栈