Konstantin V. Shvachko: S-trees

Balanced Trees.

Balanced trees are traditionally used for indexing linearly ordered data, - a task also known as the problem of maintaining dynamic dictionaries. A number of balance conditions have been developed over time.

The first example of a balanced tree, an AVL-tree, was introduced in 1962 by G.M.Adelson-Velskii and E.M.Landis. In 1970 J.Hopcroft developed 2-3-trees. A generalization of the latter, called B-trees, was proposed by R.Bayer and E.McCreight in 1971. A number of variants of the classical B-trees have been examined in literature by now. B*-trees, B+-trees, symmetric B-trees, and (a,b)-trees are among them.

The history of balanced trees.

1962 AVL-tree G.M.Adelson-Velskii and E.M.Landis
1970 2-3-tree J.Hopcroft
1971 B-tree R.Bayer, E.McCreight
1971 red-black-tree R.Bayer
B*-tree, B+tree, (a,b)-tree
1992 S(1)-tree utilization 1/2 - ε
1994 S(2)-tree utilization 2/3 - ε
1995 S(b)-tree utilization 1 - ε

Various balance conditions used in balanced trees are intended to provide fast logarithmic time access to the data.

Utilization characterizes the space overhead of storing data in a particular type of balanced trees. Utilization is defined as the ratio of the size of the original data to the space used to represent the data set as a balanced tree.

The main disadvantage of B-trees is that they provide poor utilization, when used to store variable length keys, e.g. strings. If the average length of keys stored in a B-tree is substantially smaller than the key length maximum, then it is not possible to guarantee any lower bound for utilization of B-trees greater than 0, even though that the number of elements in the tree is always more than a half of its capacity.

S(b)-tree is a generalization of B-trees.

A new type of balanced trees, called S(b)-trees, was developed in order to efficiently maintain variable length keys. Contrary to B-trees, S(b)-trees provide optimal utilization of keys of variable length, while the data access time remains logarithmic, the same as for B-trees. The difference with B-trees is that a node capacity for S(b)-trees is evaluated by the total length of keys stored in the node, rather than by their number as in B-trees. The balance condition for S(b)-trees is based on their local incompressibility. That is, any sequence consisting of b+1 neighboring nodes of the tree cannot be compressed into b well-formed nodes.

This is the summary of results proven for S(b)-trees:

S(b)-tree is a generalization of B-trees: a B-tree is a special case of a S(0)-tree.
The lower bound of utilization for S(b)-trees is 1 - ε, where ε is inversely proportional to the tree branching.
Search, insertion, and deletion of a key in a n-node S(b)-tree can be performed in time O(log n).
S(b)-trees form a shrinking hierarchy by parameter b. That is if b < b' then the class of S(b')-trees is strictly contained in the class of S(b)-trees.

See my old site, which describes an implementation of a S(b)-tree library.