Introduction

Sometime applications require us to maintain a collection of elements as collection of disjoint sets. The most important operations that would be needed of such a data structure would be querying if an element belongs to particular set and merging two sets. The thing that gets me excited about this data structure is that almost all operations run in linear time(woo hoo!!).

Applications

The textbook cites the example of connected components as a use of disjoint sets. The connected components problem requires us to identify if a two given nodes in a graph are connected. This can be extended to the practical world, for example consider the case of finding if we have a path between two cities or perhaps in a social network we can use this to find out really quickly if one individual can know to someone else very quickly. This can be used in Kruskal's algorithms to identify minimum spanning trees that have quite a few interesting real world uses, for example finding the least cost and time required to fly from one point to another using a number of intermediate airports.

The disjoint set data structure is a collection S where S = {S1, S2..., Sk} of disjoint dynamic sets. Each set is represented by an element of the set, that might selected based on some pre-specified rule or arbitrarily. This element is known called the representative.

Operations

The operations that this data structure would have to implement are

  • Make-Set(x) Creates a disjoint set whose only member and representative is x. This runs in O(1).
  • Union(x, y) Performs a union of two disjoint sets Sx and Sy.
  • Find-Set(x) Returns a pointer to a set to the representative of the set containing x. This runs in O(1).

This data structure can be represented using linked lists. The figure below shows how this is done. Make-Set and Find-Set are pretty efficient in almost all cases. The only operation that we really have to look at is Union.

* A simple implementation of Union

The most naive implementation of Union would consist of appending one list to the end of another and updating all the head pointers. However we can contrive a set of operations that make the performance for a n-1 Unions run at about Θ(n2) operations.

* Weighted union heuristic

The idea here is to append a shorter list to a longer list. So as a result this would require us to maintain information related to the length of the set in the list. It can be shown that the amortized performance of this data structure over a sequence of Make-Set, Find-Set and Unions would be O(m + n lg n).

Disjoint set forests

We represent faster implementations of disjoint sets as rooted trees where each tree represents one set. In a disjoint set forest each member points only to its parent. The root of each tree contains the representative and is its own parent.

The operations on the trees are carried out as follows.

  • Make-Set(x): Creates a tree with x as the only element and the representative. This runs in O(1).
  • Union(x, y) Performs a union of two disjoint sets Sx and Sy by making the representative of one tree point to the root of another.
  • Find-Set(x) Returns a pointer to a set to the representative of the set containing x. This is done by following the parent pointers of the nodes till we reach the root of the tree.

This good this with this representation is that we can get really fast times on the union operations. The bad thing is that if the tree turns out to be arranged linearly we get screwed quite badly on the time required to perform the query operation. Fortunately for us we have two heuristics that get the running time down to O(m) where m is the number of operations that we perform on the tree.

  • Union by rank: This heuristic is similar to the weighted union heuristic we know and love. It entails making the root of the tree with lower height point to the tree who has a higher height. This height information being stored in the tree is called rank.
  • Path Compression: The idea here pretty simple, as you go about looking for the parent of a particular node, make all the elements along the path point to the root of the tree. This can be written using a very nice recursive procedure.

The cool thing about this is that with these two heuristics the Worst case running time is O(m α(n)), where α(n) grows really, really slowly.