Introduction to C Programming - Part 10

by Björn Fahller

Introduction
After last month's parenthesis with a linked list (have you tried writing a doubly linked list?) we'll now create the intended data structure, a binary search tree. At the end, you'll also find a link to a .zip file with the full sources for the word frequency histogram. The sources also come with an alternative tree implementation called a Splay tree, which for real world data usually is more effective than the binary search tree. The binary search tree is easy to describe, though, so that's what will go into in this article.

By the way, this article is long. Most of the contents have been covered before, though, so I don't think it'll be that hard to get through.

A tree
The list we saw last month is a flat, or one dimensional, data structure. The tree builds a hierarchy. Every node (i.e. element in the tree) has two or more pointers to nodes below it (called children). A node with all child pointers equal to NULL is a leaf. The number of child pointers in a node is called the arity of the tree. If the arity is two, we have a binary tree. Here's a graphical view of a binary tree:



If you haven't found out by now, us computer types do everything upside down, even trees, with the root on top, and leaves at the bottom. You might as well get used to this odd representation right away, because this is what tree structures look like in every computer science and software engineering book I have seen.

OK, that's how trees look like. How are they encoded in C? Very similar to last month's list. Instead of one pointer "pNext" we now introduce two pointers "pLeft" and "pRight".

A demo program (ignoring error handling and deallocating memory): Here we have now programmatically created the tree displayed in the image above.

For a tree to be a binary search tree, there need to be an ordering relation between the nodes. All nodes under the "pLeft" child on the node are, in some sense "less" than the node, and those to the right are "more". In the example program above, I coded this relation by means of their values. This means that searching is easy and fast.

To look for a node with a value, compare the value of the root with your value. If your value is less than that of the root, then go to the left, otherwise go to the right. Then go on comparing, until you either find a leaf, or a node with the value you're looking for. If you found a leaf, which wasn't the value you were looking for, the value wasn't in the tree.

Inserting a new value into a tree is done by first searching for a node with that value. If it is already there, nothing more needs to be done. If it is not, we end up in a leaf, in which case we create a new node under the leaf (to the right, or to the left, depending on if our value is greater or smaller than that of the leaf).

Reason for trees
Why would we want to use sorted trees? Isn't the list from last month sufficient?

The answer is execution speed, at least for large numbers of elements. Assume that we have two sorted collections with the same data, one being a (linked) list, and one a perfectly balanced binary search tree. The number of elements in both is 1000.

For searching the list, an average of 500 nodes must be checked. [Because about half the time you'll find it before you have searched half the nodes, and about half the time you'll find it after searching more than half the nodes. Remember to search in a linked list you have to look at each one in turn, in the order they are in the list. Ed.]

For searching the (balanced) tree, a maximum of 11 nodes must be checked. As the number of elements increases, so does the gain from using a tree structure over a list.

How can I claim that the maximum nodes to check for the tree would be 11? Well, a perfectly balanced tree is one in which the only nodes that have one child are those whose child is a leaf. All others have two children or are leaves. [This means that there are (almost) equal numbers of nodes below a node, to its left and its right. Try drawing it on a bit of paper if you aren't sure. Ed.]

If we look at such a tree as a set of levels, where the root is the top level, we can see that every level (except possibly at the bottom) have twice as many nodes as that above it. A tree of 1000 nodes would then have this characteristic: We see clearly that such a tree containing 1000 nodes is 11 levels high, and since in searching we only compare a value with one element per level, the maximum is 11.

A word frequency tree
Returning to our word frequency histogram, our tree should contain words as values. For the tree to be usable outside of this program, the data in it should be a generic kind of user data, just as was the case with the wordfile (lesson 8) and the function defining what a word is. We should also allow the user to define the comparison function, and thus deciding the sort order; otherwise we'd have to implement several different trees where just a detail in the comparison function, like case sensitivity, differs. As a cheat, I will not implement deletion of nodes.

Semantics:
OK, time to yet again have a look at the semantics of our new abstract data type, the word tree ADT.

Tree creation:

When creating the tree, the user must define a comparison function. It is an error not to pass a comparison function. The comparison function must match this type: int (*wordtree_comparison)(const char* word1,                           const char* word2); The return values from the function is interpreted as: 0 : the words are equal. <0 : word1 is less than word2. >0 : word1 is greater than word2. Insertion of word:

The insertion function accepts a valid wordtree pointer and a word (const char*)

The word must be copied onto heap (with malloc) since we cannot keep the original pointer passed. After all, the pointer may refer to a local variable that the caller might overwrite the moment after.

The insertion function returns a pointer to the userdata pointer associated with the word, or NULL if insertion for some reason fails. If the word was not in the tree before this, the userdata pointer, pointed to by the return value, is NULL, and can be set up by the caller. An example: This was one indirection more than usual, but look at it this way. Inside our implementation, there will be a struct representing a node, looking something like this: If the insert function creates a new node, it creates such a struct on the heap. The struct is reached through a pointer, say "pNode." It then initialises the struct so that the "user" part is NULL, and returns the address of the user part, like this: pNode->user = NULL; return &(pNode->user); Since "user" is a "void*", the address of "user" is a pointer to "user", which then will have the type "void**", i.e. a pointer to a void*.

If the insert function finds a node with an identical word to ours, it does not create a new node, but instead returns the "user" part of the already initialised node struct.

Since the value returned is indeed the pointer to the "user" part, it can be updated by us, as seen in the usage example above.

Searching for a word:
The search function accepts a valid wordtree pointer and a word (const char*)

The pointer to the word must not be the NULL pointer.

As with insertion, the pointer to the user data pointer is returned. If the word is not found, the NULL pointer is returned.

Traversing the tree:
Tree traversal is a way of visiting every word in the tree in their sorted order.

The traversal function accepts a valid wordtree pointer, a visiting function, and a pointer to user-specified data that will be passed on to the visiting function for every word. This common data can for example be formatting information for printouts or a file to write to.

When calling the traverse function, the visiting function defining what to do must match the following type: int (*wordfile_wordvisitor)(const char* word,                            void* userdata,                             void* commondata); The traversal function interprets the return value as: 0: stop traversing. non 0: go on. The traversal function itself returns 1 when all nodes have been visited, and 0 if the traversal was interrupted by the visitor function.

Destroying a tree:
The destroy function accepts a valid wordtree pointer and a userdata handling function (which may be the NULL pointer.)

The userdata handling function must match the following type: void (*wordfile_userdatadel)(void* userdata); A function taking care of user data is necessary if it is allocated on the heap, otherwise a memory leak occurs. If no user data, or userdata not allocated on the heap is used, then the pointer passed may be the NULL pointer.

C prototypes
As usual, translating our semantics into a C header file wordtree.h gives us this:

Implementation
As I mentioned in the description of how insertion is done, it begins with searching. This fact can, and should, be exploited by writing a common helper function. After all, having two functions in our program doing the same thing is asking for trouble; sooner or later one of them will need change, and the other will not follow.

Since the insert function will need to update the node data, the common helper function must return the pointer to the node pointer found. If the node was not found, it will return a pointer to a NULL pointer. For a search, this is an indication that it should return NULL. For the insert function, this means that the returned pointer must be updated to point to a newly created node. If the node on the other hand was found, the return value will be a pointer to a valid pointer, which both cases will be dereferenced to access the user data component of the struct. In the case of the insert function, it is the pointer to the user data component that will be returned, and for the search function, it is the value of the user data component that will be returned.

So then, let's implement this baby in wordtree.c:

Traversing trees
Suppose we have functions called "pre", "post" and "in", and a traverse function looking like this: Traversal of the following tree, by a call to "traverse(pRoot)" will look as follows:



In the places marked "t" a call to "traverse" takes place. The first time is the one at the top, near "pRoot". "traverse" is called on pRoot. Since the node pointer passed is not NULL, "pre" is called. Once "pre" returns, traverse is called again (the "t" below the first "pre" from the top left.) This time it's the left most node that is traversed. The left node pointer is not NULL so "pre" is called again, and so is "traverse", however this time the pointer is NULL so it returns immediately. After the return of that call to traverse, the function "in" is called, followed by "traverse" on the right node (which is NULL, so it does nothing). When returning from that call, it calls "post". When done with post, it returns to its caller, which was the second call to "traverse". It now calls "in" (below the top node) and then goes on traversing the right sub-tree.

As you can see, the traversal goes around the tree counter clockwise. It calls "pre" on the left hand side of all nodes, while going down. It calls "in" while under a node, and "post" on the right hand side, while going up.

If you take a careful look at the calls to "in" you note that they occur from left to right, and since the tree is sorted, this means that the in-order traversal is done in the sorted order of the tree. In the ADT implemented, post-order traversal is done for destruction. Since post order traversal is always done from bottom and up, it suits destruction perfectly. First go down a branch all the way to the leaf, then destroy the leaf, go up, destroy what the leaf was attached to, go up again, destroy the branch, and so on, until everything is destroyed. A pre-order destruction wouldn't work, since it would first destroy the root, and then not know where to go to. An in-order destruction would successfully destroy the left most branches, and then not know where to go to, because the node containing the "pRight" pointers would then have been destroyed.

The word frequency histogram ADT
With the aid of this generic word tree, it's easy to write an ADT for word frequency histograms. It'll be the by far simplest ADT yet written. All it does is to forward calls to the tree. Here is the header "wfh.h" And now over to the very simple implementation. As I mentioned earlier, it's just a matter of forwarding the calls to the tree ADT. For traversal, though, a little trick is needed. As you can see, the type for a wfh_visitor differs from the type for a wordtree_visitor. We can cheat our way around that problem, by internally using our own visitor function, conforming to the type of wordtree_visitor (since our function will traverse the tree) and our own common data. Our common data is a struct containing a pointer to the user's visitor (wfh_visitor) and the user's commondata. Whenever our function is called by the tree traversal function, we unpack our known userdata and call the user's visitor function with the user's commondata. Here's how it's all implemented in wfh.c

The program
At last, we can now finally put the pieces together and write the word frequency histogram program, making use of the just written abstract data types, and the wordfile ADT from lesson 8.

Here's one example implementation, wordfreq.c

The lie
With the word frequency histogram written, it's time to reveal the lie. So now, after having read all of the above, I'm telling you it's not true (nice chap this, eh?). When reading the explanation for why a tree is so much more effective than a list, did you notice the condition? It goes for "perfectly balanced" trees. So, how likely is it that a tree becomes perfectly balanced when built up with real-world data like the words in a text file? Not very. Not very likely at all. How likely is it to become reasonably balanced? Well... hard to say. Let's assume the case of a file of alphabetically sorted words. What happens when we build up the tree for the words "a binary sorted tree violator".



Hmm... All left child pointers are NULL. Everything lies to the right. Yes, it's unfortunately true, for sorted data, a binary search tree becomes a sorted list! So much for that advantage over the list.

Splay trees
A splay tree is a binary search tree that is dynamically rebuilt (keeping the sort order, though) as nodes are inserted and searched for. It is rebuilt in a way making it less vulnerable to the sequence of insertions and searches. For data that forms a reasonable well balanced binary search tree, a splay tree does perform worse, but for data that forms poorly balanced search trees, the splay tree is much more effective.

I ran some tests, comparing the simple binary search tree and the splay tree. For the test, I've run the example program, with a custom compare function that counts calls, for both C:\README of Warp 4 (fairly random data giving a reasonable search tree) and the list of all files on my hard disk, with full path names (very repetitive and sorted data). I've run the tests with assert tests and without (NDEBUG defined). The result was as follows: What can clearly be seen is that for sorted repetitive data, the splay tree is much faster. For the random data, however, this is not so clear. With assertions active, the splay tree is much more effective than the simple sorted tree, but not with assertions turned off. How come? Whenever a node is searched for, or inserted, the tree is rebuilt so that the node is moved to the root. The assertions do searches in the tree, so found nodes are moved to the top. When a word that has been inserted before is inserted again, the first call to wordtree_find (to get the count used by the post condition) moves the word to the top. wordtree_insert thus gets a hit on the first word it tries, and the second call to wordtree_find, for the post condition, also gets a hit on the first try.

I will not go more into the splay tree, as this article has become long enough as it is. If you're curious you can find a paper describing the operation and a CGI based demonstration, and even a java applet demo you can watch as it displays what happens when you insert/remove/search elements. You'll also find a splay variant of the "wordtree" in the ZIP file.

Recap
In this article we've seen that:
 * Trees are hierarchical recursive structures, where every node has 2 or more pointers to sub nodes, usually called children.
 * The number of pointers to children for a node is called the arity of the tree, (thus trees where every node has 2 child pointers are called binary trees).
 * A node where all child pointers are NULL is called a leaf.
 * Computer geeks make poor gardeners as they like to plant trees upside down.
 * Reasonably balanced trees are much more effective than lists for insertion and searching.
 * For sorted data, the simple sorted binary tree collapses into a sorted list.

Exercise 1
This one's a difficult one. Please write me and ask for details, but think a little bit first.

Implement a "wordtree_removeword" function, looking like this: int wordtree_removeword(WORDTREE* tree,                        const char* word,                         wordtree_userdatadel deletor); A return value of 1 indicates successful removal of the word, and 0 failure.

The problem: When you remove a node, and that node has two children, one of the children must take the place of the just deleted node, and the tree must remain sorted.

Exercise 2
This one's easier. With the above construction, you have to provide a deletor function both when removing a single word, and when destroying the entire tree. In both cases the function ought to be the same, but providing the same information in two places is a sure source of future errors (changing one, and forgetting the other). Make a change to the ADT so that you provide the deletor already in the create function, and then just use it when destroying the tree or deleting a single word.

The End
That was it. End of C intro. You now know quite a bit about programming in the C language. Let's look back at what you've learned: All in all, I'd say you're now pretty good programmers. Just exercise your skills to gain some experience.
 * The C syntax for one (no small feat).
 * The datatypes, int, char, unsigned char, arrays, structs, pointers, enums, and how they all interrelate, what you can do with them, and what values they can have.
 * You've learned about C loop structures like the "for", "while", "do/while", and also about the "if" conditional statement.
 * You can write functions, and know how their parameters are treated, and how to return values.
 * You know how to use "programming by contract" to improve your software quality, and how to enforce the contract with "assert".
 * You can create abstract data types to make complete packages, encapsulating all implementation details for users.
 * You know how to use pointers to functions for callbacks from the ADT's to user functions, and how to use the void* to store user specified data in your ADT.
 * You have seen how a large number of the most frequently used functions in the ANSI/ISO C library works and when to use them.
 * You also know how to write dynamic data structures like lists and trees.

Are there things I have not taught? Yes, there are. Most of those, however, are obscure details that I fear would complicate matters without really bringing you anything of value. It's very unusual that I use C constructs, and programming practices that I have not taught in this introductory course.

I would like to ask you a favour. Please e-mail me your impressions of this introduction. This is the first time I've written any kind of course, and I want feedback. Have I been too deep in details, or skimmed to superficially? Have I been fast paced or too slow? Have I covered the right things to begin with? What things are you missing? What things have you appreciated? Without feedback of this kind, I will never be able to improve, and I want to improve. Already next month I'm starting with a new introduction series: An introduction to C++.

Also, as usual, if my explanations have failed somewhere, please e-mail me and ask what I meant, or ask for more details, if necessary.

Oh, I almost forgot. Here are the promised complete sources:
 * wordfreq.zip, the original as written in this article.
 * splaywtr.zip, alternative wordtree.c implemented as a splay tree. The implementation is more or less a direct copy of the algorithm described in Sleator's paper and not the recursive bottom-up algorithm displayed by the Java applet example.