Introduction to C Programming - Part 10: Difference between revisions
| Line 133: | Line 133: | ||
| </small> | </small> | ||
| The return values from the function is interpreted as: | |||
|   <small> |   <small> | ||
| Line 165: | Line 165: | ||
| </small> | </small> | ||
| This was one indirection more than usual, but look at it this way. Inside our implementation, there will be a struct representing a node, looking something like this: | |||
|   <small> |   <small> | ||
Revision as of 23:33, 22 February 2012
Written by Björn Fahller
Introduction
After last month's parenthesis with a linked list (have you tried writing a doubly linked list?) we'll now create the intended data structure, a binary search tree. At the end, you'll also find a link to a .zip file with the full sources for the word frequency histogram. The sources also come with an alternative tree implementation called a Splay tree, which for real world data usually is more effective than the binary search tree. The binary search tree is easy to describe, though, so that's what will go into in this article.
By the way, this article is long. Most of the contents have been covered before, though, so I don't think it'll be that hard to get through.
A tree
The list we saw last month is a flat, or one dimensional, data structure. The tree builds a hierarchy. Every node (i.e. element in the tree) has two or more pointers to nodes below it (called children). A node with all child pointers equal to NULL is a leaf. The number of child pointers in a node is called the arity of the tree. If the arity is two, we have a binary tree. Here's a graphical view of a binary tree:
If you haven't found out by now, us computer types do everything upside down, even trees, with the root on top, and leaves at the bottom. You might as well get used to this odd representation right away, because this is what tree structures look like in every computer science and software engineering book I have seen.
OK, that's how trees look like. How are they encoded in C? Very similar to last month's list. Instead of one pointer "pNext" we now introduce two pointers "pLeft" and "pRight".
A demo program (ignoring error handling and deallocating memory):
 #include <stdlib.h>               /* malloc */
 typedef struct inttree_struct {
   struct inttree_struct* pLeft;
   struct inttree_struct* pRight;
   int value;
 } inttree;
 int main(void)
 {
   inttree* pRoot;
   inttree* right;
   inttree* rightmost;
   inttree* left = (inttree*)malloc(sizeof(inttree));
   left->value = 3;
   left->pLeft = NULL;
   left->pRight = NULL;
   rightmost = (inttree*)malloc(sizeof(inttree));
   rightmost->value = 8;
   rightmost->pLeft =NULL;
   rightmost->pRight = NULL;
   right = (inttree*)malloc(sizeof(inttree));
   right->value = 7;
   right->pLeft = NULL;
   right->pRight = rightmost;
   pRoot = (inttree*)malloc(sizeof(inttree));
   pRoot->value = 5;
   pRoot->pLeft = left;
   pRoot->pRight = right;
 }
Here we have now programmatically created the tree displayed in the image above.
For a tree to be a binary search tree, there need to be an ordering relation between the nodes. All nodes under the "pLeft" child on the node are, in some sense "less" than the node, and those to the right are "more". In the example program above, I coded this relation by means of their values. This means that searching is easy and fast.
To look for a node with a value, compare the value of the root with your value. If your value is less than that of the root, then go to the left, otherwise go to the right. Then go on comparing, until you either find a leaf, or a node with the value you're looking for. If you found a leaf, which wasn't the value you were looking for, the value wasn't in the tree.
Inserting a new value into a tree is done by first searching for a node with that value. If it is already there, nothing more needs to be done. If it is not, we end up in a leaf, in which case we create a new node under the leaf (to the right, or to the left, depending on if our value is greater or smaller than that of the leaf).
Reason for trees
Why would we want to use sorted trees? Isn't the list from last month sufficient?
The answer is execution speed, at least for large numbers of elements. Assume that we have two sorted collections with the same data, one being a (linked) list, and one a perfectly balanced binary search tree. The number of elements in both is 1000.
For searching the list, an average of 500 nodes must be checked. [Because about half the time you'll find it before you have searched half the nodes, and about half the time you'll find it after searching more than half the nodes. Remember to search in a linked list you have to look at each one in turn, in the order they are in the list. Ed.]
For searching the (balanced) tree, a maximum of 11 nodes must be checked. As the number of elements increases, so does the gain from using a tree structure over a list.
How can I claim that the maximum nodes to check for the tree would be 11? Well, a perfectly balanced tree is one in which the only nodes that have one child are those whose child is a leaf. All others have two children or are leaves. [This means that there are (almost) equal numbers of nodes below a node, to its left and its right. Try drawing it on a bit of paper if you aren't sure. Ed.]
If we look at such a tree as a set of levels, where the root is the top level, we can see that every level (except possibly at the bottom) have twice as many nodes as that above it. A tree of 1000 nodes would then have this characteristic:
level nodes in level. Nodes in or above level 1 1 1 2 2 3 3 4 7 5 8 15 6 16 31 7 32 63 8 64 127 9 128 255 10 256 511 11 512 1023
We see clearly that such a tree containing 1000 nodes is 11 levels high, and since in searching we only compare a value with one element per level, the maximum is 11.
A word frequency tree
Returning to our word frequency histogram, our tree should contain words as values. For the tree to be usable outside of this program, the data in it should be a generic kind of user data, just as was the case with the wordfile (lesson 8) and the function defining what a word is. We should also allow the user to define the comparison function, and thus deciding the sort order; otherwise we'd have to implement several different trees where just a detail in the comparison function, like case sensitivity, differs. As a cheat, I will not implement deletion of nodes.
Semantics:
OK, time to yet again have a look at the semantics of our new abstract data type, the word tree ADT.
Tree creation:
When creating the tree, the user must define a comparison function. It is an error not to pass a comparison function. The comparison function must match this type:
 int (*wordtree_comparison)(const char* word1,
                            const char* word2);
The return values from the function is interpreted as:
0 : the words are equal. <0: word1 is less than word2. >0: word1 is greater than word2.
Insertion of word:
The insertion function accepts a valid wordtree pointer and a word (const char*)
The word must be copied onto heap (with malloc) since we cannot keep the original pointer passed. After all, the pointer may refer to a local variable that the caller might overwrite the moment after.
The insertion function returns a pointer to the userdata pointer associated with the word, or NULL if insertion for some reason fails. If the word was not in the tree before this, the userdata pointer, pointed to by the return value, is NULL, and can be set up by the caller. An example:
 userdata** ppuser = (userdata**)wordtree_insert(word);
 if (ppuser = NULL)
 {
   printf("insertion failed\n");
   exit(1);
 }
 if (*ppuser == NULL)
 {
   *ppuser = (userdata*)malloc(sizeof(userdata));
   /* set up our userdata */
 } else {
   /* update our userdata */
 }
This was one indirection more than usual, but look at it this way. Inside our implementation, there will be a struct representing a node, looking something like this:
   struct internalstructname {
      /* whatever's needed to keep the tree structure and word */
      /* ... */
void* user; };
If the insert function creates a new node, it creates such a struct on the heap. The struct is reached through a pointer, say "pNode." It then initialises the struct so that the "user" part is NULL, and returns the address of the user part, like this:
pNode->user = NULL; return &(pNode->user);
Since "user" is a "void*", the address of "user" is a pointer to "user", which then will have the type "void**", i.e. a pointer to a void*.
If the insert function finds a node with an identical word to ours, it does not create a new node, but instead returns the "user" part of the already initialised node struct.
Since the value returned is indeed the pointer to the "user" part, it can be updated by us, as seen in the usage example above.
Searching for a word:
The search function accepts a valid wordtree pointer and a word (const char*)
The pointer to the word must not be the NULL pointer.
As with insertion, the pointer to the user data pointer is returned. If the word is not found, the NULL pointer is returned.
Traversing the tree:
Tree traversal is a way of visiting every word in the tree in their sorted order.
The traversal function accepts a valid wordtree pointer, a visiting function, and a pointer to user-specified data that will be passed on to the visiting function for every word. This common data can for example be formatting information for printouts or a file to write to.
When calling the traverse function, the visiting function defining what to do must match the following type:
 int (*wordfile_wordvisitor)(const char* word,
                             void* userdata,
                             void* commondata);
The traversal function interprets the return value as:
0: stop traversing. non 0: go on.
The traversal function itself returns 1 when all nodes have been visited, and 0 if the traversal was interrupted by the visitor function.
Destroying a tree:
The destroy function accepts a valid wordtree pointer and a userdata handling function (which may be the NULL pointer.)
The userdata handling function must match the following type:
void (*wordfile_userdatadel)(void* userdata);
A function taking care of user data is necessary if it is allocated on the heap, otherwise a memory leak occurs. If no user data, or userdata not allocated on the heap is used, then the pointer passed may be the NULL pointer.
C prototypes
As usual, translating our semantics into a C header file wordtree.h gives us this:
struct wordtree_struct; typedef struct wordtree_struct WORDTREE;
 typedef int (*wordtree_comparison)(const char* word1,
                                    const char* word2);
/* comparison function used by wordtree. word1 and */ /* word2 are always valid strings. The return value */ /* is interpreted by the wordtree as: */ /* */ /* 0 : word1 is equal to word2 */ /* <0: word1 is less than word2 */ /* >0: word1 is greater than word2 */
 typedef int (*wordtree_wordvisitor)(const char* word,
                                     void* userdata,
                                     void* commondata);
/* Function called for every word when traversing */ /* the tree. */ /* word is always a valid string. */ /* userdata is the userdata associated with the */ /* word. */ /* commondata is the pointer passed to the traverse */ /* function. */ /* If the return value is 0, traversal stops. */
typedef void (*wordtree_userdatadel)(void* userdata);
/* Function called for every userdata defined in the*/ /* tree when destroying it. The function must do */ /* whatever is needed to take care of terminating */ /* the userdata or resource leaks will occur. */
WORDTREE* wordtree_create(wordtree_comparison comp);
/* Create a new wordtree. */ /* */ /* Return values: */ /* A valid wordtree pointer or */ /* NULL on failure. */ /* */ /* Preconditions: */ /* comp is not NULL */
 void** wordtree_insert(WORDTREE* tree,
                        const char* word);
/* Insert word in tree. */ /* */ /* Return values: */ /* NULL on failure to insert word. */ /* Pointer to NULL userdata if word was inserted. */ /* Pointer to userdata if word was already in */ /* the tree. */ /* */ /* Preconditions: */ /* tree != NULL */ /* word != NULL */
void** wordtree_find(WORDTREE* tree, const char* word);
/* Search for word in the tree. */ /* */ /* Return values: */ /* NULL if the word was not in the tree. */ /* Otherwise pointer to user data pointer */ /* associated with the word */ /* */ /* Preconditions: */ /* tree != NULL */ /* word != NULL */
 int wordtree_traverse(WORDTREE* tree,
                       wordtree_wordvisitor visitor,
                       void* commondata);
/* Visit all words in the tree, in their sorted */ /* order, and call visitor with the word and the */ /* userdata associated with it plus the common data */ /* for all words. The traversal ends when all words */ /* have been visited or visitor returns 0. */ /* */ /* Return values: */ /* 1: Tree traversed successfully. */ /* 0: Traversal aborted by visitor */ /* */ /* Preconditions: */ /* tree != NULL */ /* visitor != NULL */
 void wordtree_destroy(WORDTREE* tree,
                       wordtree_userdatadel deletor);
/* Destroy the tree, freeing all allocated memory, */ /* and call deletor, if not NULL, for the userdata */ /* associated with every word in the tree. */ /* */ /* Preconditions: */ /* tree != NULL */
Implementation
As I mentioned in the description of how insertion is done, it begins with searching. This fact can, and should, be exploited by writing a common helper function. After all, having two functions in our program doing the same thing is asking for trouble; sooner or later one of them will need change, and the other will not follow.
Since the insert function will need to update the node data, the common helper function must return the pointer to the node pointer found. If the node was not found, it will return a pointer to a NULL pointer. For a search, this is an indication that it should return NULL. For the insert function, this means that the returned pointer must be updated to point to a newly created node. If the node on the other hand was found, the return value will be a pointer to a valid pointer, which both cases will be dereferenced to access the user data component of the struct. In the case of the insert function, it is the pointer to the user data component that will be returned, and for the search function, it is the value of the user data component that will be returned.
So then, let's implement this baby in wordtree.c:
#include <stdlib.h> /* malloc/free */ #include <assert.h> /* assertions */ #include <string.h> /* strlen, strcpy */ #include "wordtree.h" /* the prototypes */
 typedef struct treenode_struct {
   struct treenode_struct* pLeft;
   struct treenode_struct* pRight;
   char* word;
   void* userdata;
 } TREENODE;
 struct wordtree_struct {
   TREENODE* pRoot;
   wordtree_comparison comp;
 };
 static TREENODE** findnode(TREENODE** node,
                            wordtree_comparison comp,
                            const char* word);
/* Helper function for wordtree_insert and */ /* wordtree_find. Returns the address of a pointer */ /* to a node. If a node with the right word is */ /* found, the address of the pointer to that node */ /* is returned. If no node is found, the address */ /* of the pointer to be initialised with the new */ /* node is returned. */
 static void subtreedestroy(TREENODE* node,
                            wordtree_userdatadel del);
 /* Helper function for destruction. It traverses   */
 /* tree and calls del (if not NULL) for the        */
 /* userdata of every node, before deallocating the */
 /* node itself.                                    */
 static int subtreetraverse(TREENODE* node,
                            wordtree_wordvisitor visitor,
                            void* always);
 /* Helper function for wordtree_traverse.          */
 /* Traverses the entire tree from node and         */
 /* downwards and calls visitor in sorted order     */
 WORDTREE* wordtree_create(wordtree_comparison comp)
 {
   WORDTREE* result=(WORDTREE*)malloc(sizeof(WORDTREE));
   if (result != NULL)
   {
     result->pRoot = NULL; /* initialise empty tree */
     result->comp = comp;  /* with comparison func. */
   }
   return result;
 }
 void** wordtree_insert(WORDTREE* tree,
                        const char* word)
 {
   TREENODE** theNode;
/* Preconditions: */ /* tree != NULL */ assert(tree != NULL);
/* word != NULL */ assert(word != NULL);
   /* Get pointer to pointer to the node with our */
   /* word. If the word is not in the tree, the   */
   /* pointer points to a null pointer, which we  */
   /* must initialise. Otherwise we can use the   */
   /* returned value directly.                    */
   theNode = findnode(&(tree->pRoot),
                      tree->comp,
                      word);
   if (*theNode == NULL) /* word is not yet in */
   {                     /* tree insert it now */
     *theNode = (TREENODE*)malloc(sizeof(TREENODE));
     (*theNode)->pLeft = NULL;
     (*theNode)->pRight = NULL;
     (*theNode)->userdata = NULL;
     (*theNode)->word = (char*)malloc(strlen(word)+1);
     strcpy((*theNode)->word, word);
   }
return &((*theNode)->userdata); }
 void** wordtree_find(WORDTREE* tree,
                     const char* word)
 {
   TREENODE** theNode;
/* Preconditions: */ /* tree != NULL */ assert(tree != NULL);
/* word != NULL */ assert(word != NULL);
   /* Just as with insert above, we get a pointer */
   /* to a pointer to the node. If the pointer    */
   /* pointed to is NULL, the word isn't there.   */
   theNode = findnode(&(tree->pRoot),
                      tree->comp,
                      word);
   return (*theNode == NULL) ? NULL
                             : &((*theNode)->userdata);
 }
 int wordtree_traverse(WORDTREE* tree,
                       wordtree_wordvisitor visitor,
                       void* always)
 {
   /* Preconditions:                               */
   /*   tree != NULL                               */
   assert(tree != NULL);
/* visitor != NULL */ assert(visitor != NULL);
return subtreetraverse(tree->pRoot, visitor, always); }
 void wordtree_destroy(WORDTREE* tree,
                       wordtree_userdatadel deletor)
 {
   /* Preconditions:                         */
   /*   tree != NULL                         */
   assert(tree != NULL);
/* Destroy all nodes in the tree, and the */ /* associated user data (if applicable) */ subtreedestroy(tree->pRoot, deletor);
/* Then deallocate the wordtree struct */ free(tree); }
/********************/ /* Helper functions */ /*******************/
 static TREENODE** findnode(TREENODE** node,
                            wordtree_comparison comp,
                            const char* word)
 { /* See description in beginning of file */
   TREENODE** npp = node;
   while (*npp != NULL)
   {
     /* compare word, with the word in *node */
     int result = comp(word, (*npp)->word);
     if (result == 0) /* word == (*npp)->word */
       {
         return npp; /* OK, got it, return it. */
       }
     if (result < 0) /* word < (*npp)->word go on */
     {               /* searching to the left.   */
         npp = &((*npp)->pLeft); /* Addr. of pointer */
       }
     else /* word > (*npp)->word, go on  */
     {    /* searching to the right     */
       npp = &((*npp)->pRight); /* Address of pointer */
     }
   }
   return npp;
 }
 static void subtreedestroy(TREENODE* node,
                            wordtree_userdatadel del)
 { /* See description in beginning of file, and */
   /* detailed description of tree traversal    */
   /* further down.                             */
   if (node != NULL)
   {
     /* First destroy everything to the left */
     subtreedestroy(node->pLeft, del);
     /* Then everything to the right */
     subtreedestroy(node->pRight, del);
     /* Then the user data, if del allows us */
     if (del != NULL)
       {
         del(node->userdata);
       }
     /* deallocate the word */
     free(node->word);
     /* And finally the node itself */
     free(node);
   }
 }
 static int subtreetraverse(TREENODE* node,
                            wordtree_wordvisitor visitor,
                            void* commondata)
 { /* See description in beginning of file, and */
   /* detailed description of tree traversal    */
   /* further down.                             */
   if (node == NULL)
   {
     return 1;
   }
   return subtreetraverse(node->pLeft,
                          visitor,
                          commondata)
       && visitor(node->word,
                  node->userdata,
                  commondata)
       && subtreetraverse(node->pRight,
                          visitor,
                          commondata);
/* Here we make use of C's lazy evaluation of */ /* logical expressions. */ /* */ /* If the first call to subtreetraverse */ /* returns 0, the whole expression will be */ /* zero, so nothing more needs to be done. In */ /* this case visitor and the second */ /* subtreetraverse will not be called. */ /* The same happens if visitor returns 0, */ /* then the whole expression will be zero */ /* anyway, so the last subtreetraverse will */ /* not be called. */ /* */ /* In other words, if any of the above */ /* functions returns 0, the rest of them will */ /* not be called (thus aborting the */ /* traversal) and the function will return 0. */ /* This zero return value will then be */ /* propagated upwards, immediately terminating*/ /* all traversals. */ }
Traversing trees
Suppose we have functions called "pre", "post" and "in", and a traverse function looking like this:
 void traverse(TREENODE* pNode)
 {
   if (pNode != NULL)
   {
     pre();
     traverse(pNode->pLeft);
     in();
     traverse(pNode->pRight);
     post();
   }
 }
Traversal of the following tree, by a call to "traverse(pRoot)" will look as follows:
In the places marked "t" a call to "traverse" takes place. The first time is the one at the top, near "pRoot". "traverse" is called on pRoot. Since the node pointer passed is not NULL, "pre()" is called. Once "pre()" returns, traverse is called again (the "t" below the first "pre()" from the top left.) This time it's the left most node that is traversed. The left node pointer is not NULL so "pre()" is called again, and so is "traverse()", however this time the pointer is NULL so it returns immediately. After the return of that call to traverse, the function "in()" is called, followed by "traverse()" on the right node (which is NULL, so it does nothing). When returning from that call, it calls "post()". When done with post, it returns to its caller, which was the second call to "traverse()". It now calls "in()" (below the top node) and then goes on traversing the right sub-tree.
As you can see, the traversal goes around the tree counter clockwise. It calls "pre()" on the left hand side of all nodes, while going down. It calls "in()" while under a node, and "post" on the right hand side, while going up.
If you take a careful look at the calls to "in()" you note that they occur from left to right, and since the tree is sorted, this means that the in-order traversal is done in the sorted order of the tree. In the ADT implemented, post-order traversal is done for destruction. Since post order traversal is always done from bottom and up, it suits destruction perfectly. First go down a branch all the way to the leaf, then destroy the leaf, go up, destroy what the leaf was attached to, go up again, destroy the branch, and so on, until everything is destroyed. A pre-order destruction wouldn't work, since it would first destroy the root, and then not know where to go to. An in-order destruction would successfully destroy the left most branches, and then not know where to go to, because the node containing the "pRight" pointers would then have been destroyed.
The word frequency histogram ADT
With the aid of this generic word tree, it's easy to write an ADT for word frequency histograms. It'll be the by far simplest ADT yet written. All it does is to forward calls to the tree. Here is the header "wfh.h"
typedef struct wfh_struct WFH;
 typedef int (*wfh_compare)(const char* word1,
                            const char* word2);
  /* Comparison function for words. The words   */
  /* are always valid strings. The return value */
  /* is interpreted as:                         */
  /*                                            */
  /* 0: word1 equals word2.                     */
  /* >0: word1 is greater than word2            */
  /* <0: word1 is greater than word2            */
 typedef int (*wfh_visitor)(const char* word,
                            unsigned freq,
                            void* commondata);
 /* Function for visiting all words in histogram */
 /* in their sorted order.                       */
 /* word is always a valid string                */
 /* freq is the frequency of the word.           */
 /* commondata is the pointer passed to the      */
 /* traverse function.                           */
 /* A return value of 0 terminates the traversal */
 /* of words.                                    */
WFH* wfh_create(wfh_compare comp); /* Create a new word frequency histogram. */ /* */ /* Return value: */ /* A valid pointer to a word frequency hist.*/ /* 0 on failure. */ /* */ /* Precondition: */ /* comp != 0 */
 unsigned wfh_insert(WFH* hist,
                     const char* word);
 /* Insert a word in the histogram.            */
 /*                                            */
 /* Return value:                              */
 /*   The frequency of the word.               */
 /*   0 on failure (invalidates postcond)      */
 /*                                            */
 /* Preconditions:                             */
 /*   hist != NULL                             */
 /*   word != NULL                             */
 /*                                            */
 /* Postconditions:                            */
 /* wfh_find(hist,word) ==                     */
 /*   1+old wfh_find(hist,word)                */
unsigned wfh_find(WFH* hist, const char* word); /* Find frequency of a word in histogram. */ /* */ /* Return values: */ /* The frequency of word. */ /* 0 if word not in hist. */ /* */ /* Preconditions: */ /* hist != NULL */ /* word != NULL */
 int wfh_traverse(WFH* hist,
                  wfh_visitor visitor,
                  void* commondata);
 /* Visit all words in histogram in their      */
 /* sorted order.                              */
 /*                                            */
 /* Return values:                             */
 /*   1, all words visited.                    */
 /*   0, traversal aborted by visitor.         */
 /*                                            */
 /* Preconditions:                             */
 /*   hist != NULL                             */
 /*   visitor != NULL                          */
void wfh_destroy(WFH* hist); /* Destroy the histogram. */ /* */ /* Preconditions: */ /* hist != NULL */
And now over to the very simple implementation. As I mentioned earlier, it's just a matter of forwarding the calls to the tree ADT. For traversal, though, a little trick is needed. As you can see, the type for a wfh_visitor differs from the type for a wordtree_visitor. We can cheat our way around that problem, by internally using our own visitor function, conforming to the type of wordtree_visitor (since our function will traverse the tree) and our own common data. Our common data is a struct containing a pointer to the user's visitor (wfh_visitor) and the user's commondata. Whenever our function is called by the tree traversal function, we unpack our known userdata and call the user's visitor function with the user's commondata. Here's how it's all implemented in wfh.c
#include <stdlib.h> /* malloc/free */ #include <assert.h> #include "wfh.h" #include "wordtree.h"
 struct wfh_struct {
   WORDTREE* tree;
 };
 WFH* wfh_create(wfh_compare comp)
 {
   WFH* pwfh;
/* Precondition: */ /* comp != 0 */ assert(comp != NULL);
   pwfh = (WFH*)malloc(sizeof(WFH));
   if (pwfh != NULL)
   {
     pwfh->tree = wordtree_create(comp);
     if (pwfh->tree == NULL)
     {
       free(pwfh);
       pwfh = 0;
     }
   }
   return pwfh;
 }
 unsigned wfh_insert(WFH* hist,
                     const char* word)
 {
    unsigned** ppcount;
 #ifndef NDEBUG
 /* for post condition, see lesson 9 for explanation: */
   unsigned old_freq;
 #endif
/* Preconditions: */ /* hist != NULL */ assert(hist != NULL);
/* word != NULL */ assert(word != NULL);
#ifndef NDEBUG /* for post condition: */ old_freq = wfh_find(hist, word); #endif
   ppcount = (unsigned**)wordtree_insert(hist->tree,
                                         word);
   if (ppcount == NULL)
   {
     return 0; /* insertion failed */
   }
   if (*ppcount == NULL) /* first time seen */
   {
     /* Our data associated with words is an     */
     /* unsigned int denoting the frequency. The */
     /* first time a word is seen, allocate the  */
     /* counter and initialise it to 1.          */
     *ppcount = (unsigned*)malloc(sizeof(unsigned));
     **ppcount = 1;
   }
   else
   {
     /* If the word has occurred before, just   */
     /* increment the frequency counter for it */
++(**ppcount); } /* Postconditions: */ /* wfh_find(hist,word) == */ /* 1+old wfh_find(hist,word) */ assert(wfh_find(hist, word) == 1+old_freq); return **ppcount; }
 unsigned wfh_find(WFH* hist, const char* word)
 {
   unsigned** pfreq;
   /* Preconditions:                           */
   /*   hist != NULL                           */
   assert(hist != NULL);
   /*   word != NULL                           */
assert(word != NULL);
/* wordtree_find returns the pointer to */ /* userdata pointer. If the pointer */ /* is NULL, the word isn't in the tree. */ /* If it's non-NULL, we can interpret it */ /* as a pointer to an unsigned int */ /* pointer, (since that's how we */ /* allocated it when inserting the word) */ /* and get the frequency from there. */
   pfreq = (unsigned**)wordtree_find(hist->tree,
                                     word);
return pfreq == NULL ? 0 : **pfreq; }
 struct travdata {
   wfh_visitor visitor;
   void* usercommon;
 };
/* helper function forwarding calls to the */ /* user's visitor function. */
 static int treevisitor(const char* word,
                        void* data,
                        void* commondata)
 {
   struct travdata* ptd = (struct travdata*)commondata;
/* get the visotor the user intended to call */ wfh_visitor userfunc = ptd->visitor;
unsigned count = *(unsigned*)data;
/* Call the user's visitor with the word, */ /* the frequency and the user's commondata. */ return userfunc(word, count, ptd->usercommon); }
 int wfh_traverse(WFH* hist,
                  wfh_visitor visitor,
                  void* commondata)
 {
   struct travdata ourdata;
/* Preconditions: */ /* hist != NULL */ assert(hist != NULL);
/* visitor != NULL */ assert(visitor != NULL);
/* store the user supplied visitor and */ /* commondata in ourdata. */ ourdata.visitor = visitor; ourdata.usercommon = commondata;
   /* Traverse the tree with our tree visitor  */
   /* and pass it the user supplied visitor    */
   /* and common data through ourdata passed   */
   /* as the commondata. treevisitor knows how */
   /* to unpack this and forward the call to   */
   /* the user specified visitor function.     */
   return wordtree_traverse(hist->tree,
                            treevisitor,
                            &ourdata);
 }
 void wfh_destroy(WFH* hist)
 {
   /* Preconditions:                           */
   /*   hist != NULL                           */
   assert(hist != NULL);
/* Since the data associated with every */ /* word is allocated on heap, it must be */ /* deallocated when destroying the tree. */ /* The ANSI/ISO C function "free" fits */ /* bill just perfectly. */
wordtree_destroy(hist->tree, free); free(hist); }
The program
At last, we can now finally put the pieces together and write the word frequency histogram program, making use of the just written abstract data types, and the wordfile ADT from lesson 8.
Here's one example implementation, wordfreq.c
#include <stdio.h> #include <string.h> /* strcmp */ #include "wordfile.h" #include "wfh.h"
 int printAll(const char* word,
              unsigned freq,
              void* common)
 {
   int width = *(int*)common;
   printf("%*d:%s\n", width, freq, word);
   /* %* in this case means "interpret the parameter  */
   /* on this place as an integer specifying the      */
   /* width." For example: printf("%5d", 12); yields  */
   /* the same result as   printf("%*d", 5, 12); In   */
   /* the latter case, though, the width can be       */
   /* specified by a variable and thus determined at  */
   /* run-time, where as in the former case, where    */
   /* the width must be known at compile time.        */
return 1; }
 int main(int argc, char* argv[])
 {
   WORDFILE* file;
   WFH* freqhist;
   int width = 4;
   int argcount;
   char* filename;
   char word[80];/* what language has this long words? */
   if (argc != 2)
   {
     printf("Usage: wordfreq filename\n");
     return 1;
   }
   for (argcount = 1; argcount < argc; ++argcount)
   {
     filename = argv[argcount];
     /* open our wordfile, or exit on failure */
     file = wordfile_open(filename);
     if (file == NULL)
     {
       printf("Couldn't open \"%s\"\n",
              filename);
       continue;
       /* continue means going on immediately with the */
       /* next iteration of the loop.                  */
     }
     /* OK, now create the word frequency histogram, or */
     /* exit on failure.                                */
     freqhist = wfh_create(strcmp);
     /* strcmp is a case sensitive comparison for     */
     /* strings, behaving exactly as required by the  */
     /* tree and word frequency histogram ADT. It     */
     /* differs uppercase letters from lowercase by   */
     /* considering all uppercase letters being "less"*/
     /* than any lowercase letter. Digits come before */
     /* any letter in sort order. Note that this is   */
     /* only true for letters in the English alphabet.*/
     /* For letters not in the English alphabet, the  */
     /* sort order depends on the current code page.  */
     /* It does the comparison by actually subtracting*/
     /* the values of the characters in the string,   */
     /* until one of them differs or the strings end. */
     /* OS/2 offers functions for language specific   */
     /* string comparison (Note that letters can have */
     /* different sort orders for different           */
     /* languages.) That, however is another story.   */
     if (freqhist == NULL)
     {
       printf("Failed creating histogram\n");
       return 2;
     }
     /* Read words from the file, one by one, and insert */
     /* into the histogram, until all words have been    */
     /* read                                             */
     while (wordfile_nextword(file,word,sizeof(word))>0)
     {
       wfh_insert(freqhist, word);
     }
     /* when done reading, close the file */
     wordfile_close(file);
     /* Now when the histogram is complete for the */
     /* just read file, print the frequencies.     */
     printf("\nFrequencies of the words in \"%s\"\n",
            filename);
     wfh_traverse(freqhist, printAll, (void*)&width);
     /* And when done with that, destroy the histogram */
     wfh_destroy(freqhist);
   }
   return 0;
 }
The lie
With the word frequency histogram written, it's time to reveal the lie. So now, after having read all of the above, I'm telling you it's not true (nice chap this, eh?). When reading the explanation for why a tree is so much more effective than a list, did you notice the condition? It goes for "perfectly balanced" trees. So, how likely is it that a tree becomes perfectly balanced when built up with real-world data like the words in a text file? Not very. Not very likely at all. How likely is it to become reasonably balanced? Well... hard to say. Let's assume the case of a file of alphabetically sorted words. What happens when we build up the tree for the words "a binary sorted tree violator".
Hmm... All left child pointers are NULL. Everything lies to the right. Yes, it's unfortunately true, for sorted data, a binary search tree becomes a sorted list! So much for that advantage over the list.
Splay trees
A splay tree is a binary search tree that is dynamically rebuilt (keeping the sort order, though) as nodes are inserted and searched for. It is rebuilt in a way making it less vulnerable to the sequence of insertions and searches. For data that forms a reasonable well balanced binary search tree, a splay tree does perform worse, but for data that forms poorly balanced search trees, the splay tree is much more effective.
I ran some tests, comparing the simple binary search tree and the splay tree. For the test, I've run the example program, with a custom compare function that counts calls, for both C:\README of Warp 4 (fairly random data giving a reasonable search tree) and the list of all files on my hard disk, with full path names (very repetitive and sorted data). I've run the tests with assert tests and without (NDEBUG defined). The result was as follows:
                   Calls to compare function
                   dirlist         C:\README
               splay    simple    splay  simple
       assert  3035383 16485864  114230  181169
       NDEBUG  1934204  5487971   80086   59921
What can clearly be seen is that for sorted repetitive data, the splay tree is much faster. For the random data, however, this is not so clear. With assertions active, the splay tree is much more effective than the simple sorted tree, but not with assertions turned off. How come? Whenever a node is searched for, or inserted, the tree is rebuilt so that the node is moved to the root. The assertions do searches in the tree, so found nodes are moved to the top. When a word that has been inserted before is inserted again, the first call to wordtree_find (to get the count used by the post condition) moves the word to the top. wordtree_insert thus gets a hit on the first word it tries, and the second call to wordtree_find, for the post condition, also gets a hit on the first try.
I will not go more into the splay tree, as this article has become long enough as it is. If you're curious you can find a paper describing the operation and a CGI based demonstration, and even a java applet demo you can watch as it displays what happens when you insert/remove/search elements. You'll also find a splay variant of the "wordtree" in the ZIP file.
Recap
In this article we've seen that:
- Trees are hierarchical recursive structures, where every node has 2 or more pointers to sub nodes, usually called children.
- The number of pointers to children for a node is called the arity of the tree, (thus trees where every node has 2 child pointers are called binary trees).
- A node where all child pointers are NULL is called a leaf.
- Computer geeks make poor gardeners as they like to plant trees upside down.
- Reasonably balanced trees are much more effective than lists for insertion and searching.
- For sorted data, the simple sorted binary tree collapses into a sorted list.
Exercise 1
This one's a difficult one. Please write me and ask for details, but think a little bit first.
Implement a "wordtree_removeword" function, looking like this:
 int wordtree_removeword(WORDTREE* tree,
                         const char* word,
                         wordtree_userdatadel deletor);
A return value of 1 indicates successful removal of the word, and 0 failure.
The problem: When you remove a node, and that node has two children, one of the children must take the place of the just deleted node, and the tree must remain sorted.
Exercise 2
This one's easier. With the above construction, you have to provide a deletor function both when removing a single word, and when destroying the entire tree. In both cases the function ought to be the same, but providing the same information in two places is a sure source of future errors (changing one, and forgetting the other). Make a change to the ADT so that you provide the deletor already in the create function, and then just use it when destroying the tree or deleting a single word.
The End
That was it. End of C intro. You now know quite a bit about programming in the C language. Let's look back at what you've learned:
- The C syntax for one (no small feat).
- The datatypes, int, char, unsigned char, arrays, structs, pointers, enums, and how they all interrelate, what you can do with them, and what values they can have.
- You've learned about C loop structures like the "for", "while", "do/while", and also about the "if" conditional statement.
- You can write functions, and know how their parameters are treated, and how to return values.
- You know how to use "programming by contract" to improve your software quality, and how to enforce the contract with "assert".
- You can create abstract data types to make complete packages, encapsulating all implementation details for users.
- You know how to use pointers to functions for callbacks from the ADT's to user functions, and how to use the void* to store user specified data in your ADT.
- You have seen how a large number of the most frequently used functions in the ANSI/ISO C library works and when to use them.
- You also know how to write dynamic data structures like lists and trees.
All in all, I'd say you're now pretty good programmers. Just exercise your skills to gain some experience.
Are there things I have not taught? Yes, there are. Most of those, however, are obscure details that I fear would complicate matters without really bringing you anything of value. It's very unusual that I use C constructs, and programming practices that I have not taught in this introductory course.
I would like to ask you a favour. Please e-mail me your impressions of this introduction. This is the first time I've written any kind of course, and I want feedback. Have I been too deep in details, or skimmed to superficially? Have I been fast paced or too slow? Have I covered the right things to begin with? What things are you missing? What things have you appreciated? Without feedback of this kind, I will never be able to improve, and I want to improve. Already next month I'm starting with a new introduction series: An introduction to C++.
Also, as usual, if my explanations have failed somewhere, please e-mail me and ask what I meant, or ask for more details, if necessary.
Oh, I almost forgot. Here are the promised complete sources:
wordfreq.zip, the original as written in this article. splaywtr.zip, alternative wordtree.c implemented as a splay tree. The implementation is more or less a direct copy of the algorithm described in Sleator's paper and not the recursive bottom-up algorithm displayed by the Java applet example.


