Introduction to C Programming - Part 9

by Björn Fahller

Introduction
Since we can now read words of whatever kind we like from text files, half the problem of creating a word frequency histogram is solved. The second half lies in finding out how often a word occurs. One way of storing large data structures is to use an array. The problem with the array is that we must know how many entries we want in it. How many unique words are there in a text file? Hard to know before reading the text file. This month we will have a look at data structures, that unlike "struct" and arrays, can have as many, or as few, elements as you wish. Last month, we saw that dynamic memory allocation ("malloc") answered the question "how many?" with "as many as you please." This can, and will, be used to create new elements to dynamic data structures when needed, and remove them when no longer wanted. A good data structure for a dictionary will be explained next month. This month, a simple (yet very useful, but not as a dictionary) data structure will be introduced.

A simple list
Two important prerequisites for dynamic data structures were introduced last month. In "wordfile.h" there was a forward declaration of the "WORDFILE" datatype: struct wordfile_struct; typedef struct wordfile_struct WORDFILE; which was enough to declare pointers with. As you probably remember, the wordfile functions either required a "WORDFILE*" or, in the case of "wordfile_create," returned one.

The other prerequisite I've already mentioned; dynamic memory allocation with "malloc."

Since a pointer can be defined for a data type that is not yet complete, it is possible to create a "struct" which can point to itself: At /***/ the data type "struct intlist_struct" is not yet complete, but it's already enough to allow pointers to it, just as with the forward declared data type "WORDFILE."

This perhaps seems a bit weird, but it allows us to chain as many integers as we wish in a list (by using the intlist type). Instances of "intlist" can appear as follows:



Here the "pNext" component of the first instance points to the second instance. "pNext" of the second instance points to the third instance, and "pNext" of the third instance is NULL, denoting the end of the list.

A list of this kind, with one pointer in every element, pointing to the next one, is called a "single linked list". A more flexible list is the "double linked list" with pointers both to the next and the previous elements. This kind of list will not be introduced in this article, but please do create one as an exercise.

Here's a small example of a single linked list (which ignores checking if "malloc" returns NULL): If you compile and run this program, you should see: value of element#1 : 35 value of element#2 : 40 value of element#3 : 45 The advantage of using these kind of structures is that we don't have to know in advance how many elements we will use (as opposed to the case with arrays), but the price is in indexing. The only way to get the n:th element of the list, is to start at the beginning and iterate through the list until the n:th element is found.

A list of words
We can use this knowledge to make an ADT [Abstract Data Type - EDM] of a list of words. To make the list generally useful, every word should have some user defined data associated with it, stored as a "void*." To keep the list simple, though, this extra data will not be used this month. Another simplification I will make, is to assume that "malloc" always succeeds (i.e. never returns a NULL pointer).

What operations are useful for a generic list? We need to be able to store things, find things, query things and remove things. By using something called iterators, which are used to iterate through a list, we can leave finding things to the user, in the sense that the user can iterate through the list and see if the contents of an element is the wanted. The iterators can also be used for deletion and even insertion. A list iterator is very similar to an array index, with the important difference that you can not do arithmetic on iterators.

Semantics
Again, it's time to think about the list from the user perspective.

We should have two data types, WORDLIST and WORDITERATOR. The list holds the words and user data, and the iterator is used to iterate through the elements of the list.

We must be able to create a word list, so a create function is needed: WORDLIST* wordlist_create(void); The value returned should either be NULL, or an empty word list. void wordlist_destroy(WORDLIST* wordlist); Since the wordlist makes use of dynamically allocated data, it must be possible to destroy the structure to avoid memory leaks.

It is a programming error if "wordlist" is the NULL pointer. void wordlist_addFirst(WORDLIST* wordlist,                       const char* word); A way of adding a word directly to the list, without going through iterators is desirable, although not necessary.

It is a programming error if any of "wordlist" or "word" is the NULL pointer.

"word" must be copied by the list.

After completion, the first word in the list will be a copy of "word". void wordlist_deleteFirst(WORDLIST* wordlist); Deletes the first element in the list.

It is a programming error if "wordlist" is the NULL pointer.

It is a programming error if the list is empty.

After completion, the first word is removed; the number of elements in the list is one less than prior to the call.

An iterator referring to the first element will be invalidated by this call. It is difficult to state a rule for checking this, though. size_t wordlist_numberOfElements(WORDLIST* wordlist); It can be useful to find out the size of the list.

It is a programming error if "wordlist" is the NULL pointer.

The value returned is the number of elements in the list. WORDITERATOR* wordlist_beginning(WORDLIST* wordlist); To use iterators, we need a starting point, and the first word in the list seems like a good candidate.

It is a programming error if "wordlist" is the NULL pointer.

If "wordlist" is an empty list, the NULL pointer is returned.

wordlist_wordAt(wordlist_beginning(wordlist)) is the first word of the list. const char* wordlist_wordAt(WORDITERATOR* iter); To make the iterator useful for the user, it must be possible to query the iterator which it refers to. It is (arguably) not a good idea to allow altering the word, hence it is returned as "const char*".

It is a programming error if "iter" is the NULL pointer.

The returned pointer must not be used after the associated list item is deleted. void wordlist_insertAfter(WORDITERATOR* iter,                          const char* word); If the user wants to create a list with the elements in a certain order, it must be possible to insert elements in a defined place.

It is a programming error if any of "iter" or "word" is the NULL pointer.

After insertion, wordlist_wordAt(wordlist_next(iter)) equals "word". void wordlist_deleteAfter(WORDITERATOR* iter); As with destroying the entire list, deleting an individual element may be desirable.

It is a programming error if "iter" is the NULL pointer.

It is a programming error if "iter" is the last element of the list.

An iterator referring to the element after "iter" will be invalidated by this call. It is difficult to state a rule for checking this, though. WORDITERATOR* wordlist_next(WORDITERATOR* iter); Could we not move through the list with an iterator, it would not be an iterator.

It is a programming error if "iter" is the NULL pointer.

If "iter" refers to the last element in a list, the returned value will be the NULL pointer. Translating the above to C and comments as in the earlier examples becomes:

How to?
Before diving into the implementation, we need to think about how some of the operations should be done.

Adding a word
There are two functions for adding a word. "wordlist_addFirst" and "wordlist_addAfter". Common to both is that they insert an element at a specified place in the list. This commonality can, and should, be exploited. If we, in the list ADT, have a pointer to the first element, this pointer can be seen as an iterator referring to the first element. So, when creating a new element, be it first in the list, or somewhere else, we can first create the element and initialise its "next" pointer. If the element created is to be the first, its "next" pointer will refer to the element that used to be the first; otherwise it will be set to the element that used to follow the passed iterator. After having created and initialised the new element, the pointer to the first element must be altered to point to this one, if it was to be the first; otherwise it's the "next" pointer of the passed iterator that needs updating. We have now exactly caught the commonality between adding first, and adding after an iterator.

OK, given a list like this:



where "iter" refers to the element containing "value1", how do we add a value "middle" after it, so we get:



A function behaving like this (very rough pseudocode, describing what's done, this is definitely not C) can do the job:

If applied to the case above, it's used like this: iter->pNext = createElem(iter->pNext, middle); Or, to say it in English, "createElem" creates a new element, with a value, and a "next" element that we give it. By saying that the "next" element is the one after the one referred to by our iterator (iter->pNext) we will have our element pointing to where we want it. The next problem is to tell the element referred to by our iterator to point to the newly created object instead of the one it used to point to (the one with "value2"), and the assignment solves that.

More graphically, the steps are:

Before:



New element created and initialised:



iter->pNext set to point to new element:



Removing a word
When removing a word, you first have to change the pNext pointer, so you point past it, and then remove the element. Graphically, it appears like this:

Before:



Link past it: iter->pNext = iter->pNext->pNext;



Deallocate the middle element:



Word list implementation
An implementation, "wordlist.c" may look like this:

/** 1 **/ In part 5, when "assert" was first introduced, I mentioned that it does nothing at all if the macro "NDEBUG" is defined. At /** 1 **/ in the code, the variable "old_elements" is defined/used solely for checking the post condition that the number of elements in the list has been decremented by 1. Since the post condition is checked with "assert", the post condition is not checked if "NDEBUG" is defined. Likewise, we don't want "old_elements" defined, and most notably not calculated by an expensive call to "wordlist_numberOfElements" when not needed. The preprocessor directive "#ifndef" takes care of this. Everything between "#ifndef MACRONAME" and "#endif" is invisible to the compiler if the macro "MACRONAME" is defined (ifndef means if-not-defined). There's also a preprocessor directive "#ifdef" that checks if a macro is defined. In this case, the "#ifndef/#endif" pairs makes the definition and use of "old_elements" happen if, and only if, the macro "NDEBUG" was not defined when compiling.

Recursion
"lengthOfTail" does perhaps look strange, but I assure you, it does calculate the number of elements in the list. It works this way:

If iter is NULL, the depth is 0, otherwise the depth is one more than the depth of the rest of the list.

Here's an example with a 3 element list:



"lengthOfTail(iter)" is calculated as follows:

"iter" is not NULL, so the length is "1+lengthOfTail(iter->pNext)"

"iter->pNext" is not NULL, but points to the element with value "middle", thus its length is "1+lengthOfTail(iter->pNext->pNext)", and the total so far is "1+1+lengthOfTail(iter->pNext->pNext)"

"iter->pNext->pNext" is not NULL, but points to the element with value "value2", thus its length is "1+lengthOfTail(iter->pNext->pNext->pNext)", and the total so far is "1+1+1+lengthOfTail(iter->pNext->pNext->pNext)"

"iter->pNext->pNext->pNext" is NULL, so its length is 0, and the total becomes "1+1+1+0" which is 3.

This method of calling itself is called recursion. The important thing to remember with recursion, is the same as for loops: there must always be a way out.

The same function can of course be written as: I think the recursive variant is both shorter and easier to understand (recursion might take a little while to get used to, but once you do, it's neat). For both variants there is a trade off. The advantage of the iterative variant (the alternative one just above) is that its memory requirements are constant, no matter the size of the stack. That is not the case with the recursive variant, which also most probably is slower. In this case, both are fairly easy to grasp, but in the next article, you'll see a function that is very easy to implement in a recursive way, but for which an iterative implementation requires quite a lot of work.

Extras
Now that you have a complete single linked list for words, you can try writing a double linked list (with pointers to both the next and previous element). With such an ADT you can move iterators both backwards and forwards, insert/delete before and after. With a little thought on your implementation, you can also make adding/deleting elements first and last simple and quick. Be careful with the pointers when inserting and deleting. It's easy to forget one. [Also think carefully about the order you update the pointers - EDM] Be careful with memory leaks. Use clear pre- and post-conditions, and use them scrutinously. In the early phase, they help you clarify what you want the ADT to do, and later on, it helps you get it to work correctly. (Question: how can you claim that a program does what it should, if it's never stated what it should do?)

Recap

 * With the aid of malloc/free, we can create data structures that can grow as needed.
 * Clear pre- and post-conditions help us when defining what we want the ADT to do, and then to make sure it really does it.
 * Stating pre- and post-conditions in terms of the interface of the ADT makes it easier for a user of it to understand violated assertions.

Next
Unless you scream/yell/beg/ask a lot, the next article will conclude the series on C programming by finally getting to the word frequency histogram. I don't think any new C constructions will be used next month (well, possibly a few details), but a new data structure will be introduced, the binary tree.