Introduction to C Programming - Part 8

by Björn Fahller

Introduction
Last month we saw how we could encapsulate the implementation of a word reading file in a separate module. The encapsulation was limited, though, in that only one word file could be open at the time, and that the definition of what is a word was hard coded. This article will introduce a different kind of encapsulation, usually referred to as an abstract data type (often referred to as an "ADT",) which allows us to use as many word files as we please. The new word file will also be expanded so that the definition of what is a word can be changed by the user.

Definition
If you look at the implementation of last month's word file, you notice that there was a variable called "file" of type "FILE*," that was used when reading data. The variable either had the value NULL, or a value given from a call to the standard function "fopen." What I did not mention last month, was that it is possible to have several files open at the same time. All you have to do is to have several variables of type "FILE*," and assign each of them the value of different calls to "fopen." Which variable you use determines which file you operate on. That is exactly how a word file should work. We need a data type called "WORDFILE*" (or something like that) and variables of that type get their values from "wordfile_open." "FILE*" is what is called an abstract data type. Abstract because we don't know, and don't care, about its values (other than to compare against NULL.) We just pass values of that kind on to functions that do understand them, like "fgets", "fgetc" and "fclose."

Just like last month, let's first have a look at the desired semantics of our new word file.

Semantics
WORDFILE* wordfile_open(const char* filename); int wordfile_close(WORDFILE* wordfile); Like last month, this closes the word file. "wordfile" tells which. If closing was successful, the value returned is 1, otherwise 0.
 * Failure to open a word file is reported by returning NULL, success by returning a non-NULL value.
 * Passing the NULL pointer as the name is a programming error.
 * If no file with the passed name exists, open should fail.
 * If the return value is non-NULL, the word file is opened.

Here "wordfile" determines which word file we want the next word from. "buffer", "buffersize" and the return value have the same meaning as last month.
 * "wordfile" must be a value returned by a previous call to "wordfile_open," for which "wordfile_close" has not been called before.
 * If "wordfile" is NULL, "wordfile_close" does nothing and returns 1.
 * A return value of 1 indicates successful closing of the file, and 0 indicates a failure.


 * "wordfile" must be a value returned by a call to "wordfile_open", for which "wordfile_close" has not been called.
 * It is a programming error if "wordfile" is the NULL pointer.
 * "buffer" must not be the NULL pointer.
 * "buffersize" must be at least 2, to hold a minimum of one character and the null-termination.
 * Return the length of the word copied into buffer. If 0 is returned, no word was found before end of file. If the number returned equals "buffersize," there was not room for the entire word in "buffer." In this case, the buffer will contain only the first "buffersize-1" characters of the word, the remaining characters will be discarded.
 * If end of file is reached when reading a word, the end of the word is also reached, so the word read is copied into buffer, and the length of it returned. The next call will return 0, indicating that the last word has been read.

A header file for our new improved word file can look something like this: The forward declaration says that there is a struct data type with the name "struct wordfile_struct", which we call "WORDFILE". We don't say anything, however, about what the guts of the struct are, or will be. This incomplete data type cannot be instantiated (i.e., we cannot have variables of type "WORDFILE".) We can, however, have pointers to an incomplete datatype. The good thing about this, is that for a user, WORDFILE is a secret. It's something they can use, through calls to our functions, but they cannot (easily) tamper with the data used by the word file.

Thoughts on implementation
Before getting down to writing the code, there are a few things that needs to be thought out. The data type "WORDFILE", or rather, the "struct wordfile_struct" cannot remain a secret much longer. To the user of our improved word file, it should of course remain a secret, but to us, as implementors, the time has come to define it. What information is needed for every word file? If we look at last month's solution, the only data used was the variable "file" of type "FILE*." Is a "FILE*" enough? Actually, yes, for now. One way of defining our data type is: struct wordfile_struct { FILE* file }; Since the typedef makes "WORDFILE" an alias for "struct wordfile_struct", we can hereafter refer to the data type as "WORDFILE."

Then comes the next problem; that of returning values of type "WORDFILE*." We saw in part 6 how a pointer to a variable can be obtained with the unary operator "&". That doesn't sound like a very good solution for us now, though. How many variables would we need? Would 2 be enough, or 10? Using an array instead of discrete variables doesn't help either, since we still have the problem of deciding how large the array should be. Fortunately there is a solution available in the ANSI/ISO C library. The solution is a function pair named "malloc" and "free", declared in . Their prototypes are: void* malloc(size_t size); void free(void* ptr);

What is "void*"? Earlier I've said that "void" is a pseudo type used to denote "nothing at all", for example as the return type of a function not returning any value, or as the parameter type of a function not requiring any parameters. In the case of "void*", "void" should be interpreted as "anything," thus "void*" becomes a pointer to anything. Since this type can be used to point to any data, it is not possible to do any arithmetics on it. It can be compared to the NULL pointer, and it can be cast to other pointer types.

Enough about "void*" for now. What is it "malloc" and "free" actually do? "malloc" allocates a block of memory, as large as its parameter says it should be, and returns the pointer to it. If you remember part 6, on pointers and arrays, you maybe remember the "sizeof" operator. "sizeof" is very frequently used together with "malloc," since to have a pointer to a type X, "malloc(sizeof(X))" will allocate exactly as large a block of memory as is required for the type X, and return the pointer to the block.

"free" undoes what "malloc" did, that is, it deallocates the block of memory that "malloc" allocated. It is very important to remember to always "free" objects created with "malloc" when they are no longer needed, otherwise you get what is called a memory leak.

So, for our word file, we can create our "WORDFILE" with "malloc" in "wordfile_open," and deallocate it with "free" in "wordfile_close." In both cases, what we use is the pointer to the "WORDFILE" created by "malloc."

Implementation
First, the definition of "WORDFILE" should go into "wordfile.h". Other than that, "wordfile.c" can look as follows: The implementation of "wordfile_nextword" is the same as last month, with "file" replaced with "wordfile->file."

At /* 1 */, "fclose" returns 0 on success, so if the closing is successful, "retval" is assigned the value 1, as agreed upon in the interface specification.

Usage of this wordfile is slightly different from usage of the one from last month. Here's last month's test program rewritten for this version.

What is a word
Now that we can have many word files open simultaneously, there's still the problem that a user must trust our judgement for what a word is. What's used so far is that any sequence of characters in the English alphabet and the digits, surrounded by anything else, is a word. In many cases, this is not good enough. Just as an example, suppose I want all identifiers used in the program itself to pass as words. "wordfile_close" will not pass, since "_" fails "isalnum".

What we have to do, is to let the user define what a word is. An easy way to do it, is to let the user pass a string containing all valid characters for a word. That's simple to understand and to implement, so I think it should be done. It's not good enough, though. Suppose I want to split constructions like "ThisWordIsConcatenated" to the word sequence "This", "Word", "Is", "Concatenated". Here both upper case and lower case letters are allowed in words, but a capital letter is always the beginning of a new word. A way to allow this, and much more, without us worrying too much about how to do it, is to let the user tell which function to use when distinguishing words.

There is a data type available in C, that I have so far not mentioned, that can be used for this. The data type is called a pointer to function. The good thing about pointers to functions is that they're flexible yet type safe. The bad thing is that their syntax is terrible.

When defining a pointer to a function, the things to think of are the return type and the parameter list, of the kind of function it is supposed to be pointing to.

This tiny example will show you. Please take your time and study the syntax carefully. The unary "&" operator before the function names is not needed when you want the pointer to the function. It's purely stylistic. I always use it, to clearly show other readers of my code, that I do indeed intend to use the pointer to the function, and not call the function (and accidentally forgot the parenthesis.)

The beauty of pointers to functions is not in the syntax for sure, but in its usefulness. Let's put this in perspective of our word file. The user of it can specify any function they like, as long as it conforms to an interface (i.e. the return type and parameter list) that we specify, and we can call that function through a pointer. Thus our code is not made much harder than it is today, yet its flexibility for the user has grown tremendously.

So how then, should the function used for the word file behave and look like? If we look again at the example where both upper and lower case letters are allowed for words, but the transition from lower case to upper case denotes the beginning of a new word, it is clear that the function needs the previous letter. There are three ways to allow the function to do this. We can, in our implementation, keep the previous letter, and send both the previous and the current letter to the function. The problems with that are, what to send as the previous character when the first character is read from the file, and what if the function requires a longer history than one character? Second is to make that the user's problem all together. The problem with that is that no matter how the user implements it, they will be restricted to only one word file with that function at the time, because the history data will be shared. The third, approach, is to make it the user's problem with our help; that is, the user must specify what data it needs as the history, and the user must initialise it to something reasonable, but we can help by passing that data to the function, for every word file. The only problem with that, is that we cannot, in our implementation, know what data the user will need. The work around for the latter is to revisit our new friend, the "pointer to anything" type, the "void*". The user can instantiate data of whatever kind needed, and pass the address to it as a "void*", and we can pass that pointer to the function, which in its turn casts it back to the type it knows it is. Here's an example of how this can work, just to show you. This example does not use previous characters as its history, but instead a counter of how many times it has been called. /* 1 */ Compare "*string++" with "(*puser)++" in "function". "*string++" dereferences "string," and then increments it, while "(*puser)++" increments the dereferenced value of "*puser". The operator precedence rules make "*string++" identical with "*(string++)"

With a construction like the above, we leave control to the user. How does this fit in with our word file then?

Semantics again
Yet again, it's time to think about how we want the word file to behave. The way I think is preferable, although perhaps unnecessarily constraining, is to allow the user to change word definition only once, and only between opening the word file, and reading the first word. The default behaviour must be exactly the same as in the previous implementation, because only then can a user upgrade without changing any currently written programs.

In other words, the two new functions (for defining what a word is) can look something like this: Of course, our "WORDFILE" datatype now needs to hold more information than just a "FILE*". It must hold either a function to call and userdata, or a character string, and the first char of a new word, if "firstInWord" is returned (in order to store it as the first character in the string on the next call to "wordfile_nextword",) and a flag indicating if reading has begun or not. We can compress that to only a function, userdata, the char and the flag, by having a special string function, looking exactly like the "wordfunction", where the userdata is the string. The default "isalnum" behaviour can also be implemented by a function which just calls "isalnum" just as in the call counting example. When storing the left over character, if "firstInWord" is used, there is a problem in how to tell that no character was left over. Either a flag can be used, or an illegal value. To use an illegal value for "char", though, the datatype needs to be something else. I've chosen "int", as the datatype, and the constant "EOF" to represent no character left over.

Implementation again
/* 1 */ "strchr" searches for a character in a string. If found, it returns the pointer to the character in the string, otherwise it returns NULL. In this case, we want to see if "c" is in the string, which it is if the return value is not NULL. Before this test, we check if "c" is 0, because all strings always have the character (char)0 (the null-termination character.)

/* 2 */ Instead of calling "isalnum" as before, we now call the function pointed to by "wordfile->isWord" with the user data supplied, and check loop as long as the character is deemed not to be in a word. We must also explicitly check for EOF before calling "wordfile->isWord", since it accepts a character, and not an int, and hence cannot safely react on "EOF".

We can now recompile the previous test program with this implementation. Its behaviour will be identical, as promised. The fun begins when we make use of the extras. Try adding this to the program: To the variables of "main" add unsigned char lastchar = 0; and just prior to the reading loop, add: wordfile_wordfunction(file, toUpperTransition, (void*)&lastchar); If you recompile and run against a file containing, for example "ThisSentenceIsBuiltUpOfManyConcatenatedWordsWithTheirFirstLetterCapitalized"

You can probably already guess what it does.

Not too bad, eh?

I'll try to keep next month's part a bit shorter. Promise!

Recap

 * malloc and free can be used to create/destroy objects as needed. This is more or less essential when using abstract data types.
 * The type "void*" is a generic pointer which can point to anything, but it is impossible to do any arithmetics on it. In general, one should be careful with "void*" since they can indeed point to anything, and you can cast them to any other pointer type. If you're not careful, it's easy to cast it to the wrong type.
 * Pointers to functions are useful when we want to leave it to the user of an abstract data type to define the behaviour. This usage is generally referred to as a "callback function," since the part we've written, call back to user a function.
 * The encapsulation makes it possible for us to make changes to the internals, without affecting a user of the abstract data type.

Coming up
Next month I will show you how to write dynamic data structures of arbitrary size (unlike arrays or structs, for which we need to know the number of elements.) This problem must be solved before writing the word frequency histogram, since it is not known until the entire file is read how many unique words there are in the file. To solve this, I will make much more use of "malloc" and "free", and you will also see a function call construction called "recursion."

Please don't hesitate to e-mail me if you have questions, wishes for details to cover or want me to clarify things.