Introduction to C Programming - Part 7

by Björn Fahller

An Introduction to C Programming - Part 7
As promised last month, the knowledge so far gained will be used in writing a real program. The program will create word frequency histograms for text files. This month, I will explain how you pass parameters to a program, some basics of file handling, a programming practice called "encapsulation" and another called "programming by contract." The latter two are not specific to C, but are good software engineering practices.

Parameters to programs
In order for our program to create a word frequency histogram for a text file, we need a way of telling the program which text file to read from. One way is to have the program ask for the file name, once started, but that isn't very smooth. A better way is to do it like most programs do; accept command line parameters. For example, the program can be started like this: [D:\]wordhist c:\readme and then create a word frequency histogram for "c:\readme."

main
Command line parameters are sent to the "main" function. For "main" to get the parameters, it must be declared a bit differently, though, with a parameter list. The version of "main" to use when accepting command line parameters is: int main(int argc, char* argv[]) The parameter "argv" may look a little frightening, but it isn't that bad. As I mentioned last month, a string is an array of character (char stringname[size]), and when passing arrays to a function, it's actually the pointer to the first element that is passed. For strings, it's very usual to see "char* stringname" in the parameter list of a function. With this little explanation, we can see that "argv" is an array of strings. Also as mentioned last month, you lose the size information when passing an array to a function. For the string, this is not important, since all strings are null terminated (i.e. the last character is (char)0). The number of strings passed, however, is sent in "argc."

Let's test this with a small program: Call this program "argtest.exe" and run it a couple of times. This is how it looks for me: As you see, there is always at least one argument, "argv[0]", and it is always the name of the program itself. Unfortunately, the contents of "argv[0]" might differ from compiler to compiler and also depend on the shell it is started from. VisualAge C++ always pass the name of the program as sent by the shell. When using 4OS2 as my shell, I get the above result. When using CMD.EXE, I get the name exactly as I type it.

Since I promised last month that I would now explain the so far unexplained parts of C as I go by, I will now reveal why "main" returns an "int".

All programs leave something called a "return code" when they terminate. The norm is to return 0 for successful execution, and non-zero for error reporting. The return code is often used in scripts, and to combine programs on the command line. Let's change "argtest" a little bit, and return "argc-1" instead of 0, just to test it.

Running it by itself makes no difference from before, but combined with other programs through "&&" and "||" shows something: We can see here that the return code and the operation ("&&" or "||") is what determines if the second command should run or not.

On the command line, and in scripts, you can also use the "if errorlevel n" construct, which executes whatever follows if the return code from the previously run program is greater than, or equal to n. This example shows how it works: Maybe you noticed that the "printf" call in the example has a new detail in the formatting string. The detail is the number 2 in: printf("%2d : %s\n", index, argv[index]); The interpretation of the above is to print "index" as an integer, just as "%d" usually means, but reserve a width of 2 characters for it. The number 1, for example, will be written as " 1". The number of digits is not limited to 2, though, so it's still possible to print numbers requiring more than 2 digits this way.

Basic file reading
In most programming environments, including C programming, dealing with files resembles real-life dealing with books. You can find out some data about a file by looking at it, but to read from or write to it, it must be opened.

Unlike books, however, you specify your intent when opening a file. You specify that you intend to read it, write in it (or both). In OS/2 and DOS, you must also specify if the file is binary or not. When done, the file must be closed.

All file handling functions and data types are defined in .

Here is a small example program, opening a file specified in the command line, and printing its contents: Now there is a lot to explain:
 * 1) FILE* is the data type used by the file handling functions declared in . We do not use a variable of FILE* type for anything but to pass to functions handling files and possibly compare to NULL.
 * 2) A check if a parameter has been passed. Note that since "argv[0]" always is the name of the program, "argc" never holds a value less than 1.
 * 3) Here the file is opened. The first argument to "fopen" is the name, and the second is our intent. "r" means the file should be opened as a text file for reading. Had we wanted to read a binary file, the string would have been "rb". "fopen" returns "NULL" if it fails to open the file. "NULL" is a special pointer value usually assigned to pointers not pointing to anything. It is good programming practice to always give pointers the value "NULL" if they do not point to anything useful. [The value of NULL is defined in  -- EDM]
 * 4) From the inside and out: "fgets" reads a line from the text file, and stores it in "line." If the read fails for some reason, "fgets" returns NULL. In other words, the while loop prints the lines read, as long as reading is successful. Reading will fail when the end of the file is reached. Note that "fgets" reads up until and including the newline character, or until all of "line" is filled. If there is room for the entire line of text in "line", the newline character is also stored, otherwise the rest of the line, including the newline character, will be retrieved by the next call to "fgets" (again, if there is room for it).
 * 5) When we are done with the file, it must be closed. "fclose" returns 0 if it succeeds in closing the file. In this case, however, we ignore the returned result.

A word file
What we want to do, for the word frequency histogram program, is to read words, and not lines. Unfortunately, there is no function in the ANSI C library that reads words from a file, so we must define our own. For the wordfile to be useful, it must have a number of characteristics. For example it must support: Preferably, it should also be possible to specify what separates words, since this depends on the context. Here is what the prototypes might look like: Two new things, that must be explained, just turned up. What does "const char*", in the parameter list for "wordfile_open" mean? The type "const char*" is a pointer to a constant character, that is, a pointer to a character which may not change. Well, in fact it may change, but not through the pointer. In other words, the character (or in this case, character string) passed, does not need to be a constant. Instead "const" is a promise, saying that this pointer will not cause the character to change. An example will explain this better: First "pa" is set to point to "a". This is what was explained last month. That "pca" can point to "a" is not an error. "a" can be changed, either directly or through "pa", but it will not change due to us doing something with "pca," so the promise holds. "pb" cannot get its value from "pca" however, so this line would lead to a compilation error. "pb" is not const, so it promises nothing, meaning it could break the promise "pca" made. Since "pca" has promised not to change whatever it points to, nothing that can change it, can get its value from "pca." The last line in the example results in a compilation error because it is illegal to assign a value to the dereferenced const pointer, since otherwise the promise would be broken.
 * 1) * Use on any file we want to.
 * 2) * Clean close after use.
 * 3) * Retrieving the next word in the file.
 * 4) * Checks if we have reached the end of the file.

Returning to our wordfile, const in the parameter list means that "wordfile_open" promises not to alter the string passed as the name. The next new thing is "size_t." This is a type, declared in a number of the standard headers,  being one of them. It is an unsigned integer type used to represent sizes (usually of objects, but not necessarily so.) In this case, the size of the buffer passed to the function, and when returning, the length of the string copied into the buffer.

Semantics
Before beginning to write the code for the wordfile, it's wise to spend a few minutes thinking about how we want it to behave. What should "wordfile_open" do? int wordfile_open(const char* name); The normal operation is of course to just open the file. How do we tell the user of "wordfile_open" if it was successful in opening the file? What parameters are legal? What do we do if the proposed file does not exist, or cannot be opened for reading? Can several wordfiles be opened at the same time? To keep things simple for now, I propose the following characteristics for "wordfile_open" Now we do the same for the other functions of the wordfile. int wordfile_close(void); What do we do if the wordfile is not open? What if it is open, but for some reason cannot be closed? Proposal: size_t wordfile_nextword(char* buffer, size_t buffersize); What values for "buffer" and "buffersize" are legal? What do we do if the wordfile is not open? What do we do if there is not room for the word found in buffer? What do we do if end of file is reached? Proposal:
 * 1) * Only one wordfile may be open at the time. An attempt to open a second wordfile is a programming error.
 * 2) * Failure to open a wordfile is reported by returning 0, success by returning a non-zero value.
 * 3) * Passing the NULL pointer as the name is a programming error.
 * 4) * If no file with the passed name exists, open should fail.
 * 5) * If the return value is non-zero, the wordfile is opened.
 * 1) * Closing a wordfile that is not open is a programming error.
 * 2) * Successful closing is reported by a non-zero return value, and zero for failure.
 * 3) * If a non-zero value is returned, the wordfile is closed.
 * 1) * It is a programming error to call "wordfile_nextword" if the wordfile is not open.
 * 2) * "buffersize" must be at least 2, to hold a minimum of one character and the null-termination.
 * 3) * "buffer" must not be the NULL pointer.
 * 4) * Return the length of the word copied into buffer. If 0 is returned, no word was found before end of file. If the number returned equals "buffersize", there was not room for the word in "buffer." In this case, the buffer will contain only the first buffersize-1 characters of the word, the remaining characters will be discarded.
 * 5) * If end of file is reached when reading a word, the end of the word is also reached, so the word read is copied into buffer, and the length of it returned. The next call will return 0, indicating that the last word has been read.

Now we can write the header file "wordfile.h", and document all the above.

Programming by Contract
Without mentioning it, I have now explained part of the "programming by contract" concept. For all the functions above, you see a comment part called "Preconditions:" It lists things that must be true when calling the function. For some functions, you also see a "Postconditions:" listing things that will be true when the function has returned. The idea behind "programming by contract" is to make clear who is responsible for what. The functions with post conditions say "If you promise [Precondition:] I promise [Postcondition:] will be true when I'm done." If the precondition is violated, the caller of the function is guilty of doing something wrong. If the postcondition is violated, the function has failed to do its job. "wordfile_nextword" should have a post condition, but it's very difficult to state one that can be checked, since it depends so much on the file. When identifying the pre- and post-conditions above, I was careful in making sure they were all possible to check for. There is a macro defined in  called "assert" that is used for this kind of test. Macros will be explained another month, so for now, just see "assert" as a special kind of function taking one argument. It tests if the value is 0 [representing false in C - EDM], and if so, it prints an error message and aborts execution. We can test it with this little program. When I run this program, I get the following result: Not too bad? It would of course be better if it somehow could point out the call that violated the condition, but it's as close as you can get with ANSI/ISO C. The problem with these kind of checks, is that you usually only want them during development, and maybe beta test. You don't normally want them in the final product, because the tests aren't supposed to fail, but making them takes time. "assert" handles this by doing nothing at all if the macro "NDEBUG" is defined when compiling. "NDEBUG", unlike "assert" does not behave like a function. Instead its presence causes "assert" to do nothing. Most compilers allow defining macros in the parameter list, and oddly, most compilers seem to agree on doing this with the -D flag. An example: Just by providing the "-DNDEBUG" flag when compiling, the test was removed. Back to our wordfile. Have you noticed, by the way, that I have so far not mentioned a word about how this should be implemented? This is not because I've forgotten, but because until now it has been unimportant. What should be done is the most important thing. The job itself can be done in many different ways, but someone using the wordfile is not interested in that.

Implementation
Now, however, we should start thinking of how to implement it, and the skeleton of "wordfile.c" can be written right away, and make use of "assert" to check the conditions. Before filling in the blanks there is another C detail that requires an explanation. Near the top, you find a line: static FILE* file = NULL; /* 1. Explained after the listing */ In this context, the keyword "static" means that the variable "file" is only accessible from this file. It means that if, in another file, an identifier named "file" is referred to, it will not collide with this one. Used like this, static has two advantages: One is that other parts of the program cannot reach the identifier. The other, very similar, is that the global name space is not polluted. If "static" was not available for use like this, you'd have to find some clever name to avoid clashes with names defined in other parts (that perhaps someone else has written), and you'd still not be sure that no one manipulates it without your knowledge. If the variable was not declared "static", someone making use of an identifier named "file" would manipulate this one! Now to fill in the blanks. At /** 1 **/ some things should be explained. The two lines say: while ((c = fgetc(file)) != EOF && !isalnum(c)) /** 1 **/ ;/* loop until we find an alphanumeric character or EOF */ "fgetc" reads a character from the passed file. It returns the character as an "int", though. The reason is that in case end of file has been reached, it returns "EOF", and "EOF" must be outside the valid range for characters (otherwise, what would you do if you read the character that equals EOF?). The "isalnum" function, declared in  tells whether the character passed is alphanumeric or not. In  there are several other similar functions. Now, what these two lines do is to read character after character, as long as end of file is not reached, and the character read is not alphanumeric. The loop body is empty, meaning that nothing should be done in the loop. The odd placement of the semicolon is intentional, since it, together with the comment, shows that I wanted to place it there. A programming mistake I have seen a few times too often, is when a semicolon has been added by mistake after the last parenthesis, leaving an empty loop body that was not intended. Placing the semicolon below makes it more visible, and more likely to be seen as intentional. When the loop exits, the first non-alphanumerical character or end of file is reached.

/** 2 **/ "isalnum" is first called on the last character read from the previous loop. If end of file was reached there, "isalnum" will return 0 since "isalnum" returns 0 for "EOF", and "c" will have the value "EOF" if the end of the file was reached. So, if the end of the file is reached, the loop will not be entered, and the function will report 0 characters copied into buffer.

/** 3 **/ Execution only reaches here if "isalnum" returns true, which it does for all letters in the English alphabet (upper and lower case letters), and the digits.

Use our wordfile
Now we can use the word file in a small word-reader program: 1. When writing "for" loops, all the three expressions (separated by the semicolons) are optional. The first, if any, is executed once, before looping begins, the second is the condition determining when to continue looping, and when to stop, and the last, usually with a side effect, is evaluated after the loop body for every iteration. In this case, nothing is initialised, no condition is stated (meaning the for loop will not terminate,) and nothing is done after the loop body for every iteration. Using an "infinite" loop for reading words is harmless. Sooner or later "wordfile_nextword" will return 0, which will break the loop. Now, save and compile together with wordfile.c as explained in part 5.

As you can see in the small test program, it doesn't need to know anything about how wordfile does its work, only about its interface. It is this technique that is called "encapsulation," since all internals of how the wordfile works is encapsulated by the interface. The good thing with it, is that we can make any changes to wordfile.c we like, for readability, for correcting bugs, for improving performance, or for whatever reason. As long as we still follow the contract set up and documented in wordfile.h, any program making use of wordfile can take advantage of the changes by a simple recompile. It also helps trouble shooting. Since no data about the wordfile is visible outside wordfile.c, any error with the wordfile is either a violation of a precondition, or a bug in wordfile.c.

Now, this has become rather long, so it's time I stopped here.

Recap

 * 1) * Command line parameters can be passed to programs by declaring "main" with "argc" and "argv".
 * 2) * "main" returns an "int", which becomes the programs return code, used when combining programs on the command line.
 * 3) * Functions for reading (and also for writing) files, like "fopen", "fclose", "feof", "fgets" and "fgetc" can be found in .
 * 4) * "const" can be used on pointers as a promise not to change what they point to.
 * 5) * Programming by contract clarifies responsibilities and helps pin pointing errors when enforced with "assert."
 * 6) * The "assert" macro helps to enforce pre- and post-conditions. Once a program is debugged, the effect of "assert" can be removed by recompiling with the "NDEBUG" macro defined.
 * 7) * Encapsulation of implementation details improves maintainability, since changes can be made to implementations, without affecting other parts of the program.
 * 8) * Encapsulation encourages code-reuse, since it is easy to incorporate a complete package.

Coming up
Next month two major restrictions on the wordfile will be removed. The implementation will be changed so it allows the user to define what is in a word and what isn't, and it will be possible to have several wordfiles open at the same time. Please don't hesitate to e-mail me if you have questions, wishes for details to cover or want me to clarify things.