An Introduction to C++ Programming - Part 10/13

Written by Björn Fahller

An improved file array
[Note: the source code for this month is introcpp10.zip here. Ed.]

Last month a file based array template for truly huge amounts of data was introduced. While good, it was nowhere near our goals. Error handling was missing completely, making it dangerous to use in real life. There was no way to say how a user defined data type should be represented on disk, yet they weren't disallowed, which is a dangerous combination. It was also lacking iterators, something that is handy, and is an absolute requirement for generic programming with algorithms that are independent of the source of the data. On top of that, we'd really like the ability to store several different arrays in the same file, and also have an anonymous array which creates a temporary file and removes it when the array is destroyed. All of these will be dealt with this month, yet very little will be new. Instead it's time to make use of all the things learned so far in the course.

The data representation problem
In the file array as implemented last month, data was always stored in a raw binary format, exactly mirroring the bits as they lay in memory. This works fine for integers and such, but can be disastrous in other situations. Imagine a file array of strings (where string is a char*). With the implementation from last month, the pointer value would be stored, not the data pointed to. When reading, a pointer value is read, and when dereferenced, whatever happens to be at the memory location pointed to (if anything) will be used (which is more than likely to result in a rather quick crash.) Anything with pointers is dangerous when stored in a raw binary format, yet we must somehow allow pointers in the array, and preferably so without causing problems for those using the array with built-in arithmetic types. How can this be done?

In part 4, when templates were introduced, a clever little construct called traits classes was shown. I then gave this rather terse description: ''A traits class is never instantiated, and doesn't contain any data. It just tells things about other classes, that is its sole purpose.'' Doesn't that smell like something we can use here? A traits class that tells how the data types should be represented on disk?

What do we need from such a traits class? Obviously, we need to know how much disk space each element will take, so a size member will definitely be necessary, otherwise we cannot know much disk space will be required. We also need to know how to store the data, and how to read it. The easiest way is probably to have member functions writeTo and readFrom in the traits class. Thus we can have something looking like this: The array is then rewritten to use this when dealing with the data. The change is extremely minor. storeElement needs to be rewritten as: The change for readElement is of course analogous. However, as indicated by the last comment, a new error possibility has shown up. What if the writeTo and readFrom members of the traits class are buggy and write or read more data to disk than they're allowed to? Since it's the user of the array that must write the traits class (at least for their own data types) we cannot solve the problem, but we can give the user a chance to discover that something went wrong. Unfortunately for writing, the error is extremely severe; it means that the next entry in the array will have its data destroyed...

In the traits class, by the way, the constant size, used for telling how many bytes in the stream each T will occupy, poses a problem with most C++ compilers today (modern ones mostly makes life so much easier.) The problem is that a static variable, and also a static constant, in a class, needs to reside somewhere in memory, and the class declaration is not enough for that. This problem is two-fold. To begin with, where should it be stored? It's very much up to whoever writes the class, but somewhere in the code, there must be something like: const size_t ArrayFileElementAccess::size = ...; where X is the name of the class dealt with by the particular traits specialisation.

The second problem is that this is totally unnecessary. What we want is a value that can be used by the compiler at compile time, not a memory location to read a value from. As I mentioned, a modern compiler does make this much easier. In standard C++ it is allowed to write: Note that for some reason that I do not know, this construct is only legal if the type is a constant of an integral or enumeration type. size_t is such a type, it's some unsigned integral type, probably unsigned int, but possibly unsigned long. The expression denoted ... must be possible to evaluate at compile time. Unless code is written that explicitly takes the address of size, we need not give the constant any space to reside in. The odd construct template <> is also new C++ syntax, and means that what follows is a specialisation of a previously declared template.

For old compilers, however, there's a work-around for integral values, no larger than the largest int value. We cheat and use an enum instead of a size_t. This makes the declaration: This is a bit ugly, but it is perfectly harmless.

The advantage gained by adding the traits class is flexibility and safety. If someone wants to use a file array for their own class, they're free to do so. However, they must first write a FileArrayElementAccess specialisation. Failure to do so will result in a compilation error. This early error detection is beneficial. The sloppy solution from last month would not yield any error until run-time, which means a (usually long) debugging session.

Several arrays in a file
What is needed in order to host several arrays in the same file? One way or the other, there must be a mechanism for finding out where one array begins and another ends. I think the simplest solution, is to let go of the file names, and instead make the constructors accept an fstream&. We can then require that the put and get pointer of the stream must be where the array can begin, and we can in turn promise that the put and get pointer will be positioned at the byte after the array end. Of course, in addition to having a reference to the fstream in our class, we also need the home position, to seek relative to, when indexing the array. This becomes easy to write for us, it becomes easy to use as well. For someone requiring only one array in a file, there'll be slightly more code, an fstream object must be explicitly initialised somewhere, and passed to the constructor of the array, instead of just giving it a name. I think the functionality increase/code expansion exchange is favourable.

In order to improve the likelihood of finding errors, we can waste a few bytes of disk space by writing a well known header and trailer pattern at the beginning and end of the array (before the first element, and after the last one.) If someone wants to allocate an array using an existing file, we can find out if the get pointer is in place for an array start.

The constructor creating a file should, however, first try to read from the file to see if it exists. If it does, it should be created from the file, just like the constructor accepting a stream only does. If the read fails, however, we can safely assume that the file doesn't exist and should instead be created.

The change in the class definition, and constructor implementation is relatively straight forward, if long: Other than the above, the only change needed for the array is that seeking will be done relative to home rather than the beginning of the file (plus the size of the header entries.) The new versions of storeElement and readElement become:

Temporary file array
Making use of a temporary file to store a file array that's not to be persistent between runs of the application isn't that tricky. The implementation so far makes use of a stream and known data about the beginning of the stream, number of elements and size of the elements. This can be used for the temporary file as well. The only thing we need to do is to create the temporary file first, open it with an fstream object, and tie the stream reference to that object, and remember to delete the file in the destructor.

What's the best way of creating something and making sure we remember to undo it later? Well, of course, creating a new helper class which creates the file in its constructor and removes it in its destructor. Piece of cake. The only problem is that we shouldn't always create a temporary file, and when we do, we can handle it a bit different from what we do with a global file that can be shared. For example, we know that we have exclusive rights to the file, and that it won't be reused, so there's no need for the extra information in the beginning and end. So, how's a temporary file created? The C++ standard doesn't say, and neither is there any support for it in the old de-facto standard. I don't think C does either. There are, however, two functions tmpnam and tempnam defined as commonly supported extensions to C. They can be found in . I have in this implementation chosen to use tempnam as it's more flexible. tempnam works like this: it accepts two string parameters named dir and prefix. It first attempts to create a temporary file in the directory pointed to by the environment variable TMPDIR. If that fails, it attempts to create it in the directory indicated by the dir parameter, unless it's 0, in which case a hard-coded default is attempted. It returns a char* indicating a name to use. The memory area pointed to is allocated with the C function malloc, and thus must be deallocated with free and not delete[].

Over to the implementation details:

We add a class called temporaryfile, which does the above mentioned work. We also add a member variable pfile which is of type ptr . Remember the ptr template from last month? It's a smart pointer that deallocates whatever it points to in its destructor. It's important that the member variable pfile is listed before the stream member, since initialisation is done in the order listed, and the stream member must be initialised from the file object owned by pfile. We also add a constructor with the number of elements as its sole parameter, which makes use of the temporary file. In the above code, tempnam, remove and free are prefixed with ::, to make sure that it's the names in global scope that are meant, just in case someone enhances the class with a few more member functions whose name might clash.

For the sake of syntactical convenience, I have added yet another operator to the ptr class template: It's the operator-> that's new, which allows us to write things like p->x, where p is a ptr, and the type X contains some member named x. The return type for operator-> must be something that operator-> can be applied to. The explanation sounds recursive, but it makes sense if you look at the above code. ptr::operator-> returns an X*. X* is something you can apply the built in operator-> to (which gives you access to the elements.) That's it! The rest of the array works exactly as before. No need to rewrite anything else.

Code reuse
If you're an experienced C programmer, especially experienced with programming embedded systems where memory constraints are tough and you also have a good memory, you might get a feeling that something's wrong here.

What I'm talking about is something I mentioned the first time templates were introduced: ''Templates aren't source code. The source code is generated by the compiler when needed. This means that if we in a program uses FileArray, FileArray , FileArray and FileArray (where X and Y'' are some classes,) there will be code for all four types. Now, have a close look at the member functions and see in what way FileArray ::FileArray(iostream& fs, size_t elements) differs from FileArray ::FileArray(iostream& fs, size_t elements). Please do compare them.

What did you find? The only difference at all is in the handling of the member elem_size, yet the same code is generated several times with that as the only difference. This is what is often referred to as the template code bloat of C++. We don't want code bloat. We want fast, tight, and slick applications.

Since the only thing that differs is the size of the elements, we can move the rest to something that isn't templatised, and use that common base everywhere. I've already shown how code reuse can be done by creating a separate class and have a member variable of that type. In this article I want to show an alternative way of reusing code, and that is through inheritance. Note very carefully that I did not say public inheritance. Public inheritance models is-A relationships only. We don't want an is-A relationship here. All we want is to reuse code to reduce code bloat. This is done through private inheritance. Private inheritance is used far less than it should be. Here's all there is to it. Create a class with the desired implementation to reuse and inherit privately from it. Nothing more, nothing less. To a user of your class, it matters not at all if you chose not to reuse code at all, reuse through encapsulation of a member variable, or reuse through private inheritance. It's not possible to refer to the descendant class through a pointer to the private base class, private inheritance is an implementation detail only, and not an interface issue.

To the point. What can, and what can not be isolated and put in a private base class? Let's first look at the data. The stream reference member can definitely be moved to the base, and so can the pfile member for temporary files. The array_size member can safely be there too and also the home member for marking the beginning of the array on the stream. By doing that alone we have saved just about nothing at all, but if we add as a data member in the base class the size (on disk) for the elements, and we can initialise that member through the FileArrayElementAccess::size traits member, all seeking in the file, including the initial seeking when creating the file array, can be moved to the base class. Now a lot has been gained. Left will be very little. Let's look at the new improved implementation:

Now for the declaration of the base class. The only surprise here should be the nesting of the class temporaryfile. Yes, it's possible to define a class within a class. Since the temporaryfile class is defined in the private section of FileArrayBase, it's inaccessible from anywhere other than the FileArrayBase implementation. It's actually possible to nest classes in class templates as well, but few compilers today support that. When implementing the member functions of the nested class, it looks a bit ugly, since the surrounding scope must be used. The implementation of FileArrayBase is very similar to the FileArray earlier. The only difference is that we use a parameter for the element size, instead of the traits class. To make life a little bit easier, I've assumed two arrays of char named ArrayBegin and ArrayEnd, which hold the patterns to be used for marking the beginning and end of an array on disk. Apart from the tricky questions, it's all pretty straight forward. The really good news, however, is how easy this makes the implementation of the class template FileArray. Now watch this! How much easier can it get? This reduced code bloat, and also makes the source code easier to understand, extend and maintain.

What can go wrong?
Already in the very beginning of this article series, part 1, I introduced exceptions; the C++ error handling mechanism. Of course exceptions should be used to handle the error situations that can occur in our array class. When I introduced exceptions, I didn't tell the whole truth about them. There was one thing I didn't tell, because at that time it wouldn't have made much sense. That one thing is that when exceptions are caught, dynamic binding works, or to use wording slightly more English-like, we can create exception class hierarchies with public inheritance, and we can choose what level to catch. Here's a mini example showing the idea: At **1 above, objects of class B and class B1 are caught if thrown from f. In **2 objects of class C (and descendants of C, if any are declared elsewhere) are caught. At **3 all others from the A hierarchy are caught.

This may seem like a curious detail of purely academic worth, but it's extremely useful. We can use abstraction levels for errors. For example, we can have a root class FileArrayException, from which all other exceptions regarding the file array inherits. We can see that there are clearly two kinds of errors that can occur in the file array; abuse and environmental issues outside the control of the programmer. For abuse I mean things like indexing outside the valid bounds, and with environmental issues I mean faulty or full disks (Since there are several programs running, a check if there's enough disk space is still taking a chance. Even if there was enough free space when the check was made, that space may be occupied when the next statement in the program is executed.)

A reasonable start for the exception hierarchy then becomes: Here FileArrayLogicError are for clear violations of the not too clearly stated preconditions, and FileArrayRuntimeError for things that the programmer may not have a chance to do something about. In a perfectly debugged program, the only exceptions ever thrown from file arrays will be of the FileArrayRuntimeError kind.

We can divide those further into: class FileArrayCreateError : public FileArrayRuntimeError {}; For whenever the creation of the array fails, regardless of why (it's not very easy to find out if it's a faulty disk or lack of disk space, for example.) class FileArrayStreamError : public FileArrayRuntimeError {}; If after creation, something goes wrong with a stream; for example if seeking or reading/writing fails. class FileArrayDataCorruptionError : public FileArrayRuntimeError {}; If an array is created from an old existing file, and we note that the header or trailer doesn't match the expected. class FileArrayBoundsError : public FileArrayLogicError {}; Addressing outside the legal bounds. class FileArrayElementSizeError : public FileArrayLogicError {}; If the read/write members of the element access traits class are faulty and either write too much (thus overwriting the data for the next element) or reads too much (in which case the last few bytes read will be garbage picked from the next element.)

It's of course possible to take this even further. I think this is quite enough, though.

Now we have a reasonably fine level of error reporting, yet an application that wishes a coarse level of error handling can choose to catch the higher levels of the hierarchy only.

As an exercise, I invite you to add the throws to the code. Beware, however; it's not a good idea to add exception specifications to the member functions making use of the T's (since you cannot know which operations on T's that may throw, and what they do throw.) You can increase the code size and eligibility gain from the private inheritance of the implementation in the base by putting quite a lot of the error handling there.

Iterators
An iterator into a file array is something whose behavior is analogous to that of pointers into arrays. We want to be able to create an iterator from the array (in which case the iterator refers to the first element of the array.) We want to access that element by dereferencing the iterator (unary operator *,) and we want iterator arithmetic with integers.

An easy way of getting there is to let an iterator contain a pointer to a file array, and an index. Whenever the iterator is dereferenced, we return (*array)[index]. That way we even have error handling for iterator arithmetic that lead us outside the valid range for the array given for free from the array itself. The iterator arithmetics becomes simple too, since it's just ordinary arithmetics on the index type. The implementation thus seems easy; all that's needed is to define the operations needed for the iterators, and the actions we want. Here's my idea:
 * creation from array yields iterator referring to first element
 * copy construction and assignment are of course well behaved.
 * moving forwards and backwards with operator++ and operator--.
 * addition of array and long int value n yields iterator referring to n:th element of array.
 * iterator+=n (where n is of type long int) adds n to the value of the index in the iterator. This addition is never an error; it's dereferencing the iterator that's an error if the index is out of range. Operator -= is analogous.
 * iterator+n yields a new iterator referring to the iterator.index+n:th element of the array, and analogous for operator-.
 * iterator1-iterator2 yields a long int which is the difference between the indices of the iterators. If iterator1 and iterator2 refer to different arrays, it's an error and we throw an exception.
 * iterator1==iterator2 returns non-zero if the arrays and indices of iterator1 and iterator2 are equal.
 * iterator1!=iterator2 returns !(iterator1==iterator2)
 * *iterator returns whatever (*array)[index] returns, i.e a
 * leArrayProxy. * iterator[n] returns (*array)[index+n].
 * iterator1.
 * iterator1>=iterator2 returns !(iterator1<iterator2). Likewise for operator<=.

I think the above is an exhaustive list. Neither of the above is difficult. It's just a lot of code to write, and thus a good chance of making errors. With a little thought, however, quite a lot of code can be reused over and over, thus reducing the amount to write and also the risk for errors. As an example, a rule of thumb when writing a class for which an object o and some other value v the operations o+=v, o+v and v+o are well defined and behaves like they do for the built in types (which they really ought to, unless you want to give the class users some rather unhealthy surprises) is to define operator+= as a member of the class, and two versions of operator+ that are implemented with operator+=. Here's how it's done in the iterator example: Surely, the code for the two versions of operator+ must be written, but since its behaviour is defined in terms of operator+= it means that if we have an error, there's only one place to correct it.

There's no need to display all the code here in the article, you can study it in the sources. The above shows how it all works, though, and as you can see, it's fairly simple.

Recap
This month the news in short was:
 * You can increase flexibility for your templates without sacrificing ease of use or safety by using traits classes.
 * Enumerations in classes can be used to have class-scope constants of integral type.
 * Modern compilers do not need the above hack. Defining a class-scope static constant of an integral type in the class declaration is cleaner and more type safe.
 * Standard C++ and even C, does not have any support for the notion of temporary files. Fortunately there are commonly supported extensions to the languages that do.
 * Private inheritance can be used for code reuse.
 * Private inheritance is very different from public inheritance. Public inheritance models is-A relationships, while private inheritance models is-implemented-in-terms-of relationships.
 * A user of a class that has privately inherited from something else cannot take advantage of this fact. To a user the private inheritance doesn't make any difference.
 * Private inheritance is in real-life used far less than it should be. In many situations where public inheritance is used, private inheritance should've been used.
 * Exception catching is polymorphic (i.e. dynamic binding works when catching.)
 * The polymorphism of exception catching allows us to create an arbitrarily fine-grained error reporting mechanism while still allowing users who want a coarse error reporting mechanism to use one (they'll just catch classes near the root of the exception class inheritance tree.)
 * Always implement binary operator+, operator-, operator* and operator/ as functions outside the classes, and always implement them in terms of the operator+=, operator-=, operator*= and operator/= members of the classes.

Exercises

 * Alter the file array such that it's possible to instantiate two (or more) kinds of FileArray in the same program, where the alternatives store the data in different formats. (hint, the alternatives will all need different traits class specialisations.)
 * What's the difference between using private inheritance of a base class, and using a member variable of that same class, for reusing code?
 * In which situations is it crucial which alternative you choose?

Coming up
Next month we'll have a look at smart pointers. I'm beginning to dry up on topics now, however, so please write and give me suggestions for future topics to cover.