An Introduction to C++ Programming - Part 9/13

Written by Björn Fahller

File I/O and Binary Streams
In parts 5 and 6, the basics of I/O were introduced, with formatted reading and writing from standard input and output. We'll now have a look at I/O for files. In a sense, it's better to stop using the term I/O here, and instead use streams and streaming, since the ideas expressed here and in parts 5 and 6 can be used for other things than I/O, for example in-memory formatting of data (we'll see that at the very end of this article.)

Files
In what way is writing Hello world on standard output different from writing it to a file? The question is worth some thought, since in many programming languages there is a distinct difference. Is the message different? Is the format (as seen from the program) different? I cannot see any difference in those aspects. The only thing that truly differs is the media where the formatted message ends up. In the former case, it's on your screen, but for file I/O it's in a file somewhere on your hard disk. In other words, there is very little difference, or at least, there's very much in common.

As we've seen so far, commonality is expressed either through inheritance or templates, depending on what's common and what's not. To refresh your memory, templates are used when we want the same kind of behaviour, independent of data. For example a stack of some data type. Inheritance is used when you want similar, but in some important aspects different, behaviour at runtime for the same kind of data. We saw this for the staff hierarchy and mailing addresses in parts 7 and 8. In this case it's inheritance that's the correct solution, since the data will be the same, but where it will end up (and most notably, how it does end up there) differs. (Incidentally, there's a good case for using templates too, regarding the type of characters used. The C++ standard does indeed have templatized streams, just for differing between character types. Few compilers today support this, however. See the Standards Update towards the end of the article for more information.)

The inheritance tree for stream types look like this:



The way to read this is that there's a base class named ios, from which the classes istream and ostream inherit. The classes ifstream and ofstream in their turn inherit from istream and ostream respectively. The f in the names imply that they're file streams. Then there's the odd ones, iostream, which inherits from both istream and ostream, and fstream which inherits from both ifstream and ofstream. Inheriting from two bases is called multiple inheritance, and is by many seen as evil. Many programming languages have banned it: Objective-C, Java, Smalltalk to mention a few, while other programming languages, like Eiffel, go to the other extreme and allow you to inherit the same base several times Personally I think multiple inheritance is very useful if used right, but it can cause severe problems. Here is a situation where it's used in the right way. Anyway, this means that fstream is a file stream for both reading and writing, while iostream is an abstract stream for both reading and writing. More often than you think, you probably don't want to use the iostream or fstream classes.

This inheritance, however, means that all the stream insertion and extraction functions (the operator>> and operator<<) you've written, will work just as they do with file streams. Now, wasn't that neat? In other words, the only things you need to learn for file based I/O are the details that are specific to files.

File Streams
The first thing you need to know before you can use file streams is how to create them. The parts of interest look like this: You get access to the classes by #including . The empty constructors always create a file stream object that is not tied to any file. To tie such an object to a file, a call to open must be made. open and the constructors with parameters behaves identically. name is of course the name of the file. Since you normally use either ifstream or ofstream and rarely fstream, this is normally the only parameter you need to supply. Sometimes, however, you need to use the mode parameter. It's a bit field, in which you use bitwise or (operator|) for any of the values ios::in, ios::out, ios::ate, ios::app, ios::trunc, and finally ios::binary. Some implementations also provide ios::nocreate and ios::noreplace, but those are extensions. Some implementations do not have ios::binary, while others call it ios::bin. These variations of course makes it difficult to write portable C++ today. Fortunately, the six ones listed first are required by the standard (although, they belong to class ios_base, rather than ios.) The meaning of these are: Of course combinations like ios::noreplace | ios::nocreate doesn't make sense - the failure is guaranteed.

On many implementations today there's also a third parameter for the constructors and open; a protection parameter. How this parameter behaves is very operating system dependent.

Now for some simple usage: As you can see, once the stream object is created, its usage is analogous to that of cout that you're already familiar with. Of course reading with ifstream is done the same way, just use the object as you've used cin earlier.

The file stream classes also have a member function close, that by force closes the file and unties the stream object from it. Few are the situations when you need to call this member function, since the destructors do close the file.

Actually this is all there is that's specific to files.

Binary streaming
So far we've dealt with formatted streaming only, that is, the process of translating raw data into a human readable form, or translating human readable data into the computer's internal representation. Some times you want to stream raw data as raw data, for example to save space in a file. If you look at a file produced by, for example a word processor, it's most likely not in a human readable form. Note that binary streaming does not necessarily mean using the ios::binary mode when opening a file (although, that is indeed often the case.) They're two different concepts. Binary streaming is what you use your stream for, raw data that is, and opening a file with the ios::binary mode, means turning the brain damaged LF<->CR/LF translation off.

Binary streaming is done through the stream member functions : The writing interface is extremely simple and straight forward, while the reading interface includes a number of small but important differences. Note that these member functions are implemented in classes istream and ostream, so they're not specific to files, although files are where you're most likely to use them. Let's have a look at them, one by one: ostream& ostream::write(const char* s, streamsize n);

Write n characters to the stream, from the array pointed to by s. streamsize is a signed integral data type. Despite streamsize being signed, you're of course not allowed to pass a negative size here (what would that mean?) Exactly the characters found in s will be written to the stream, no more, no less. ostream& ostream::put(char c); Inserts the character into the stream. ostream& ostream::flush; Force the data in the stream to be written (file streams are usually buffered.) istream& istream::read(char* s, streamsize n); Read n characters into the array pointed to by s. Here you better make sure that the array is large enough, or unpleasant things will happen. Note that only the characters read from the stream are inserted into the array. It will not be zero terminated, unless the last character read from the stream indeed is '\0'. int istream::get; Read one character from the stream, and return it. The value is an int instead of char since the return value might be EOF (which is not uniquely representable as a char.) istream& istream::get(char& c); Same as above, but read the character into c instead. Here a char is used instead of an int, since you can check the value directly by calling .eof on the reference returned. istream& istream::get(char* s, streamsize n,                        char delim='\n'); This one's similar to read above, but with the difference that it reads at most n characters. It stops if the delimiter character is found. Note that when the delimiter is found, it is not read from the stream. istream& istream::getline(char* s, streamsize n,                            char delim='\n'); The only difference between this one and get above, is that this one does read the delimiter from the stream. Note, however, that the delimiter is not stored in the array. istream& istream::ignore(streamsize n=1,                           int delim=EOF); Reads at most n characters from the stream, but doesn't store them anywhere. If the delimiter character is read, it stops there. Of course, if the delimiter is EOF (as is the default) it does not read past EOF, that's physically impossible.

Array on file
An example: Say we want to store an array of integers in a file, and we want to do this in raw binary format. Naturally we want to be able to read the array as well. A reasonable way is to first store a size (in elements) followed by the data. Both the size and the data will be in raw format. The above code does a lot of ugly type casting, but that's normal for binary streaming. What's done here is to use brute force to see the address of elems as a const char* (since that's what write expects) and then say that only the sizeof(elems) bytes from that pointer are to be read. What this actually does is to write out the raw memory that elems resides in to the stream. After this, it does the same kind of thing for the array. Note that sizeof(*p) reports the size of the type that p points to. I could as well have written sizeof(int), but that is a dangerous duplication of facts. It's enough that I've said that p is a pointer to int. Repeating int again just means I'll forget to update one of them when I change the type to something else.

To read such an array into memory requires a little more work: It's not particularly hard to follow; first read the number of elements, then allocate an array of that size, and read the data into it.

Seeking
Until now we have seen streams as, what it sounds like, continuous streams of data. Sometimes however, there's a need to move around, both backward and forward. Streams like standard input and standard output are truly continuous streams, within which you cannot move around. Files, in contrast, are true random access data stores. Random access streams have something called position pointers. They're not to be confused with pointers in the normal C++ sense, but it's something referring to where in the file you currently are. There's the put pointer, which refers to the next position to write data to, if you attempt to write anything, and the get pointer, which refers to the next position to read data from. An ostream of course only has the put pointer, and an istream only the get pointer. There's a total of 6 new member functions that deal with random access in a stream: streampos, which you get from tellg and tellp is an absolute position in a stream. You cannot use the values for anything other than seekg and seekp. You especially cannot examine a value and hope to find something useful there (i.e. you can, but what you find out might hold only for the current release of your specific compiler, other compilers, or other releases of the same compiler, might show different characteristics for streampos.) Well, there are two other things you can do with streampos values. You can subtract two values, and get a streamoff value, and you can add a streamoff value to a streampos value. streamoff, by the way, is some signed integral type, probably a long.

By using the value returned from tellg or tellp, you have a way of finding your way back, or do relative searches by adding/subtracting streamoff values.

The seekg and seekp methods accept a streamoff value and a direction, and work in a slightly different way. You search your way to a position relative to the beginning of the stream, the end of the stream, or the current position, the selection of which, is done through the ios::seek_dir enum, which has these three values ios::beg, ios::end and ios::cur. To make the next write occur on the very first byte of the stream, call os.seekp(0,ios::beg), where os is some random access ostream.

In any reasonable implementation, any of the seek member functions use lazy evaluation. That is, when you call any of the seek member functions, the only thing that happens is that some member variable in the stream object changes value. It's not until you actually read or write, something truly happens on disk (or wherever the stream data resides.)

A stream array, for really huge amounts of data
Suppose we have a need to access enormous amounts of simple data, say 10 million floating point numbers. It's not a very good idea to just allocate that much memory, at least not on my machine with a measly 64Mb RAM. It'll not just make this application crawl, but probably the whole system due to excessive paging. Instead, let's use a file to access the data. This makes for slow access, for sure, but nothing else will suffer.

Here's the idea. The array must be possible to use with any data type, including user defined classes. Its usage must resemble that of real arrays as much as possible, but extra functionality that arrays do not have, such as asking for the number of elements in it, is OK. There must be a type, resembling pointers to arrays, that can be used for traversing it. We do not want the size of the array to be part of its type (if you've programmed in Pascal, you know why.) In addition to arrays, we want some measures of safety from stupid mistakes, such as addressing beyond the range of the array, and also for errors that arrays cannot have (disk full, cannot create file, disk corruption, etc.) We also want to say that an array is just a part of a file and not necessarily an entire file. This would allow the user to create several arrays within the same file. To prevent this article from growing way too long, quite a few of the above listed features will be left for next month. The things to cover this month are: An array of built-in fundamental types only, which lacks pointers and is limited to one file per array. We'll also skip error handling for now (you can add it as an exercise, I'll raise some interesting questions along the way,) and add that too next month.

First of all, the array must be a template, so it can be used to store arbitrary types. Since we do not want the size to be part of the type signature, the size is not a template parameter, but a parameter for the constructor. Of course, we cannot have the entire array duplicated in memory (then all the benefits will be lost,) instead we will search for the data on file every time it's needed.

Here's the outline for the class. As can be expected, operator[] can be overloaded, which is handy for providing a familiar syntax. However, already here we see a problem. What's the non-const operator[] to return? To see why this is a problem, ask yourself what you want operator[] to do. I want operator[] to do two things, depending on where it's used; like this: When operator[] is on the left hand side of an assignment, I want to write data to the file, and if its on the right hand side of an assignment, I want to read data from the file. Ouch.

Warning: I've often seen it suggested that the solution is to have the const version read and return a value, and the non-const version write a value. As slick as it would be, it's wrong and it won't work. The const version is called for const array objects, the non-const version for non-const array objects.

Instead what we have to do is to pull a little trick. The trick is, as so often in computer science, to add another level of indirection. This is done by not taking care of the problem in operator[], but rather let it return a type, which does the job. We create a class template, looking like this: We have to make sure, of course, that there are member functions in FileArray that can read and write (and of course, those functions are not the operator[], since then we'd have an infinite recursion.) All constructors, except for the copy constructors, are made private to prevent users from creating objects of the class whenever they want to. After all, this class is a helper for the array only, and is not intended to ever even be seen. This, however, poses a problem; with the constructors being private, how can FileArray::operator[] create and return one?

Enter another C++ feature: friends. Friends are a way of breaking encapsulation. What?!?! Yes, what you read is right. Friends break encapsulation, and (this is the real shock) that's a good thing! Friends break encapsulation in a controlled way. We can, in FileArrayProxy declare FileArray to be a friend. This means that FileArray can access everything in FileArrayProxy, including things that are declared private. Paradoxically, violating encapsulation with friendship strengthens encapsulation when done right. The only alternative here to using friendship, is to make the constructors public, but then anyone can create objects of this class, and that's what we wanted to prevent. Friends are useful for strong encapsulation, but it's important to use it only in situations where two (or more classes) are so tightly bound to one another that they're meaningless on their own. This is the case with FileArrayProxy. It's meaningless without FileArray, thus FileArray is declared a friend of FileArrayProxy. The declaration then becomes: We can now start implementing the array. Some problems still lie ahead, but I'll mention them as we go. The functions for reading and writing are made private members of the array, since they're not for anyone to use. Again, we need to make use of friendship to grant FileArrayProxy the right to access them. Let's define them right away: All of a sudden, we face an unexpected problem. The above code won't compile. The member function is declared const, and as such, all member variables are const, and neither seekg nor read are allowed on constant streams. The problem is one of differing between logical constness and bitwise constness. This member function is logically const, as it does not alter the array in any way. However, it is not bitwise const; the stream member changes. C++ cannot understand logical constness, only bitwise constness. If you have a modern compiler, the solution is very simple; you declare stream to be mutable fstream stream; in the class definition. I, however, have a very old compiler, so I have to find a different solution. This solution is, yet again, one of adding another level of indirection. I can have a pointer to an fstream. When in a const member function, the pointer is also const, but not what it points to (there's a difference between a constant pointer, and a pointer to a constant.) The only reasonable way to achieve this is to store the stream object on the heap, and in doing this I introduce a possible danger; what if I forget to delete the pointer? Sure, I'll delete it in the destructor, but what if an exception is thrown already in the constructor, then the destructor will never execute (since no object has been created that must be destroyed.)

Do you remember the thing to think of until this month? The clues were, destructor, pointer and delete. Thought of anything? What about this extremely simple class template? This is probably the simplest possible of the family known as smart pointers. I'll probably devote a whole article exclusively for these some time. Whenever an object of this type is destroyed, whatever it points to is deleted. The only thing we have to keep in mind when using it, is to make sure that whatever we feed it is allocated on heap (and is not an array) so it can be deleted with operator delete.

This solves our problem nicely. When this thing is a constant, the thing pointed to still isn't a constant (look at the return type for operator*, it's a T&, not a const T&.) So, instead of using an fstream member variable called stream, let's use a ptr  member named pstream. With this change, readElement must be slightly rewritten: template   T FileArray::readElement(size_t index) const {    (*pstream).seekg(sizeof(max_size)+index*sizeof(T)); // what if seek fails? T t;    (*pstream).read((char*)&t, sizeof(t)); // what if read fails? return t;  } I bet the change wasn't too horrifying. Now for the constructors: The access members: Well, this wasn't too much work, but then, as can be seen by the comments, there's absolutely no error handling here. I've left out the size member function, since its implementation is trivial.

Next in line is FileArrayProxy. The copy constructor is needed, since the return value must be copied (return from FileArray::operator[],) and it must be public for this to succeed. The one that the compiler generates for us, which just copies all member variables, will do just fine. The compiler doesn't generate a default constructor (one which accepts no parameters,) since we have explicitly defined a contructor. The assignment operator is necessary, however. Sure, the compiler will try to generate one for us if we don't, but it will fail, since references (fa) can't be rebound. Note, however, that if we instead of a reference had used a pointer, it would succeed, but the result would *NOT* be what we want. What it would do is to copy the member variables, but what we want to do is to read data from one array and write it to another.

Now for the implementation: That was it. Can you see what happens with the proxy? Let's analyze a small code snippet: On line two, arr.operator[](2) is called, which creates a FileArrayProxy  from arr with the index 2. The object, which is a temporary and does not have a name, has as its member fa a reference to arr, and as its member index the value 2. On this temporary object, operator=(int) is executed. This operator in turn calls fa.storeElement(index, t), where index is still 2 and the value of t is 0. Thus, arr[2]=0 ends up as arr.storeElement(2,0). On line 3, a similar proxy is created through the call to operator[](2) This time, however, the operator int const is called. This member function in turn calls fa.readElement(2) and returns its value, thus int x=arr[2] translates to int x=arr.readElement(2). On line 4, finally, arr[0]=arr[2] creates two temporary proxies, one referring to index 0, and one to index 2. The assignment operator is called, which in turn calls fa.storeElement(0,p), where p is the temporary proxy referring to element 2. Since storeElement wants an int, p.operator int const is called, which calls arr.readElement(2). In other words arr[0] = arr[2] generates the code arr.storeElement(0, arr.readElement(2)).

As you can see, the proxies don't add any new functionality, they're just syntactic sugar, albeit very useful. With them we can treat our file arrays very much like any kind of array. There's one thing we cannot do: With ordinary arrays, the above would be legal and have well defined semantics, assigning arr[2] the value 2, and arr[3] the value 5. With our file array we cannot do this, but unfortunately the compiler does not prevent it (a decent compiler will warn that we're binding a constant or pointer to a temporary.) We'll mend that hole next month (think about how) and also add iterators, which will allow us to use the file arrays almost exactly like real ones.

In memory data formatting
One often faced problem is that of converting strings representing some data to that data, or vice versa. With the aid of istrstream, ostrstream and strstream, this is easy. For example, say we have a string containing digits, and want those digits as an integer, the thing to do is to create an istrstream object from the string. An example will explain: After executing this snippet, x will have the value 23542. istrstream isn't much more exciting than that. ostrstream on the other hand is more exciting. There are two alternative uses for ostrstream. One where you have an array you want to store data in, and one where you want the ostrstream to create it for you, as needed (usually because you have no idea what size the buffer must have.) The former usage is like this: The variable buffer will contain the string x=23.34 after this snippet. The stream manipulator ends zero terminates the buffer. Zero termination is not done by default, since the stream cannot know where to put it, and besides you might not always want it.

The other variant, where you don't know how large a buffer you will need, is generally more useful (I think.) I think the example pretty much shows what this kind of usage does. The member function str returns a pointer to the internal buffer (which is then frozen, that is, the stream guarantees that it will not deallocate the buffer, nor overwrite it. Attempts to alter the stream while frozen, will fail.) pcount returns the number of characters stored in the buffer. Last freeze can either freeze the buffer, or unfreeze it. The latter is done by giving it a parameter with the value 0. I find this interface to be unfortunate. It's so easy to forget to release the buffer (by simply forgetting to call os.freeze(0)) and that leads to a memory leak.

strstream finally, is just like fstream the combined read/write stream.

The string streams can be found in the header <strstream.h> (or for some compilers <strstrea.h>.)

Standards update
With the C++ standard, a lot of things have changed regarding streams. As I mentioned already last month, the headers are actually and, and the names std::istream, std::ostream, etc. The streams are templatized too, which both makes life easier and not. The underlying type for std::ostream is: std::basic_ostream<class charT, class traits=std::char_traits<charT> > charT is the basic type for the stream. For ostream this is char (ostream is actually a typedef.) There's another typedef, std::wostream, where the underlying type is wchar_t, which on most systems probably will be 16-bit Unicode. The class template char_traits is a traits class which holds the type used for EOF, the value of EOF, and some other house keeping things.

Why the standard has removed the file stream open modes ios::create and ios::nocreate is beyond me, as they're extremely useful.

Casting is ugly, and it's hard to see in large code blocks. There are four new cast operators, that are highly visible, in the standard. They're (in approximate order of increasing danger,) dynamic_cast<T>, static_cast<T>, const_cast<T> and reinterpret_cast<T>. In the binary streaming seen in this article, reinterpret_cast<T> would be used, as a way of saying, Yeah, I know I'm violating type safety, but hey, I know what I'm doing, OK? The good thing about it is that it's so visible that anyone doubting it can easily spot the dangerous lines and have a careful look. The syntax is: os.write(reinterpret_cast<const char*>(&variable), sizeof(variable));

Finally, the generally useful strstreams has been replaced by std::istringstream, std::ostringstream and std::stringstream (plus wide variants, std::wistringstream, etc.) defined in the header. They do not operate on char*, but on strings (there is a string class, or again, rather a string class template, where the most important template parameter is the underlying character.) std::ostringstream does not suffer from the freeze problem that ostrstream does.

Recap
The news this month were:
 * streams dealing with files, or in-memory formatting, are used just the same way as the familiar cout and cin, which saves both learning and coding (the already written operator<< and operator>> can be used for all kinds of streams already.)
 * streams can be used for binary, unformatted I/O too. This normally doesn't make sense for cout and cin or in-memory formatting (as the name implies,) but it's often useful when dealing with files.
 * It is possible to move around in streams, at least file streams and in-memory formatting streams. It's generally not possible to move around in cin and cout.
 * proxy classes can be used to differentiate read and write operations for operator[] (the construction can of course be used elsewhere too, but it's most useful in this case.)
 * friends break encapsulation in a way that, when done right, strengthens encapsulation.
 * there's a difference between logical const and bitwise const, but the C++ compiler doesn't know and always assumes bitwise const.
 * truly simple smart pointers can save some memory management house keeping, and also be used as a work around for compilers lacking mutable (i.e. the way of declaring a variable as non-const for const members, in other words, how to differentiate between logical and bitwise const.)
 * streams can be used also for in-memory formatting of data.

Exercises

 * Improve the file array such that it accepts a stream& instead of a file name, and allows for several arrays in the same file.
 * Improve the proxy such that int& x=arr[2] and int* p=&arr[1] becomes illegal.
 * Add a constructor to the array that accepts only a size_t describing the size of the array, which creates a temporary file and removes it in its destructor.
 * What happens if we instantiate FileArray with a user defined type? Is it always desireable? If not, what is desireable? If you cannot define what's desireable, how can instantiation with user defined types be banned?
 * How can you, using the stream interface, calculate the size of a file?

Coming up
Next month will be devoted to improving the FileArray. We'll have iterators, allow arbitrary types, add error handling and more. I assume I won't need to tell you that it'll be possible to use the FileArray, just as ordinary arrays with generic programming, i.e. we can have the exact same source code for dealing with both!