Reading binary files in Modern C++

For a general-purpose programming language used to write desktop applications as well as to program embedded systems, C++ makes it surprisingly difficult to simply read a bunch of bytes from a binary file. Compared with other high-level programming languages, it's complicated.

Maybe it's a bit unfair to compare C++ to Java or Python in this respect, but then again, maybe it isn't. Or maybe reading data from a binary file is not such a common use case after all, but somehow I don't buy that. So, let's have quick look at how it is done in some other languages, and then proceed to find out what that requires in C++.

In this context I'm only interested in reading the complete contents of the file, which will work just fine for small(ish) files. My practical purpose is to read MIDI System Exclusive files, which tend to be just a few kilobytes, or at most a few megabytes in size. In a modern desktop or cloud service context this is peanuts, but if you need to read files that are hundreds of megabytes in size, you will need to resort to streaming, to keep the memory use of your program in check.

Reading a binary file in Python

Python has the convenient bytes object. I made a convenient helper function to read in a file and return a bytes object, complete with type hints for Python 3.5 or later.

import sys

def read_file_data(filename: str) -> bytes:
    try:
        with open(filename, 'rb') as f:
            return f.read()
    except FileNotFoundError:
        print(f'File not found: {filename}')
        sys.exit(-1)

Using the with statement ensures that the file is closed. For more information about reading binary files, take a look at the articles from Python Morsels, starting with How to read a binary file in Python.

Reading a binary file in Scheme

Here is a curveball for you: reading a binary file in Scheme, or more accurately, using Chez Scheme, which is one of the more established Scheme dialects along with GNU Guile.

The Scheme bytevector is roughly the equivalent of the Python bytes object. You can use a file input port to access the contents of a file, and get the full contents of the file using the get-bytevector-all function.

As with Python, I made a small helper function to read the contents of a file:

(import (chezscheme))

(define (read-file-data filename)
  (get-bytevector-all (open-file-input-port filename)))

For more information on Scheme, refer to The Scheme Programming Language, Fourth Edition by R. Kent Dybvig, the principal developer of Chez Scheme.

Reading a binary file in Rust

Rust is an up-and-coming systems programming language, which has gained mindshare in recent years among programmers who like a more predictable language than C++, with less obvious discontinuities (the technical term for "WTF"). For many of the benefits of Rust (with some of the negatives), see Why Rust?.

My little helper function to read a binary file in Rust looks like this (with the required import statements):

use std::io::prelude::*;
use std::fs;

fn read_file_data(name: &String) -> Vec<u8> {
    let mut f = fs::File::open(&name).expect("no file found");
    let mut buffer = Vec::new();
    f.read_to_end(&mut buffer).expect("unable to read file");
    buffer
}

It returns a Vec<u8>, where Vec is a Rust collection type with a template type parameter u8. Note that Rust deallocates memory when variables go out of scope, which also cause the std::fs::File object to automatically close.

For an occasionally updated series on doing stuff with Rust, see the Flecks of Rust newsletter on this site.

Reading a binary file in Modern C++

The solutions for reading a binary file in Python, Scheme and Rust were straightforward enough to use. When I started to figure out how to achieve the same in C++, I soon realised that it would be a little different.

Modern C++ does have the std::vector collection type. It is closest to the Vec type of Rust, also being a template type. Since I want to use the C++ std::byte type for the items in the vector, I know I will be needing a std::vector<std::byte> instance.

For accessing the file, you can use the ifstream class. I haven't found a way to read all the file data with one method call, so the next best thing is to find out the size of the file, and then read exactly that number of bytes.

With the help of the information found in Modern C++ Programming Cookbook, 2nd Ed by Marius Bancila, I came up with the following helper function:

#include <fstream>
#include <iterator>
#include <vector>

std::vector<std::byte> readFileData(const std::string& name) {
    std::ifstream inputFile(name, std::ios_base::binary);

    // Determine the length of the file by seeking
    // to the end of the file, reading the value of the
    // position indicator, and then seeking back to the beginning.
    inputFile.seekg(0, std::ios_base::end);
    auto length = inputFile.tellg();
    inputFile.seekg(0, std::ios_base::beg);

    // Make a buffer of the exact size of the file and read the data into it.
    std::vector<std::byte> buffer(length);
    inputFile.read(reinterpret_cast<char*>(buffer.data()), length);

    inputFile.close();
    return buffer;
}

Note that this function does not perform error checking when the file is opened, or try to find out if the read succeeded.

What I find weird is that there is no read function for std::byte, which is conceptually wrong because std::byte would be exactly the right type here. Instead, you need to use reinterpret_cast. Of course, it's all bits anyway, but I would like them to be the most obvious and correct bits.

You could specify the size of the vector when you initialize it, but you need to be careful: if you use the uniform initialization syntax of Modern C++ (curly brackets around the value), like std::vector<std::byte> buffer{length};, you will end up creating a one-element vector with the current value of length as the sole element. Instead, you will want to use parentheses, like std::vector<std::byte> buffer(length);. Another day, another C++ footgun.

The type of value returned by the tellg method of std::ifstream is a std::fpos, while the size of the vector is a size_type, which is usually a typedef for std::size_t, which is... oh, never mind. We seem to have descended into another pit of madness in the C++ type system. Somehow it all seems to work, where the definition of "work" is "compiles with clang++ and runs on macOS 12".

Truth be told, the biggest differences in C++ were the need to find out the size of the file, and to make a buffer to hold that exact number of bytes. You can write a helper function to paper over these differences, but shouldn't that be a standard library function?

So there you have it: reading a binary file in Modern C++. It's not exactly the kind of straightforward solution to what must be a common task, so if I'm missing something obvious, then please let me know!