Storing images in TensorFlow record files

Author:Murphy | View: 27905 | Time: 2025-03-23 19:34:36

How to use TFRecord files, a TensorFlow-specific data format for efficient data storage and reading, when dealing with images

Did you know that TensorFlow has a custom format to store data? It's called TensorFlowRecords – or TFRecords for short—and builds upon a simple principle:

Store data sequentially (within a file) to access continuous chunks quickly.

This approach is based on protocol buffers, a cross-platform approach to storing structural data. We do not need to dive deeper into the background here; what we need to know is that data is stored in a dictionary-like mapping:

{"string": value}

A single file may hold many such "dictionaries," called Examples in TensorFlow, as depicted in the following graphic:

An overview of the concept behind TensorFlow record files. Image by the author.

Within each Example – or dictionary – the individual data entries are stored. This format is highly flexible: you can store images, text, audio, and any data that can be cast to a byte representation. Further, data types can be mixed, letting us keep, e.g., images and bounding boxes along with a textual description. However, before going too far too soon, we'll focus on a single modality: images. The remaining modalities, audio and text data, will be covered in upcoming posts.

Based on my experience, I have found it best to cover such an advanced topic with simple examples that best showcase the underlying workflow. In this case, we use random (image-shaped) matrices.

Storing images

Creating random data

Consider a dataset of 1000 images, each size 224 by 224 with three color channels. Each sample of this imaginary dataset is labeled into one of ten classes, 0 to 9. Using only the numpy-library, we can create such a dataset easily:

The result of this code snippet is a dataset (here: the numpy array) full of image-like data.

Helper functions

After we have a working dataset, we must convert it into byte data.

To this end, we create four helper functions (see also here). The first three helper functions convert certain data types, such as float, to a Tfrecord-compatible representation. The last helper function turns an array into a string of binary data:

Creating the TFRecord dataset

These functions come into play once we begin creating the TFRecords files. Here, we need a function that creates the layout of a single Example, that is, the layout for the internal representation of the image we want to store. Using our simplified visual representation from before, such an Example has multiple slots with data, called Features:

A conceptual overview of how data is stored within an *Example. Image by the author.*

For first-time users, creating such a condensed representation can be overwhelming, so let's cover it one by one. First, we need to store information to recover the input's data dimensions. For our image use case, this is the height, width (224), and number of channels (3). Each number is an integer, meaning we can store them as integer data.

Second, we need to store an image's byte representation.

And third, we need to store the label, which, like the data dimensions, is stored as integer data. In code, these three requirements are modeled as follows:

Next, we need a function that takes the dataset, consisting of the random images and their equally random labels, and prepares them for storage. First, we open a writer-object that handles writing the data to disk. Afterward, we use a for-loop that goes over the numpy arrays, creates image-label pairs, and stores them in a TFRecord file using the previously described method. Finally, after we finish iterating over the dataset, we close the writer:

That is it! After calling this function, we have a single file that stores our entire dataset!

Retrieving images

Extract the byte data

When, at a later point, we want to work with the TFRecords, we need to retrieve the stored data. Conceptually, we are now inversing the process of storing. Here, we prepare the structure but don't fill it with data yet. Be cautious: the placeholders must have the same name and an appropriate data type; otherwise, the extraction will fail. Then, for each Example in the TFRecord file, we extract the content and reshape the image:

Create a dataset

After coding the routine for extracting data, we need a way to apply it to each sample in the TFRecord file. This process, where we parse the data – i.e., bring it into the correct format – is done by mapping the extraction function to each Example. Here, we rely on TensorFlow's tf.data API, which has such functionality on board:

Afterward, we point this function to the previously created TFRecord file (here: "_randomimages.tfrecords") and retrieve the data. Then, as a sanity check, we can compare an image's shape and see if it has been recovered correctly:

Caveats

What we have covered in this post is how to get image data into a TFRecord file. There are two caveats, and respectively assumptions:

First, we began with images already loaded into the memory (our numpy arrays). And second, in our setup, all examples had the same shape – which is unlikely in real-world applications.

The first point is straightforward to solve: use one of the many excellent libraries that do that. Examples here are the imageio library or Pillow. For these libraries, plenty of tutorials exist, showing you the steps necessary for loading data.

The second point is a bit more tricky. The challenge is not the creation of the TFRecord files but the data loading in combination with batching. Remember that we stored the raw image data and its shape via the previous functions? When parsing the TFRecord file, this information allows us to restore the image's appropriate shape. However, now, when combining multiple examples into a batch, we face the possibility of having various data dimensions: Image 1 might be 224 by 224 pixels, but the next might be 124 by 356 pixels.

We have a solution for such cases: TensorFlow's _padded_batch()_ method. To get you started, here's the previous dataset creation code (which initially did not use any batching; samples were just returned one by one) but this time with padded batching:

The interesting part begins in line 10, which pads each batch in the dataset to a fixed shape specified by the padded_shapes argument. The first element of the tuple is padded to [256, None, 3], meaning that the first dimension of the tensor is fixed to 256, the second dimension is padded to the minimum supported length that fits all examples of this batch, and the third dimension is fixed to 3. The second element of the batch tuple, the labels, do not require padding, which is why we write [], meaning that no padding should be applied.

Wrapup

In this post, we covered storing one data modality – images – into TFRecord files, a TensorFlow-specific data format for efficient data storage and reading. In covering the workflow behind it, we generated a dataset of random "images" and equally random labels. We then used this dataset to show how to prepare the data for storage using three helper functions. Finally, after writing the data to disk using TensorFlow-native methods, we also coded the reverse: extracting the data from the file. Conceptually, this involved inversing the storage process by filling a placeholder dictionary. In the end, we also briefly discussed two caveats and how to solve them.

Tags: Data Science Deep Learning Machine Learning TensorFlow Tfrecord