HDF5 file format with Pandas


HDF5 is a data format that stores and manages large and complex data with lesser disk space and faster retrieve.

While reading and working on large datasets, we often run into out of memory error because of the large memory used. Storing data is some other format which is fast to access and smaller in size, is one of the solutions for this problem. One such file format is HDF5.


HDF stands for Hierarchical Data Format. Most common version used is Version 5.

A HDF file can store any kind of heterogeneous data objects such as images, arrays, tables, graphs, documents etc. It organizes the data in hierarchical fashion. Every HDF file starts with a root group('/') that contains other groups and/or data objects.


image hdfgroup

HDF file stores meta data about the data stored in it so that any application can interpret the content and structure of the file.

Pandas implements HDFStore interface to read, write, append, select a HDF file.


Create HDF file using Pandas

We can create a HDF5 file using the HDFStore class provided by Pandas.

Syntax: HDFStore(path, mode)

Where

  • Path is the File path.

  • Mode is the mode in which file is opened. It can be 'a'(append), 'w'(write'), 'r+'(read but file to be already existing). Append mode is default, it creates the file and opens in write mode if the file is not already existing.

Example: The following code creates and opens a HDF file('hdf_file.h5') in append mode(default).

import pandas as pd
from pandas import HDFStore
hdf = HDFStore('hdf_file.h5')

Store data in the HDF file

Data is stored in HDF file in the form of key, data pairs (similar to dictionaries). HDFStore has Put method to store data in HDF file.

Syntax: HDFStore.put(key, value, format=None, index=True, append=False, complib=None, complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict', track_times=True, dropna=False)

Where

  • Key is the identifier for the data object

  • Value is the Dataframe or series.

  • Format is the format used to store data objects. Values can be 'fixed'(default) or 'table'.

  • Append appends the input data to the existing. It forces Table format.

  • data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True.

  • Encoding provides an encoding for strings.

  • track_times governs the recording of times associated with an object. If set to True, time data is recorded.

  • Dropna specify if the null data values to be dropped or not.

Example: The following code reads iris.csv file and stores it in HDF file that is previously created with key as ‘key1’.

df = pd.read_csv(../input/iris/Iris.csv”)  # read Iris file
hdf.put('key1', df, format='table', data_columns=True) #put data in hdf file

Here key1 represents the key used to access the dataframe in HDF file. This key is useful because HDF file can store many data objects in the same file. Each data object can be accessed using the key.


Create groups and dataobjects in same file

We can add more dataobjects or groups in the same file.

Example: Following code adds dataframe df2 in the same file that we created, and also adds a group called group1 with a dataframe df3.

Import numpy as np
df2 = pd.DataFrame(np.random.rand(5,3) #dataframe df2
hdf.put(‘key2’,df2) # to add a dataframe to the hdf file
df3= pd.DataFrame(np.random.rand(10,2)
hdf.put(/group1/key3’,df3) # to add a group with df3 in the hdf file

Now df2 can be accessed with '/key2 '(Here ‘/’ represents the root group) and df3 can be accessed with '/group1/key3'.(Here ‘/group1’ represents a group)


Example: To get the shape of the df3 stored on hdf file:

hdf[/group1/key3’].shape()

Write data to HDF file

Apart from using HDFStore class to write data in HDF file, Pandas uses a method to_hdf to write data to the file.

Syntax: DataFrame.to_hdf(path_or_buf, key, mode='a', complevel=None, complib=None, append=False, format=None, index=True, min_itemsize=None, nan_rep=None, dropna=None, data_columns=None, errors='strict', encoding='UTF-8')

Where

  • Path_or_buf is File path or HDFStore object.

  • Key is the identifier for the group in the store.

  • Mode is the mode in which file is opened. It can be 'a'(append), 'w'(write'), 'r+'(read but file to be already existing).

  • Complevel specifies a compression level for data.

  • Append appends the input data to the existing. It is only for table format data.

  • Format is the format used to store data objects. Values can be 'fixed'(default) or 'table'.

  • Error specifies how encoding and decoding errors are to be handled.

  • Encoding provides an encoding for strings.

  • Min_itemsize maps column names to minimum string sizes for columns.

  • Nan_rep specifies how to represent null values as str.

  • Dropna specify if the null data values to be dropped or not.

  • data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True.

Example: To write a randomly generated Dataframe to the file with key as 'key'

write_data = pd.DataFrame(np.random.rand(5,3))
write_data.to_hdf(‘hdf_file.h5’,’key’,mode=’w’)#writes data to hdf file

Append data to HDF file

HDF5 does not require all the data to be written at once. A dataset can be extended whenever necessary. But the node must already exist and be Table format.

Syntax: HDFStore.append(key, value, format=None, axes=None, index=True, append=True, complib=None, complevel=None, columns=None, min_itemsize=None, nan_rep=None, chunksize=None, expectedrows=None, dropna=None, data_columns=None, encoding=None, errors='strict')

Where

  • Key is the identifier for the data object.

  • Format is the format used to store data objects. Value 'table'.

  • Append appends the input data to the existing.

  • data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True.

  • Min_itemsize maps column names to minimum string sizes for columns.

  • Nan_rep specifies how to represent null values as str.

  • Chunksizesize specifies size to chunk the writing.

  • Expectedrows is the expected total row size for the table.

  • Encoding provides an encoding for strings.

  • Dropna specify if the null data values to be dropped or not.

Example:

Following code appends dataframe of random numbers to table in previously created file having key as 'key2'.

hdf.append('/key2',pd.DataFrame(np.random.rand(5,3))

Read HDF file

The get method in HDFStore class can be used to read the file. Mode=’r’ has to be specified to open the file in read mode.

Syntax:

HDFStore.get(key)

Where

  • key is the identifier for the data object

It returns object of same type as object stored in file


Example: Following code read hdf file that is previously created in read mode

hdf =HDFStore(‘hdf_file.h5', mode=’r’)
data = hdf.get(/key1’)

In this example, the data that is returned would be of type pandas dataframe because we stored pandas 'Iris' dataframe in key1 while creating the file.


There is another method implemented by Pandas to read the file. It is read_hdf. This method is used because it can query the data while reading.

Syntax: pandas.read_hdf(path_or_buf, key=None, mode='r', errors='strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)

Where

  • Path_or_buf is File path or HDFStore object.

  • Key is the identifier for the data object.

  • Mode is the mode in which file is opened. It can be 'a'(append), 'r'(read'), 'r+'(read but file to be already existing).

  • Error specifies how encoding and decoding errors are to be handled.

  • Where is alist of Term (or convertible) objects.

  • Start is the Row number to start selection.

  • Stop is the Row number to stop selection.

  • Columns is a list of columns names to be returned.

  • Iterator returns an iterator object.

  • Chunksizesize specifies number of rows to include in an iteration when using an iterator.

  • **kwargs are additional keyword arguments passed to HDFStore.

It returns object of same type as object stored in file

Example: The following code returns Species data when ID is greater than 10

read_data = pd.read_hdf('hdf_file.h5','/key1',where=['Id>10'], columns=[‘Species’])

As the data that is returned(read_data) from hdf_file is the pandas Dataframe(same type as object stored), it can be used for data analysis like a normal dataframe.

Example: Following code is reading head of the file, finding shape, describing the dataset

read_data.head()   #read head of file
read_data.shape    #fetching shape of the file
read_data.describe  #describe the dataset


Close the HDF file

A file has to be closed after using it. Following is the code to close the HDF file that we created.

hdf.close()

Summary

HDF5 is built for fast I/O processing and storage. To access the data in this file format, Pandas implements interface which is discussed in this blog.



References:

pandas.read_hdf — pandas 1.2.3 documentation (pydata.org)

https://www.hdfgroup.org/