What Is HDF5?
HDF5 (Hierarchical Data Format version 5) is a versatile, high-performance file format and data model for storing and managing large, complex datasets. Developed by the HDF Group (originally at the National Center for Supercomputing Applications), HDF5 has been a cornerstone of scientific computing since the 1980s. It is used by NASA, CERN, NOAA, ECMWF, and nearly every major scientific computing institution for storing everything from satellite imagery and climate models to particle physics data and MRI datasets.
HDF5 is not just a file format — it is a hierarchical data model (like a file system within a file) combined with a binary storage layout optimized for large arrays and parallel I/O. A single HDF5 file can store terabytes of data organized into a structured hierarchy of groups, datasets, and metadata attributes, all accessible with random reads without loading the entire file.
The HDF5 Data Model
HDF5 organizes data using three fundamental concepts:
Groups
Groups are like directories in a file system. Every HDF5 file has a root group (/) and can contain any number of nested groups. Groups contain datasets and other groups.
/ ← root group
├── /climate/ ← group
│ ├── /climate/temperature ← dataset: 3D array (time × lat × lon)
│ ├── /climate/precipitation
│ └── /climate/wind_speed
├── /metadata/
│ ├── /metadata/station_info
│ └── /metadata/calibration
└── /satellite/
├── /satellite/band_1 ← dataset: 2D array (rows × cols)
└── /satellite/band_2
Datasets
Datasets are N-dimensional arrays of data — the core data container. A dataset has:
- A shape (dimensions): e.g.,
(365, 721, 1440)for daily temperature at 0.25° global resolution - A datatype: int8, int16, int32, int64, float32, float64, complex128, string, etc.
- Attributes: metadata key-value pairs attached to the dataset
- Optional chunking and compression
Attributes
Attributes are small metadata values attached to groups or datasets. They store units, coordinate reference systems, creation timestamps, processing history, and any other contextual information.
dataset.attrs['units'] = 'Kelvin'
dataset.attrs['long_name'] = 'Air Temperature at 2 meters'
dataset.attrs['valid_range'] = [180.0, 340.0]
dataset.attrs['_FillValue'] = 9.96921e+36 # missing data sentinel
HDF5 Key Features
Chunking
By default, HDF5 datasets are stored contiguously. Chunking divides a dataset into regularly-sized chunks stored independently. This enables:
- Reading a subset of data without loading the entire dataset
- Compression applied per-chunk (no chunk decompresses the entire file)
- Efficient writes in any order (each chunk is independent)
import h5py
import numpy as np
# Without chunking (bad for large arrays with partial reads)
f.create_dataset('data', data=large_array)
# With chunking — choose chunk shape to match your access pattern
# For a time series (time × lat × lon), chunk along the time axis:
f.create_dataset('temperature',
shape=(365, 721, 1440),
dtype='float32',
chunks=(1, 721, 1440), # one time slice per chunk
compression='gzip',
compression_opts=4)
Compression
HDF5 supports pluggable compression filters applied per-chunk:
| Filter | Description | Ratio | Speed |
|---|---|---|---|
gzip |
DEFLATE compression | Good | Medium |
lzf |
LZF — fast, moderate ratio | Fair | Very fast |
szip |
NASA-developed scientific compression | Good | Fast |
| BLOSC | Meta-compressor (wraps LZ4, Zstd, etc.) | Excellent | Very fast |
scaleoffset |
Lossy integer/float quantization | Excellent | Very fast |
# Gzip compression level 4 (0=off, 9=maximum)
ds = f.create_dataset('data', data=arr, compression='gzip', compression_opts=4)
# Shuffle filter + gzip (shuffle reorders bytes before compression — major improvement for numeric arrays)
ds = f.create_dataset('data', data=arr, compression='gzip', shuffle=True)
The shuffle filter is one of HDF5's most important optimizations: it reorders bytes of multi-byte integers/floats so that the same byte position of adjacent elements is contiguous. This dramatically improves gzip's ability to find repeated patterns in numeric data.
Parallel I/O
HDF5 supports parallel reading and writing via MPI-IO when compiled with the parallel HDF5 library. This enables multiple processes on a supercomputer cluster to simultaneously read/write different parts of the same HDF5 file — a critical capability for large-scale scientific simulations.
# Parallel HDF5 with mpi4py
from mpi4py import MPI
import h5py
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
with h5py.File('parallel_output.h5', 'w', driver='mpio', comm=comm) as f:
ds = f.create_dataset('data', shape=(1000, 1000))
# Each rank writes its portion
ds[rank*100:(rank+1)*100, :] = np.random.randn(100, 1000)
Working with HDF5 in Python
h5py — The Standard Python Interface
# pip install h5py numpy
import h5py
import numpy as np
# --- Writing ---
with h5py.File('climate_data.h5', 'w') as f:
# Create group structure
climate = f.create_group('climate')
metadata = f.create_group('metadata')
# Create datasets
n_times, n_lat, n_lon = 365, 181, 360
temp = climate.create_dataset(
'temperature',
shape=(n_times, n_lat, n_lon),
dtype='float32',
chunks=(1, n_lat, n_lon),
compression='gzip',
compression_opts=4,
shuffle=True,
fillvalue=9.96921e+36
)
# Set attributes
temp.attrs['units'] = 'Kelvin'
temp.attrs['long_name'] = 'Air Temperature at 2m'
temp.attrs['valid_min'] = 180.0
temp.attrs['valid_max'] = 340.0
# Write data
for day in range(n_times):
temp[day, :, :] = np.random.uniform(250, 310, (n_lat, n_lon)).astype('float32')
# Dataset metadata
lats = climate.create_dataset('latitude', data=np.linspace(-90, 90, n_lat))
lats.attrs['units'] = 'degrees_north'
lons = climate.create_dataset('longitude', data=np.linspace(-180, 179, n_lon))
lons.attrs['units'] = 'degrees_east'
# File-level attributes
f.attrs['Conventions'] = 'CF-1.8'
f.attrs['history'] = f'Created {np.datetime64("today")}'
# --- Reading ---
with h5py.File('climate_data.h5', 'r') as f:
# Inspect structure
print("Groups:", list(f.keys()))
print("Climate datasets:", list(f['climate'].keys()))
# Read subset (no need to load entire dataset)
temp = f['climate/temperature']
print(f"Shape: {temp.shape}, Dtype: {temp.dtype}")
# Read January (days 0-30), Northern Hemisphere (lat index 91-180)
jan_nh = temp[0:31, 91:181, :]
print(f"Jan NH mean: {jan_nh.mean():.2f} K")
# Read attributes
print("Units:", temp.attrs['units'])
# Iterate datasets recursively
def print_structure(name, obj):
if isinstance(obj, h5py.Dataset):
print(f" Dataset: {name} shape={obj.shape} dtype={obj.dtype}")
f.visititems(print_structure)
Inspecting HDF5 Files from the Command Line
# HDF5 utilities (from HDF5 library)
h5ls climate_data.h5 # list top-level contents
h5ls -r climate_data.h5 # recursive listing
h5dump -H climate_data.h5 # print full schema/header
h5dump -d /climate/temperature climate_data.h5 # print dataset values
# HDFView — GUI tool from The HDF Group (free download)
# Provides a tree browser, data table viewer, and attribute editor
HDF5 vs NetCDF4
NetCDF4 (Network Common Data Form version 4) is built directly on top of HDF5 — a NetCDF4 file is a valid HDF5 file with a specific convention for variable naming, coordinate systems, and attributes (the CF Conventions). NetCDF4 is the standard format for atmospheric and oceanographic data; HDF5 is the more general-purpose container.
| Aspect | HDF5 | NetCDF4 |
|---|---|---|
| Built on | HDF5 library | HDF5 |
| Conventions | None built-in | CF Conventions (units, coordinates, dimensions) |
| Domain | General scientific | Climate/Ocean/Atmosphere |
| Tools | h5py, HDFView, h5dump | netCDF4-python, NCO, CDO, xarray |
| Interoperability | Open format, many tools | Same plus CF-aware tools |
In practice, if you work with climate or atmospheric data, use NetCDF4. For general high-performance scientific array storage, use HDF5 directly.
HDF5 in Machine Learning
HDF5 gained widespread adoption in machine learning through Keras's model saving format (.h5 or .keras). The HDF5 file stores:
- Model architecture (JSON)
- Layer weights as typed float arrays
- Optimizer state
# Keras — save and load a model
model.save('my_model.h5') # HDF5 format
model = tf.keras.models.load_model('my_model.h5')
# Modern Keras uses .keras format (ZIP of JSON + numpy)
model.save('my_model.keras')
# Direct HDF5 for large training datasets
with h5py.File('training_data.h5', 'r') as f:
X = f['features'][:] # load all features
y = f['labels'][:]
PyTables is a higher-level Python library on top of HDF5 that adds table semantics, SQL-like queries, and better pandas integration for structured data.
When to Use HDF5
Use HDF5 when:
- Storing large multi-dimensional scientific arrays (climate, astronomy, medical imaging, simulation output)
- You need hierarchical organization within a single file
- Random access to subsets of large arrays is required
- Parallel read/write from multiple processes is needed
- Interoperability with scientific software (MATLAB, R, Julia, Fortran) is required
Consider alternatives when:
- Data is primarily tabular and analytical (use Parquet)
- Real-time streaming is required (use Avro, Kafka)
- Cloud-native object storage is the primary access pattern (Zarr is HDF5-compatible but cloud-optimized)
Related conversions
Frequent conversions across the catalogue: