Guide

HDF5 Format Guide: Hierarchical Data for Scientific Computing

PC By Pablo Cirre • Updated May 7, 2026

What Is HDF5?

HDF5 (Hierarchical Data Format version 5) is a versatile, high-performance file format and data model for storing and managing large, complex datasets. Developed by the HDF Group (originally at the National Center for Supercomputing Applications), HDF5 has been a cornerstone of scientific computing since the 1980s. It is used by NASA, CERN, NOAA, ECMWF, and nearly every major scientific computing institution for storing everything from satellite imagery and climate models to particle physics data and MRI datasets.

HDF5 is not just a file format — it is a hierarchical data model (like a file system within a file) combined with a binary storage layout optimized for large arrays and parallel I/O. A single HDF5 file can store terabytes of data organized into a structured hierarchy of groups, datasets, and metadata attributes, all accessible with random reads without loading the entire file.

The HDF5 Data Model

HDF5 organizes data using three fundamental concepts:

Groups

Groups are like directories in a file system. Every HDF5 file has a root group (/) and can contain any number of nested groups. Groups contain datasets and other groups.

/                           ← root group
├── /climate/               ← group
│   ├── /climate/temperature  ← dataset: 3D array (time × lat × lon)
│   ├── /climate/precipitation
│   └── /climate/wind_speed
├── /metadata/
│   ├── /metadata/station_info
│   └── /metadata/calibration
└── /satellite/
    ├── /satellite/band_1   ← dataset: 2D array (rows × cols)
    └── /satellite/band_2

Datasets

Datasets are N-dimensional arrays of data — the core data container. A dataset has:

A shape (dimensions): e.g., (365, 721, 1440) for daily temperature at 0.25° global resolution
A datatype: int8, int16, int32, int64, float32, float64, complex128, string, etc.
Attributes: metadata key-value pairs attached to the dataset
Optional chunking and compression

Attributes

Attributes are small metadata values attached to groups or datasets. They store units, coordinate reference systems, creation timestamps, processing history, and any other contextual information.

dataset.attrs['units'] = 'Kelvin'
dataset.attrs['long_name'] = 'Air Temperature at 2 meters'
dataset.attrs['valid_range'] = [180.0, 340.0]
dataset.attrs['_FillValue'] = 9.96921e+36  # missing data sentinel

HDF5 Key Features

Chunking

By default, HDF5 datasets are stored contiguously. Chunking divides a dataset into regularly-sized chunks stored independently. This enables:

Reading a subset of data without loading the entire dataset
Compression applied per-chunk (no chunk decompresses the entire file)
Efficient writes in any order (each chunk is independent)

import h5py
import numpy as np

# Without chunking (bad for large arrays with partial reads)
f.create_dataset('data', data=large_array)

# With chunking — choose chunk shape to match your access pattern
# For a time series (time × lat × lon), chunk along the time axis:
f.create_dataset('temperature',
    shape=(365, 721, 1440),
    dtype='float32',
    chunks=(1, 721, 1440),     # one time slice per chunk
    compression='gzip',
    compression_opts=4)

Compression

HDF5 supports pluggable compression filters applied per-chunk:

Filter	Description	Ratio	Speed
`gzip`	DEFLATE compression	Good	Medium
`lzf`	LZF — fast, moderate ratio	Fair	Very fast
`szip`	NASA-developed scientific compression	Good	Fast
BLOSC	Meta-compressor (wraps LZ4, Zstd, etc.)	Excellent	Very fast
`scaleoffset`	Lossy integer/float quantization	Excellent	Very fast

# Gzip compression level 4 (0=off, 9=maximum)
ds = f.create_dataset('data', data=arr, compression='gzip', compression_opts=4)

# Shuffle filter + gzip (shuffle reorders bytes before compression — major improvement for numeric arrays)
ds = f.create_dataset('data', data=arr, compression='gzip', shuffle=True)

The shuffle filter is one of HDF5's most important optimizations: it reorders bytes of multi-byte integers/floats so that the same byte position of adjacent elements is contiguous. This dramatically improves gzip's ability to find repeated patterns in numeric data.

Parallel I/O

HDF5 supports parallel reading and writing via MPI-IO when compiled with the parallel HDF5 library. This enables multiple processes on a supercomputer cluster to simultaneously read/write different parts of the same HDF5 file — a critical capability for large-scale scientific simulations.

# Parallel HDF5 with mpi4py
from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

with h5py.File('parallel_output.h5', 'w', driver='mpio', comm=comm) as f:
    ds = f.create_dataset('data', shape=(1000, 1000))
    # Each rank writes its portion
    ds[rank*100:(rank+1)*100, :] = np.random.randn(100, 1000)

Working with HDF5 in Python

h5py — The Standard Python Interface

# pip install h5py numpy

import h5py
import numpy as np

# --- Writing ---
with h5py.File('climate_data.h5', 'w') as f:
    # Create group structure
    climate = f.create_group('climate')
    metadata = f.create_group('metadata')

    # Create datasets
    n_times, n_lat, n_lon = 365, 181, 360
    temp = climate.create_dataset(
        'temperature',
        shape=(n_times, n_lat, n_lon),
        dtype='float32',
        chunks=(1, n_lat, n_lon),
        compression='gzip',
        compression_opts=4,
        shuffle=True,
        fillvalue=9.96921e+36
    )

    # Set attributes
    temp.attrs['units'] = 'Kelvin'
    temp.attrs['long_name'] = 'Air Temperature at 2m'
    temp.attrs['valid_min'] = 180.0
    temp.attrs['valid_max'] = 340.0

    # Write data
    for day in range(n_times):
        temp[day, :, :] = np.random.uniform(250, 310, (n_lat, n_lon)).astype('float32')

    # Dataset metadata
    lats = climate.create_dataset('latitude', data=np.linspace(-90, 90, n_lat))
    lats.attrs['units'] = 'degrees_north'
    lons = climate.create_dataset('longitude', data=np.linspace(-180, 179, n_lon))
    lons.attrs['units'] = 'degrees_east'

    # File-level attributes
    f.attrs['Conventions'] = 'CF-1.8'
    f.attrs['history'] = f'Created {np.datetime64("today")}'

# --- Reading ---
with h5py.File('climate_data.h5', 'r') as f:
    # Inspect structure
    print("Groups:", list(f.keys()))
    print("Climate datasets:", list(f['climate'].keys()))

    # Read subset (no need to load entire dataset)
    temp = f['climate/temperature']
    print(f"Shape: {temp.shape}, Dtype: {temp.dtype}")

    # Read January (days 0-30), Northern Hemisphere (lat index 91-180)
    jan_nh = temp[0:31, 91:181, :]
    print(f"Jan NH mean: {jan_nh.mean():.2f} K")

    # Read attributes
    print("Units:", temp.attrs['units'])

    # Iterate datasets recursively
    def print_structure(name, obj):
        if isinstance(obj, h5py.Dataset):
            print(f"  Dataset: {name} shape={obj.shape} dtype={obj.dtype}")
    f.visititems(print_structure)

Inspecting HDF5 Files from the Command Line

# HDF5 utilities (from HDF5 library)
h5ls climate_data.h5          # list top-level contents
h5ls -r climate_data.h5       # recursive listing
h5dump -H climate_data.h5     # print full schema/header
h5dump -d /climate/temperature climate_data.h5  # print dataset values

# HDFView — GUI tool from The HDF Group (free download)
# Provides a tree browser, data table viewer, and attribute editor

HDF5 vs NetCDF4

NetCDF4 (Network Common Data Form version 4) is built directly on top of HDF5 — a NetCDF4 file is a valid HDF5 file with a specific convention for variable naming, coordinate systems, and attributes (the CF Conventions). NetCDF4 is the standard format for atmospheric and oceanographic data; HDF5 is the more general-purpose container.

Aspect	HDF5	NetCDF4
Built on	HDF5 library	HDF5
Conventions	None built-in	CF Conventions (units, coordinates, dimensions)
Domain	General scientific	Climate/Ocean/Atmosphere
Tools	h5py, HDFView, h5dump	netCDF4-python, NCO, CDO, xarray
Interoperability	Open format, many tools	Same plus CF-aware tools

In practice, if you work with climate or atmospheric data, use NetCDF4. For general high-performance scientific array storage, use HDF5 directly.

HDF5 in Machine Learning

HDF5 gained widespread adoption in machine learning through Keras's model saving format (.h5 or .keras). The HDF5 file stores:

Model architecture (JSON)
Layer weights as typed float arrays
Optimizer state

# Keras — save and load a model
model.save('my_model.h5')         # HDF5 format
model = tf.keras.models.load_model('my_model.h5')

# Modern Keras uses .keras format (ZIP of JSON + numpy)
model.save('my_model.keras')

# Direct HDF5 for large training datasets
with h5py.File('training_data.h5', 'r') as f:
    X = f['features'][:]  # load all features
    y = f['labels'][:]

PyTables is a higher-level Python library on top of HDF5 that adds table semantics, SQL-like queries, and better pandas integration for structured data.

When to Use HDF5

Use HDF5 when:

Storing large multi-dimensional scientific arrays (climate, astronomy, medical imaging, simulation output)
You need hierarchical organization within a single file
Random access to subsets of large arrays is required
Parallel read/write from multiple processes is needed
Interoperability with scientific software (MATLAB, R, Julia, Fortran) is required

Consider alternatives when:

Data is primarily tabular and analytical (use Parquet)
Real-time streaming is required (use Avro, Kafka)
Cloud-native object storage is the primary access pattern (Zarr is HDF5-compatible but cloud-optimized)

Related conversions

Frequent conversions across the catalogue:

Frequently Asked Questions

HDF5 (Hierarchical Data Format version 5) is used for storing large, complex scientific datasets. Major use cases include: climate and weather data (temperature, precipitation, wind grids at global resolution), satellite imagery (multi-band raster arrays), particle physics simulations (CERN uses HDF5 for LHC collision data), medical imaging (fMRI brain scans stored as 4D arrays), machine learning model weights (Keras stores models in .h5 files), and any numerical computation that produces large multi-dimensional arrays. Its key advantage is the ability to read small subsets of huge files efficiently without loading everything into memory.

HDF5 (Hierarchical Data formato version 5) is used para storing large, complexo scientific datasets. Major usar cases include: climate e weather data (temperature, precipitation, wind grids at global resolução), satellite imagery (multi-band raster arrays), particle physics simulations (CERN uses HDF5 para LHC collision data), medical imaging (fMRI brain scans stored as 4D arrays), machine learning model weights (Keras stores models in .h5 files), e any numerical computation that produces large multi-dimensional arrays. Its key advantage is the ability to read small subsets of huge arquivos efficiently sem loading everything em memory.

HDF5 (Hierarchical Data Format version 5) is used für storing large, complex scientific datasets. Major verwenden cases include: climate und weather data (temperature, precipitation, wind grids at global Auflösung), satellite imagery (multi-band raster arrays), particle physics simulations (CERN uses HDF5 für LHC collision data), medical imaging (fMRI brain scans stored as 4D arrays), machine learning model weights (Keras stores models in .h5 files), und any numerical computation that produces large multi-dimensional arrays. Its key advantage is the ability to read small subsets von huge Dateien efficiently ohne loading everything in memory.

HDF5 (Hierarchical Data formato version 5) is used para storing large, complex scientific datasets. Major usar cases include: climate y weather data (temperature, precipitation, wind grids at global resolución), satellite imagery (multi-band raster arrays), particle physics simulations (CERN uses HDF5 para LHC collision data), medical imaging (fMRI brain scans stored as 4D arrays), machine learning model weights (Keras stores models in .h5 files), y any numerical computation that produces large multi-dimensional arrays. Its key advantage is the ability to read small subsets de huge archivos efficiently sin loading everything en memory.

On KaijuConverter every file is processed inside an isolated container, encrypted in transit (TLS 1.3) and at rest, and automatically deleted after 60 minutes with multi-pass overwrite. We never train on, share, or analyze user content. For maximum privacy on extremely sensitive material, prefer offline tools (ImageMagick, FFmpeg, LibreOffice) that you control end-to-end.

A regular binary file or NumPy .npy file stores a single array with minimal metadata. HDF5 is a hierarchical container (like a file system within a file) that stores many named datasets, organized into groups, with rich metadata attributes on each. A single .h5 file might contain dozens of related arrays (temperature, pressure, humidity, coordinates) with their units, coordinate reference systems, and provenance information all self-described. HDF5 also supports chunked storage (random access to subsets), pluggable compression, and parallel I/O from multiple processes simultaneously.

A regular binary arquivo ou NumPy .npy arquivo stores a single array com minimal metadata. HDF5 is a hierarchical container (like a arquivo system within a file) that stores many named datasets, organized em groups, com rich metadata attributes on each. A single .h5 arquivo might contain dozens of related arrays (temperature, pressure, humidity, coordinates) com their units, coordinate reference systems, e provenance information all self-described. HDF5 also suporta chunked storage (random access to subsets), pluggable compressão, e parallel I/O de multiple processes simultaneously.

A regular binary Datei oder NumPy .npy Datei stores a single array mit minimal metadata. HDF5 is a hierarchical Container (like a Datei system within a file) that stores many named datasets, organized in groups, mit rich metadata attributes on each. A single .h5 Datei might contain dozens von related arrays (temperature, pressure, humidity, coordinates) mit their units, coordinate reference systems, und provenance information all self-described. HDF5 also unterstützt chunked storage (random access to subsets), pluggable Komprimierung, und parallel I/O von multiple processes simultaneously.

A regular binary archivo o NumPy .npy archivo stores a single array con minimal metadata. HDF5 is a hierarchical contenedor (like a archivo system within a file) that stores many named datasets, organized en groups, con rich metadata attributes on each. A single .h5 archivo might contain dozens de related arrays (temperature, pressure, humidity, coordinates) con their units, coordinate reference systems, y provenance information all self-described. HDF5 also soporta chunked storage (random access to subsets), pluggable compresión, y parallel I/O de multiple processes simultaneously.

For 95% of use cases, yes — server-side ImageMagick, FFmpeg and LibreOffice produce identical output to the same tools on your laptop. Desktop software wins for: extremely large files (multi-GB), batch jobs of thousands of files, scripted pipelines, or content too sensitive to upload. KaijuConverter caps at 500 MB per file (1 GB on paid plans).

Install h5py (`pip install h5py numpy`) — it is the standard Python interface for HDF5. Open files with `h5py.File("data.h5", "r")` as a context manager. Navigate the hierarchy like a dictionary: `f["climate"]["temperature"]` returns a dataset object. Read data with NumPy-style slicing: `temp[:, 0:90, :]` reads the first 90 latitudes for all times and longitudes without loading the full dataset. Use `h5py.File("data.h5", "r").visititems(callback)` to recursively explore the file structure. The `h5ls` and `h5dump` command-line tools from the HDF5 library provide quick inspection without writing code.

Install h5py (`pip install h5py numpy`) — it is the padrão Python interface para HDF5. abrir arquivos com `h5py.File("data.h5", "r")` como um context manager. Navigate the hierarchy like a dictionary: `f["climate"]["temperature"]` returns a dataset object. Read data com NumPy-style slicing: `temp[:, 0:90, :]` reads the first 90 latitudes para all times e longitudes sem loading the full dataset. usar `h5py.File("data.h5", "r").visititems(callback)` to recursively explore o arquivo structure. The `h5ls` e `h5dump` command-line ferramentas de the HDF5 library provide quick inspection sem writing code.

Install h5py (`pip install h5py numpy`) — it is the Standard Python interface für HDF5. öffnen Dateien mit `h5py.File("data.h5", "r")` als ein context manager. Navigate the hierarchy like a dictionary: `f["climate"]["temperature"]` returns a dataset object. Read data mit NumPy-style slicing: `temp[:, 0:90, :]` reads the first 90 latitudes für all times und longitudes ohne loading the full dataset. verwenden `h5py.File("data.h5", "r").visititems(callback)` to recursively explore die Datei structure. The `h5ls` und `h5dump` command-line Werkzeuge von the HDF5 library provide quick inspection ohne writing code.

Install h5py (`pip install h5py numpy`) — it is the estándar Python interface para HDF5. abrir archivos con `h5py.File("data.h5", "r")` como un context manager. Navigate the hierarchy like a dictionary: `f["climate"]["temperature"]` returns a dataset object. Read data con NumPy-style slicing: `temp[:, 0:90, :]` reads the first 90 latitudes para all times y longitudes sin loading the full dataset. usar `h5py.File("data.h5", "r").visititems(callback)` to recursively explore el archivo structure. The `h5ls` y `h5dump` command-line herramientas de the HDF5 library provide quick inspection sin writing code.

Most format conversions are lossy by design — JPG, MP3, MP4, WebP all discard perceptual data to save bytes. Going through a lossy intermediate compounds the loss. To minimize visible/audible drift: convert from the original master, choose a higher quality setting, and avoid converting back and forth between lossy formats.

NetCDF4 is built on top of HDF5 — a NetCDF4 file is a valid HDF5 file that follows the CF Conventions (Climate and Forecast Metadata Conventions) for naming coordinates, dimensions, and attributes. You can open a NetCDF4 file with h5py and it works. The difference is that NetCDF4 adds semantic structure: standard names for coordinates (latitude, longitude, time, depth), conventions for units and calendars, and CF-aware tools (xarray, CDO, NCO) that understand these semantics and enable operations like "select all data at 45°N" without writing coordinate arithmetic. For climate science, use NetCDF4. For general scientific arrays, use HDF5 directly.

NetCDF4 is built on top of HDF5 — a NetCDF4 arquivo é a valid HDF5 arquivo that follows the CF Conventions (Climate e Forecast Metadata Conventions) para naming coordinates, dimensions, e attributes. Você pode abrir a NetCDF4 arquivo com h5py e it works. The difference is that NetCDF4 adds semantic structure: padrão names para coordinates (latitude, longitude, time, depth), conventions para units e calendars, e CF-aware ferramentas (xarray, CDO, NCO) that understand these semantics e enable operations like "select all data at 45°N" sem writing coordinate arithmetic. para climate science, usar NetCDF4. para general scientific arrays, usar HDF5 directly.

NetCDF4 is built on top von HDF5 — a NetCDF4 Datei is a valid HDF5 Datei that follows the CF Conventions (Climate und Forecast Metadata Conventions) für naming coordinates, dimensions, und attributes. Sie können öffnen a NetCDF4 Datei mit h5py und it works. The difference is that NetCDF4 adds semantic structure: Standard names für coordinates (latitude, longitude, time, depth), conventions für units und calendars, und CF-aware Werkzeuge (xarray, CDO, NCO) that understand these semantics und enable operations like "select all data at 45°N" ohne writing coordinate arithmetic. für climate science, verwenden NetCDF4. für general scientific arrays, verwenden HDF5 directly.

NetCDF4 is built on top de HDF5 — a NetCDF4 archivo is a valid HDF5 archivo that follows the CF Conventions (Climate y Forecast Metadata Conventions) para naming coordinates, dimensions, y attributes. Puedes abrir a NetCDF4 archivo con h5py y it works. The difference is that NetCDF4 adds semantic structure: estándar names para coordinates (latitude, longitude, time, depth), conventions para units y calendars, y CF-aware herramientas (xarray, CDO, NCO) that understand these semantics y enable operations like "select all data at 45°N" sin writing coordinate arithmetic. para climate science, usar NetCDF4. para general scientific arrays, usar HDF5 directly.

Yes — KaijuConverter accepts multiple files in a single drop and returns a ZIP. For very large batches (thousands of files) consider command-line tools or our API: <code>find . -name "*.heic" -exec magick {} {.}.jpg \;</code> or similar one-liners scale to millions of files when run locally.