Zarr: Cloud-Native Chunked Arrays for Scientific Computing
What Is Zarr?
Zarr is an open-source storage format and Python library for chunked, compressed, N-dimensional arrays. Designed from the ground up for cloud and distributed computing, Zarr stores each chunk of an array as a separate object — in a local directory, an S3 bucket, Google Cloud Storage, Azure Blob Storage, or any other key-value store — enabling efficient parallel reads and writes without the file-locking and serialization bottlenecks of monolithic formats like HDF5.
Zarr is now the standard format for large-scale climate data (the Pangeo ecosystem), genomics (AnnData .zarr), Earth observation (the Open Data Cube), and machine learning training datasets where terabyte-scale arrays must be read efficiently from object storage.
The format has two active versions:
- Zarr v2: widely deployed, stable, supported by virtually all scientific Python tooling
- Zarr v3: redesigned metadata, better extension model, native support for sharding (multiple chunks per object) — in active adoption as of 2024–2025
Core Concepts
Arrays and Chunks
A Zarr array is conceptually identical to a NumPy array — an N-dimensional grid of typed data — but its storage is chunked:
my_array/ ← Zarr array directory
.zarray ← JSON metadata (shape, dtype, chunk shape, compressor)
.zattrs ← JSON attributes (units, coordinates, etc.)
0.0.0 ← Chunk [0,0,0]
0.0.1 ← Chunk [0,0,1]
0.1.0 ← Chunk [0,1,0]
...
9.15.23 ← Chunk [9,15,23]
Each chunk file is an independent compressed binary blob. Reading a subset of the array reads only the relevant chunk files — no sequential scan, no decompression of irrelevant data.
Groups
Zarr groups are containers (directories or prefix namespaces in object storage) that hold arrays and other groups:
dataset.zarr/
.zgroup ← marks this as a Zarr group
.zattrs ← group-level attributes
temperature/ ← sub-array
.zarray
0.0
0.1
...
precipitation/
.zarray
...
time/
.zarray
0
1
...
Stores
Zarr's key design feature is store abstraction — any key-value storage backend:
zarr.DirectoryStore('path/')— local filesystemzarr.ZipStore('archive.zarr.zip')— single ZIP filezarr.N5Store('path/')— N5 format compatibility (neuroscience)s3fs.S3FileSystem() + zarr.storage.FSStore— AWS S3gcsfs.GCSFileSystem() + zarr.storage.FSStore— Google Cloud Storageadlfs.AzureBlobFileSystem() + zarr.storage.FSStore— Azure Blob
Working with Zarr in Python
Creating Arrays
import zarr
import numpy as np
# Open/create a local directory store
store = zarr.DirectoryStore('climate_data.zarr')
root = zarr.group(store=store)
# Create a chunked, compressed array
temp = root.create_dataset(
'temperature',
shape=(8760, 721, 1440), # hours × lat × lon (one year, 0.25° global)
chunks=(24, 90, 180), # one day × 10° lat × 20° lon
dtype=np.float32,
compressor=zarr.Blosc(cname='lz4', clevel=5, shuffle=zarr.Blosc.SHUFFLE),
fill_value=np.nan
)
# Attach metadata attributes
temp.attrs['units'] = 'K'
temp.attrs['long_name'] = '2m air temperature'
temp.attrs['grid_mapping'] = 'crs'
print(temp.info)
# Type : zarr.core.Array
# Data type : float32
# Shape : (8760, 721, 1440)
# Chunk shape : (24, 90, 180)
# Order : C
# Read-only : False
# Compressor : Blosc(cname='lz4', clevel=5, shuffle=1, blocksize=0)
Writing Data
# Write a slice (automatically identifies and compresses affected chunks)
temp[0:24, :, :] = np.random.randn(24, 721, 1440).astype(np.float32) + 280
# Write from a NumPy array
data = np.load('era5_2023.npy') # shape (8760, 721, 1440)
temp[:] = data # writes all chunks
Reading Data — Lazy Slicing
# Open existing store
root = zarr.open('climate_data.zarr', mode='r')
temp = root['temperature']
# Lazy slice — only reads chunks that overlap this region
europe = temp[:, 400:560, 720:900] # shape (8760, 160, 180)
print(europe.shape) # (8760, 160, 180)
# Compute statistics on a subset
jan = temp[0:744, :, :] # January (744 hours)
print(f"Global Jan mean: {jan[:].mean():.2f} K")
Cloud Access — S3 Without Downloading
import s3fs
import zarr
# Access a public Zarr dataset on S3 (no download)
s3 = s3fs.S3FileSystem(anon=True)
store = zarr.storage.FSStore('s3://era5-pds/zarr/2023/', fs=s3)
root = zarr.open(store, mode='r')
print(list(root.keys())) # ['air_pressure_at_mean_sea_level', ...]
temp = root['air_temperature_at_2_metres']
print(temp.shape) # (8784, 721, 1440)
# Read only western Europe, January 1 2023
jan1_europe = temp[0:24, 520:580, 720:780]
print(jan1_europe.shape) # (24, 60, 60)
Zarr with Dask: Parallel Computing
Zarr integrates natively with Dask for parallel, out-of-core computation:
import xarray as xr
import dask.array as da
# Open Zarr as xarray Dataset (adds coordinate labels, CF conventions)
ds = xr.open_zarr('s3://era5-pds/zarr/2023/', storage_options={'anon': True})
print(ds)
# Lazy computation — builds a task graph, no data read yet
annual_mean = ds['air_temperature_at_2_metres'].mean(dim='time')
# Execute in parallel (reads chunks concurrently)
result = annual_mean.compute() # returns xarray DataArray
result.to_netcdf('annual_mean_2023.nc')
Zarr v3 and Sharding
Zarr v3 introduces sharding — storing multiple chunks inside a single object (shard). This solves the "too many small files" problem on object storage (where each GET request has fixed overhead):
# Zarr v3 with sharding: 10×10 chunks per shard object
import zarr.codecs
arr = zarr.open_array(
'sharded.zarr',
mode='w',
shape=(10000, 10000),
chunks=(100, 100), # logical chunk size
shards=(1000, 1000), # shard size (10×10 chunks per object)
dtype=np.float32,
codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
)
Zarr in Major Ecosystems
- Pangeo — cloud-native geoscience: ERA5 reanalysis, CMIP6 climate projections, GOES-16 satellite data — all in Zarr on S3
- OME-Zarr (NGFF) — biological microscopy standard; whole-slide images, light-sheet Z-stacks stored as multiscale Zarr pyramids
- AnnData — single-cell genomics:
.h5ad(HDF5) and.zarrbackends forscanpyworkflows - TileDB — array database built on Zarr-compatible storage concepts
- Kerchunk / VirtualiZarr — reference filesystem technique to expose NetCDF/HDF5/GRIB2 files as Zarr without copying data
Comparison: Zarr vs. HDF5
| Feature | Zarr | HDF5 |
|---|---|---|
| Storage model | Key-value (directory/object) | Single binary file |
| Cloud-native | Yes — chunks are independent objects | No — requires full-file download or SWMR |
| Parallel writes | Yes (no file locking) | Limited (parallel HDF5 requires MPI) |
| Remote access | Direct HTTP/S3 range requests per chunk | Needs full file or special server (HSDS) |
| Format metadata | JSON (.zarray, .zattrs) | Binary (embedded in file) |
| Ecosystem | Python-first (dask, xarray, pangeo) | C-first, broad language support |
| Arbitrary key-value stores | Yes | No |
| Compression codecs | Blosc, LZ4, Zstd, Zlib, GZip, BZ2 | Deflate, SZIP, LZF, Blosc (plugin) |
| SWMR streaming | No | Yes (HDF5 1.10+) |
Practical Tips
- Match chunk shape to access pattern — exactly the same rule as HDF5: chunk shape determines I/O efficiency
- Use Blosc/LZ4 for fast decompression with good ratios; use Blosc/Zstd for better compression when read speed is secondary
- For cloud storage: set chunk size to 1–20 MB to balance object storage GET request overhead vs. data over-read
- Use xarray as your primary interface —
xr.open_zarr()adds coordinate labels, CF metadata, and Dask integration in one step - Consolidate metadata before publishing:
zarr.consolidate_metadata('dataset.zarr')writes all.zarray/.zattrsinto a single.zmetadataJSON, reducing GET requests for large groups - Zarr is not a database — it has no transactions, no partial-chunk updates; use it for write-once or append-only patterns
Zarr's cloud-native chunked storage model makes it the format of choice wherever scientific data must be accessed at scale from object storage — the successor to HDF5 for the cloud-computing era.
Related conversions
Frequent conversions across the catalogue: