Guide

Zarr: Cloud-Native Chunked Arrays for Scientific Computing

PC By Pablo Cirre • Updated May 7, 2026

Zarr: Cloud-Native Chunked Arrays for Scientific Computing

What Is Zarr?

Zarr is an open-source storage format and Python library for chunked, compressed, N-dimensional arrays. Designed from the ground up for cloud and distributed computing, Zarr stores each chunk of an array as a separate object — in a local directory, an S3 bucket, Google Cloud Storage, Azure Blob Storage, or any other key-value store — enabling efficient parallel reads and writes without the file-locking and serialization bottlenecks of monolithic formats like HDF5.

Zarr is now the standard format for large-scale climate data (the Pangeo ecosystem), genomics (AnnData .zarr), Earth observation (the Open Data Cube), and machine learning training datasets where terabyte-scale arrays must be read efficiently from object storage.

The format has two active versions:

Zarr v2: widely deployed, stable, supported by virtually all scientific Python tooling
Zarr v3: redesigned metadata, better extension model, native support for sharding (multiple chunks per object) — in active adoption as of 2024–2025

Core Concepts

Arrays and Chunks

A Zarr array is conceptually identical to a NumPy array — an N-dimensional grid of typed data — but its storage is chunked:

my_array/                   ← Zarr array directory
    .zarray                 ← JSON metadata (shape, dtype, chunk shape, compressor)
    .zattrs                 ← JSON attributes (units, coordinates, etc.)
    0.0.0                   ← Chunk [0,0,0]
    0.0.1                   ← Chunk [0,0,1]
    0.1.0                   ← Chunk [0,1,0]
    ...
    9.15.23                 ← Chunk [9,15,23]

Each chunk file is an independent compressed binary blob. Reading a subset of the array reads only the relevant chunk files — no sequential scan, no decompression of irrelevant data.

Groups

Zarr groups are containers (directories or prefix namespaces in object storage) that hold arrays and other groups:

dataset.zarr/
    .zgroup                 ← marks this as a Zarr group
    .zattrs                 ← group-level attributes
    temperature/            ← sub-array
        .zarray
        0.0
        0.1
        ...
    precipitation/
        .zarray
        ...
    time/
        .zarray
        0
        1
        ...

Stores

Zarr's key design feature is store abstraction — any key-value storage backend:

zarr.DirectoryStore('path/') — local filesystem
zarr.ZipStore('archive.zarr.zip') — single ZIP file
zarr.N5Store('path/') — N5 format compatibility (neuroscience)
s3fs.S3FileSystem() + zarr.storage.FSStore — AWS S3
gcsfs.GCSFileSystem() + zarr.storage.FSStore — Google Cloud Storage
adlfs.AzureBlobFileSystem() + zarr.storage.FSStore — Azure Blob

Working with Zarr in Python

Creating Arrays

import zarr
import numpy as np

# Open/create a local directory store
store = zarr.DirectoryStore('climate_data.zarr')
root = zarr.group(store=store)

# Create a chunked, compressed array
temp = root.create_dataset(
    'temperature',
    shape=(8760, 721, 1440),    # hours × lat × lon (one year, 0.25° global)
    chunks=(24, 90, 180),       # one day × 10° lat × 20° lon
    dtype=np.float32,
    compressor=zarr.Blosc(cname='lz4', clevel=5, shuffle=zarr.Blosc.SHUFFLE),
    fill_value=np.nan
)

# Attach metadata attributes
temp.attrs['units'] = 'K'
temp.attrs['long_name'] = '2m air temperature'
temp.attrs['grid_mapping'] = 'crs'

print(temp.info)
# Type                : zarr.core.Array
# Data type           : float32
# Shape               : (8760, 721, 1440)
# Chunk shape         : (24, 90, 180)
# Order               : C
# Read-only           : False
# Compressor          : Blosc(cname='lz4', clevel=5, shuffle=1, blocksize=0)

Writing Data

# Write a slice (automatically identifies and compresses affected chunks)
temp[0:24, :, :] = np.random.randn(24, 721, 1440).astype(np.float32) + 280

# Write from a NumPy array
data = np.load('era5_2023.npy')  # shape (8760, 721, 1440)
temp[:] = data                    # writes all chunks

Reading Data — Lazy Slicing

# Open existing store
root = zarr.open('climate_data.zarr', mode='r')
temp = root['temperature']

# Lazy slice — only reads chunks that overlap this region
europe = temp[:, 400:560, 720:900]   # shape (8760, 160, 180)
print(europe.shape)                   # (8760, 160, 180)

# Compute statistics on a subset
jan = temp[0:744, :, :]               # January (744 hours)
print(f"Global Jan mean: {jan[:].mean():.2f} K")

Cloud Access — S3 Without Downloading

import s3fs
import zarr

# Access a public Zarr dataset on S3 (no download)
s3 = s3fs.S3FileSystem(anon=True)
store = zarr.storage.FSStore('s3://era5-pds/zarr/2023/', fs=s3)
root = zarr.open(store, mode='r')

print(list(root.keys()))             # ['air_pressure_at_mean_sea_level', ...]
temp = root['air_temperature_at_2_metres']
print(temp.shape)                    # (8784, 721, 1440)

# Read only western Europe, January 1 2023
jan1_europe = temp[0:24, 520:580, 720:780]
print(jan1_europe.shape)             # (24, 60, 60)

Zarr with Dask: Parallel Computing

Zarr integrates natively with Dask for parallel, out-of-core computation:

import xarray as xr
import dask.array as da

# Open Zarr as xarray Dataset (adds coordinate labels, CF conventions)
ds = xr.open_zarr('s3://era5-pds/zarr/2023/', storage_options={'anon': True})
print(ds)

# Lazy computation — builds a task graph, no data read yet
annual_mean = ds['air_temperature_at_2_metres'].mean(dim='time')

# Execute in parallel (reads chunks concurrently)
result = annual_mean.compute()   # returns xarray DataArray
result.to_netcdf('annual_mean_2023.nc')

Zarr v3 and Sharding

Zarr v3 introduces sharding — storing multiple chunks inside a single object (shard). This solves the "too many small files" problem on object storage (where each GET request has fixed overhead):

# Zarr v3 with sharding: 10×10 chunks per shard object
import zarr.codecs

arr = zarr.open_array(
    'sharded.zarr',
    mode='w',
    shape=(10000, 10000),
    chunks=(100, 100),     # logical chunk size
    shards=(1000, 1000),   # shard size (10×10 chunks per object)
    dtype=np.float32,
    codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
)

Zarr in Major Ecosystems

Pangeo — cloud-native geoscience: ERA5 reanalysis, CMIP6 climate projections, GOES-16 satellite data — all in Zarr on S3
OME-Zarr (NGFF) — biological microscopy standard; whole-slide images, light-sheet Z-stacks stored as multiscale Zarr pyramids
AnnData — single-cell genomics: .h5ad (HDF5) and .zarr backends for scanpy workflows
TileDB — array database built on Zarr-compatible storage concepts
Kerchunk / VirtualiZarr — reference filesystem technique to expose NetCDF/HDF5/GRIB2 files as Zarr without copying data

Comparison: Zarr vs. HDF5

Feature	Zarr	HDF5
Storage model	Key-value (directory/object)	Single binary file
Cloud-native	Yes — chunks are independent objects	No — requires full-file download or SWMR
Parallel writes	Yes (no file locking)	Limited (parallel HDF5 requires MPI)
Remote access	Direct HTTP/S3 range requests per chunk	Needs full file or special server (HSDS)
Format metadata	JSON (.zarray, .zattrs)	Binary (embedded in file)
Ecosystem	Python-first (dask, xarray, pangeo)	C-first, broad language support
Arbitrary key-value stores	Yes	No
Compression codecs	Blosc, LZ4, Zstd, Zlib, GZip, BZ2	Deflate, SZIP, LZF, Blosc (plugin)
SWMR streaming	No	Yes (HDF5 1.10+)

Practical Tips

Match chunk shape to access pattern — exactly the same rule as HDF5: chunk shape determines I/O efficiency
Use Blosc/LZ4 for fast decompression with good ratios; use Blosc/Zstd for better compression when read speed is secondary
For cloud storage: set chunk size to 1–20 MB to balance object storage GET request overhead vs. data over-read
Use xarray as your primary interface — xr.open_zarr() adds coordinate labels, CF metadata, and Dask integration in one step
Consolidate metadata before publishing: zarr.consolidate_metadata('dataset.zarr') writes all .zarray/.zattrs into a single .zmetadata JSON, reducing GET requests for large groups
Zarr is not a database — it has no transactions, no partial-chunk updates; use it for write-once or append-only patterns

Zarr's cloud-native chunked storage model makes it the format of choice wherever scientific data must be accessed at scale from object storage — the successor to HDF5 for the cloud-computing era.

Related conversions

Frequent conversions across the catalogue:

Frequently Asked Questions

Zarr stores each chunk of an N-dimensional array as a separate object (file or cloud storage object), making it inherently cloud-native. HDF5 is a single binary file requiring special server software or full downloads for remote access. Zarr supports parallel writes without file locking (impossible in standard HDF5), and any chunk can be read via a simple HTTP GET request from S3 or GCS. HDF5 has broader language support and SWMR streaming. Use Zarr for cloud/distributed workflows; HDF5 for local or HPC-parallel use cases.

Zarr stores each chunk of an N-dimensional array como um separate object (file ou cloud storage object), making it inherently cloud-native. HDF5 is a single binary arquivo requiring special server software ou full downloads para remote access. Zarr suporta parallel writes sem arquivo locking (impossible in padrão HDF5), e any chunk can be read via a simples HTTP GET request de S3 ou GCS. HDF5 has broader language support e SWMR streaming. usar Zarr para cloud/distributed workflows; HDF5 para local ou HPC-parallel usar cases.

Zarr stores each chunk von an N-dimensional array als ein separate object (file oder cloud storage object), making it inherently cloud-native. HDF5 is a single binary Datei requiring special server Software oder full downloads für remote access. Zarr unterstützt parallel writes ohne Datei locking (impossible in Standard HDF5), und any chunk can be read via a einfach HTTP GET request von S3 oder GCS. HDF5 has broader language support und SWMR streaming. verwenden Zarr für cloud/distributed workflows; HDF5 für local oder HPC-parallel verwenden cases.

Zarr stores each chunk de an N-dimensional array como un separate object (file o cloud storage object), making it inherently cloud-native. HDF5 is a single binary archivo requiring special server software o full downloads para remote access. Zarr soporta parallel writes sin archivo locking (impossible in estándar HDF5), y any chunk can be read via a simple HTTP GET request de S3 o GCS. HDF5 has broader language support y SWMR streaming. usar Zarr para cloud/distributed workflows; HDF5 para local o HPC-parallel usar cases.

On KaijuConverter every file is processed inside an isolated container, encrypted in transit (TLS 1.3) and at rest, and automatically deleted after 60 minutes with multi-pass overwrite. We never train on, share, or analyze user content. For maximum privacy on extremely sensitive material, prefer offline tools (ImageMagick, FFmpeg, LibreOffice) that you control end-to-end.

Use s3fs and zarr together: `import s3fs, zarr; s3 = s3fs.S3FileSystem(anon=True); store = zarr.storage.FSStore("s3://bucket/path/", fs=s3); root = zarr.open(store, mode="r")`. Slicing the array sends only the HTTP GET requests for the chunks that overlap your selection. No full download occurs. Combining with xarray: `xr.open_zarr("s3://bucket/path/", storage_options={"anon": True})` adds coordinate labels and Dask lazy computation.

Use s3fs e zarr together: `import s3fs, zarr; s3 = s3fs.S3FileSystem(anon=True); store = zarr.storage.FSStore("s3://bucket/path/", fs=s3); root = zarr.open(store, mode="r")`. Slicing the array sends only the HTTP GET requests para the chunks that overlap your selection. No full baixar occurs. Combining com xarray: `xr.open_zarr("s3://bucket/path/", storage_options={"anon": True})` adds coordinate labels e Dask lazy computation.

Use s3fs und zarr together: `import s3fs, zarr; s3 = s3fs.S3FileSystem(anon=True); store = zarr.storage.FSStore("s3://bucket/path/", fs=s3); root = zarr.open(store, mode="r")`. Slicing the array sends only the HTTP GET requests für the chunks that overlap your selection. No full herunterladen occurs. Combining mit xarray: `xr.open_zarr("s3://bucket/path/", storage_options={"anon": True})` adds coordinate labels und Dask lazy computation.

Use s3fs y zarr together: `import s3fs, zarr; s3 = s3fs.S3FileSystem(anon=True); store = zarr.storage.FSStore("s3://bucket/path/", fs=s3); root = zarr.open(store, mode="r")`. Slicing the array sends only the HTTP GET requests para the chunks that overlap your selection. No full descargar occurs. Combining con xarray: `xr.open_zarr("s3://bucket/path/", storage_options={"anon": True})` adds coordinate labels y Dask lazy computation.

For 95% of use cases, yes — server-side ImageMagick, FFmpeg and LibreOffice produce identical output to the same tools on your laptop. Desktop software wins for: extremely large files (multi-GB), batch jobs of thousands of files, scripted pipelines, or content too sensitive to upload. KaijuConverter caps at 500 MB per file (1 GB on paid plans).

For cloud object storage (S3, GCS, Azure), target 1–20 MB per chunk. Too small (< 100 KB) and GET request overhead dominates; too large (> 50 MB) and you read too much data for small requests. Match the chunk shape to your most common access pattern — if you typically read full 2D spatial slices, make chunks cover full spatial dimensions. Use xarray's `rechunker` library to rechunk existing datasets without loading everything into memory.

For cloud object storage (S3, GCS, Azure), target 1–20 MB per chunk. Too small (< 100 KB) e GET request overhead dominates; too large (> 50 MB) e you read too much data para small requests. Match the chunk shape to your most common access pattern — if you tipicamente read full 2D spatial slices, make chunks cover full spatial dimensions. usar xarray's `rechunker` library to rechunk existing datasets sem loading everything em memory.

For cloud object storage (S3, GCS, Azure), target 1–20 MB per chunk. Too small (< 100 KB) und GET request overhead dominates; too large (> 50 MB) und you read too much data für small requests. Match the chunk shape to your most common access pattern — if you typically read full 2D spatial slices, make chunks cover full spatial dimensions. verwenden xarray's `rechunker` library to rechunk existing datasets ohne loading everything in memory.

For cloud object storage (S3, GCS, Azure), target 1–20 MB per chunk. Too small (< 100 KB) y GET request overhead dominates; too large (> 50 MB) y you read too much data para small requests. Match the chunk shape to your most common access pattern — if you typically read full 2D spatial slices, make chunks cover full spatial dimensions. usar xarray's `rechunker` library to rechunk existing datasets sin loading everything en memory.

Most format conversions are lossy by design — JPG, MP3, MP4, WebP all discard perceptual data to save bytes. Going through a lossy intermediate compounds the loss. To minimize visible/audible drift: convert from the original master, choose a higher quality setting, and avoid converting back and forth between lossy formats.

OME-Zarr (Next-Generation File Format, NGFF) is a community specification for storing biological microscopy data in Zarr format. It defines conventions for multiscale image pyramids (progressively downsampled resolution levels), coordinate transforms, channel metadata, and labels (segmentation masks). A whole-slide image or a 3D light-sheet stack is stored as a Zarr group with multiple resolution levels, enabling efficient viewing at any zoom level without loading the full dataset. Major tools like napari, FIJI (via BigDataViewer), QuPath, and OMERO support OME-Zarr.

OME-Zarr (Next-Generation formatoo de arquivo, NGFF) is a community specification para storing biological microscopy data in Zarr formato. It defines conventions para multiscale image pyramids (progressively downsampled resolução levels), coordinate transforms, channel metadata, e labels (segmentation masks). A whole-slide image ou a 3D light-sheet stack is stored como um Zarr group com multiple resolução levels, enabling efficient viewing at any zoom level sem loading the full dataset. Major ferramentas like napari, FIJI (via BigDataViewer), QuPath, e OMERO support OME-Zarr.

OME-Zarr (Next-Generation Dateiformat, NGFF) is a community specification für storing biological microscopy data in Zarr Format. It defines conventions für multiscale image pyramids (progressively downsampled Auflösung levels), coordinate transforms, channel metadata, und labels (segmentation masks). A whole-slide image oder a 3D light-sheet stack is stored als ein Zarr group mit multiple Auflösung levels, enabling efficient viewing at any zoom level ohne loading the full dataset. Major Werkzeuge like napari, FIJI (via BigDataViewer), QuPath, und OMERO support OME-Zarr.

OME-Zarr (Next-Generation formatoo de archivo, NGFF) is a community specification para storing biological microscopy data in Zarr formato. It defines conventions para multiscale image pyramids (progressively downsampled resolución levels), coordinate transforms, channel metadata, y labels (segmentation masks). A whole-slide image o a 3D light-sheet stack is stored como un Zarr group con multiple resolución levels, enabling efficient viewing at any zoom level sin loading the full dataset. Major herramientas like napari, FIJI (via BigDataViewer), QuPath, y OMERO support OME-Zarr.

Yes — KaijuConverter accepts multiple files in a single drop and returns a ZIP. For very large batches (thousands of files) consider command-line tools or our API: <code>find . -name "*.heic" -exec magick {} {.}.jpg \;</code> or similar one-liners scale to millions of files when run locally.