Guide

Apache Arrow Format Guide: Zero-Copy Columnar In-Memory Data

PC By Pablo Cirre • Updated May 7, 2026

What Is Apache Arrow?

Apache Arrow is an open-source, language-independent columnar memory format for flat and hierarchical data. Created by Wes McKinney (the creator of pandas) and donated to the Apache Software Foundation in 2016, Arrow has become the lingua franca of in-memory data exchange in the modern data stack.

The key innovation of Arrow is not what it stores — it is how it stores it: in a standardized, zero-copy columnar memory layout that is identical across all programming languages and runtime environments. This means a Python process can hand data directly to a Java process, a C++ process, a Go process, or an R process, with no serialization, deserialization, or data copying overhead.

The Problem Arrow Solves

Before Arrow, moving data between tools required expensive serialization. Imagine a pipeline:

pandas DataFrame → serialize to CSV → read into Spark → serialize to JSON → read into TensorFlow

Each serialization step copies all the data, converts it to a generic format, and then the receiver re-parses it from scratch. For a 10 GB dataset, this might require 50+ GB of memory and minutes of CPU time just for the format conversions.

With Arrow, the same pipeline becomes:

pandas DataFrame → Arrow Table → Spark DataFrame → TensorFlow tensor
   (zero-copy share)            (zero-copy share)    (zero-copy share)

Each step shares pointers to the same memory buffers — no copying, no parsing, no format conversion.

Arrow File Formats

Arrow defines two file/stream formats for persistence and inter-process communication:

Arrow IPC File Format (.arrow)

The Arrow IPC (Inter-Process Communication) File Format (also called the "Feather v2" format) is a random-access file format for storing Arrow data on disk. Files end with .arrow (or .feather for the legacy Feather v1 format).

Structure:

Magic bytes ("ARROW1\0\0")
Schema message (field names, types, metadata)
[DictionaryBatch messages] (for dictionary-encoded columns)
[RecordBatch messages]  (rows of actual data, in columnar layout)
[RecordBatch messages]
...
Footer (schema + offsets to all record batches — enables random access)
Magic bytes ("ARROW1")

Arrow IPC Stream Format

The stream format is designed for streaming data transmission (sockets, pipes, message queues). Unlike the file format, it does not have a footer — it can be consumed without buffering the entire stream.

Schema message
[DictionaryBatch messages]
RecordBatch message 1
RecordBatch message 2
...
EOS marker (end-of-stream)

Comparison with Parquet

Feature	Arrow IPC	Parquet
Orientation	Columnar, optimized for read speed	Columnar, optimized for storage efficiency
Compression	Optional (LZ4, ZSTD)	Strong (Snappy, ZSTD, GZIP)
Read speed	Extremely fast (memory-mapped, zero-copy)	Fast
Write speed	Extremely fast	Medium
File size	Larger (less compressed)	Smaller
Random access	Yes (file format)	Yes (row group level)
Use case	In-memory cache, fast interchange	Long-term storage, analytics
Extension	`.arrow`, `.feather`	`.parquet`

Rule of thumb: Use Parquet for storage (smaller files, CDN, S3), use Arrow for computation (fast reads, in-memory DataFrames, cross-language pipelines).

Arrow Data Types

Arrow has a rich type system that covers all common data patterns:

Primitive types:

int8, int16, int32, int64 — signed integers
uint8, uint16, uint32, uint64 — unsigned integers
float16, float32, float64 — IEEE 754 floats
bool — 1-bit packed booleans
null — all-null type

Variable-length types:

utf8, large_utf8 — UTF-8 strings (32-bit vs 64-bit offsets)
binary, large_binary — raw byte arrays

Temporal types:

date32, date64 — days or milliseconds since epoch
timestamp[unit, tz] — timestamp with optional timezone
time32[unit], time64[unit] — time of day
duration[unit] — time span
interval — calendar interval (months, days, nanoseconds)

Nested types:

list<T> — variable-length list of T
struct<fields> — named fields (like a row type or JSON object)
map<K,V> — key-value pairs
union<types> — tagged union (like Rust's enum)

Specialized types:

dictionary<int, T> — dictionary encoding (like Parquet's dictionary encoding)
extension<name, storage, metadata> — user-defined types

Working with Apache Arrow in Python

# pip install pyarrow

import pyarrow as pa
import pyarrow.ipc as ipc
import numpy as np

# Create an Arrow Table from Python objects
table = pa.table({
    'id': pa.array([1, 2, 3, 4, 5], type=pa.int32()),
    'name': pa.array(['Alice', 'Bob', 'Carol', 'Dave', 'Eve']),
    'score': pa.array([95.2, 87.3, 91.8, 78.5, 88.0], type=pa.float64()),
    'active': pa.array([True, True, False, True, False]),
})

print(table.schema)
# id: int32
# name: string
# score: double
# active: bool

print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")
print(f"Memory: {table.nbytes / 1024:.1f} KB")

# Column operations (no copying!)
names = table.column('name')           # pyarrow.ChunkedArray
scores = table.column('score').to_pylist()  # Python list

# Filter rows
import pyarrow.compute as pc
filtered = table.filter(pc.greater(table['score'], 88))
print(filtered.to_pydict())

Reading and Writing Arrow Files

import pyarrow as pa
import pyarrow.ipc as ipc

# Write Arrow IPC file
with ipc.new_file('data.arrow', table.schema) as writer:
    writer.write_table(table)

# Read Arrow IPC file
with ipc.open_file('data.arrow') as reader:
    schema = reader.schema_arrow
    table2 = reader.read_all()

# Memory-mapped read (truly zero-copy — reads directly from disk mapping)
source = pa.memory_map('data.arrow', 'r')
reader = ipc.open_file(source)
table3 = reader.read_all()
# table3 column buffers point directly into the mmap'd file — no copy!

# Streaming format (for pipes, sockets)
sink = pa.BufferOutputStream()
with ipc.new_stream(sink, table.schema) as writer:
    writer.write_table(table)
buf = sink.getvalue()

reader = ipc.open_stream(buf)
table4 = reader.read_all()

Integration with pandas

import pandas as pd
import pyarrow as pa

# pandas → Arrow (zero-copy for numeric types)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4.0, 5.0, 6.0]})
arrow_table = pa.Table.from_pandas(df)

# Arrow → pandas (zero-copy for numeric types)
df2 = arrow_table.to_pandas()

# Arrow preserves pandas metadata (index, dtypes) via schema metadata

Integration with Polars

import polars as pl
import pyarrow as pa

# Polars uses Arrow natively — LazyFrames are Arrow-backed
lf = pl.scan_ipc('data.arrow')
result = lf.filter(pl.col('score') > 88).collect()

# Convert between Polars and Arrow
arrow_table = result.to_arrow()
back_to_polars = pl.from_arrow(arrow_table)

Arrow Flight: High-Performance Data Transfer

Arrow Flight is a gRPC-based protocol for transferring Arrow data over the network at extreme speeds. It eliminates the overhead of traditional REST/HTTP APIs by streaming Arrow record batches directly without serialization to JSON or binary formats like Protocol Buffers.

# pip install pyarrow[flight]
import pyarrow.flight as fl

# Flight server (simplified)
class DataFlightServer(fl.FlightServerBase):
    def do_get(self, context, ticket):
        table = pa.table({'a': [1, 2, 3], 'b': [4.5, 5.5, 6.5]})
        return fl.GeneratorStream(table.schema, [table.to_batches()[0]])

server = DataFlightServer("grpc://localhost:8815")
server.serve()

# Flight client
client = fl.connect("grpc://localhost:8815")
reader = client.do_get(fl.Ticket(b"my_table"))
table = reader.read_all()
print(table)

Arrow Flight achieves 10-50× throughput compared to REST APIs for large dataset transfers — a critical advantage in data engineering pipelines.

Arrow in the Ecosystem

Arrow's columnar memory format is used internally by:

pandas 2.0+: Arrow-backed DataFrames are the new default option (pd.read_parquet(..., dtype_backend='pyarrow'))
Polars: entirely Arrow-native; uses Arrow IPC as its interchange format
Apache Spark: ArrowOptimize (spark.sql.execution.arrow.pyspark.enabled=true) enables zero-copy Python↔Spark transfers
DuckDB: uses Arrow internally; duckdb.arrow() returns an Arrow Table directly
RAPIDS cuDF (GPU DataFrames): Arrow-compatible memory layout for GPU↔CPU data exchange
R {arrow} package: full Arrow integration for R data frames
ADBC (Arrow Database Connectivity): Arrow-native replacement for ODBC/JDBC

The standardization of Arrow as the in-memory format means that a Polars DataFrame, a DuckDB query result, a pandas DataFrame, and a Spark partition can all share the same memory buffers without any copying — the ultimate data interoperability story.

Related conversions

Frequent conversions across the catalogue:

Frequently Asked Questions

Apache Arrow is a standardized in-memory columnar format designed for maximum read speed and zero-copy data sharing between processes and languages. Parquet is a columnar file format optimized for storage efficiency (strong compression, smaller files). Think of Arrow as the RAM format and Parquet as the disk format: Parquet stores data compactly on disk, you read it into Arrow in-memory for fast computation, process it, then write results back to Parquet. Arrow files (.arrow/.feather) read faster than Parquet because they use less encoding overhead and support memory-mapping (reading data directly from disk into memory without copying).

Apache Arrow is a padrãoized in-memory columnar formato designed para máximo read speed e zero-copy data compartilhando between processes e languages. Parquet is a columnar formatoo de arquivo optimized para storage efficiency (strong compressão, smaller files). Think of Arrow como o RAM formato e Parquet como o disk formato: Parquet stores data compactly on disk, you read it em Arrow in-memory para fast computation, process it, then write results back to Parquet. Arrow arquivos (.arrow/.feather) read mais rápido que Parquet because they usar less codificação overhead e support memory-mapping (reading data directly de disk em memory sem copying).

Apache Arrow is a Standardized in-memory columnar Format designed für maximal read speed und zero-copy data sharing between processes und languages. Parquet is a columnar Dateiformat optimized für storage efficiency (strong Komprimierung, smaller files). Think von Arrow als das RAM Format und Parquet als das disk Format: Parquet stores data compactly on disk, you read it in Arrow in-memory für fast computation, process it, then write results back to Parquet. Arrow Dateien (.arrow/.feather) read schneller als Parquet because they verwenden less Codierung overhead und support memory-mapping (reading data directly von disk in memory ohne copying).

Apache Arrow is a estándarized in-memory columnar formato designed para máximo read speed y zero-copy data sharing between processes y languages. Parquet is a columnar formatoo de archivo optimized para storage efficiency (strong compresión, smaller files). Think de Arrow como el RAM formato y Parquet como el disk formato: Parquet stores data compactly on disk, you read it en Arrow in-memory para fast computation, process it, then write results back to Parquet. Arrow archivos (.arrow/.feather) read más rápido que Parquet because they usar less codificación overhead y support memory-mapping (reading data directly de disk en memory sin copying).

On KaijuConverter every file is processed inside an isolated container, encrypted in transit (TLS 1.3) and at rest, and automatically deleted after 60 minutes with multi-pass overwrite. We never train on, share, or analyze user content. For maximum privacy on extremely sensitive material, prefer offline tools (ImageMagick, FFmpeg, LibreOffice) that you control end-to-end.

Zero-copy means that when you pass Arrow data from one library or process to another, no data is duplicated in memory — instead, the receiver gets a pointer to the same memory buffer the sender used. For example, when you call `table.to_pandas()` on an Arrow Table with numeric columns, pandas can use the Arrow buffer directly without copying the numbers. This is possible because Arrow defines a single standardized memory layout that all Arrow-aware libraries understand. The practical impact is dramatic for large datasets: sharing a 10 GB Arrow table between pandas and DuckDB takes microseconds and uses no additional memory, whereas sharing a 10 GB CSV would require reading and parsing the entire file.

Zero-copy means that when you pass Arrow data de one library ou process to another, no data is duplicated in memory — instead, the receiver gets a pointer para o same memory buffer the sender used. Por exemplo, when you call `table.to_pandas()` on an Arrow Table com numeric columns, pandas can usar the Arrow buffer directly sem copying the numbers. This is possible because Arrow defines a single padrãoized memory layout that all Arrow-aware libraries understand. The practical impact is dramatic para large datasets: compartilhando a 10 GB Arrow table between pandas e DuckDB takes microseconds e uses no additional memory, whereas compartilhando a 10 GB CSV would require reading e parsing the entire file.

Zero-copy means that when you pass Arrow data von one library oder process to another, no data is duplicated in memory — instead, the receiver gets a pointer zum same memory buffer the sender used. Zum Beispiel, when you call `table.to_pandas()` on an Arrow Table mit numeric columns, pandas can verwenden the Arrow buffer directly ohne copying the numbers. This is possible because Arrow defines a single Standardized memory layout that all Arrow-aware libraries understand. The practical impact is dramatic für large datasets: sharing a 10 GB Arrow table between pandas und DuckDB takes microseconds und uses no additional memory, whereas sharing a 10 GB CSV would require reading und parsing the entire file.

Zero-copy means that when you pass Arrow data de one library o process to another, no data is duplicated in memory — instead, the receiver gets a pointer al same memory buffer the sender used. Por ejemplo, when you call `table.to_pandas()` on an Arrow Table con numeric columns, pandas can usar the Arrow buffer directly sin copying the numbers. This is possible because Arrow defines a single estándarized memory layout that all Arrow-aware libraries understand. The practical impact is dramatic para large datasets: sharing a 10 GB Arrow table between pandas y DuckDB takes microseconds y uses no additional memory, whereas sharing a 10 GB CSV would require reading y parsing the entire file.

For 95% of use cases, yes — server-side ImageMagick, FFmpeg and LibreOffice produce identical output to the same tools on your laptop. Desktop software wins for: extremely large files (multi-GB), batch jobs of thousands of files, scripted pipelines, or content too sensitive to upload. KaijuConverter caps at 500 MB per file (1 GB on paid plans).

Use PyArrow's IPC file format: `import pyarrow.ipc as ipc`. To write: `with ipc.new_file("data.arrow", table.schema) as w: w.write_table(table)`. To read: `with ipc.open_file("data.arrow") as r: table = r.read_all()`. For memory-mapped (truly zero-copy) reading: `source = pa.memory_map("data.arrow", "r"); table = ipc.open_file(source).read_all()` — the column buffers point directly into the disk-mapped memory with no copying. The `.feather` extension is the older Feather v1 format (now superseded by Arrow IPC v2); `pd.read_feather()` and `pd.to_feather()` use it.

Use PyArrow's IPC formatoo de arquivo: `import pyarrow.ipc as ipc`. To write: `with ipc.new_file("data.arrow", table.schema) as w: w.write_table(table)`. To read: `with ipc.open_file("data.arrow") as r: table = r.read_all()`. para memory-mapped (truly zero-copy) reading: `source = pa.memory_map("data.arrow", "r"); table = ipc.open_file(source).read_all()` — the column buffers point directly em the disk-mapped memory com no copying. The `.feather` extension is the older Feather v1 formato (now superseded by Arrow IPC v2); `pd.read_feather()` e `pd.to_feather()` usar it.

Use PyArrow's IPC Dateiformat: `import pyarrow.ipc as ipc`. To write: `with ipc.new_file("data.arrow", table.schema) as w: w.write_table(table)`. To read: `with ipc.open_file("data.arrow") as r: table = r.read_all()`. für memory-mapped (truly zero-copy) reading: `source = pa.memory_map("data.arrow", "r"); table = ipc.open_file(source).read_all()` — the column buffers point directly in the disk-mapped memory mit no copying. The `.feather` extension is the older Feather v1 Format (now superseded by Arrow IPC v2); `pd.read_feather()` und `pd.to_feather()` verwenden it.

Use PyArrow's IPC formatoo de archivo: `import pyarrow.ipc as ipc`. To write: `with ipc.new_file("data.arrow", table.schema) as w: w.write_table(table)`. To read: `with ipc.open_file("data.arrow") as r: table = r.read_all()`. para memory-mapped (truly zero-copy) reading: `source = pa.memory_map("data.arrow", "r"); table = ipc.open_file(source).read_all()` — the column buffers point directly en the disk-mapped memory con no copying. The `.feather` extension is the older Feather v1 formato (now superseded by Arrow IPC v2); `pd.read_feather()` y `pd.to_feather()` usar it.

Most format conversions are lossy by design — JPG, MP3, MP4, WebP all discard perceptual data to save bytes. Going through a lossy intermediate compounds the loss. To minimize visible/audible drift: convert from the original master, choose a higher quality setting, and avoid converting back and forth between lossy formats.

Arrow Flight is a high-performance network protocol for transferring Arrow data over the network, built on gRPC. It streams Arrow record batches directly over the wire without serializing to JSON or Protocol Buffers, achieving 10-50× better throughput than REST APIs for large data transfers. It defines a standard service model: clients can list available datasets (ListFlights), get their schemas (GetSchema), download data (DoGet), and upload data (DoPut). Arrow Flight is used by database systems (DuckDB, InfluxDB, Dremio) and ML platforms (MLflow, Ray) for fast data serving without the overhead of traditional data APIs.

Arrow Flight is a hoch-performance network protocol für transferring Arrow data over the network, built on gRPC. It streams Arrow record batches directly over the wire ohne serializing to JSON oder Protocol Buffers, achieving 10-50× better throughput than REST APIs für large data transfers. It defines a Standard service model: clients can list verfügbar datasets (ListFlights), get their schemas (GetSchema), herunterladen data (DoGet), und hochladen data (DoPut). Arrow Flight is used by database systems (DuckDB, InfluxDB, Dremio) und ML Plattformen (MLflow, Ray) für fast data serving ohne the overhead von traditional data APIs.

Arrow Flight is a alta-performance network protocol para transferring Arrow data over the network, built on gRPC. It streams Arrow record batches directly over the wire sin serializing to JSON o Protocol Buffers, achieving 10-50× better throughput than REST APIs para large data transfers. It defines a estándar service model: clients can list disponible datasets (ListFlights), get their schemas (GetSchema), descargar data (DoGet), y subir data (DoPut). Arrow Flight is used by database systems (DuckDB, InfluxDB, Dremio) y ML plataformas (MLflow, Ray) para fast data serving sin the overhead de traditional data APIs.

Yes — KaijuConverter accepts multiple files in a single drop and returns a ZIP. For very large batches (thousands of files) consider command-line tools or our API: <code>find . -name "*.heic" -exec magick {} {.}.jpg \;</code> or similar one-liners scale to millions of files when run locally.