What Is Apache Arrow?
Apache Arrow is an open-source, language-independent columnar memory format for flat and hierarchical data. Created by Wes McKinney (the creator of pandas) and donated to the Apache Software Foundation in 2016, Arrow has become the lingua franca of in-memory data exchange in the modern data stack.
The key innovation of Arrow is not what it stores — it is how it stores it: in a standardized, zero-copy columnar memory layout that is identical across all programming languages and runtime environments. This means a Python process can hand data directly to a Java process, a C++ process, a Go process, or an R process, with no serialization, deserialization, or data copying overhead.
The Problem Arrow Solves
Before Arrow, moving data between tools required expensive serialization. Imagine a pipeline:
pandas DataFrame → serialize to CSV → read into Spark → serialize to JSON → read into TensorFlow
Each serialization step copies all the data, converts it to a generic format, and then the receiver re-parses it from scratch. For a 10 GB dataset, this might require 50+ GB of memory and minutes of CPU time just for the format conversions.
With Arrow, the same pipeline becomes:
pandas DataFrame → Arrow Table → Spark DataFrame → TensorFlow tensor
(zero-copy share) (zero-copy share) (zero-copy share)
Each step shares pointers to the same memory buffers — no copying, no parsing, no format conversion.
Arrow File Formats
Arrow defines two file/stream formats for persistence and inter-process communication:
Arrow IPC File Format (.arrow)
The Arrow IPC (Inter-Process Communication) File Format (also called the "Feather v2" format) is a random-access file format for storing Arrow data on disk. Files end with .arrow (or .feather for the legacy Feather v1 format).
Structure:
Magic bytes ("ARROW1\0\0")
Schema message (field names, types, metadata)
[DictionaryBatch messages] (for dictionary-encoded columns)
[RecordBatch messages] (rows of actual data, in columnar layout)
[RecordBatch messages]
...
Footer (schema + offsets to all record batches — enables random access)
Magic bytes ("ARROW1")
Arrow IPC Stream Format
The stream format is designed for streaming data transmission (sockets, pipes, message queues). Unlike the file format, it does not have a footer — it can be consumed without buffering the entire stream.
Schema message
[DictionaryBatch messages]
RecordBatch message 1
RecordBatch message 2
...
EOS marker (end-of-stream)
Comparison with Parquet
| Feature | Arrow IPC | Parquet |
|---|---|---|
| Orientation | Columnar, optimized for read speed | Columnar, optimized for storage efficiency |
| Compression | Optional (LZ4, ZSTD) | Strong (Snappy, ZSTD, GZIP) |
| Read speed | Extremely fast (memory-mapped, zero-copy) | Fast |
| Write speed | Extremely fast | Medium |
| File size | Larger (less compressed) | Smaller |
| Random access | Yes (file format) | Yes (row group level) |
| Use case | In-memory cache, fast interchange | Long-term storage, analytics |
| Extension | .arrow, .feather |
.parquet |
Rule of thumb: Use Parquet for storage (smaller files, CDN, S3), use Arrow for computation (fast reads, in-memory DataFrames, cross-language pipelines).
Arrow Data Types
Arrow has a rich type system that covers all common data patterns:
Primitive types:
int8,int16,int32,int64— signed integersuint8,uint16,uint32,uint64— unsigned integersfloat16,float32,float64— IEEE 754 floatsbool— 1-bit packed booleansnull— all-null type
Variable-length types:
utf8,large_utf8— UTF-8 strings (32-bit vs 64-bit offsets)binary,large_binary— raw byte arrays
Temporal types:
date32,date64— days or milliseconds since epochtimestamp[unit, tz]— timestamp with optional timezonetime32[unit],time64[unit]— time of dayduration[unit]— time spaninterval— calendar interval (months, days, nanoseconds)
Nested types:
list<T>— variable-length list of Tstruct<fields>— named fields (like a row type or JSON object)map<K,V>— key-value pairsunion<types>— tagged union (like Rust's enum)
Specialized types:
dictionary<int, T>— dictionary encoding (like Parquet's dictionary encoding)extension<name, storage, metadata>— user-defined types
Working with Apache Arrow in Python
# pip install pyarrow
import pyarrow as pa
import pyarrow.ipc as ipc
import numpy as np
# Create an Arrow Table from Python objects
table = pa.table({
'id': pa.array([1, 2, 3, 4, 5], type=pa.int32()),
'name': pa.array(['Alice', 'Bob', 'Carol', 'Dave', 'Eve']),
'score': pa.array([95.2, 87.3, 91.8, 78.5, 88.0], type=pa.float64()),
'active': pa.array([True, True, False, True, False]),
})
print(table.schema)
# id: int32
# name: string
# score: double
# active: bool
print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")
print(f"Memory: {table.nbytes / 1024:.1f} KB")
# Column operations (no copying!)
names = table.column('name') # pyarrow.ChunkedArray
scores = table.column('score').to_pylist() # Python list
# Filter rows
import pyarrow.compute as pc
filtered = table.filter(pc.greater(table['score'], 88))
print(filtered.to_pydict())
Reading and Writing Arrow Files
import pyarrow as pa
import pyarrow.ipc as ipc
# Write Arrow IPC file
with ipc.new_file('data.arrow', table.schema) as writer:
writer.write_table(table)
# Read Arrow IPC file
with ipc.open_file('data.arrow') as reader:
schema = reader.schema_arrow
table2 = reader.read_all()
# Memory-mapped read (truly zero-copy — reads directly from disk mapping)
source = pa.memory_map('data.arrow', 'r')
reader = ipc.open_file(source)
table3 = reader.read_all()
# table3 column buffers point directly into the mmap'd file — no copy!
# Streaming format (for pipes, sockets)
sink = pa.BufferOutputStream()
with ipc.new_stream(sink, table.schema) as writer:
writer.write_table(table)
buf = sink.getvalue()
reader = ipc.open_stream(buf)
table4 = reader.read_all()
Integration with pandas
import pandas as pd
import pyarrow as pa
# pandas → Arrow (zero-copy for numeric types)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4.0, 5.0, 6.0]})
arrow_table = pa.Table.from_pandas(df)
# Arrow → pandas (zero-copy for numeric types)
df2 = arrow_table.to_pandas()
# Arrow preserves pandas metadata (index, dtypes) via schema metadata
Integration with Polars
import polars as pl
import pyarrow as pa
# Polars uses Arrow natively — LazyFrames are Arrow-backed
lf = pl.scan_ipc('data.arrow')
result = lf.filter(pl.col('score') > 88).collect()
# Convert between Polars and Arrow
arrow_table = result.to_arrow()
back_to_polars = pl.from_arrow(arrow_table)
Arrow Flight: High-Performance Data Transfer
Arrow Flight is a gRPC-based protocol for transferring Arrow data over the network at extreme speeds. It eliminates the overhead of traditional REST/HTTP APIs by streaming Arrow record batches directly without serialization to JSON or binary formats like Protocol Buffers.
# pip install pyarrow[flight]
import pyarrow.flight as fl
# Flight server (simplified)
class DataFlightServer(fl.FlightServerBase):
def do_get(self, context, ticket):
table = pa.table({'a': [1, 2, 3], 'b': [4.5, 5.5, 6.5]})
return fl.GeneratorStream(table.schema, [table.to_batches()[0]])
server = DataFlightServer("grpc://localhost:8815")
server.serve()
# Flight client
client = fl.connect("grpc://localhost:8815")
reader = client.do_get(fl.Ticket(b"my_table"))
table = reader.read_all()
print(table)
Arrow Flight achieves 10-50× throughput compared to REST APIs for large dataset transfers — a critical advantage in data engineering pipelines.
Arrow in the Ecosystem
Arrow's columnar memory format is used internally by:
- pandas 2.0+: Arrow-backed DataFrames are the new default option (
pd.read_parquet(..., dtype_backend='pyarrow')) - Polars: entirely Arrow-native; uses Arrow IPC as its interchange format
- Apache Spark: ArrowOptimize (
spark.sql.execution.arrow.pyspark.enabled=true) enables zero-copy Python↔Spark transfers - DuckDB: uses Arrow internally;
duckdb.arrow()returns an Arrow Table directly - RAPIDS cuDF (GPU DataFrames): Arrow-compatible memory layout for GPU↔CPU data exchange
- R
{arrow}package: full Arrow integration for R data frames - ADBC (Arrow Database Connectivity): Arrow-native replacement for ODBC/JDBC
The standardization of Arrow as the in-memory format means that a Polars DataFrame, a DuckDB query result, a pandas DataFrame, and a Spark partition can all share the same memory buffers without any copying — the ultimate data interoperability story.
Related conversions
Frequent conversions across the catalogue: