What Is Apache Avro?
Apache Avro is an open-source data serialization framework developed within the Apache Hadoop project and released in 2009. Unlike Protocol Buffers (which requires code generation) or JSON (which is schema-less), Avro takes a distinctive approach: schemas are defined in JSON, stored alongside the data in the same file, and the binary encoding is schema-driven at runtime — no code generation is strictly required (though it is optional for typed languages).
Avro is the primary serialization format for Apache Kafka (the dominant event streaming platform) and is widely used in the Hadoop/Spark ecosystem. Its combination of compact binary encoding, embedded schema, and schema evolution capabilities makes it ideal for event streaming and data pipeline scenarios where data formats evolve over time.
How Avro Works
Schema Definition
Avro schemas are written in JSON. A schema describes a data type — typically a record (like a struct or object):
{
"type": "record",
"name": "User",
"namespace": "com.example",
"fields": [
{"name": "id", "type": "long"},
{"name": "username", "type": "string"},
{"name": "email", "type": "string"},
{"name": "age", "type": ["null", "int"], "default": null},
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "tags", "type": {"type": "array", "items": "string"}},
{"name": "address", "type": {
"type": "record",
"name": "Address",
"fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "country", "type": "string", "default": "US"}
]
}}
]
}
Avro Data Types
Primitive types:
null,boolean,int,long,float,double,bytes,string
Complex types:
record— named fields (like a struct/class)enum— named enumerationarray— ordered list of one typemap— string-keyed map of one value typeunion— value of one of multiple types (commonly["null", "string"]for nullable fields)fixed— fixed-length byte array
Logical types (annotations on primitive types):
date— int, days since epochtime-millis/time-micros— int/long, time of daytimestamp-millis/timestamp-micros— long, UTC timestamplocal-timestamp-millis/local-timestamp-micros— without timezoneduration— fixed(12): months + days + millisecondsdecimal— bytes or fixed with precision+scale
The Avro Binary Encoding
Avro's binary encoding is remarkably compact because the schema is known by both writer and reader — field names and types don't need to be repeated in each record.
Avro binary for a User record:
[long: id=42] → zigzag-encoded varint: 0x54
[string: "alice"] → [length: 0x0A] [bytes: 61 6C 69 63 65]
[string: "alice@example.com"] → [length: 0x24] [bytes: ...]
[union index: 1 (int present)] → 0x02
[int: age=28] → zigzag-encoded varint: 0x38
[long: created_at=1745539200000] → zigzag-encoded varint: ...
[array: ["python","kafka"]] → [count: 0x04] [len+bytes] [len+bytes] [block end: 0]
[string: "123 Main St"] → ...
[string: "Austin"] → ...
[string: "US"] → ...
Key encoding decisions:
- Variable-length integers: ints and longs use ZigZag encoding (like Protocol Buffers) — small numbers take 1 byte
- Strings and bytes: prefixed with their length as a varint
- Nulls: take 0 bytes (union index = 0 for null)
- Arrays and maps: blocks of items with block-count prefix; end marker is 0
Avro Container Files
For persistent storage, Avro data is typically stored in Avro Object Container Files (.avro), which include:
Magic: "Obj\x01" (4 bytes)
File metadata (map):
"avro.schema" → the schema as a JSON string
"avro.codec" → compression codec: "null", "deflate", "snappy", "bzip2", "xz", "zstandard"
Sync marker (16 bytes, random)
Data block 1:
Object count (long)
Serialized byte count (long)
Serialized objects (Avro binary)
Sync marker
Data block 2 ...
The sync marker enables splitting large Avro files for parallel processing (Hadoop MapReduce, Spark).
Schema Evolution: The Killer Feature
Like Protocol Buffers, Avro supports safe schema evolution — but with a different mechanism. While protobuf uses field numbers, Avro uses name matching with defaults.
Writer Schema vs Reader Schema
When reading Avro data, you provide both:
- Writer schema: the schema used when the data was written
- Reader schema: the schema your code expects to read
Avro's schema resolution rules handle the differences:
// Writer schema (v1 — original)
{"type": "record", "name": "Order",
"fields": [
{"name": "id", "type": "long"},
{"name": "product", "type": "string"},
{"name": "quantity", "type": "int"}
]}
// Reader schema (v2 — added "price" with default, removed "quantity")
{"type": "record", "name": "Order",
"fields": [
{"name": "id", "type": "long"},
{"name": "product", "type": "string"},
{"name": "price", "type": "double", "default": 0.0}
]}
Resolution rules:
- Added field with default: if the writer didn't write it, use the default → compatible
- Removed field: if the reader doesn't need it, skip it in the writer data → compatible
- Type promotion:
int→long,float→double→ compatible - Renamed field: use
aliasesarray in the reader schema to match old names - Changed type incompatibly: breaks compatibility
Avro with Apache Kafka and Schema Registry
In Kafka deployments, Avro's schema management is handled by Confluent Schema Registry — a separate service that stores and versions all schemas:
Schema Registry
┌─────────────────┐
Producer ──schema─→ │ subject: │
──id+data→ │ "orders-value" │
│ v1: {...} │
│ v2: {...} │
└─────────────────┘
Consumer ──id─────→ Schema Registry (fetch schema by ID)
←─schema── (cached locally)
←─decode── Avro binary
A Kafka Avro message has a 5-byte magic prefix: \x00 (magic byte) + 4-byte schema ID (big-endian int32), followed by the Avro-binary-encoded payload.
# pip install confluent-kafka confluent-kafka[avro]
from confluent_kafka import Producer, Consumer
from confluent_kafka.avro import AvroProducer, AvroConsumer
from confluent_kafka.avro.serializer import MessageSerializer
schema_str = """{
"type": "record",
"name": "Order",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string", "default": "USD"}
]
}"""
# Producer
producer = AvroProducer(
{'bootstrap.servers': 'localhost:9092',
'schema.registry.url': 'http://localhost:8081'},
default_value_schema=avro.loads(schema_str)
)
producer.produce(
topic='orders',
value={'order_id': 'ORD-001', 'amount': 49.99, 'currency': 'USD'}
)
producer.flush()
# Consumer
consumer = AvroConsumer(
{'bootstrap.servers': 'localhost:9092',
'group.id': 'my-group',
'schema.registry.url': 'http://localhost:8081'}
)
consumer.subscribe(['orders'])
while True:
msg = consumer.poll(1.0)
if msg and not msg.error():
order = msg.value() # automatically deserialized dict
print(f"Order: {order['order_id']} — ${order['amount']}")
Working with Avro in Python (without Kafka)
# pip install fastavro
import fastavro
import io
# Define schema
schema = {
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "email", "type": "string"},
{"name": "age", "type": ["null", "int"], "default": None},
]
}
parsed_schema = fastavro.parse_schema(schema)
# Write Avro to file
records = [
{"name": "Alice", "email": "alice@example.com", "age": 30},
{"name": "Bob", "email": "bob@example.com", "age": None},
]
with open('users.avro', 'wb') as f:
fastavro.writer(f, parsed_schema, records, codec='snappy')
# Read Avro from file
with open('users.avro', 'rb') as f:
reader = fastavro.reader(f)
schema = reader.writer_schema # embedded schema
for record in reader:
print(record) # {'name': 'Alice', 'email': '...', 'age': 30}
# Read with schema evolution (provide reader schema)
with open('users.avro', 'rb') as f:
reader = fastavro.reader(f, reader_schema=new_parsed_schema)
for record in reader:
print(record)
Avro vs Protocol Buffers vs Parquet
| Feature | Avro | Protocol Buffers | Parquet |
|---|---|---|---|
| Orientation | Row | Row | Columnar |
| Schema in file | Yes (embedded) | No (external .proto) | Yes (footer) |
| Code generation | Optional | Required | Optional |
| Schema evolution | Name-based + aliases | Number-based | Column add/remove |
| Compression | Block-level | None (external) | Column-level |
| Read speed | Fast (row) | Fast (row) | Fastest (column prune) |
| Primary use | Kafka, Hadoop streaming | gRPC, microservices | Analytics, data warehouse |
| Language support | Many (Java-native) | 15+ official | Many |
When to Use Avro
Use Avro when:
- Building Kafka-based event streaming pipelines
- Working with the Hadoop/Spark ecosystem (HDFS storage, Hive tables)
- You need schema evolution with forward/backward compatibility guarantees
- Java is your primary language (Avro has the best Java support)
- Row-oriented streaming writes are the primary access pattern
Use Protocol Buffers instead when:
- Building gRPC microservices
- Strong typing and code generation are priorities
- Multiple languages need generated client code
Use Parquet instead when:
- Analytical query performance is the priority
- Data is primarily read, not streamed
- Column-level operations are common
Related conversions
Frequent conversions across the catalogue: