Guide

Apache Avro Format Guide: Schema-Based Serialization for Kafka & Hadoop

PC By Pablo Cirre • Updated May 7, 2026

What Is Apache Avro?

Apache Avro is an open-source data serialization framework developed within the Apache Hadoop project and released in 2009. Unlike Protocol Buffers (which requires code generation) or JSON (which is schema-less), Avro takes a distinctive approach: schemas are defined in JSON, stored alongside the data in the same file, and the binary encoding is schema-driven at runtime — no code generation is strictly required (though it is optional for typed languages).

Avro is the primary serialization format for Apache Kafka (the dominant event streaming platform) and is widely used in the Hadoop/Spark ecosystem. Its combination of compact binary encoding, embedded schema, and schema evolution capabilities makes it ideal for event streaming and data pipeline scenarios where data formats evolve over time.

How Avro Works

Schema Definition

Avro schemas are written in JSON. A schema describes a data type — typically a record (like a struct or object):

{
  "type": "record",
  "name": "User",
  "namespace": "com.example",
  "fields": [
    {"name": "id",         "type": "long"},
    {"name": "username",   "type": "string"},
    {"name": "email",      "type": "string"},
    {"name": "age",        "type": ["null", "int"], "default": null},
    {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "tags",       "type": {"type": "array", "items": "string"}},
    {"name": "address",    "type": {
      "type": "record",
      "name": "Address",
      "fields": [
        {"name": "street",  "type": "string"},
        {"name": "city",    "type": "string"},
        {"name": "country", "type": "string", "default": "US"}
      ]
    }}
  ]
}

Avro Data Types

Primitive types:

null, boolean, int, long, float, double, bytes, string

Complex types:

record — named fields (like a struct/class)
enum — named enumeration
array — ordered list of one type
map — string-keyed map of one value type
union — value of one of multiple types (commonly ["null", "string"] for nullable fields)
fixed — fixed-length byte array

Logical types (annotations on primitive types):

date — int, days since epoch
time-millis / time-micros — int/long, time of day
timestamp-millis / timestamp-micros — long, UTC timestamp
local-timestamp-millis / local-timestamp-micros — without timezone
duration — fixed(12): months + days + milliseconds
decimal — bytes or fixed with precision+scale

The Avro Binary Encoding

Avro's binary encoding is remarkably compact because the schema is known by both writer and reader — field names and types don't need to be repeated in each record.

Avro binary for a User record:
[long: id=42]                    → zigzag-encoded varint: 0x54
[string: "alice"]                → [length: 0x0A] [bytes: 61 6C 69 63 65]
[string: "alice@example.com"]    → [length: 0x24] [bytes: ...]
[union index: 1 (int present)]   → 0x02
[int: age=28]                    → zigzag-encoded varint: 0x38
[long: created_at=1745539200000] → zigzag-encoded varint: ...
[array: ["python","kafka"]]      → [count: 0x04] [len+bytes] [len+bytes] [block end: 0]
[string: "123 Main St"]          → ...
[string: "Austin"]               → ...
[string: "US"]                   → ...

Key encoding decisions:

Variable-length integers: ints and longs use ZigZag encoding (like Protocol Buffers) — small numbers take 1 byte
Strings and bytes: prefixed with their length as a varint
Nulls: take 0 bytes (union index = 0 for null)
Arrays and maps: blocks of items with block-count prefix; end marker is 0

Avro Container Files

For persistent storage, Avro data is typically stored in Avro Object Container Files (.avro), which include:

Magic: "Obj\x01" (4 bytes)
File metadata (map):
  "avro.schema" → the schema as a JSON string
  "avro.codec"  → compression codec: "null", "deflate", "snappy", "bzip2", "xz", "zstandard"
Sync marker (16 bytes, random)
Data block 1:
  Object count (long)
  Serialized byte count (long)
  Serialized objects (Avro binary)
  Sync marker
Data block 2 ...

The sync marker enables splitting large Avro files for parallel processing (Hadoop MapReduce, Spark).

Schema Evolution: The Killer Feature

Like Protocol Buffers, Avro supports safe schema evolution — but with a different mechanism. While protobuf uses field numbers, Avro uses name matching with defaults.

Writer Schema vs Reader Schema

When reading Avro data, you provide both:

Writer schema: the schema used when the data was written
Reader schema: the schema your code expects to read

Avro's schema resolution rules handle the differences:

// Writer schema (v1 — original)
{"type": "record", "name": "Order",
 "fields": [
   {"name": "id",       "type": "long"},
   {"name": "product",  "type": "string"},
   {"name": "quantity", "type": "int"}
 ]}

// Reader schema (v2 — added "price" with default, removed "quantity")
{"type": "record", "name": "Order",
 "fields": [
   {"name": "id",      "type": "long"},
   {"name": "product", "type": "string"},
   {"name": "price",   "type": "double", "default": 0.0}
 ]}

Resolution rules:

Added field with default: if the writer didn't write it, use the default → compatible
Removed field: if the reader doesn't need it, skip it in the writer data → compatible
Type promotion: int → long, float → double → compatible
Renamed field: use aliases array in the reader schema to match old names
Changed type incompatibly: breaks compatibility

Avro with Apache Kafka and Schema Registry

In Kafka deployments, Avro's schema management is handled by Confluent Schema Registry — a separate service that stores and versions all schemas:

                    Schema Registry
                    ┌─────────────────┐
Producer ──schema─→ │ subject:        │
         ──id+data→ │ "orders-value"  │
                    │ v1: {...}        │
                    │ v2: {...}        │
                    └─────────────────┘
Consumer ──id─────→ Schema Registry (fetch schema by ID)
         ←─schema── (cached locally)
         ←─decode── Avro binary

A Kafka Avro message has a 5-byte magic prefix: \x00 (magic byte) + 4-byte schema ID (big-endian int32), followed by the Avro-binary-encoded payload.

# pip install confluent-kafka confluent-kafka[avro]
from confluent_kafka import Producer, Consumer
from confluent_kafka.avro import AvroProducer, AvroConsumer
from confluent_kafka.avro.serializer import MessageSerializer

schema_str = """{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount",   "type": "double"},
    {"name": "currency", "type": "string", "default": "USD"}
  ]
}"""

# Producer
producer = AvroProducer(
    {'bootstrap.servers': 'localhost:9092',
     'schema.registry.url': 'http://localhost:8081'},
    default_value_schema=avro.loads(schema_str)
)
producer.produce(
    topic='orders',
    value={'order_id': 'ORD-001', 'amount': 49.99, 'currency': 'USD'}
)
producer.flush()

# Consumer
consumer = AvroConsumer(
    {'bootstrap.servers': 'localhost:9092',
     'group.id': 'my-group',
     'schema.registry.url': 'http://localhost:8081'}
)
consumer.subscribe(['orders'])
while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        order = msg.value()  # automatically deserialized dict
        print(f"Order: {order['order_id']} — ${order['amount']}")

Working with Avro in Python (without Kafka)

# pip install fastavro

import fastavro
import io

# Define schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name",  "type": "string"},
        {"name": "email", "type": "string"},
        {"name": "age",   "type": ["null", "int"], "default": None},
    ]
}
parsed_schema = fastavro.parse_schema(schema)

# Write Avro to file
records = [
    {"name": "Alice", "email": "alice@example.com", "age": 30},
    {"name": "Bob",   "email": "bob@example.com",   "age": None},
]

with open('users.avro', 'wb') as f:
    fastavro.writer(f, parsed_schema, records, codec='snappy')

# Read Avro from file
with open('users.avro', 'rb') as f:
    reader = fastavro.reader(f)
    schema = reader.writer_schema  # embedded schema
    for record in reader:
        print(record)  # {'name': 'Alice', 'email': '...', 'age': 30}

# Read with schema evolution (provide reader schema)
with open('users.avro', 'rb') as f:
    reader = fastavro.reader(f, reader_schema=new_parsed_schema)
    for record in reader:
        print(record)

Avro vs Protocol Buffers vs Parquet

Feature	Avro	Protocol Buffers	Parquet
Orientation	Row	Row	Columnar
Schema in file	Yes (embedded)	No (external .proto)	Yes (footer)
Code generation	Optional	Required	Optional
Schema evolution	Name-based + aliases	Number-based	Column add/remove
Compression	Block-level	None (external)	Column-level
Read speed	Fast (row)	Fast (row)	Fastest (column prune)
Primary use	Kafka, Hadoop streaming	gRPC, microservices	Analytics, data warehouse
Language support	Many (Java-native)	15+ official	Many

When to Use Avro

Use Avro when:

Building Kafka-based event streaming pipelines
Working with the Hadoop/Spark ecosystem (HDFS storage, Hive tables)
You need schema evolution with forward/backward compatibility guarantees
Java is your primary language (Avro has the best Java support)
Row-oriented streaming writes are the primary access pattern

Use Protocol Buffers instead when:

Building gRPC microservices
Strong typing and code generation are priorities
Multiple languages need generated client code

Use Parquet instead when:

Analytical query performance is the priority
Data is primarily read, not streamed
Column-level operations are common

Related conversions

Frequent conversions across the catalogue:

Frequently Asked Questions

Apache Avro is a binary data serialization framework that combines compact binary encoding with schema-aware reading and writing. It is primarily used in Apache Kafka (as the recommended message serialization format with Confluent Schema Registry), the Hadoop ecosystem (for HDFS storage and Hive external tables), and Apache Spark data pipelines. Unlike JSON, Avro stores schemas alongside the data, enabling any reader to decode it without out-of-band schema distribution. Unlike Protocol Buffers, Avro does not require code generation — schemas are read at runtime from the embedded JSON schema definition.

Apache Avro is a binary data serialization framework that combines compact binary codificação com schema-aware reading e writing. It is primarily used in Apache Kafka (as the recommended message serialization formato com Confluent Schema Registry), the Hadoop ecosystem (for HDFS storage e Hive external tables), e Apache Spark data pipelines. Unlike JSON, Avro stores schemas alongside the data, enabling any reader to decode it sem out-of-band schema distribution. Unlike Protocol Buffers, Avro does not require code generation — schemas are read at runtime de the embedded JSON schema definition.

Apache Avro is a binary data serialization framework that combines compact binary Codierung mit schema-aware reading und writing. It is primarily used in Apache Kafka (as the recommended message serialization Format mit Confluent Schema Registry), the Hadoop ecosystem (for HDFS storage und Hive external tables), und Apache Spark data pipelines. Unlike JSON, Avro stores schemas alongside the data, enabling any reader to decode it ohne out-of-band schema distribution. Unlike Protocol Buffers, Avro does not require code generation — schemas are read at runtime von the embedded JSON schema definition.

Apache Avro is a binary data serialization framework that combines compact binary codificación con schema-aware reading y writing. It is primarily used in Apache Kafka (as the recommended message serialization formato con Confluent Schema Registry), the Hadoop ecosystem (for HDFS storage y Hive external tables), y Apache Spark data pipelines. Unlike JSON, Avro stores schemas alongside the data, enabling any reader to decode it sin out-of-band schema distribution. Unlike Protocol Buffers, Avro does not require code generation — schemas are read at runtime de the embedded JSON schema definition.

On KaijuConverter every file is processed inside an isolated container, encrypted in transit (TLS 1.3) and at rest, and automatically deleted after 60 minutes with multi-pass overwrite. We never train on, share, or analyze user content. For maximum privacy on extremely sensitive material, prefer offline tools (ImageMagick, FFmpeg, LibreOffice) that you control end-to-end.

Avro uses name-based schema matching to handle evolution. When reading old Avro data with a new schema: (1) new fields with defaults are filled with their default values if the writer did not include them; (2) fields the reader does not need are skipped; (3) type promotions like int→long are handled automatically; (4) renamed fields use an "aliases" array in the reader schema to match old names. In Kafka with Confluent Schema Registry, compatibility modes (BACKWARD, FORWARD, FULL) are enforced at publish time — a producer cannot register a schema that breaks compatibility with existing consumers.

For 95% of use cases, yes — server-side ImageMagick, FFmpeg and LibreOffice produce identical output to the same tools on your laptop. Desktop software wins for: extremely large files (multi-GB), batch jobs of thousands of files, scripted pipelines, or content too sensitive to upload. KaijuConverter caps at 500 MB per file (1 GB on paid plans).

Both can serialize Kafka messages, but Avro is far more efficient: an Avro message is typically 5-10× smaller than equivalent JSON because field names are not repeated in every message (they are in the schema, shared once), and values use compact binary encoding (a 64-bit integer is 1-10 bytes in Avro vs 1-20 bytes as a JSON string). Avro also enforces schema validation — a producer cannot send a message with wrong field types or missing required fields. JSON requires no schema and is human-readable; Avro requires a schema and is binary. For high-throughput production systems, Avro's efficiency difference translates directly to lower storage costs and higher throughput.

Both can serialize Kafka messages, mas Avro is far more efficient: an Avro message is tipicamente 5-10× menor que equivalent JSON because field names are not repeated in every message (they are no schema, compartilhado once), e values usar compact binary codificação (a 64-bit integer is 1-10 bytes in Avro vs 1-20 bytes como um JSON string). Avro also enforces schema validation — a producer cannot enviar a message com wrong field types ou missing required fields. JSON requires no schema e is human-readable; Avro requires a schema e is binary. para alta-throughput production systems, Avro's efficiency difference translates directly to baixaer storage costs e altaer throughput.

Both can serialize Kafka messages, aber Avro is far more efficient: an Avro message is typically 5-10× kleiner als equivalent JSON because field names are not repeated in every message (they are im schema, shared once), und values verwenden compact binary Codierung (a 64-bit integer is 1-10 bytes in Avro vs 1-20 bytes als ein JSON string). Avro also enforces schema validation — a producer cannot senden a message mit wrong field types oder missing required fields. JSON requires no schema und is human-readable; Avro requires a schema und is binary. für hoch-throughput production systems, Avro's efficiency difference translates directly to niedriger storage costs und hocher throughput.

Both can serialize Kafka messages, pero Avro is far more efficient: an Avro message is typically 5-10× más pequeño que equivalent JSON because field names are not repeated in every message (they are en el schema, shared once), y values usar compact binary codificación (a 64-bit integer is 1-10 bytes in Avro vs 1-20 bytes como un JSON string). Avro also enforces schema validation — a producer cannot enviar a message con wrong field types o missing required fields. JSON requires no schema y is human-readable; Avro requires a schema y is binary. para alta-throughput production systems, Avro's efficiency difference translates directly to bajaer storage costs y altaer throughput.

Most format conversions are lossy by design — JPG, MP3, MP4, WebP all discard perceptual data to save bytes. Going through a lossy intermediate compounds the loss. To minimize visible/audible drift: convert from the original master, choose a higher quality setting, and avoid converting back and forth between lossy formats.

No — Avro supports two usage modes. Generic mode (schema-less code): you work with generic Record objects and map-like access, reading fields by name at runtime. This requires no code generation. Specific mode (generated classes): you run the Avro compiler to generate typed Java/Python/etc. classes from the schema, giving you compile-time type safety and IDE autocomplete. In Python with fastavro, you always use the generic mode — schemas are parsed from the JSON definition at runtime with `fastavro.parse_schema()`, and records are plain Python dicts. The Java Avro library supports both generic and specific modes.

No — Avro suporta two usage modes. Generic mode (schema-less code): you funcionar com generic Record objects e map-like access, reading fields by name at runtime. This requires no code generation. Specific mode (generated classes): you run the Avro compiler to generate typed Java/Python/etc. classes de the schema, giving you compile-time type safety e IDE autocomplete. In Python com fastavro, you always usar the generic mode — schemas are parsed de the JSON definition at runtime com `fastavro.parse_schema()`, e records are plain Python dicts. The Java Avro library suporta both generic e specific modes.

No — Avro unterstützt two usage modes. Generic mode (schema-less code): you work mit generic Record objects und map-like access, reading fields by name at runtime. This requires no code generation. Specific mode (generated classes): you run the Avro compiler to generate typed Java/Python/etc. classes von the schema, giving you compile-time type safety und IDE autocomplete. In Python mit fastavro, you always verwenden the generic mode — schemas are parsed von the JSON definition at runtime mit `fastavro.parse_schema()`, und records are plain Python dicts. The Java Avro library unterstützt both generic und specific modes.

No — Avro soporta two usage modes. Generic mode (schema-less code): you work con generic Record objects y map-like access, reading fields by name at runtime. This requires no code generation. Specific mode (generated classes): you run the Avro compiler to generate typed Java/Python/etc. classes de the schema, giving you compile-time type safety y IDE autocomplete. In Python con fastavro, you always usar the generic mode — schemas are parsed de the JSON definition at runtime con `fastavro.parse_schema()`, y records are plain Python dicts. The Java Avro library soporta both generic y specific modes.

Yes — KaijuConverter accepts multiple files in a single drop and returns a ZIP. For very large batches (thousands of files) consider command-line tools or our API: <code>find . -name "*.heic" -exec magick {} {.}.jpg \;</code> or similar one-liners scale to millions of files when run locally.