MP4: The Universal Video Container — Technical Deep Dive
MP4 is the video format you interact with every day — downloaded from YouTube, shot on your phone, shared on social media, streamed on Netflix. Despite its ubiquity, MP4 is profoundly misunderstood: most people think of it as a video codec, but it is actually a container format that can hold many different codecs. Understanding MP4's internal box structure, codec combinations, and streaming optimizations unlocks the ability to build video pipelines that are both technically correct and performance-optimized.
MP4 Is Not a Codec
This is the single most important thing to understand: MP4 is a container, not a codec. An MP4 file is like a shipping container that can hold different cargo:
- Video codec: H.264 (AVC), H.265 (HEVC), AV1, VP9, MPEG-4 Part 2
- Audio codec: AAC, MP3, Opus, AC-3 (Dolby), ALAC (Apple Lossless), PCM
- Subtitles: TTXT (text), WebVTT, SRT
- Metadata: title, cover art, chapters
The codec determines compression quality and compatibility. The container provides the packaging that lets players find audio and video tracks, synchronize them, and seek to any point efficiently.
MP4 and ISOBMFF
MP4 (MPEG-4 Part 14, ISO/IEC 14496-14) is a profile of ISOBMFF (ISO Base Media File Format, ISO/IEC 14496-12). ISOBMFF is also the foundation of:
- HEIF/HEIC (images)
- QuickTime (.mov) (Apple's predecessor to MP4)
- 3GPP (.3gp) (mobile video)
- CMAF (Common Media Application Format for streaming)
- AVIF (AV1 images)
- Fragmented MP4 (used in DASH and HLS streaming)
Box/Atom Architecture
An MP4 file is a hierarchy of boxes (also called atoms). Every box has a 4-byte size and a 4-byte type (four ASCII characters):
MP4 File (series of top-level boxes)
├── ftyp — File Type Box
│ ├── major_brand: 'isom' / 'mp42' / 'M4V ' / 'M4A ' / 'avc1'
│ ├── minor_version: 0x200 (512)
│ └── compatible_brands: ['isom', 'mp41', 'avc1']
│
├── free — Free space (optional, padding)
│
├── moov — Movie Container (ALL metadata — must be present before mdat for streaming!)
│ ├── mvhd — Movie Header (duration, timescale, rate, volume, creation/modification time)
│ ├── trak — Track Container (one per stream: video, audio, subtitles)
│ │ ├── tkhd — Track Header (track ID, duration, width/height)
│ │ └── mdia — Media Container
│ │ ├── mdhd — Media Header (timescale, duration, language)
│ │ ├── hdlr — Handler Reference ('vide' for video, 'soun' for audio, 'text')
│ │ └── minf — Media Information Container
│ │ ├── vmhd — Video Media Header (graphicsMode, opcolor)
│ │ ├── dinf — Data Information (URL/URN of media data)
│ │ └── stbl — Sample Table (the index of all frames)
│ │ ├── stsd — Sample Description (codec info: avcC, hvcC, mp4a, Opus)
│ │ ├── stts — Time-to-Sample (decode timestamps)
│ │ ├── ctts — Composition Time Offset (display vs decode order)
│ │ ├── stss — Sync Sample Table (keyframe/I-frame positions)
│ │ ├── stsc — Sample-to-Chunk mapping
│ │ ├── stsz — Sample Size table (bytes per frame)
│ │ └── stco — Chunk Offset table (byte offsets into mdat)
│ └── udta — User Data (metadata: title, artist, cover art, chapters)
│
└── mdat — Media Data (the actual video/audio bytes, referenced by stco)
The stbl (Sample Table Box) is the heart of MP4 seeking and playback. Together, stts + ctts + stsc + stsz + stco tell the player: "frame N is located at byte offset X in mdat, has size Y bytes, and should be displayed at time T."
The moov-at-front Problem (Fast Start)
By default, many encoders write mdat first, then moov at the end of the file. This means a web player cannot start playback until the entire file has downloaded — it cannot seek to find the moov box without downloading everything.
Web-optimized MP4 moves moov to the front. This is called Fast Start or Progressive Download:
# Check moov position
ffprobe -v quiet -show_entries format_tags=moov_position input.mp4 2>/dev/null || \
python3 -c "
import struct, sys
with open('input.mp4','rb') as f:
while True:
size_bytes = f.read(4)
if len(size_bytes) < 4: break
size = struct.unpack('>I', size_bytes)[0]
box_type = f.read(4).decode('ascii','replace')
print(f'Box: {box_type} at offset {f.tell()-8}, size {size}')
if box_type in ('moov','mdat','ftyp'): pass
if size > 8: f.seek(size - 8, 1)
"
# Move moov to front (Fast Start)
ffmpeg -i input.mp4 -c copy -movflags +faststart output.mp4
# Verify: moov should appear before mdat in output
Fragmented MP4 (fMP4) for Streaming
Regular MP4 is monolithic — the stbl index assumes one contiguous media file. For adaptive bitrate streaming (DASH, HLS), video is split into segments (typically 2–10 seconds each). Fragmented MP4 extends ISOBMFF with movie fragments:
fMP4 Structure
├── ftyp (with 'iso5', 'avc1' brands)
├── moov (initialization segment — no mdat, no stbl offsets)
│ └── mvex → trex (Track Extends Box — default fragment settings)
├── moof + mdat — Movie Fragment 1 (segment 1)
│ └── moof (Movie Fragment Header)
│ ├── mfhd (Movie Fragment Header: sequence number)
│ └── traf (Track Fragment)
│ ├── tfhd (Track Fragment Header)
│ ├── tfdt (Track Fragment Decode Time)
│ └── trun (Track Run: per-sample sizes, offsets, durations, flags)
├── moof + mdat — Movie Fragment 2
└── mfra (Movie Fragment Random Access — optional, for seeking)
└── tfra (Track Fragment Random Access: time → moof offset map)
DASH players (Shaka Player, dash.js) and HLS with fMP4 segments request the init.mp4 first, then fetch individual segment files as needed based on the manifest.
Working with MP4 Using FFmpeg
# Encode to H.264 + AAC in MP4 (universal compatibility)
ffmpeg -i input.mov \
-c:v libx264 -crf 23 -preset medium -profile:v high \
-c:a aac -b:a 192k \
-movflags +faststart \
output.mp4
# Encode to H.265 (HEVC) — 40-50% smaller than H.264 at same quality
ffmpeg -i input.mov \
-c:v libx265 -crf 28 -preset slow \
-c:a aac -b:a 192k \
-movflags +faststart \
output_hevc.mp4
# Encode to AV1 (royalty-free, best compression)
ffmpeg -i input.mov \
-c:v libaom-av1 -crf 32 -b:v 0 -cpu-used 4 \
-c:a libopus -b:a 128k \
output.mp4
# Extract video stream without re-encoding (copy)
ffmpeg -i input.mp4 -c:v copy -an video_only.mp4
# Extract audio without re-encoding
ffmpeg -i input.mp4 -c:a copy -vn audio.m4a
# Create web-optimized thumbnail
ffmpeg -i input.mp4 -ss 00:00:05 -vframes 1 -vf "scale=640:-1" thumb.jpg
# Inspect MP4 structure
ffprobe -v quiet -print_format json -show_format -show_streams input.mp4
# Check moov atom position (for fast start verification)
ffprobe -v trace -i input.mp4 2>&1 | grep -E "^(ftyp|free|moov|mdat)"
Python: Parsing MP4 Box Structure
import struct
def parse_mp4_boxes(filepath, max_depth=2):
"""Parse top-level MP4 boxes from a file."""
boxes = []
with open(filepath, 'rb') as f:
file_size = f.seek(0, 2)
f.seek(0)
offset = 0
while offset < file_size:
# Read box header (8 bytes minimum)
header = f.read(8)
if len(header) < 8:
break
size, box_type = struct.unpack('>I4s', header)
box_type = box_type.decode('ascii', errors='replace')
if size == 1: # Extended size (64-bit)
ext_size = struct.unpack('>Q', f.read(8))[0]
data_offset = offset + 16
next_offset = offset + ext_size
elif size == 0: # Box extends to end of file
next_offset = file_size
data_offset = offset + 8
else:
next_offset = offset + size
data_offset = offset + 8
boxes.append({
'offset': offset,
'size': next_offset - offset,
'type': box_type
})
# Skip box content
f.seek(next_offset)
offset = next_offset
return boxes
for box in parse_mp4_boxes('video.mp4'):
print(f"0x{box['offset']:08x}: {box['type']:4s} ({box['size']:,} bytes)")
# Using pymediainfo for higher-level info
try:
from pymediainfo import MediaInfo
info = MediaInfo.parse('video.mp4')
for track in info.tracks:
if track.track_type == 'Video':
print(f"Video: {track.codec_id}, {track.width}×{track.height} @ {track.frame_rate} fps")
print(f" Bit rate: {track.bit_rate//1000 if track.bit_rate else '?'} kb/s")
print(f" Duration: {track.duration/1000:.1f}s")
elif track.track_type == 'Audio':
print(f"Audio: {track.codec_id}, {track.sampling_rate} Hz, {track.channel_s} ch, {track.bit_rate//1000 if track.bit_rate else '?'} kb/s")
except ImportError:
print("pip install pymediainfo")
Common MP4 Codec Combinations
| Use Case | Video | Audio | Container Note |
|---|---|---|---|
| Universal web | H.264 Baseline/High | AAC-LC | Maximum compatibility |
| High quality web | H.265 / AV1 | AAC or Opus | Use <source> fallback |
| Apple devices | H.264 or H.265 | AAC or ALAC | .mp4 or .m4v |
| YouTube upload | H.264 + AAC at 256 kb/s | AAC | YouTube re-encodes anyway |
| Professional edit | ProRes / DNxHD | PCM or AAC | MOV or MXF container |
| Mobile streaming | H.264 Baseline | HE-AAC | 3GP or MP4 |
| Screen recording | H.264 or H.265 | AAC | MP4 with fast start |
Summary
MP4 is a container, not a codec — understanding this distinction prevents countless technical misunderstandings. Its ISOBMFF box architecture provides a flexible, extensible framework that supports progressive download, adaptive streaming, multiple tracks, and rich metadata. The moov-before-mdat (fast start) optimization is mandatory for any MP4 served over the web. FFmpeg's -movflags +faststart flag and the libx264 + aac codec combination remain the gold standard for universal web video compatibility.
Related conversions
Common video conversions that pair well with this guide: