PDF: Portable Document Format for Cross-Platform Documents
PDF (Portable Document Format) is the ISO 32000 standard for document interchange, ensuring consistent rendering across devices and operating systems. Originally developed by Adobe (1993), PDF is now vendor-agnostic and ubiquitous in business, legal, healthcare, and government. Understanding PDF structure is essential for creating secure, accessible, searchable, and compliant documents.
PDF Document Structure: Objects & Primitives
Every PDF file is a collection of objects—strings, numbers, arrays, dictionaries, and streams:
%PDF-1.4 (header with version)
1 0 obj (object ID 1, generation 0)
<< /Type /Catalog /Pages 2 0 R >> (dictionary with reference to object 2)
endobj
2 0 obj (page tree node)
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj (page object)
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R >>
endobj
4 0 obj (content stream)
stream
BT
/F1 12 Tf
100 700 Td
(Hello, World!) Tj
ET
endstream
endobj
xref (cross-reference table)
0 5
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000133 00000 n
0000000229 00000 n
trailer (trailer dictionary)
<< /Size 5 /Root 1 0 R >>
startxref
365
%%EOF
Key object types:
| Type | Example | Purpose |
|---|---|---|
| Boolean | true, false |
Logic |
| Number | 42, 3.14 |
Coordinates, sizes, font sizes |
| String | (Hello), <48656C6C6F> |
Text, binary data |
| Name | /Type, /Catalog |
Dictionary keys |
| Array | [0 0 612 792] |
Page dimensions (x0, y0, x1, y1) |
| Dictionary | << /Type /Page >> |
Object properties |
| Stream | stream…endstream |
Image data, page content |
| Null | null |
Empty reference |
| Reference | 2 0 R |
Pointer to object 2, generation 0 |
Page Composition: Content Streams & Graphics
A PDF page's visual appearance is defined by a content stream—a sequence of operators:
q (save graphics state)
1 0 0 1 50 50 cm (transformation matrix: translate 50,50)
/F1 12 Tf (set font F1, size 12)
100 50 Td (move text position to 100, 50)
(Sample Text) Tj (show text string)
Q (restore graphics state)
BT (begin text)
/F1 12 Tf (select font)
100 700 Td (text position)
(Line 1) Tj (show text)
T* (next line — uses leading parameter)
(Line 2) Tj
ET (end text)
Common graphics operators:
| Operator | Effect |
|---|---|
m x y |
MoveTo (start new path at x, y) |
l x y |
LineTo (draw line to x, y) |
c x1 y1 x2 y2 x3 y3 |
CurveTo (cubic Bézier) |
h |
ClosePath (connect last to first) |
S |
Stroke (draw path outline) |
f |
Fill (fill path) |
rg r g b |
SetRGBColor (0.0–1.0) |
k c m y k |
SetCMYKColor |
q |
Save graphics state |
Q |
Restore graphics state |
cm a b c d e f |
Transformation matrix (scale, rotate, translate) |
Page Structure: MediaBox, CropBox, Bleed & Trim
Every page defines multiple box coordinates:
/MediaBox [0 0 612 792] (full page 8.5×11 inches at 72 DPI)
/CropBox [20 20 592 772] (visible content area — excludes margins)
/BleedBox [10 10 602 782] (printing bleed area — printer marks)
/TrimBox [20 20 592 772] (final cut size after trimming)
/ArtBox [50 50 562 742] (content area for design tools)
Typical workflow:
- MediaBox: physical page size (8.5×11")
- CropBox: visible area in PDF viewer (excludes printer margins)
- BleedBox: extends beyond TrimBox for print bleed (0.125")
- TrimBox: final trimmed size (8×10.5")
- ArtBox: working canvas for designer
Compression: FlateDecode & Image Streams
PDFs compress content via several algorithms:
FlateDecode (zlib/deflate):
stream
x\x9c+P0T0P04R048W04S040050\r\x0c\x1c+\x10O0\x12\x00\x00\x00\x00\x1f\xff
endstream
Applies LZ77 + Huffman to text and vector data (~40% reduction).
Image embedding (DCTDecode for JPEG):
stream
[raw JPEG bytes—FFD8FFE0…FFD9]
endstream
Embeds JPEG data directly (no re-encoding), saving ~3× storage vs. raw RGB.
CCITT Group 4 (1-bit monochrome scan):
/Filter /CCITTFaxDecode
/K -1
/Columns 2550
Encodes bilevel (1-bit) scans (fax, document scans) at ~30:1 ratio.
Encryption: RC4 & AES
PDFs support user-level and owner-level passwords:
RC4 (legacy, 40/128-bit):
- RC4 stream cipher with MD5 hash
- 40-bit: limited, crackable (for export compliance)
- 128-bit: adequate for casual security
AES (AES-128/256, recommended):
- AES in CBC mode with PKCS#5 padding
- AES-128: 128-bit key derived from user password (MD5 hash)
- AES-256: 256-bit key, requires PDF 2.0
Encryption object:
/Encrypt << /Filter /Standard /V 4 /R 4 /Length 128 /P -44
/O <hex-owner-password-hash> /U <hex-user-password-hash> >>
Access permissions:
/P -44= no printing, no copying, no form filling/P -1= all permissions (unrestricted)/P -4= high-res printing only
Digital Signatures & Certificates
PDFs support embedded X.509 certificates for legal signatures:
/Sig << /Type /Sig /Filter /Adobe.PPKLite /SubFilter /adbe.pkcs7.detached
/Name (John Doe) /Reason (Approval) /Location (NYC)
/M (D:20240101120000Z) /Contents <hex-PKCS7-data> >>
Signature workflow:
- Creator derives hash of page content (SHA-256)
- Signs hash with private key (RSA-2048)
- Embeds PKCS#7 structure with signer certificate + timestamp
- Recipient verifies signature via signer's public cert
- Any content change invalidates signature
Timestamp Authority (TSA): ensures signature validity even after signer cert expires.
Form Fields & AcroForm
Interactive forms embed field definitions:
/AcroForm << /Fields [7 0 R 8 0 R] /SigFlags 3 >>
7 0 obj
<< /Type /Annot /Subtype /Widget /FT /Tx /Rect [50 700 200 720]
/V (Default Value) /DV (Initial Value)
/DA (0 0 0 rg /F1 12 Tf) >>
endobj
Field types:
| Type | Subtypes | Use |
|---|---|---|
/Tx |
Text | Single-line text input |
/Ch |
Combo, List | Dropdown selection |
/Btn |
Push, Check, Radio | Buttons |
/Sig |
— | Digital signature |
Metadata & Accessibility (Tagged PDF)
Document Info Dictionary:
/Info << /Title (Report Q4 2024) /Author (Jane Smith)
/Subject (Financial Results) /Keywords (revenue, profit)
/CreationDate (D:20240101120000Z) >>
Tagged PDF (PDF/UA—Universal Accessibility):
/StructTreeRoot << /Type /StructElem /S /Document /K [1 0 R 2 0 R] >>
1 0 obj
<< /Type /StructElem /S /Paragraph /P 0 0 R /Pg 3 0 R >>
endobj
Enables screen readers to navigate document structure (headings, lists, tables, alt text for images).
Common Rendering Engines
| Engine | Purpose | Speed |
|---|---|---|
| Ghostscript | Server-side rasterization (PDF→PNG) | Medium (C-based) |
| PDFium | Google Chrome's PDF viewer | Fast (C++, incremental rendering) |
| MuPDF | Lightweight iOS/Android | Very fast (memory-efficient) |
| iText 7 | Java/C# PDF manipulation | Medium (JVM-based) |
| PyPDF4/pypdf | Python text extraction | Slow (pure Python) |
Practical Security & Compliance
PDF/A (archival):
- Embedded fonts, metadata, color profiles
- No encryption, external references, or JavaScript
- Suitable for long-term legal storage (20+ years)
PDF/UA (accessibility):
- Tagged structure, alt text, logical tab order
- Named destinations for bookmarks
- Suitable for government compliance (Section 508, EN 301 549)
PDF/X (print exchange):
- CMYK colors, embedded fonts, spot colors
- No transparency, layers, or JavaScript
- Suitable for print production RIPs
PDF vs. Alternatives
| Format | Encoding | Editability | Encryption | Size |
|---|---|---|---|---|
| Complex object model | Limited (locked) | ✅ (AES-256) | 200 KB (doc) | |
| DOCX | ZIP + XML | ✅ (full) | ✅ (only Office 365) | 150 KB |
| PostScript | Stack-based operators | ❌ (source only) | ❌ | 500 KB |
| TIFF | Image raster | ❌ (raster) | ❌ | 5 MB (scanned) |
| HTML | Markup + CSS | ✅ (full) | ❌ | 50 KB |
PDF dominates for distribution due to its rendering consistency, encryption, and signature support. For collaborative editing, DOCX is superior. For archival, PDF/A is required.
Creating & Manipulating PDFs Programmatically
Ghostscript (PDF→PNG rasterization):
gs -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=png16m -r300 \
-sOutputFile=page-%d.png input.pdf
Python (pypdf):
from pypdf import PdfReader
reader = PdfReader('input.pdf')
for page in reader.pages:
print(page.extract_text())
PDFtk (merge, split, stamp):
pdftk A=input1.pdf B=input2.pdf cat A B output merged.pdf
pdftk input.pdf burst output page-%d.pdf
iText 7 (Java):
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument doc = new PdfDocument(writer);
Document document = new Document(doc);
document.add(new Paragraph("Hello, World!"));
document.close();
PDF's longevity, security, and interoperability make it indispensable for business, legal, and government workflows.