PDF or DOCX — which format should I send for review?

Send <strong>PDF</strong> when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send <strong>DOCX</strong> when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

How do I extract just the text from a PDF without OCR?

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

PDF or DOCX — which format should I send for review?

Send <strong>PDF</strong> when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send <strong>DOCX</strong> when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

How do I extract just the text from a PDF without OCR?

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

Guide

HTML Format: The Complete Guide to the Web's Document Language

PC By Pablo Cirre • Updated May 7, 2026

What Is HTML as a File Format?

HTML (HyperText Markup Language) is both the language of the web and a document file format. An .html or .htm file is a plain-text document using XML-like tag syntax that browsers render into visual pages, and that converters parse to extract content, structure, and metadata.

While most people think of HTML as "websites," HTML files are also used as:

Self-contained documents — single-file reports, e-books (EPUB is HTML), email newsletters
Intermediate conversion format — Word, PDF, and Markdown tools all pass through HTML
Documentation formats — Javadoc, Doxygen, Sphinx, and mkdocs all output HTML
Application output — Jupyter Notebook export, Confluence pages, data reports
MHTML (.mht) — browser-saved web pages that bundle HTML + CSS + images into one file

HTML Document Structure

A well-formed modern HTML5 document:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="Page description for search engines">
    <meta name="author" content="Jane Smith">
    <title>Document Title</title>
    <link rel="stylesheet" href="styles.css">
    <link rel="icon" type="image/svg+xml" href="/favicon.svg">
</head>
<body>
    <header>
        <nav aria-label="Main navigation">
            <a href="/">Home</a>
        </nav>
    </header>
    <main>
        <article>
            <h1>Main Heading</h1>
            <p>First paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
        </article>
    </main>
    <footer>
        <p>&copy; 2025 Author Name</p>
    </footer>
    <script src="app.js" defer></script>
</body>
</html>

The DOCTYPE Declaration

<!DOCTYPE html> is not an HTML tag — it is a processing instruction telling browsers to use standards mode (not quirks mode). HTML5 has only one DOCTYPE. Legacy documents may have verbose DOCTYPEs referencing DTD URLs:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">

The `<head>` Section

The <head> contains document metadata — information about the page, not displayed content:

Element	Purpose
`<title>`	Browser tab title, bookmark name, search result title
`<meta charset>`	Character encoding (always UTF-8 for modern documents)
`<meta name="description">`	Search engine snippet
`<meta name="viewport">`	Responsive design scaling
`<link rel="stylesheet">`	External CSS reference
`<link rel="canonical">`	Preferred URL (SEO deduplication)
`<meta property="og:*">`	Open Graph metadata (social media previews)
`<script>`	JavaScript (or external script reference)

Semantic HTML5 Elements

HTML5 introduced semantic elements that describe content structure:

Element	Meaning
`<header>`	Introductory content for a section or the page
`<nav>`	Navigation links
`<main>`	Primary page content (one per page)
`<article>`	Self-contained, independently distributable content
`<section>`	Thematic grouping of content
`<aside>`	Tangential content (sidebar, callout)
`<footer>`	Closing content for a section or the page
`<figure>`	Self-contained illustration, code block, etc.
`<figcaption>`	Caption for `<figure>`
`<time datetime="2025-03-15">`	Machine-readable date/time
`<address>`	Contact information for the nearest `<article>` or `<body>`

Heading Hierarchy

HTML has six heading levels: <h1> through <h6>. The <h1> should be used once per page (or once per <article>) and establishes the document's primary topic. Headings define an outline that screen readers and converters use to understand document structure — not just visual size.

HTML Entities and Encoding

Special characters must be escaped in HTML:

Character	Entity	Numeric
`<`	`<`	`<`
`>`	`>`	`>`
`&`	`&`	`&`
`"`	`"`	`"`
`'`	`'`	`'`
(non-breaking space)	` `	` `
`©`	`©`	`©`
`—` (em dash)	`—`	`—`

With UTF-8 charset declared, most Unicode characters can be used directly without entities.

Inline vs External Resources

CSS Approaches

<!-- Inline styles (worst practice — not cacheable, not reusable) -->
<p style="color: red; font-size: 16px;">Red text</p>

<!-- Internal stylesheet (single-page documents, email) -->
<style>
  p { color: #333; line-height: 1.6; }
</style>

<!-- External stylesheet (best practice for websites) -->
<link rel="stylesheet" href="styles.css">

JavaScript Loading

<!-- Blocking (legacy — bad for performance) -->
<script src="app.js"></script>

<!-- Defer: execute after HTML parsing (recommended for most scripts) -->
<script src="app.js" defer></script>

<!-- Async: execute as soon as downloaded (for independent scripts) -->
<script src="analytics.js" async></script>

<!-- Module (supports import/export, always deferred) -->
<script type="module" src="app.mjs"></script>

Single-File HTML Documents

For self-contained HTML files (documentation, offline reports), resources can be embedded:

<!-- Inline SVG -->
<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100">
  <circle cx="50" cy="50" r="40" fill="blue"/>
</svg>

<!-- Inline image (Base64 data URL) -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNkYPhfDwAChwGA60e6kgAAAABJRU5ErkJggg==" alt="1px red dot">

<!-- Inline CSS + JS -->
<style>body{font-family:sans-serif}</style>
<script>console.log('loaded')</script>

MHTML (.mht) takes this further by bundling all external resources (images, CSS, fonts) into a MIME-encoded single file — the format used by "Save Page As" in Internet Explorer and Chrome.

Converting HTML Files

HTML to PDF

# wkhtmltopdf — renders HTML with WebKit, full CSS support
wkhtmltopdf --page-size A4 --margin-top 20mm input.html output.pdf

# Puppeteer (headless Chrome) — pixel-perfect rendering
node -e "
const puppeteer = require('puppeteer');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('file:///path/to/input.html', {waitUntil: 'networkidle0'});
  await page.pdf({path: 'output.pdf', format: 'A4', printBackground: true});
  await browser.close();
})();
"

# Pandoc (simpler, less CSS support)
pandoc input.html -o output.pdf

HTML to DOCX

# Pandoc — excellent HTML to DOCX conversion
pandoc input.html -o output.docx

# With custom reference document for styling
pandoc input.html --reference-doc=template.docx -o output.docx

HTML to Markdown

# Pandoc
pandoc input.html -t markdown -o output.md

# Python: html2text
pip install html2text
python -c "
import html2text
with open('input.html') as f:
    h = html2text.HTML2Text()
    h.ignore_links = False
    print(h.handle(f.read()))
" > output.md

HTML to Plain Text

from html.parser import HTMLParser

class TextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text = []
        self._skip = {'script', 'style', 'head'}
        self._current_skip = 0

    def handle_starttag(self, tag, attrs):
        if tag in self._skip:
            self._current_skip += 1

    def handle_endtag(self, tag):
        if tag in self._skip:
            self._current_skip -= 1

    def handle_data(self, data):
        if self._current_skip == 0:
            self.text.append(data)

parser = TextExtractor()
with open('input.html') as f:
    parser.feed(f.read())
print(' '.join(parser.text))

BeautifulSoup (Python)

from bs4 import BeautifulSoup

with open('input.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'html.parser')

# Extract title
print(soup.title.string)

# Extract all paragraphs
for p in soup.find_all('p'):
    print(p.get_text())

# Extract all links
for a in soup.find_all('a', href=True):
    print(a['href'], a.get_text())

# Get metadata
desc = soup.find('meta', attrs={'name': 'description'})
if desc:
    print(desc.get('content'))

HTML vs Other Document Formats

Feature	HTML	DOCX	PDF	Markdown
Human-readable source	Yes (with effort)	No (XML in ZIP)	No (binary/compressed)	Yes (clean)
Web browser display	Native	No (needs plugin)	Yes (modern browsers)	No
Version control friendly	Yes (text)	Difficult	Difficult	Excellent
Print fidelity	Variable	Good	Perfect	Variable
Executable code	Yes (JS)	No (macros)	Limited	No
Search engine indexing	Native	Via extraction	Via extraction	Via extraction
Accessibility	Excellent (ARIA)	Good	Acceptable	N/A
Best for	Web, interactive docs	Office documents	Final print/delivery	Developer docs, notes

HTML's position as the universal substrate of digital information — every email client, every document viewer, every e-reader, and every web application ultimately renders HTML — makes it the closest thing to a universal document format that computing has produced.

Related conversions

Document conversions that follow this topic naturally:

Frequently Asked Questions

Three main approaches: (1) Browser print — open the HTML file in Chrome/Firefox, press Ctrl+P, select "Save as PDF". Best for simple pages. (2) wkhtmltopdf — command-line tool that uses WebKit to render HTML with full CSS/JavaScript support: `wkhtmltopdf input.html output.pdf`. (3) Puppeteer — headless Chrome automation for pixel-perfect PDF with dynamic content. (4) Pandoc — `pandoc input.html -o output.pdf` (requires LaTeX, less CSS support). For web pages with complex CSS and JavaScript, browser print or Puppeteer give the best results.

Three main approaches: (1) Browser print — abrir the HTML arquivo in Chrome/Firefox, press Ctrl+P, select "Save as PDF". Best para simples pages. (2) wkhtmltopdf — command-line tool that uses WebKit to render HTML com full CSS/JavaScript support: `wkhtmltopdf input.html output.pdf`. (3) Puppeteer — headless Chrome automation para pixel-perfect PDF com dynamic content. (4) Pandoc — `pandoc input.html -o output.pdf` (requires LaTeX, less CSS support). para web pages com complexo CSS e JavaScript, browser print ou Puppeteer give the best results.

Three main approaches: (1) Browser print — öffnen the HTML Datei in Chrome/Firefox, press Ctrl+P, select "Save as PDF". Best für einfach pages. (2) wkhtmltopdf — command-line tool that uses WebKit to render HTML mit full CSS/JavaScript support: `wkhtmltopdf input.html output.pdf`. (3) Puppeteer — headless Chrome automation für pixel-perfect PDF mit dynamic content. (4) Pandoc — `pandoc input.html -o output.pdf` (requires LaTeX, less CSS support). für web pages mit complex CSS und JavaScript, browser print oder Puppeteer give the best results.

Three main approaches: (1) Browser print — abrir the HTML archivo in Chrome/Firefox, press Ctrl+P, select "Save as PDF". Best para simple pages. (2) wkhtmltopdf — command-line tool that uses WebKit to render HTML con full CSS/JavaScript support: `wkhtmltopdf input.html output.pdf`. (3) Puppeteer — headless Chrome automation para pixel-perfect PDF con dynamic content. (4) Pandoc — `pandoc input.html -o output.pdf` (requires LaTeX, less CSS support). para web pages con complex CSS y JavaScript, browser print o Puppeteer give the best results.

Send <strong>PDF</strong> when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send <strong>DOCX</strong> when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

.html and .htm are identical — both are plain-text HTML files processed identically by browsers and servers. The .htm extension is a legacy from DOS/Windows 3.x which only supported 8.3 filenames (8 characters + 3-character extension). Modern systems use .html. Some Windows servers historically served .htm by default. There is no technical difference; you can rename .htm to .html and it works identically.

.html e .htm are identical — both are plain-text HTML arquivos processed identically by browsers e servers. The .htm extension is a legacy de DOS/Windows 3.x which only suportado 8.3 filenames (8 characters + 3-character extension). moderno systems usar .html. Some Windows servers historically served .htm by default. Não há technical difference; you can rename .htm to .html e it funciona identically.

.html und .htm are identical — both are plain-text HTML Dateien processed identically by browsers und servers. The .htm extension is a legacy von DOS/Windows 3.x which only unterstützt 8.3 filenames (8 characters + 3-character extension). modern systems verwenden .html. Some Windows servers historically served .htm by default. Es gibt kein technical difference; you can rename .htm to .html und it works identically.

.html y .htm are identical — both are plain-text HTML archivos processed identically by browsers y servers. The .htm extension is a legacy de DOS/Windows 3.x which only soportado 8.3 filenames (8 characters + 3-character extension). moderno systems usar .html. Some Windows servers historically served .htm by default. No hay technical difference; you can rename .htm to .html y it works identically.

Round-tripping between similar formats (DOCX ↔ ODT, DOCX → PDF) is generally safe. Round-tripping with format-specific features (Word macros, complex tables, footnotes) often loses fidelity. Embedded fonts survive only if both source and target support font embedding (PDF yes, DOCX yes, plain HTML no). Always preview the result before deleting the original.

Multiple approaches: (1) Python with BeautifulSoup: `soup.get_text()` strips all tags and returns plain text. (2) Python html.parser: build a custom TextExtractor that skips script/style tags. (3) Command line: `lynx -dump input.html` outputs browser-rendered text. (4) Pandoc: `pandoc input.html -t plain -o output.txt`. Always strip `<script>`, `<style>`, and `<head>` content before extracting text, as these contain code/metadata rather than document text.

Multiple approaches: (1) Python com BeautifulSoup: `soup.get_text()` strips all tags e returns plain text. (2) Python html.parser: build a custom TextExtractor that skips script/style tags. (3) Command line: `lynx -dump input.html` outputs browser-rendered text. (4) Pandoc: `pandoc input.html -t plain -o output.txt`. Always strip `<script>`, `<style>`, e `<head>` content antes extracting text, as these contain code/metadata em vez de document text.

Multiple approaches: (1) Python mit BeautifulSoup: `soup.get_text()` strips all tags und returns plain text. (2) Python html.parser: build a custom TextExtractor that skips script/style tags. (3) Command line: `lynx -dump input.html` outputs browser-rendered text. (4) Pandoc: `pandoc input.html -t plain -o output.txt`. Always strip `<script>`, `<style>`, und `<head>` content vor extracting text, as these contain code/metadata rather than document text.

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

MHTML (MIME HTML) is a single-file archive that bundles an HTML page with all its resources (images, CSS, fonts, JavaScript) encoded as MIME multipart content. Regular HTML files reference external resources with URLs; MHTML embeds them as Base64-encoded sections within the single file. Chrome, Edge, and IE have offered "Save as Web Page, Single File" which creates MHTML. Disadvantage: MHTML files are not human-readable and can be much larger than the original page. For self-contained HTML archives, a directory with HTML + assets is generally preferable.

MHTML (MIME HTML) is a single-file archive that bundles an HTML page com all its resources (images, CSS, fonts, JavaScript) encoded as MIME multipart content. Regular HTML arquivos reference external resources com URLs; MHTML embeds them as Base64-encoded sections within the single file. Chrome, Edge, e IE have offered "Save as Web Page, Single File" which creates MHTML. Disadvantage: MHTML arquivos são not human-readable e can be much maior que the original page. para self-contained HTML archives, a directory com HTML + assets is Geralmente preferable.

MHTML (MIME HTML) is a single-file archive that bundles an HTML page mit all its resources (images, CSS, fonts, JavaScript) encoded as MIME multipart content. Regular HTML Dateien reference external resources mit URLs; MHTML embeds them as Base64-encoded sections within the single file. Chrome, Edge, und IE have offered "Save as Web Page, Single File" which creates MHTML. Disadvantage: MHTML Dateien are not human-readable und can be much größer als the original page. für self-contained HTML archives, a directory mit HTML + assets is Generell preferable.

MHTML (MIME HTML) is a single-file archive that bundles an HTML page con all its resources (images, CSS, fonts, JavaScript) encoded as MIME multipart content. Regular HTML archivos reference external resources con URLs; MHTML embeds them as Base64-encoded sections within the single file. Chrome, Edge, y IE have offered "Save as Web Page, Single File" which creates MHTML. Disadvantage: MHTML archivos are not human-readable y can be much más grande que the original page. para self-contained HTML archives, a directory con HTML + assets is Generalmente preferable.

Light edits (annotations, signatures, form fields) are fine in any PDF reader. Structural edits (changing paragraphs, replacing images) are awkward — PDF is a presentation format, not an editing format. The robust workflow is: keep the source DOCX/MD/HTML as the master, regenerate the PDF when changes are needed. Tools that "edit PDFs" reverse-engineer the layout and frequently break it.

Character	Entity	Numeric
`<`	`<`	`<`
`>`	`>`	`>`
`&`	`&`	`&`
`"`	`"`	`"`
`'`	`'`	`'`
(non-breaking space)	` `	` `
`©`	`©`	`©`
`—` (em dash)	`—`	`—`

Character	Entity	Numeric
`<`	`<`	`<`
`>`	`>`	`>`
`&`	`&`	`&`
`"`	`"`	`"`
`'`	`'`	`'`
(non-breaking space)	` `	` `
`©`	`©`	`©`
`—` (em dash)	`—`	`—`

HTML Format: The Complete Guide to the Web's Document Language

What Is HTML as a File Format?

HTML Document Structure

The DOCTYPE Declaration

The <head> Section

Semantic HTML5 Elements

Heading Hierarchy

HTML Entities and Encoding

Inline vs External Resources

CSS Approaches

JavaScript Loading

Single-File HTML Documents

Converting HTML Files

HTML to PDF

HTML to DOCX

HTML to Markdown

HTML to Plain Text

BeautifulSoup (Python)

HTML vs Other Document Formats

Related conversions

Frequently Asked Questions

The `<head>` Section

Character	Entity	Numeric
`<`	`<`	`<`
`>`	`>`	`>`
`&`	`&`	`&`
`"`	`"`	`"`
`'`	`'`	`'`
(non-breaking space)	` `	` `
`©`	`©`	`©`
`—` (em dash)	`—`	`—`