What Is HTML as a File Format?
HTML (HyperText Markup Language) is both the language of the web and a document file format. An .html or .htm file is a plain-text document using XML-like tag syntax that browsers render into visual pages, and that converters parse to extract content, structure, and metadata.
While most people think of HTML as "websites," HTML files are also used as:
- Self-contained documents — single-file reports, e-books (EPUB is HTML), email newsletters
- Intermediate conversion format — Word, PDF, and Markdown tools all pass through HTML
- Documentation formats — Javadoc, Doxygen, Sphinx, and mkdocs all output HTML
- Application output — Jupyter Notebook export, Confluence pages, data reports
- MHTML (
.mht) — browser-saved web pages that bundle HTML + CSS + images into one file
HTML Document Structure
A well-formed modern HTML5 document:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Page description for search engines">
<meta name="author" content="Jane Smith">
<title>Document Title</title>
<link rel="stylesheet" href="styles.css">
<link rel="icon" type="image/svg+xml" href="/favicon.svg">
</head>
<body>
<header>
<nav aria-label="Main navigation">
<a href="/">Home</a>
</nav>
</header>
<main>
<article>
<h1>Main Heading</h1>
<p>First paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
</article>
</main>
<footer>
<p>© 2025 Author Name</p>
</footer>
<script src="app.js" defer></script>
</body>
</html>
The DOCTYPE Declaration
<!DOCTYPE html> is not an HTML tag — it is a processing instruction telling browsers to use standards mode (not quirks mode). HTML5 has only one DOCTYPE. Legacy documents may have verbose DOCTYPEs referencing DTD URLs:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
The <head> Section
The <head> contains document metadata — information about the page, not displayed content:
| Element | Purpose |
|---|---|
<title> |
Browser tab title, bookmark name, search result title |
<meta charset> |
Character encoding (always UTF-8 for modern documents) |
<meta name="description"> |
Search engine snippet |
<meta name="viewport"> |
Responsive design scaling |
<link rel="stylesheet"> |
External CSS reference |
<link rel="canonical"> |
Preferred URL (SEO deduplication) |
<meta property="og:*"> |
Open Graph metadata (social media previews) |
<script> |
JavaScript (or external script reference) |
Semantic HTML5 Elements
HTML5 introduced semantic elements that describe content structure:
| Element | Meaning |
|---|---|
<header> |
Introductory content for a section or the page |
<nav> |
Navigation links |
<main> |
Primary page content (one per page) |
<article> |
Self-contained, independently distributable content |
<section> |
Thematic grouping of content |
<aside> |
Tangential content (sidebar, callout) |
<footer> |
Closing content for a section or the page |
<figure> |
Self-contained illustration, code block, etc. |
<figcaption> |
Caption for <figure> |
<time datetime="2025-03-15"> |
Machine-readable date/time |
<address> |
Contact information for the nearest <article> or <body> |
Heading Hierarchy
HTML has six heading levels: <h1> through <h6>. The <h1> should be used once per page (or once per <article>) and establishes the document's primary topic. Headings define an outline that screen readers and converters use to understand document structure — not just visual size.
HTML Entities and Encoding
Special characters must be escaped in HTML:
| Character | Entity | Numeric |
|---|---|---|
< |
< |
< |
> |
> |
> |
& |
& |
& |
" |
" |
" |
' |
' |
' |
(non-breaking space) |
|
  |
© |
© |
© |
— (em dash) |
— |
— |
With UTF-8 charset declared, most Unicode characters can be used directly without entities.
Inline vs External Resources
CSS Approaches
<!-- Inline styles (worst practice — not cacheable, not reusable) -->
<p style="color: red; font-size: 16px;">Red text</p>
<!-- Internal stylesheet (single-page documents, email) -->
<style>
p { color: #333; line-height: 1.6; }
</style>
<!-- External stylesheet (best practice for websites) -->
<link rel="stylesheet" href="styles.css">
JavaScript Loading
<!-- Blocking (legacy — bad for performance) -->
<script src="app.js"></script>
<!-- Defer: execute after HTML parsing (recommended for most scripts) -->
<script src="app.js" defer></script>
<!-- Async: execute as soon as downloaded (for independent scripts) -->
<script src="analytics.js" async></script>
<!-- Module (supports import/export, always deferred) -->
<script type="module" src="app.mjs"></script>
Single-File HTML Documents
For self-contained HTML files (documentation, offline reports), resources can be embedded:
<!-- Inline SVG -->
<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100">
<circle cx="50" cy="50" r="40" fill="blue"/>
</svg>
<!-- Inline image (Base64 data URL) -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNkYPhfDwAChwGA60e6kgAAAABJRU5ErkJggg==" alt="1px red dot">
<!-- Inline CSS + JS -->
<style>body{font-family:sans-serif}</style>
<script>console.log('loaded')</script>
MHTML (.mht) takes this further by bundling all external resources (images, CSS, fonts) into a MIME-encoded single file — the format used by "Save Page As" in Internet Explorer and Chrome.
Converting HTML Files
HTML to PDF
# wkhtmltopdf — renders HTML with WebKit, full CSS support
wkhtmltopdf --page-size A4 --margin-top 20mm input.html output.pdf
# Puppeteer (headless Chrome) — pixel-perfect rendering
node -e "
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('file:///path/to/input.html', {waitUntil: 'networkidle0'});
await page.pdf({path: 'output.pdf', format: 'A4', printBackground: true});
await browser.close();
})();
"
# Pandoc (simpler, less CSS support)
pandoc input.html -o output.pdf
HTML to DOCX
# Pandoc — excellent HTML to DOCX conversion
pandoc input.html -o output.docx
# With custom reference document for styling
pandoc input.html --reference-doc=template.docx -o output.docx
HTML to Markdown
# Pandoc
pandoc input.html -t markdown -o output.md
# Python: html2text
pip install html2text
python -c "
import html2text
with open('input.html') as f:
h = html2text.HTML2Text()
h.ignore_links = False
print(h.handle(f.read()))
" > output.md
HTML to Plain Text
from html.parser import HTMLParser
class TextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.text = []
self._skip = {'script', 'style', 'head'}
self._current_skip = 0
def handle_starttag(self, tag, attrs):
if tag in self._skip:
self._current_skip += 1
def handle_endtag(self, tag):
if tag in self._skip:
self._current_skip -= 1
def handle_data(self, data):
if self._current_skip == 0:
self.text.append(data)
parser = TextExtractor()
with open('input.html') as f:
parser.feed(f.read())
print(' '.join(parser.text))
BeautifulSoup (Python)
from bs4 import BeautifulSoup
with open('input.html', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
# Extract title
print(soup.title.string)
# Extract all paragraphs
for p in soup.find_all('p'):
print(p.get_text())
# Extract all links
for a in soup.find_all('a', href=True):
print(a['href'], a.get_text())
# Get metadata
desc = soup.find('meta', attrs={'name': 'description'})
if desc:
print(desc.get('content'))
HTML vs Other Document Formats
| Feature | HTML | DOCX | Markdown | |
|---|---|---|---|---|
| Human-readable source | Yes (with effort) | No (XML in ZIP) | No (binary/compressed) | Yes (clean) |
| Web browser display | Native | No (needs plugin) | Yes (modern browsers) | No |
| Version control friendly | Yes (text) | Difficult | Difficult | Excellent |
| Print fidelity | Variable | Good | Perfect | Variable |
| Executable code | Yes (JS) | No (macros) | Limited | No |
| Search engine indexing | Native | Via extraction | Via extraction | Via extraction |
| Accessibility | Excellent (ARIA) | Good | Acceptable | N/A |
| Best for | Web, interactive docs | Office documents | Final print/delivery | Developer docs, notes |
HTML's position as the universal substrate of digital information — every email client, every document viewer, every e-reader, and every web application ultimately renders HTML — makes it the closest thing to a universal document format that computing has produced.
Related conversions
Document conversions that follow this topic naturally: