XML Format: The Complete Technical Guide
XML (Extensible Markup Language) is a markup language and text-based data format standardized by the W3C in 1998 (XML 1.0) and updated in 2006 (XML 1.1). XML defines rules for encoding documents in a format that is both human-readable and machine-readable. It is a meta-language — rather than specifying a vocabulary, it provides the syntax for creating custom markup languages (SVG, XHTML, SOAP, RSS, Atom, GPX, KML, DOCX internal XML). Despite being largely displaced by JSON for web APIs, XML remains essential for enterprise systems, document formats, configuration, and any domain requiring validation, transformation, or mixed content.
XML vs HTML: Key Differences
XML and HTML look similar but have fundamental differences:
| Aspect | XML | HTML |
|---|---|---|
| Vocabulary | User-defined | Fixed (HTML elements) |
| Case sensitivity | Case-sensitive | Case-insensitive |
| Closing tags | Required for all elements | Optional (HTML5 allows omitting some) |
| Attribute quoting | Required | Optional in HTML5 |
| Well-formedness | Strict (parser aborts on error) | Lenient (browsers forgive errors) |
| Purpose | Data interchange/documents | Presentation in browsers |
XHTML is HTML written according to XML rules (well-formed HTML). SVG and MathML are XML vocabularies embedded in HTML5.
XML Document Structure
A well-formed XML document consists of:
XML Declaration (optional but recommended)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
version: Always "1.0" for XML 1.0 (XML 1.1 is rare)encoding: Character encoding (default: UTF-8 or UTF-16 if BOM present)standalone: Whether an external DTD is required ("yes" = no external DTD needed)
Processing Instructions
<?xml-stylesheet type="text/xsl" href="transform.xsl"?>
<?myapp version="2.0"?>
Processing instructions pass information to the application.
Document Type Declaration (DOCTYPE)
<!DOCTYPE bookstore SYSTEM "bookstore.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
References an external DTD (Document Type Definition) or internal DTD subset for validation.
Root Element and Nested Structure
<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns:dc="http://purl.org/dc/elements/1.1/">
<!-- This is a comment -->
<book id="1" language="en" in-stock="true">
<dc:title>The Great Gatsby</dc:title>
<dc:creator>F. Scott Fitzgerald</dc:creator>
<price currency="USD">12.99</price>
<description>
<![CDATA[A novel set in the Jazz Age on Long Island. Contains <special> characters & symbols.]]>
</description>
<tags>
<tag>classic</tag>
<tag>fiction</tag>
</tags>
</book>
</bookstore>
XML Syntax Rules (Well-Formedness)
An XML document is well-formed if it follows these rules:
- One root element containing all other elements
- All elements properly nested (no overlapping tags)
- All elements closed (either
<tag></tag>or self-closing<tag/>) - Attribute values always quoted (single or double)
- Case-sensitive element and attribute names
- No
<,&in content (must use entities) - No
--inside comments (<!-- valid --> <!-- in--valid -->) - Namespace prefixes must be declared
Special characters must be escaped as predefined entities:
| Character | Entity | Notes |
|---|---|---|
< |
< |
Required in content |
> |
> |
Required in attribute values |
& |
& |
Required everywhere |
" |
" |
Required in double-quoted attributes |
' |
' |
Required in single-quoted attributes |
CDATA sections (<![CDATA[ ... ]]>) bypass entity encoding for blocks of text containing < and &.
XML Namespaces
XML namespaces (W3C 1999) prevent name conflicts between XML vocabularies:
<root
xmlns="http://default-namespace.com/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://default-namespace.com/ schema.xsd">
<title>Default namespace element</title>
<dc:title>Dublin Core title</dc:title>
</root>
xmlns="URI": Default namespace (applies to unprefixed elements)xmlns:prefix="URI": Named namespace prefix- Namespace URIs are identifiers — they don't need to point to actual documents
- The
xml:prefix is reserved (xml:lang, xml:space, xml:base, xml:id)
Schema Languages
XML validity (conformance to a schema) can be enforced by:
DTD (Document Type Definition)
<!DOCTYPE catalog [
<!ELEMENT catalog (product+)>
<!ELEMENT product (name, price, description?)>
<!ATTLIST product
id ID #REQUIRED
category (books|electronics|clothing) #IMPLIED
in-stock (true|false) "true">
<!ELEMENT name (#PCDATA)>
<!ELEMENT price (#PCDATA)>
]>
DTD is built into XML, but has limitations: no data type enforcement (everything is text), no namespace support.
XML Schema (XSD)
W3C XML Schema (XSD) is the standard schema language with rich type system:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="product">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="price" type="xs:decimal"/>
<xs:element name="stock" type="xs:nonNegativeInteger"/>
<xs:element name="releaseDate" type="xs:date" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
XSD supports: string, integer, decimal, date, time, dateTime, boolean, anyURI, base64Binary, hexBinary, plus restrictions (patterns, min/max length, enumeration), complex types, inheritance, and mixed content.
RELAX NG
A simpler, more readable alternative to XSD. Available in XML and compact syntax:
# Compact syntax
element product {
attribute id { xsd:ID },
element name { text },
element price { xsd:decimal },
element stock { xsd:nonNegativeInteger }
}
XSLT: Transforming XML
XSLT (XSL Transformations, W3C) is a declarative language for transforming XML into other XML, HTML, or text:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8"/>
<xsl:template match="/">
<html>
<body>
<table>
<xsl:for-each select="bookstore/book">
<tr>
<td><xsl:value-of select="dc:title"/></td>
<td><xsl:value-of select="price"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
# Apply XSLT transformation with xsltproc
xsltproc transform.xsl input.xml > output.html
# With Saxon (Java, supports XSLT 2.0/3.0)
saxon input.xml transform.xsl > output.html
XPath: Querying XML
XPath is a query language for selecting nodes in XML documents:
/bookstore/book — all book elements under bookstore
//book — all book elements anywhere in document
//book[@language='en'] — books with language="en"
//book[price > 10] — books with price > 10
//book/dc:title/text() — text content of dc:title elements
count(//book) — count of book elements
//book[last()] — last book element
//book[position() <= 3] — first 3 books
XPath 2.0 adds type awareness, sequences, and more functions. XPath 3.1 adds maps and arrays (JSON-like structures).
Parsing XML: Language Reference
# Python: ElementTree (standard library)
import xml.etree.ElementTree as ET
tree = ET.parse("books.xml")
root = tree.getroot()
for book in root.findall("book"):
title = book.find("{http://purl.org/dc/elements/1.1/}title").text
price = float(book.find("price").text)
print(f"{title}: ${price}")
# lxml (faster, XPath 1.0 support)
from lxml import etree
tree = etree.parse("books.xml")
books = tree.xpath("//book[price > 10]/dc:title/text()",
namespaces={"dc": "http://purl.org/dc/elements/1.1/"})
# xmllint: validate and format XML
xmllint --noout --schema schema.xsd document.xml # validate against XSD
xmllint --format document.xml # pretty-print
xmllint --xpath "//book/title" document.xml # XPath query
xmllint --c14n document.xml # canonicalize
# Convert XML to JSON with xq (jq for XML)
xq '.' input.xml
# Convert XML to JSON with python
python3 -c "import xmltodict, json, sys; print(json.dumps(xmltodict.parse(sys.stdin.read()), indent=2))" < input.xml
Common XML Formats
- XHTML: HTML written as XML (strict, valid, MIME: application/xhtml+xml)
- SVG: Scalable Vector Graphics (vector images)
- SOAP: Simple Object Access Protocol (web services)
- RSS / Atom: Feed syndication formats
- KML: Keyhole Markup Language (geographic data, Google Earth)
- GPX: GPS Exchange Format (GPS tracks)
- DOCX/XLSX/PPTX: Office Open XML (ZIP + XML)
- ODF: OpenDocument Format (LibreOffice)
- Maven POM: Java project configuration
- Spring/J2EE: Enterprise Java configuration
- Apple plist: Property list (macOS/iOS configuration)
- COLLADA: 3D asset exchange format
Related conversions
Document conversions that follow this topic naturally: