PDF or DOCX — which format should I send for review?

Send PDF when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send DOCX when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

How do I extract just the text from a PDF without OCR?

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

PDF or DOCX — which format should I send for review?

Send PDF when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send DOCX when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

How do I extract just the text from a PDF without OCR?

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

Guide

XML Format: The Complete Technical Guide

Q: When should I use XML instead of JSON?

Use XML when: (1) You need mixed content (text mixed with elements, like XHTML or DocBook documents). (2) You need standardized transformation (XSLT) or querying (XPath) tools. (3) You work with document-centric formats (Office Open XML, ODF, SVG). (4) You need DTD or XSD validation as part of your workflow. (5) You integrate with enterprise systems, SOAP services, or legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, or CDATA sections. For new web APIs and configuration files, JSON or YAML are usually simpler and better suited.

Q: What is the difference between DTD and XML Schema (XSD)?

DTD (Document Type Definition) is built into XML, simple to write, but limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, supports rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance and complex type derivation. XSD is verbose but powerful. For simple validation, DTD is adequate. For enterprise applications requiring typed data, use XSD. RELAX NG (ISO) is a simpler alternative to XSD with both XML and compact notations.

Q: Quando devo usar XML em vez de JSON?

Use XML when: (1) You need mixed content (text mixed com elements, like XHTML ou DocBook documents). (2) You need padrãoized transformation (XSLT) ou querying (XPath) ferramentas. (3) You funcionar com document-centric formatoos (Office abrir XML, ODF, SVG). (4) You need DTD ou XSD validation as part of your workflow. (5) You integrate com enterprise systems, SOAP services, ou legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, ou CDATA sections. para new web APIs e configuration files, JSON ou YAML are Geralmente simplesr e better suited.

Q: Qual é a diferença entre DTD e XML Schema (XSD)?

DTD (Document Type Definition) is built em XML, simples to write, mas limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, suporta rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance e complexo type derivation. XSD is verbose mas powerful. para simples validation, DTD is adequate. para enterprise aplicativos requiring typed data, usar XSD. RELAX NG (ISO) is a simplesr alternative to XSD com both XML e compact notations.

Q: Wann sollte ich verwenden XML anstatt JSON?

Use XML when: (1) You need mixed content (text mixed mit elements, like XHTML oder DocBook documents). (2) You need Standardized transformation (XSLT) oder querying (XPath) Werkzeuge. (3) You work mit document-centric Formate (Office öffnen XML, ODF, SVG). (4) You need DTD oder XSD validation as part von your workflow. (5) You integrate mit enterprise systems, SOAP services, oder legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, oder CDATA sections. für new web APIs und configuration files, JSON oder YAML are Normalerweise einfachr und better suited.

Q: Was ist der Unterschied zwischen DTD und XML Schema (XSD)?

DTD (Document Type Definition) is built in XML, einfach to write, aber limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, unterstützt rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance und complex type derivation. XSD is verbose aber powerful. für einfach validation, DTD is adequate. für enterprise Anwendungen requiring typed data, verwenden XSD. RELAX NG (ISO) is a einfachr alternative to XSD mit both XML und compact notations.

PC By Pablo Cirre • Updated May 7, 2026

XML Format: The Complete Technical Guide

XML (Extensible Markup Language) is a markup language and text-based data format standardized by the W3C in 1998 (XML 1.0) and updated in 2006 (XML 1.1). XML defines rules for encoding documents in a format that is both human-readable and machine-readable. It is a meta-language — rather than specifying a vocabulary, it provides the syntax for creating custom markup languages (SVG, XHTML, SOAP, RSS, Atom, GPX, KML, DOCX internal XML). Despite being largely displaced by JSON for web APIs, XML remains essential for enterprise systems, document formats, configuration, and any domain requiring validation, transformation, or mixed content.

XML vs HTML: Key Differences

XML and HTML look similar but have fundamental differences:

Aspect	XML	HTML
Vocabulary	User-defined	Fixed (HTML elements)
Case sensitivity	Case-sensitive	Case-insensitive
Closing tags	Required for all elements	Optional (HTML5 allows omitting some)
Attribute quoting	Required	Optional in HTML5
Well-formedness	Strict (parser aborts on error)	Lenient (browsers forgive errors)
Purpose	Data interchange/documents	Presentation in browsers

XHTML is HTML written according to XML rules (well-formed HTML). SVG and MathML are XML vocabularies embedded in HTML5.

XML Document Structure

A well-formed XML document consists of:

XML Declaration (optional but recommended)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

version: Always "1.0" for XML 1.0 (XML 1.1 is rare)
encoding: Character encoding (default: UTF-8 or UTF-16 if BOM present)
standalone: Whether an external DTD is required ("yes" = no external DTD needed)

Processing Instructions

<?xml-stylesheet type="text/xsl" href="transform.xsl"?>
<?myapp version="2.0"?>

Processing instructions pass information to the application.

Document Type Declaration (DOCTYPE)

<!DOCTYPE bookstore SYSTEM "bookstore.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

References an external DTD (Document Type Definition) or internal DTD subset for validation.

Root Element and Nested Structure

<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns:dc="http://purl.org/dc/elements/1.1/">
  <!-- This is a comment -->
  <book id="1" language="en" in-stock="true">
    <dc:title>The Great Gatsby</dc:title>
    <dc:creator>F. Scott Fitzgerald</dc:creator>
    <price currency="USD">12.99</price>
    <description>
      <![CDATA[A novel set in the Jazz Age on Long Island. Contains <special> characters & symbols.]]>
    </description>
    <tags>
      <tag>classic</tag>
      <tag>fiction</tag>
    </tags>
  </book>
</bookstore>

XML Syntax Rules (Well-Formedness)

An XML document is well-formed if it follows these rules:

One root element containing all other elements
All elements properly nested (no overlapping tags)
All elements closed (either <tag></tag> or self-closing <tag/>)
Attribute values always quoted (single or double)
Case-sensitive element and attribute names
No <, & in content (must use entities)
No -- inside comments ( )
Namespace prefixes must be declared

Special characters must be escaped as predefined entities:

Character	Entity	Notes
`<`	`<`	Required in content
`>`	`>`	Required in attribute values
`&`	`&`	Required everywhere
`"`	`"`	Required in double-quoted attributes
`'`	`'`	Required in single-quoted attributes

CDATA sections (<![CDATA[ ... ]]>) bypass entity encoding for blocks of text containing < and &.

XML Namespaces

XML namespaces (W3C 1999) prevent name conflicts between XML vocabularies:

<root
  xmlns="http://default-namespace.com/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://default-namespace.com/ schema.xsd">
  <title>Default namespace element</title>
  <dc:title>Dublin Core title</dc:title>
</root>

xmlns="URI": Default namespace (applies to unprefixed elements)
xmlns:prefix="URI": Named namespace prefix
Namespace URIs are identifiers — they don't need to point to actual documents
The xml: prefix is reserved (xml:lang, xml:space, xml:base, xml:id)

Schema Languages

XML validity (conformance to a schema) can be enforced by:

DTD (Document Type Definition)

<!DOCTYPE catalog [
  <!ELEMENT catalog (product+)>
  <!ELEMENT product (name, price, description?)>
  <!ATTLIST product
    id ID #REQUIRED
    category (books|electronics|clothing) #IMPLIED
    in-stock (true|false) "true">
  <!ELEMENT name (#PCDATA)>
  <!ELEMENT price (#PCDATA)>
]>

DTD is built into XML, but has limitations: no data type enforcement (everything is text), no namespace support.

XML Schema (XSD)

W3C XML Schema (XSD) is the standard schema language with rich type system:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="product">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element name="price" type="xs:decimal"/>
        <xs:element name="stock" type="xs:nonNegativeInteger"/>
        <xs:element name="releaseDate" type="xs:date" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="id" type="xs:ID" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

XSD supports: string, integer, decimal, date, time, dateTime, boolean, anyURI, base64Binary, hexBinary, plus restrictions (patterns, min/max length, enumeration), complex types, inheritance, and mixed content.

RELAX NG

A simpler, more readable alternative to XSD. Available in XML and compact syntax:

# Compact syntax
element product {
  attribute id { xsd:ID },
  element name { text },
  element price { xsd:decimal },
  element stock { xsd:nonNegativeInteger }
}

XSLT: Transforming XML

XSLT (XSL Transformations, W3C) is a declarative language for transforming XML into other XML, HTML, or text:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8"/>

  <xsl:template match="/">
    <html>
      <body>
        <table>
          <xsl:for-each select="bookstore/book">
            <tr>
              <td><xsl:value-of select="dc:title"/></td>
              <td><xsl:value-of select="price"/></td>
            </tr>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

# Apply XSLT transformation with xsltproc
xsltproc transform.xsl input.xml > output.html

# With Saxon (Java, supports XSLT 2.0/3.0)
saxon input.xml transform.xsl > output.html

XPath: Querying XML

XPath is a query language for selecting nodes in XML documents:

/bookstore/book                  — all book elements under bookstore
//book                           — all book elements anywhere in document
//book[@language='en']           — books with language="en"
//book[price > 10]               — books with price > 10
//book/dc:title/text()           — text content of dc:title elements
count(//book)                    — count of book elements
//book[last()]                   — last book element
//book[position() <= 3]          — first 3 books

XPath 2.0 adds type awareness, sequences, and more functions. XPath 3.1 adds maps and arrays (JSON-like structures).

Parsing XML: Language Reference

# Python: ElementTree (standard library)
import xml.etree.ElementTree as ET

tree = ET.parse("books.xml")
root = tree.getroot()

for book in root.findall("book"):
    title = book.find("{http://purl.org/dc/elements/1.1/}title").text
    price = float(book.find("price").text)
    print(f"{title}: ${price}")

# lxml (faster, XPath 1.0 support)
from lxml import etree
tree = etree.parse("books.xml")
books = tree.xpath("//book[price > 10]/dc:title/text()",
                   namespaces={"dc": "http://purl.org/dc/elements/1.1/"})

# xmllint: validate and format XML
xmllint --noout --schema schema.xsd document.xml   # validate against XSD
xmllint --format document.xml                       # pretty-print
xmllint --xpath "//book/title" document.xml         # XPath query
xmllint --c14n document.xml                         # canonicalize

# Convert XML to JSON with xq (jq for XML)
xq '.' input.xml

# Convert XML to JSON with python
python3 -c "import xmltodict, json, sys; print(json.dumps(xmltodict.parse(sys.stdin.read()), indent=2))" < input.xml

Common XML Formats

XHTML: HTML written as XML (strict, valid, MIME: application/xhtml+xml)
SVG: Scalable Vector Graphics (vector images)
SOAP: Simple Object Access Protocol (web services)
RSS / Atom: Feed syndication formats
KML: Keyhole Markup Language (geographic data, Google Earth)
GPX: GPS Exchange Format (GPS tracks)
DOCX/XLSX/PPTX: Office Open XML (ZIP + XML)
ODF: OpenDocument Format (LibreOffice)
Maven POM: Java project configuration
Spring/J2EE: Enterprise Java configuration
Apple plist: Property list (macOS/iOS configuration)
COLLADA: 3D asset exchange format

Related conversions

Document conversions that follow this topic naturally:

Frequently Asked Questions

Use XML when: (1) You need mixed content (text mixed with elements, like XHTML or DocBook documents). (2) You need standardized transformation (XSLT) or querying (XPath) tools. (3) You work with document-centric formats (Office Open XML, ODF, SVG). (4) You need DTD or XSD validation as part of your workflow. (5) You integrate with enterprise systems, SOAP services, or legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, or CDATA sections. For new web APIs and configuration files, JSON or YAML are usually simpler and better suited.

Use XML when: (1) You need mixed content (text mixed com elements, like XHTML ou DocBook documents). (2) You need padrãoized transformation (XSLT) ou querying (XPath) ferramentas. (3) You funcionar com document-centric formatoos (Office abrir XML, ODF, SVG). (4) You need DTD ou XSD validation as part of your workflow. (5) You integrate com enterprise systems, SOAP services, ou legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, ou CDATA sections. para new web APIs e configuration files, JSON ou YAML are Geralmente simplesr e better suited.

Use XML when: (1) You need mixed content (text mixed mit elements, like XHTML oder DocBook documents). (2) You need Standardized transformation (XSLT) oder querying (XPath) Werkzeuge. (3) You work mit document-centric Formate (Office öffnen XML, ODF, SVG). (4) You need DTD oder XSD validation as part von your workflow. (5) You integrate mit enterprise systems, SOAP services, oder legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, oder CDATA sections. für new web APIs und configuration files, JSON oder YAML are Normalerweise einfachr und better suited.

Use XML when: (1) You need mixed content (text mixed con elements, like XHTML o DocBook documents). (2) You need estándarized transformation (XSLT) o querying (XPath) herramientas. (3) You work con document-centric formatoos (Office abrir XML, ODF, SVG). (4) You need DTD o XSD validation as part de your workflow. (5) You integrate con enterprise systems, SOAP services, o legacy systems that expect XML. (6) You need XML-specific features like processing instructions, entity references, o CDATA sections. para new web APIs y configuration files, JSON o YAML are Normalmente simpler y better suited.

Send PDF when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send DOCX when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

Well-formed XML follows syntactic rules: one root element, all tags closed, attributes quoted, no bare < or & in content. A parser can parse well-formed XML without any schema. Valid XML is well-formed AND conforms to a schema (DTD, XSD, or RELAX NG) — all required elements are present, data types are correct, no undefined elements appear. An XML document can be well-formed but not valid (if it violates the schema). Most applications only require well-formedness for data interchange; validation is applied when receiving XML from external sources.

Round-tripping between similar formats (DOCX ↔ ODT, DOCX → PDF) is generally safe. Round-tripping with format-specific features (Word macros, complex tables, footnotes) often loses fidelity. Embedded fonts survive only if both source and target support font embedding (PDF yes, DOCX yes, plain HTML no). Always preview the result before deleting the original.

DTD (Document Type Definition) is built into XML, simple to write, but limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, supports rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance and complex type derivation. XSD is verbose but powerful. For simple validation, DTD is adequate. For enterprise applications requiring typed data, use XSD. RELAX NG (ISO) is a simpler alternative to XSD with both XML and compact notations.

DTD (Document Type Definition) is built em XML, simples to write, mas limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, suporta rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance e complexo type derivation. XSD is verbose mas powerful. para simples validation, DTD is adequate. para enterprise aplicativos requiring typed data, usar XSD. RELAX NG (ISO) is a simplesr alternative to XSD com both XML e compact notations.

DTD (Document Type Definition) is built in XML, einfach to write, aber limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, unterstützt rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance und complex type derivation. XSD is verbose aber powerful. für einfach validation, DTD is adequate. für enterprise Anwendungen requiring typed data, verwenden XSD. RELAX NG (ISO) is a einfachr alternative to XSD mit both XML und compact notations.

DTD (Document Type Definition) is built en XML, simple to write, pero limited: no data types (everything is text), no namespace support, uses a non-XML syntax. XSD (W3C XML Schema) is itself written in XML, soporta rich data types (integer, decimal, date, boolean, anyURI, patterns), full namespace support, inheritance y complex type derivation. XSD is verbose pero powerful. para simple validation, DTD is adequate. para enterprise aplicaciones requiring typed data, usar XSD. RELAX NG (ISO) is a simpler alternative to XSD con both XML y compact notations.

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

XML-to-JSON conversion is non-trivial because XML and JSON have different structures: XML has attributes, text nodes, and mixed content that JSON doesn't natively represent. Python: `pip install xmltodict` then `json.dumps(xmltodict.parse(xml_string))`. Node.js: `xml2js` or `fast-xml-parser` libraries. Command-line: `xq` tool (`cat input.xml | xq .`). The conversion assumes a consistent structure — attributes become keys prefixed with "@", text content becomes "#text" keys. For complex XML with mixed content or processing instructions, manual mapping code is usually needed.

XML-to-JSON conversion is non-trivial because XML e JSON have different structures: XML has attributes, text nodes, e mixed content that JSON doesn't natively represent. Python: `pip install xmltodict` then `json.dumps(xmltodict.parse(xml_string))`. Node.js: `xml2js` ou `fast-xml-parser` libraries. Command-line: `xq` tool (`cat input.xml | xq .`). The conversion assumes a consistent structure — attributes become keys prefixed com "@", text content becomes "#text" keys. para complexo XML com mixed content ou processing instructions, manual mapping code is Geralmente needed.

XML-to-JSON conversion is non-trivial because XML und JSON have different structures: XML has attributes, text nodes, und mixed content that JSON doesn't natively represent. Python: `pip install xmltodict` then `json.dumps(xmltodict.parse(xml_string))`. Node.js: `xml2js` oder `fast-xml-parser` libraries. Command-line: `xq` tool (`cat input.xml | xq .`). The conversion assumes a consistent structure — attributes become keys prefixed mit "@", text content becomes "#text" keys. für complex XML mit mixed content oder processing instructions, manual mapping code is Normalerweise needed.

XML-to-JSON conversion is non-trivial because XML y JSON have different structures: XML has attributes, text nodes, y mixed content that JSON doesn't natively represent. Python: `pip install xmltodict` then `json.dumps(xmltodict.parse(xml_string))`. Node.js: `xml2js` o `fast-xml-parser` libraries. Command-line: `xq` tool (`cat input.xml | xq .`). The conversion assumes a consistent structure — attributes become keys prefixed con "@", text content becomes "#text" keys. para complex XML con mixed content o processing instructions, manual mapping code is Normalmente needed.

Light edits (annotations, signatures, form fields) are fine in any PDF reader. Structural edits (changing paragraphs, replacing images) are awkward — PDF is a presentation format, not an editing format. The robust workflow is: keep the source DOCX/MD/HTML as the master, regenerate the PDF when changes are needed. Tools that "edit PDFs" reverse-engineer the layout and frequently break it.