Pandoc is the universal document converter. Supporting over 40 formats, its DOCX → Markdown pipeline is especially useful for technical writers, software documentation, and static site generators.
## Installation
```bash
# Ubuntu / Debian
sudo apt install pandoc
# macOS
brew install pandoc
# Windows (winget)
winget install JohnMacFarlane.Pandoc
# Verify
pandoc --version
```
## Basic conversion
```bash
# Standard pandoc Markdown
pandoc document.docx -o document.md
# GitHub Flavored Markdown (native tables, task lists)
pandoc document.docx -t gfm -o document.md
# CommonMark strict
pandoc document.docx -t commonmark -o document.md
# Explicit format flags
pandoc -f docx -t gfm document.docx -o document.md
```
## Handling images
Images embedded in the DOCX are extracted to a folder using `--extract-media`:
```bash
# Extract images alongside the Markdown file
pandoc document.docx -t gfm --extract-media=. -o document.md
# Creates: ./media/image1.png, ./media/image2.jpeg ...
# The resulting MD references: 
# Custom image directory
pandoc document.docx -t gfm --extract-media=assets/imgs -o document.md
```
## Handling tables
```bash
# Pipe tables (GitHub, Obsidian, Notion compatible)
pandoc document.docx -t gfm -o document.md
# Grid tables (support multi-line cells)
pandoc document.docx -t markdown+grid_tables -o document.md
# Typical pipe table output:
# | Column A | Column B | Column C |
# |----------|----------|----------|
# | Value 1 | Value 2 | Value 3 |
```
## Footnotes
```bash
# Numbered footnote references
pandoc document.docx -t markdown+footnotes -o document.md
# Generates: text[^1] ... [^1]: Footnote content.
# GFM style
pandoc document.docx -t gfm -o document.md
```
## Recommended formatting options
```bash
# --wrap=none: no automatic line breaks (great for git diffs)
pandoc document.docx -t gfm --wrap=none -o document.md
# ATX headings (# ## ###) instead of Setext (=== ---)
pandoc document.docx -t markdown-setext_headings --wrap=none -o document.md
# Recommended combination for technical documentation
pandoc document.docx \
-t gfm \
--wrap=none \
--extract-media=./images \
-o document.md
```
## Batch conversion in Bash
```bash
#!/bin/bash
mkdir -p output/images
for f in *.docx; do
name="${f%.docx}"
echo "-> Converting: $f"
pandoc "$f" \
-t gfm \
--wrap=none \
--extract-media="output/images/${name}" \
-o "output/${name}.md"
done
total=$(ls output/*.md 2>/dev/null | wc -l)
echo "Done: $total files converted"
```
## Python with pypandoc
```python
# pip install pypandoc
import pypandoc
# Convert and capture result as string
md_content = pypandoc.convert_file('document.docx', 'gfm')
print(md_content[:300])
# Save directly to file
pypandoc.convert_file(
'document.docx',
'gfm',
outputfile='document.md',
extra_args=['--wrap=none', '--extract-media=./media'],
)
# Bulk conversion
from pathlib import Path
source_dir = Path('word_documents')
output_dir = Path('markdown_output')
output_dir.mkdir(exist_ok=True)
for docx in sorted(source_dir.glob('*.docx')):
md_path = output_dir / docx.with_suffix('.md').name
pypandoc.convert_file(
str(docx),
'gfm',
outputfile=str(md_path),
extra_args=['--wrap=none'],
)
print(f'OK: {docx.name}')
print(f'Total: {len(list(output_dir.glob("*.md")))} files converted')
```
## Common problems and fixes
**Corrupted characters (accents, special chars):**
```bash
# Pandoc defaults to UTF-8; re-encode if needed
iconv -f latin1 -t utf-8 document.md > document_utf8.md
```
**Custom Word styles not converted:**
```bash
# Styles like "CodeBlock" or "Callout" don't map automatically
# Use a Lua filter to map them to Markdown blocks
# pandoc document.docx -t gfm --lua-filter=map-styles.lua -o doc.md
```
**Images not extracted:**
```bash
# Test DOCX integrity (it's a ZIP)
unzip -t document.docx
# Manually inspect embedded media
unzip document.docx -d docx_contents
ls docx_contents/word/media/
```
## Verifying conversion quality
```bash
# Round-trip back to DOCX for visual comparison
pandoc document.md -o verification.docx
# Plain-text diff (structural comparison)
pandoc document.docx -t plain -o original.txt
pandoc document.md -t plain -o converted.txt
diff original.txt converted.txt
```
Guide