Word to Markdown with Pandoc: tables, images & batch convert

Pandoc is the universal document converter. Supporting over 40 formats, its DOCX → Markdown pipeline is especially useful for technical writers, software documentation, and static site generators. ## Installation ```bash # Ubuntu / Debian sudo apt install pandoc # macOS brew install pandoc # Windows (winget) winget install JohnMacFarlane.Pandoc # Verify pandoc --version ``` ## Basic conversion ```bash # Standard pandoc Markdown pandoc document.docx -o document.md # GitHub Flavored Markdown (native tables, task lists) pandoc document.docx -t gfm -o document.md # CommonMark strict pandoc document.docx -t commonmark -o document.md # Explicit format flags pandoc -f docx -t gfm document.docx -o document.md ``` ## Handling images Images embedded in the DOCX are extracted to a folder using `--extract-media`: ```bash # Extract images alongside the Markdown file pandoc document.docx -t gfm --extract-media=. -o document.md # Creates: ./media/image1.png, ./media/image2.jpeg ... # The resulting MD references: ![](media/image1.png) # Custom image directory pandoc document.docx -t gfm --extract-media=assets/imgs -o document.md ``` ## Handling tables ```bash # Pipe tables (GitHub, Obsidian, Notion compatible) pandoc document.docx -t gfm -o document.md # Grid tables (support multi-line cells) pandoc document.docx -t markdown+grid_tables -o document.md # Typical pipe table output: # | Column A | Column B | Column C | # |----------|----------|----------| # | Value 1 | Value 2 | Value 3 | ``` ## Footnotes ```bash # Numbered footnote references pandoc document.docx -t markdown+footnotes -o document.md # Generates: text[^1] ... [^1]: Footnote content. # GFM style pandoc document.docx -t gfm -o document.md ``` ## Recommended formatting options ```bash # --wrap=none: no automatic line breaks (great for git diffs) pandoc document.docx -t gfm --wrap=none -o document.md # ATX headings (# ## ###) instead of Setext (=== ---) pandoc document.docx -t markdown-setext_headings --wrap=none -o document.md # Recommended combination for technical documentation pandoc document.docx \ -t gfm \ --wrap=none \ --extract-media=./images \ -o document.md ``` ## Batch conversion in Bash ```bash #!/bin/bash mkdir -p output/images for f in *.docx; do name="${f%.docx}" echo "-> Converting: $f" pandoc "$f" \ -t gfm \ --wrap=none \ --extract-media="output/images/${name}" \ -o "output/${name}.md" done total=$(ls output/*.md 2>/dev/null | wc -l) echo "Done: $total files converted" ``` ## Python with pypandoc ```python # pip install pypandoc import pypandoc # Convert and capture result as string md_content = pypandoc.convert_file('document.docx', 'gfm') print(md_content[:300]) # Save directly to file pypandoc.convert_file( 'document.docx', 'gfm', outputfile='document.md', extra_args=['--wrap=none', '--extract-media=./media'], ) # Bulk conversion from pathlib import Path source_dir = Path('word_documents') output_dir = Path('markdown_output') output_dir.mkdir(exist_ok=True) for docx in sorted(source_dir.glob('*.docx')): md_path = output_dir / docx.with_suffix('.md').name pypandoc.convert_file( str(docx), 'gfm', outputfile=str(md_path), extra_args=['--wrap=none'], ) print(f'OK: {docx.name}') print(f'Total: {len(list(output_dir.glob("*.md")))} files converted') ``` ## Common problems and fixes **Corrupted characters (accents, special chars):** ```bash # Pandoc defaults to UTF-8; re-encode if needed iconv -f latin1 -t utf-8 document.md > document_utf8.md ``` **Custom Word styles not converted:** ```bash # Styles like "CodeBlock" or "Callout" don't map automatically # Use a Lua filter to map them to Markdown blocks # pandoc document.docx -t gfm --lua-filter=map-styles.lua -o doc.md ``` **Images not extracted:** ```bash # Test DOCX integrity (it's a ZIP) unzip -t document.docx # Manually inspect embedded media unzip document.docx -d docx_contents ls docx_contents/word/media/ ``` ## Verifying conversion quality ```bash # Round-trip back to DOCX for visual comparison pandoc document.md -o verification.docx # Plain-text diff (structural comparison) pandoc document.docx -t plain -o original.txt pandoc document.md -t plain -o converted.txt diff original.txt converted.txt ```

Converting Word DOCX to Markdown with Pandoc: Complete Guide