CSV vs XLSX
A detailed comparison of CSV (Comma-Separated Values) and Excel Spreadsheet — file size, quality, compatibility, and which format to choose for your workflow.
Short answer: use CSV when your data is purely tabular and will be processed by code (data pipelines, scripts, databases, ML). Use XLSX when humans need to view, edit, or analyze the data with formulas, charts, and multiple sheets. They solve genuinely different problems despite both being "spreadsheet-like".
CSV is dramatically simpler — a plain text file with comma-separated values, readable by literally any tool that can open text. XLSX is a full spreadsheet format with formulas, formatting, charts, multiple worksheets, and rich metadata. Choosing wrong (CSV when you need formulas, or XLSX when you need automation) creates friction immediately.
CSV vs XLSX at a glance
| Dimension | CSV | XLSX |
|---|---|---|
| Format type | Plain text | Binary (ZIP+XML) |
| Standardized | RFC 4180 (loosely) | ISO/IEC 29500 (strictly) |
| Multiple sheets | ❌ One sheet per file | ✅ Many sheets per workbook |
| Formulas | ❌ No (formulas as text only) | ✅ Native (=SUM, =VLOOKUP, etc.) |
| Cell formatting | ❌ No | ✅ Colors, fonts, borders, conditional |
| Charts | ❌ No | ✅ Native |
| Data types | ⚠️ Strings only (numbers parsed) | ✅ Number, date, currency, etc. |
| File size (10k rows) | ~500 KB | ~80 KB (XML compresses well) |
| Universal compatibility | ✅ Every tool ever | ✅ All modern; older Excel needs Compat Pack |
| Programmatic access | ✅ Trivial (any language) | ⚠️ Requires library (openpyxl, xlsx-js) |
| Diff-friendly (Git) | ✅ Line-by-line diff | ❌ Binary diff useless |
| Encoding gotchas | ⚠️ UTF-8 vs Windows-1252 problems | ✅ UTF-8 enforced internally |
When should you use CSV vs XLSX?
CSV Use when…
- Data exports for further processing — Database dumps, log exports, API result snapshots: CSV is the universal interchange.
- ETL pipelines — Apache Airflow, dbt, Spark, pandas all read CSV natively without dependencies.
- Machine learning datasets — Kaggle, Hugging Face, scientific data repositories distribute datasets as CSV (or Parquet, but never XLSX).
- Storing large tabular data — Modern formats like Parquet or Arrow are better, but CSV is the lowest common denominator. Easy to compress (
gzip data.csv→ 20-50% size). - Version control — Git can diff CSV files line-by-line, allowing meaningful code review of data changes. XLSX shows up as a binary blob.
- Programmatic generation — Generating CSV from Python/PHP/JS is one line of code; generating XLSX requires a library and ~10× more code.
- Cross-language portability — Every programming language has built-in or trivial CSV support. XLSX requires libraries (openpyxl in Python, SimpleXLSX in PHP, xlsx-js in JavaScript) with version compatibility issues.
- Simple data interchange between systems — CRM exports to email tool imports, accounting system to BI tool: CSV is the universal pipe.
- Long-term archival of data — A CSV file from 1995 still opens in 2026. An XLSX from 2007 opens, but who knows about XLSX from 2007 in the year 2050?
XLSX Use when…
- Spreadsheets with formulas and calculations — Financial models, budgets, project trackers: anything where cells reference other cells and update automatically.
- Multi-sheet workbooks — Report dashboards with summary + raw data + lookup tables in separate tabs. CSV would require multiple files.
- Anything humans will edit by hand — XLSX shows formulas, formatting, frozen panes, filtering — the affordances people expect from a "spreadsheet".
- Cell formatting matters — Highlighting outliers, color-coding categories, currency symbols, percentage formatting: all native to XLSX.
- Charts and pivot tables — Embedded in XLSX, follow the data when you share. CSV requires re-creating in the recipient's tool.
- Data validation rules — Dropdown lists, number ranges, custom validation: native to XLSX, impossible in CSV.
- Auditable financial records — XLSX preserves cell-level metadata, comments, change history (with track changes). CSV loses all of this.
- Sharing with non-technical colleagues — Most business users open XLSX with double-click. CSV opens as text in some configurations, requiring manual import.
- Output that goes to a printer — XLSX has page setup, headers/footers, print areas. CSV is just text — printing is awkward.
- Cloud collaboration — Google Sheets, Microsoft 365 co-authoring, Excel Online: all built around the XLSX/Sheets paradigm, not CSV.
Best format by use case
Database export
`SELECT INTO OUTFILE` and `psql \copy` produce CSV. Universal pipeline format.
Winner: CSVFinancial model
Formulas, named ranges, multi-sheet structure native to XLSX.
Winner: XLSXProgrammatic processing
Trivial parsing in any language. No library dependencies.
Winner: CSVDashboard report
Charts, conditional formatting, multiple linked sheets.
Winner: XLSXVersion-controlled data
Git can diff CSV line-by-line. XLSX is opaque binary in version history.
Winner: CSVManual editing by team
Familiar Excel UI; people just open and work.
Winner: XLSXML dataset distribution
Kaggle, HuggingFace, sklearn all expect CSV (or Parquet).
Winner: CSVPivot table analysis
Pivot tables are an XLSX feature. CSV requires importing first.
Winner: XLSXAPI data export to client
Smaller, parseable in client systems without Office installed.
Winner: CSVCSV (Comma-Separated Values)
Spreadsheets & DataCSV is a simple text-based format for tabular data where values are separated by commas. It is the universal interchange format for data between spreadsheet applications, databases, and programming languages.
About CSV filesExcel Spreadsheet
Spreadsheets & DataXLSX is the modern Microsoft Excel format based on Open XML. It is the industry standard for spreadsheets, supporting formulas, charts, pivot tables, and conditional formatting.
About XLSX filesStrengths Comparison
CSV Strengths
- Universally readable — every spreadsheet, database, and programming language.
- Human-readable in any text editor.
- Stream-friendly — can process terabytes with constant memory.
- Git-friendly — clean diffs of row changes.
- Tiny overhead vs columnar data structures for simple data.
XLSX Strengths
- Much smaller than legacy .xls files thanks to ZIP + XML.
- Human-readable structure — easy to extract data programmatically.
- Supports macros (as .xlsm variant), charts, pivot tables, conditional formatting.
- Universal support: Excel, LibreOffice, Google Sheets, Numbers, pandas.
- ISO/IEC 29500 standardized.
Limitations
CSV Limitations
- No standard — quoting, escaping, encoding, and separators vary wildly.
- No type information: 0042 might be an integer, a string, or an error.
- Leading zeros and large numbers often get mangled by Excel auto-conversion.
- Not suitable for hierarchical or binary data.
- Breaks when content contains the delimiter and the parser is naive.
XLSX Limitations
- Macros in .xlsm are a common malware vector — disabled by default in Office.
- 1M-row limit is a cultural problem — people put too much data in Excel.
- Subtle formula differences between Excel, LibreOffice, and Sheets.
- Large files with many formulas recalculate slowly.
Technical Specifications
| Specification | CSV | XLSX |
|---|---|---|
| MIME type | text/csv | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
| Specification | RFC 4180 (informational) | — |
| Typical separator | Comma (;), semicolon, tab, pipe | — |
| Typical encoding | UTF-8, Windows-1252, ISO-8859-1 | — |
| Line endings | LF (Unix), CRLF (Windows) | — |
| Container | — | ZIP (Office Open XML) |
| Max rows | — | 1,048,576 |
| Max columns | — | 16,384 |
| Released in | — | Microsoft Office 2007 |
| Variants | — | .xlsx, .xlsm (macros), .xlsb (binary) |
Typical File Sizes
CSV
- Contact export (1000 rows) 100–300 KB
- Analytics export (100k rows) 10–100 MB
- Large dataset (1M rows) 100 MB – 1 GB
- Full database dump 1 GB – 100 GB
XLSX
- Small budget spreadsheet 20–80 KB
- Financial model with charts 1–10 MB
- Large dataset (100k rows) 10–50 MB
- Enterprise model (1M+ rows) 100–500 MB
Technical deep dive: CSV vs XLSX
CSV: dead simple, surprisingly tricky
CSV (Comma-Separated Values) seems trivial — values separated by commas, rows separated by newlines. The actual specification (RFC 4180, 2005) is more nuanced:
name,age,city
"Smith, John",30,"New York, NY"
Jane,25,Chicago
Key rules:
- Fields containing commas, quotes, or newlines must be quoted with double-quotes (
") - Quotes inside fields are escaped by doubling (
"") - Line endings can be CR, LF, or CRLF (RFC says CRLF; in practice all three appear)
- Header row is optional but commonly present
- All fields are technically strings — interpretation of numbers/dates is up to the consumer
The "surprisingly tricky" part comes from real-world CSV that doesn't follow RFC 4180:
- Encoding: UTF-8, Windows-1252, ISO-8859-1, UTF-16 with BOM — all common, all incompatible. Mojibake (
café→café) plagues every data engineer. - Delimiters: Some "CSV" files use semicolons (European Excel default), tabs (TSV), or pipes (
|). Sniff before parsing. - Numbers as text: Phone numbers like
0123456789get re-formatted to123456789if a tool guesses they're integers. - Dates:
01/02/2026is January 2 in US, February 1 in EU. CSV doesn't tell you which. - Trailing commas / inconsistent column counts: parsers vary in tolerance.
Practical advice: when generating CSV, follow RFC 4180 strictly (quote everything that might contain a delimiter, use UTF-8 with BOM if Excel will open it, use ISO 8601 dates YYYY-MM-DD). When consuming CSV, sniff the delimiter and use a robust parser (Python's csv module, PHP's fgetcsv).
XLSX: a full spreadsheet in a ZIP
XLSX is Office Open XML (OOXML), ISO/IEC 29500. The file is a ZIP archive containing XML files. Rename .xlsx → .zip, unzip, and you'll find:
xl/workbook.xml— sheet list and global settingsxl/sharedStrings.xml— deduplicated text values (huge optimization)xl/worksheets/sheet1.xml— one XML per sheet with cell coordinates and valuesxl/styles.xml— fonts, colors, formatsxl/calcChain.xml— formula dependency graphxl/charts/chart1.xml— embedded charts
The "shared strings" optimization is interesting: text values are stored once and referenced by index. A sheet with 100,000 "USA" entries stores "USA" once, plus 100,000 small index references. This is why XLSX is often dramatically smaller than equivalent CSV — text repetition is essentially free.
File size: counterintuitive results
For 10,000 rows × 30 columns of mixed text+numbers:
| Format | Size | Notes |
|---|---|---|
| CSV (UTF-8) | ~500 KB | Plain text, no compression |
| CSV (gzip) | ~120 KB | Highly compressible |
| XLSX | ~80 KB | XML compressed in ZIP, sharedStrings dedup |
| XLSX (with formulas + charts) | ~150 KB | Depends on formula complexity |
| Parquet (snappy) | ~60 KB | Columnar; fastest to query |
XLSX is often smaller than uncompressed CSV thanks to ZIP compression and string deduplication. CSV only beats XLSX when gzipped, which adds processing cost.
When CSV trips you over
Real-world CSV "horror stories" that catch new data engineers:
- Excel mangling: open a CSV with phone numbers, save → leading zeros disappear (because Excel thinks they're numbers).
- Encoding wars: a file produced on Mac (UTF-8) opened on Windows Excel (Windows-1252) shows garbled accented characters.
- Semicolon CSVs from Europe: a German colleague sends "CSV" but pandas expects commas → all columns merged into one.
- Trailing whitespace: a column header
"name "(note trailing space) doesn't match"name"in code. - Quoted vs unquoted numbers:
"123"(quoted, string) vs123(unquoted, number) — pandas may auto-coerce inconsistently. - Embedded newlines: a quoted field containing
\nbreaks naive line-by-line parsing.
These are why production data pipelines often convert CSV inputs to a strict format (Parquet, JSON, database) early and process internally without re-touching CSV.
When XLSX trips you over
Real-world XLSX "horror stories":
- Date serial numbers: Excel stores dates as days-since-epoch (Jan 1, 1900 with a known bug). Reading XLSX in Python without parsing → you get integers like
45000instead of dates. - Numeric precision: Excel stores all numbers as doubles. Long integers (IDs, account numbers > 15 digits) lose precision silently.
- Formulas that reference removed cells: copy a sheet to a new file →
#REF!errors everywhere. - Locale-dependent functions:
=SUMA()(Spanish locale) →#NAME?for English-locale Excel users. - Macros (XLSM): opening untrusted XLSM with macros enabled = malware risk.
- Hidden sheets / cells: data stored in hidden sheets isn't visible in normal usage but exists in the file. Surprises auditors.
When to use neither: Parquet and Apache Arrow
For data engineering and analytics, Parquet has largely replaced CSV in modern stacks:
- Columnar storage: query "select age from 1M rows" reads only the age column, not all data
- Compression: typically 5-10× smaller than CSV
- Schema enforced: types preserved, no encoding ambiguity
- Metadata-rich: column statistics, partition info embedded
- Tool support: pandas, PySpark, DuckDB, Snowflake, BigQuery all read Parquet natively
If you control both producer and consumer, prefer Parquet over CSV for any non-trivial dataset. Use CSV when you need universal portability or the consumer is a human in Excel.
Converting between CSV and XLSX
Convert CSV to XLSX handles the typical conversion: detect encoding, sniff delimiter, parse with proper quoting, emit XLSX with appropriate type inference (numbers stay numbers, dates parsed if recognizable). Add formatting and formulas in Excel after the fact.
Convert XLSX to CSV extracts the active sheet (or one specified) as CSV. Multi-sheet workbooks need separate exports per sheet (CSV doesn't support sheets). Formulas are converted to their last-calculated values; formatting is dropped.
Ready to convert?
Convert between CSV and XLSX online, free, and without installing anything. Encrypted upload, automatic deletion after 60 minutes.
Frequently Asked Questions
Excel "helpfully" auto-converts data on import: leading zeros stripped from numbers, scientific notation applied to long IDs, dates parsed in locale-specific formats (US vs EU). This is why phone numbers like 0123456789 become 123456789 and dates like 01/02/2026 mean different things in different countries. To avoid: import CSV via Data → From Text/CSV (which lets you specify column types) instead of double-clicking.
In a programming context, yes — CSV parsing is essentially line-by-line splitting on a delimiter. XLSX requires unzipping, parsing XML, resolving sharedStrings references. For 100k rows, CSV in pandas reads in ~0.5s, XLSX reads in ~5s. For interactive Excel use, XLSX is "instant" because Excel uses native code optimizations.
No. CSV is fundamentally one table per file. The convention for "multiple sheets in CSV" is multiple files (data_2026.csv, summary_2026.csv) often packaged in a ZIP. If you need multiple sheets in one file, use XLSX or Parquet (with partitioning).
No technical limit — terabyte CSV files exist (with proper streaming readers). Practical limits: Excel maxes at 1,048,576 rows on import. Most desktop tools struggle past 100 MB. For multi-GB CSVs, use a streaming parser (pandas read_csv with chunksize, or Apache Arrow streaming).
Three reasons: (1) Formulas — financial spreadsheets are full of `=SUM(B2:B100)` calculations that update when underlying data changes. CSV stores formulas as text strings. (2) Number formatting — currency, percentages, decimals all preserved. (3) Cell metadata — comments, named ranges, validation rules, audit trails. CSV loses all of this; XLSX preserves it.
Yes — double-click any .csv on Windows opens it in Excel. The catch: Excel will auto-detect the delimiter, encoding, and column types, sometimes wrongly (especially with European delimiters like semicolons, or numbers with leading zeros). For controlled import, use Excel's Data → From Text/CSV ribbon command which shows a preview and lets you specify each column's type.
UTF-8 with BOM if Excel will open it (the BOM tells Excel "this is UTF-8" — without it, Excel may default to Windows-1252 and mangle accents). UTF-8 without BOM if a programming tool will read it (most modern tools auto-detect; the BOM can confuse some). Stick with UTF-8 always; never use Windows-1252 or ISO-8859-1 for new files.
Parquet for any non-trivial work: 5-10× smaller, faster to query, schema-enforced, columnar (reads only needed columns). CSV only when sharing with non-technical users or systems that don't support Parquet. Most modern data warehouses (Snowflake, BigQuery, Redshift) accept both. ML datasets on Kaggle/HuggingFace are increasingly Parquet-first.