Should data scientists use CSV or Parquet?

Parquet for any non-trivial work: 5-10× smaller, faster to query, schema-enforced, columnar (reads only needed columns). CSV only when sharing with non-technical users or systems that don't support Parquet. Most modern data warehouses (Snowflake, BigQuery, Redshift) accept both. ML datasets on Kaggle/HuggingFace are increasingly Parquet-first.

CSV vs XLSX: When to Use Each [Data Engineer's Guide 2026]

Q: Why does Excel sometimes corrupt my CSV files?

Excel &quot;helpfully&quot; auto-converts data on import: leading zeros stripped from numbers, scientific notation applied to long IDs, dates parsed in locale-specific formats (US vs EU). This is why phone numbers like 0123456789 become 123456789 and dates like 01/02/2026 mean different things in different countries. To avoid: import CSV via Data → From Text/CSV (which lets you specify column types) instead of double-clicking.

Q: What encoding should my CSV use?

UTF-8 with BOM if Excel will open it (the BOM tells Excel &quot;this is UTF-8&quot; — without it, Excel may default to Windows-1252 and mangle accents). UTF-8 without BOM if a programming tool will read it (most modern tools auto-detect; the BOM can confuse some). Stick with UTF-8 always; never use Windows-1252 or ISO-8859-1 for new files.

CSV vs XLSX

A detailed comparison of CSV (Comma-Separated Values) and Excel Spreadsheet — file size, quality, compatibility, and which format to choose for your workflow.

Convert CSV → XLSX Convert XLSX → CSV

Short answer: use CSV when your data is purely tabular and will be processed by code (data pipelines, scripts, databases, ML). Use XLSX when humans need to view, edit, or analyze the data with formulas, charts, and multiple sheets. They solve genuinely different problems despite both being "spreadsheet-like".

CSV is dramatically simpler — a plain text file with comma-separated values, readable by literally any tool that can open text. XLSX is a full spreadsheet format with formulas, formatting, charts, multiple worksheets, and rich metadata. Choosing wrong (CSV when you need formulas, or XLSX when you need automation) creates friction immediately.

CSV vs XLSX at a glance

Dimension	CSV	XLSX
Format type	Plain text	Binary (ZIP+XML)
Standardized	RFC 4180 (loosely)	ISO/IEC 29500 (strictly)
Multiple sheets	❌ One sheet per file	✅ Many sheets per workbook
Formulas	❌ No (formulas as text only)	✅ Native (=SUM, =VLOOKUP, etc.)
Cell formatting	❌ No	✅ Colors, fonts, borders, conditional
Charts	❌ No	✅ Native
Data types	⚠️ Strings only (numbers parsed)	✅ Number, date, currency, etc.
File size (10k rows)	~500 KB	~80 KB (XML compresses well)
Universal compatibility	✅ Every tool ever	✅ All modern; older Excel needs Compat Pack
Programmatic access	✅ Trivial (any language)	⚠️ Requires library (openpyxl, xlsx-js)
Diff-friendly (Git)	✅ Line-by-line diff	❌ Binary diff useless
Encoding gotchas	⚠️ UTF-8 vs Windows-1252 problems	✅ UTF-8 enforced internally

When should you use CSV vs XLSX?

CSV Use when…

Data exports for further processing — Database dumps, log exports, API result snapshots: CSV is the universal interchange.
ETL pipelines — Apache Airflow, dbt, Spark, pandas all read CSV natively without dependencies.
Machine learning datasets — Kaggle, Hugging Face, scientific data repositories distribute datasets as CSV (or Parquet, but never XLSX).
Storing large tabular data — Modern formats like Parquet or Arrow are better, but CSV is the lowest common denominator. Easy to compress (gzip data.csv → 20-50% size).
Version control — Git can diff CSV files line-by-line, allowing meaningful code review of data changes. XLSX shows up as a binary blob.
Programmatic generation — Generating CSV from Python/PHP/JS is one line of code; generating XLSX requires a library and ~10× more code.
Cross-language portability — Every programming language has built-in or trivial CSV support. XLSX requires libraries (openpyxl in Python, SimpleXLSX in PHP, xlsx-js in JavaScript) with version compatibility issues.
Simple data interchange between systems — CRM exports to email tool imports, accounting system to BI tool: CSV is the universal pipe.
Long-term archival of data — A CSV file from 1995 still opens in 2026. An XLSX from 2007 opens, but who knows about XLSX from 2007 in the year 2050?

XLSX Use when…

Spreadsheets with formulas and calculations — Financial models, budgets, project trackers: anything where cells reference other cells and update automatically.
Multi-sheet workbooks — Report dashboards with summary + raw data + lookup tables in separate tabs. CSV would require multiple files.
Anything humans will edit by hand — XLSX shows formulas, formatting, frozen panes, filtering — the affordances people expect from a "spreadsheet".
Cell formatting matters — Highlighting outliers, color-coding categories, currency symbols, percentage formatting: all native to XLSX.
Charts and pivot tables — Embedded in XLSX, follow the data when you share. CSV requires re-creating in the recipient's tool.
Data validation rules — Dropdown lists, number ranges, custom validation: native to XLSX, impossible in CSV.
Auditable financial records — XLSX preserves cell-level metadata, comments, change history (with track changes). CSV loses all of this.
Sharing with non-technical colleagues — Most business users open XLSX with double-click. CSV opens as text in some configurations, requiring manual import.
Output that goes to a printer — XLSX has page setup, headers/footers, print areas. CSV is just text — printing is awkward.
Cloud collaboration — Google Sheets, Microsoft 365 co-authoring, Excel Online: all built around the XLSX/Sheets paradigm, not CSV.

Best format by use case

Database export

`SELECT INTO OUTFILE` and `psql \copy` produce CSV. Universal pipeline format.

Winner: CSV

Financial model

Formulas, named ranges, multi-sheet structure native to XLSX.

Winner: XLSX

Programmatic processing

Trivial parsing in any language. No library dependencies.

Winner: CSV

Dashboard report

Charts, conditional formatting, multiple linked sheets.

Winner: XLSX

Version-controlled data

Git can diff CSV line-by-line. XLSX is opaque binary in version history.

Winner: CSV

Manual editing by team

Familiar Excel UI; people just open and work.

Winner: XLSX

ML dataset distribution

Kaggle, HuggingFace, sklearn all expect CSV (or Parquet).

Winner: CSV

Pivot table analysis

Pivot tables are an XLSX feature. CSV requires importing first.

Winner: XLSX

API data export to client

Smaller, parseable in client systems without Office installed.

Winner: CSV

CSV

CSV (Comma-Separated Values)

Spreadsheets & Data

CSV is a simple text-based format for tabular data where values are separated by commas. It is the universal interchange format for data between spreadsheet applications, databases, and programming languages.

About CSV files

XLSX

Excel Spreadsheet

Spreadsheets & Data

XLSX is the modern Microsoft Excel format based on Open XML. It is the industry standard for spreadsheets, supporting formulas, charts, pivot tables, and conditional formatting.

About XLSX files

Strengths Comparison

CSV Strengths

Universally readable — every spreadsheet, database, and programming language.
Human-readable in any text editor.
Stream-friendly — can process terabytes with constant memory.
Git-friendly — clean diffs of row changes.
Tiny overhead vs columnar data structures for simple data.

XLSX Strengths

Much smaller than legacy .xls files thanks to ZIP + XML.
Human-readable structure — easy to extract data programmatically.
Supports macros (as .xlsm variant), charts, pivot tables, conditional formatting.
Universal support: Excel, LibreOffice, Google Sheets, Numbers, pandas.
ISO/IEC 29500 standardized.

Limitations

CSV Limitations

No standard — quoting, escaping, encoding, and separators vary wildly.
No type information: 0042 might be an integer, a string, or an error.
Leading zeros and large numbers often get mangled by Excel auto-conversion.
Not suitable for hierarchical or binary data.
Breaks when content contains the delimiter and the parser is naive.

XLSX Limitations

Macros in .xlsm are a common malware vector — disabled by default in Office.
1M-row limit is a cultural problem — people put too much data in Excel.
Subtle formula differences between Excel, LibreOffice, and Sheets.
Large files with many formulas recalculate slowly.

Technical Specifications

Specification	CSV	XLSX
MIME type	text/csv	application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Specification	RFC 4180 (informational)	—
Typical separator	Comma (;), semicolon, tab, pipe	—
Typical encoding	UTF-8, Windows-1252, ISO-8859-1	—
Line endings	LF (Unix), CRLF (Windows)	—
Container	—	ZIP (Office Open XML)
Max rows	—	1,048,576
Max columns	—	16,384
Released in	—	Microsoft Office 2007
Variants	—	.xlsx, .xlsm (macros), .xlsb (binary)

Typical File Sizes

CSV

Contact export (1000 rows) 100–300 KB
Analytics export (100k rows) 10–100 MB
Large dataset (1M rows) 100 MB – 1 GB
Full database dump 1 GB – 100 GB

XLSX

Small budget spreadsheet 20–80 KB
Financial model with charts 1–10 MB
Large dataset (100k rows) 10–50 MB
Enterprise model (1M+ rows) 100–500 MB

Technical deep dive: CSV vs XLSX

CSV: dead simple, surprisingly tricky

CSV (Comma-Separated Values) seems trivial — values separated by commas, rows separated by newlines. The actual specification (RFC 4180, 2005) is more nuanced:

name,age,city
"Smith, John",30,"New York, NY"
Jane,25,Chicago

Key rules:

Fields containing commas, quotes, or newlines must be quoted with double-quotes (")
Quotes inside fields are escaped by doubling ("")
Line endings can be CR, LF, or CRLF (RFC says CRLF; in practice all three appear)
Header row is optional but commonly present
All fields are technically strings — interpretation of numbers/dates is up to the consumer

The "surprisingly tricky" part comes from real-world CSV that doesn't follow RFC 4180:

Encoding: UTF-8, Windows-1252, ISO-8859-1, UTF-16 with BOM — all common, all incompatible. Mojibake (café → café) plagues every data engineer.
Delimiters: Some "CSV" files use semicolons (European Excel default), tabs (TSV), or pipes (|). Sniff before parsing.
Numbers as text: Phone numbers like 0123456789 get re-formatted to 123456789 if a tool guesses they're integers.
Dates: 01/02/2026 is January 2 in US, February 1 in EU. CSV doesn't tell you which.
Trailing commas / inconsistent column counts: parsers vary in tolerance.

Practical advice: when generating CSV, follow RFC 4180 strictly (quote everything that might contain a delimiter, use UTF-8 with BOM if Excel will open it, use ISO 8601 dates YYYY-MM-DD). When consuming CSV, sniff the delimiter and use a robust parser (Python's csv module, PHP's fgetcsv).

XLSX: a full spreadsheet in a ZIP

XLSX is Office Open XML (OOXML), ISO/IEC 29500. The file is a ZIP archive containing XML files. Rename .xlsx → .zip, unzip, and you'll find:

xl/workbook.xml — sheet list and global settings
xl/sharedStrings.xml — deduplicated text values (huge optimization)
xl/worksheets/sheet1.xml — one XML per sheet with cell coordinates and values
xl/styles.xml — fonts, colors, formats
xl/calcChain.xml — formula dependency graph
xl/charts/chart1.xml — embedded charts

The "shared strings" optimization is interesting: text values are stored once and referenced by index. A sheet with 100,000 "USA" entries stores "USA" once, plus 100,000 small index references. This is why XLSX is often dramatically smaller than equivalent CSV — text repetition is essentially free.

File size: counterintuitive results

For 10,000 rows × 30 columns of mixed text+numbers:

Format	Size	Notes
CSV (UTF-8)	~500 KB	Plain text, no compression
CSV (gzip)	~120 KB	Highly compressible
XLSX	~80 KB	XML compressed in ZIP, sharedStrings dedup
XLSX (with formulas + charts)	~150 KB	Depends on formula complexity
Parquet (snappy)	~60 KB	Columnar; fastest to query

XLSX is often smaller than uncompressed CSV thanks to ZIP compression and string deduplication. CSV only beats XLSX when gzipped, which adds processing cost.

When CSV trips you over

Real-world CSV "horror stories" that catch new data engineers:

Excel mangling: open a CSV with phone numbers, save → leading zeros disappear (because Excel thinks they're numbers).
Encoding wars: a file produced on Mac (UTF-8) opened on Windows Excel (Windows-1252) shows garbled accented characters.
Semicolon CSVs from Europe: a German colleague sends "CSV" but pandas expects commas → all columns merged into one.
Trailing whitespace: a column header "name " (note trailing space) doesn't match "name" in code.
Quoted vs unquoted numbers: "123" (quoted, string) vs 123 (unquoted, number) — pandas may auto-coerce inconsistently.
Embedded newlines: a quoted field containing \n breaks naive line-by-line parsing.

These are why production data pipelines often convert CSV inputs to a strict format (Parquet, JSON, database) early and process internally without re-touching CSV.

When XLSX trips you over

Real-world XLSX "horror stories":

Date serial numbers: Excel stores dates as days-since-epoch (Jan 1, 1900 with a known bug). Reading XLSX in Python without parsing → you get integers like 45000 instead of dates.
Numeric precision: Excel stores all numbers as doubles. Long integers (IDs, account numbers > 15 digits) lose precision silently.
Formulas that reference removed cells: copy a sheet to a new file → #REF! errors everywhere.
Locale-dependent functions: =SUMA() (Spanish locale) → #NAME? for English-locale Excel users.
Macros (XLSM): opening untrusted XLSM with macros enabled = malware risk.
Hidden sheets / cells: data stored in hidden sheets isn't visible in normal usage but exists in the file. Surprises auditors.

When to use neither: Parquet and Apache Arrow

For data engineering and analytics, Parquet has largely replaced CSV in modern stacks:

Columnar storage: query "select age from 1M rows" reads only the age column, not all data
Compression: typically 5-10× smaller than CSV
Schema enforced: types preserved, no encoding ambiguity
Metadata-rich: column statistics, partition info embedded
Tool support: pandas, PySpark, DuckDB, Snowflake, BigQuery all read Parquet natively

If you control both producer and consumer, prefer Parquet over CSV for any non-trivial dataset. Use CSV when you need universal portability or the consumer is a human in Excel.

Converting between CSV and XLSX

Convert CSV to XLSX handles the typical conversion: detect encoding, sniff delimiter, parse with proper quoting, emit XLSX with appropriate type inference (numbers stay numbers, dates parsed if recognizable). Add formatting and formulas in Excel after the fact.

Convert XLSX to CSV extracts the active sheet (or one specified) as CSV. Multi-sheet workbooks need separate exports per sheet (CSV doesn't support sheets). Formulas are converted to their last-calculated values; formatting is dropped.

Ready to convert?

Convert between CSV and XLSX online, free, and without installing anything. Encrypted upload, automatic deletion after 60 minutes.

Convert CSV to XLSX Convert XLSX to CSV

Frequently Asked Questions

Excel "helpfully" auto-converts data on import: leading zeros stripped from numbers, scientific notation applied to long IDs, dates parsed in locale-specific formats (US vs EU). This is why phone numbers like 0123456789 become 123456789 and dates like 01/02/2026 mean different things in different countries. To avoid: import CSV via Data → From Text/CSV (which lets you specify column types) instead of double-clicking.

In a programming context, yes — CSV parsing is essentially line-by-line splitting on a delimiter. XLSX requires unzipping, parsing XML, resolving sharedStrings references. For 100k rows, CSV in pandas reads in ~0.5s, XLSX reads in ~5s. For interactive Excel use, XLSX is "instant" because Excel uses native code optimizations.

No. CSV is fundamentally one table per file. The convention for "multiple sheets in CSV" is multiple files (data_2026.csv, summary_2026.csv) often packaged in a ZIP. If you need multiple sheets in one file, use XLSX or Parquet (with partitioning).

No technical limit — terabyte CSV files exist (with proper streaming readers). Practical limits: Excel maxes at 1,048,576 rows on import. Most desktop tools struggle past 100 MB. For multi-GB CSVs, use a streaming parser (pandas read_csv with chunksize, or Apache Arrow streaming).

Three reasons: (1) Formulas — financial spreadsheets are full of `=SUM(B2:B100)` calculations that update when underlying data changes. CSV stores formulas as text strings. (2) Number formatting — currency, percentages, decimals all preserved. (3) Cell metadata — comments, named ranges, validation rules, audit trails. CSV loses all of this; XLSX preserves it.

Yes — double-click any .csv on Windows opens it in Excel. The catch: Excel will auto-detect the delimiter, encoding, and column types, sometimes wrongly (especially with European delimiters like semicolons, or numbers with leading zeros). For controlled import, use Excel's Data → From Text/CSV ribbon command which shows a preview and lets you specify each column's type.

UTF-8 with BOM if Excel will open it (the BOM tells Excel "this is UTF-8" — without it, Excel may default to Windows-1252 and mangle accents). UTF-8 without BOM if a programming tool will read it (most modern tools auto-detect; the BOM can confuse some). Stick with UTF-8 always; never use Windows-1252 or ISO-8859-1 for new files.

Parquet for any non-trivial work: 5-10× smaller, faster to query, schema-enforced, columnar (reads only needed columns). CSV only when sharing with non-technical users or systems that don't support Parquet. Most modern data warehouses (Snowflake, BigQuery, Redshift) accept both. ML datasets on Kaggle/HuggingFace are increasingly Parquet-first.