What Is CSV?
CSV — Comma-Separated Values — is one of the oldest and most universally understood data interchange formats. Despite having no official governing standard for nearly four decades, it became the lingua franca for tabular data transfer between spreadsheets, databases, and data pipelines. RFC 4180 (2005) codified common practice, and most modern tools follow it as a baseline.
A CSV file is plain text where each line represents a data record, and fields within a record are separated by a delimiter — traditionally a comma, but sometimes a semicolon (common in European locales where commas serve as decimal separators), a tab character (TSV), a pipe |, or any other agreed-upon delimiter.
RFC 4180 — The De Facto Standard
RFC 4180 defines the following grammar rules:
- Each record is on its own line, terminated by CRLF (
\r\n). - The last record may or may not have an ending CRLF.
- An optional header record appears as the first line.
- Fields may be enclosed in double quotes.
- Fields containing commas, double quotes, or line breaks must be enclosed in double quotes.
- A double quote within a field is escaped by doubling it:
"". - Spaces are considered part of a field and should not be ignored.
"id","name","price","description"
1,"Laptop","999.99","High-performance 15"" display"
2,"Mouse","29.99","Ergonomic, wireless"
3,"Keyboard","79.99","Mechanical
clicky"
Note line 2: the escaped double quote in 15"" and the quoted field containing a comma in line 3. Line 4 contains an embedded newline — valid per RFC 4180 but problematic for many line-by-line readers.
TSV — Tab-Separated Values
TSV uses a tab character (\t, U+0009) as the delimiter. Tabs are rare in text data, so TSV avoids most quoting requirements:
id name price description
1 Laptop 999.99 High-performance 15" display
2 Mouse 29.99 Ergonomic wireless
TSV is natively produced by copying cells from Excel or Google Sheets and pasting into a text editor. It is preferred for bioinformatics data (FASTQ, BED, GFF formats) and Unix command pipelines where cut -f operates on tab fields.
Trade-off: TSV breaks silently if any data value contains a tab character. CSV with proper quoting handles tabs fine.
Encoding Pitfalls
CSV files have no header to declare encoding. The same byte sequence 0xE9 means é in ISO-8859-1 but is part of a multi-byte sequence in UTF-8. Common problems:
| Issue | Cause | Fix |
|---|---|---|
| Garbled accented characters | Encoding mismatch (UTF-8 vs Latin-1) | Always save and read as UTF-8 |
BOM prefix  |
UTF-8 BOM (EF BB BF) displayed as text |
Strip BOM or use UTF-8-without-BOM |
| Missing BOM in Excel | Excel expects UTF-8 BOM to recognize the encoding | Add BOM or use UTF-16 LE for Excel |
| Wrong line endings | Unix LF vs Windows CRLF | Normalize with dos2unix / unix2dos |
The UTF-8 BOM (\xEF\xBB\xBF) at the start of a file tells Windows applications (especially Excel) that the file is UTF-8. Many Unix tools strip it or include it in the first field value.
Parsing CSV Correctly
Never split on commas with a simple split(',') — this breaks on quoted fields. Use a proper CSV library:
Python:
import csv
with open('data.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name'], row['price'])
JavaScript (Node.js):
const { parse } = require('csv-parse/sync');
const records = parse(csvText, { columns: true, skip_empty_lines: true });
PHP:
if (($fh = fopen('data.csv', 'r')) !== false) {
$header = fgetcsv($fh);
while (($row = fgetcsv($fh)) !== false) {
$record = array_combine($header, $row);
echo $record['name'];
}
}
pandas (Python):
import pandas as pd
df = pd.read_csv('data.csv', encoding='utf-8', dtype={'id': int, 'price': float})
print(df.describe())
Schema and Type Inference
CSV has no built-in types — every value is a string. Type inference is done at read time, which causes notorious surprises:
| Raw value | Excel interprets as | pandas interprets as |
|---|---|---|
1/2 |
January 2nd (date) | string "1/2" |
0001 |
integer 1 (drops leading zero) |
string "0001" (with dtype=str) |
TRUE |
boolean True |
boolean True |
1E5 |
100000 (number) |
float 100000.0 |
NA |
string "NA" |
float NaN |
Solutions:
- Declare types explicitly when reading (
dtype={'zip': str}in pandas). - Quote all values in the CSV to prevent auto-interpretation.
- Use CSV Schema (IETF draft) or Frictionless Data Specs to document field types alongside the CSV.
- Use JSON Table Schema embedded in a
datapackage.jsondescriptor.
CSV Dialects
Many tools produce non-standard CSV variants:
| Dialect | Separator | Quote | Notes |
|---|---|---|---|
| RFC 4180 | , |
" |
Official baseline |
| Excel CSV | , |
" |
Adds UTF-8 BOM, CRLF |
| Excel CSV European | ; |
" |
Semicolon for locales using comma as decimal |
| MySQL CSV | , |
" |
NULL represented as \N |
| PostgreSQL COPY | \t |
" |
TSV-like, \N for NULL |
| Python csv.writer | , |
" |
LF on Unix, platform-native |
| LibreOffice Calc | ; |
" |
Default for many EU locales |
Always document the dialect — or use a header-carrying format like Parquet or Arrow IPC for production pipelines.
Large CSV Files
For files beyond ~1 GB:
- Stream row by row — never load into memory at once.
- Parallel processing: split on newlines (careful with embedded newlines) or use Apache Spark's
spark.read.csv()with automatic partition detection. - Convert to Parquet: 5–10× smaller, columnar storage, push-down predicates, typed schema. One-time conversion pays dividends for repeated queries.
- DuckDB: SQL directly on CSV files with zero setup:
SELECT name, AVG(price) FROM read_csv_auto('data.csv') GROUP BY name; - csvkit: command-line toolkit (
csvstat,csvcut,csvsql,in2csv) for Unix pipelines.
CSV in Data Pipelines
CSV remains ubiquitous at pipeline boundaries because it requires no schema negotiation:
[Source DB] → mysqldump CSV → [SFTP] → ETL tool → [Data Warehouse]
[API JSON] → pandas json_normalize → to_csv → [Partner]
[Sensor] → logger writes CSV → [Batch upload] → S3 → Athena
For internal pipeline hops, prefer binary formats (Parquet, Arrow, Avro) that carry schema. Use CSV only at human-facing boundaries or where the receiving system mandates it.
Geospatial and Specialized Variants
- GeoCSV: lat/lon columns for point data, imported directly by QGIS, ArcGIS.
- CSVW (CSV on the Web): W3C recommendation associating metadata JSON with a CSV file to declare column types, foreign keys, and provenance.
- iCalendar vCard MIME: some vCard exporters produce pseudo-CSV (comma-separated fields on one line) — not RFC 4180 compliant.
- FASTQ-adjacent: bioinformatics tools often use TSV with fixed column semantics (GFF3, BED, VCF, SAM).
Converting CSV
- CSV → JSON:
csvkit'scsvjson, pandasdf.to_json(orient='records'). - CSV → Parquet:
pandas+pyarrow, or DuckDBCOPY (SELECT * FROM read_csv_auto('in.csv')) TO 'out.parquet' (FORMAT PARQUET). - CSV → SQL:
csvsql --db sqlite:///data.db --insert data.csv. - CSV → Excel:
pandasto_excel(), oropenpyxldirectly. - JSON → CSV:
pandasjson_normalize()+to_csv(), orjq+@csv. - XML → CSV: XSLT, or Python
xmltodict+ pandas.
Best Practices
- Always specify encoding — UTF-8 without BOM for Unix pipelines; UTF-8 with BOM if the recipient is Excel on Windows.
- Quote all string fields that could contain the delimiter, quotes, or newlines.
- Include a header row — column names make the file self-describing.
- Document the dialect in accompanying metadata or a README.
- Avoid embedding newlines in field values if the downstream tool does not handle them.
- Use
NULLrepresentation consistently — empty string,\N,NULL, orNA; document which. - Validate with a schema (Frictionless Data, CSVW) before ingestion.
- Convert to Parquet for any dataset > 100 MB that will be queried repeatedly.
- Use TSV for bioinformatics data where tabs never appear in values.
- Never parse CSV with regex or
split— use a proper CSV library.
Related conversions
Document conversions that follow this topic naturally: