Guide

CSV and TSV Format: The Complete Technical Guide

PC By Pablo Cirre • Updated May 7, 2026

What Is CSV?

CSV — Comma-Separated Values — is one of the oldest and most universally understood data interchange formats. Despite having no official governing standard for nearly four decades, it became the lingua franca for tabular data transfer between spreadsheets, databases, and data pipelines. RFC 4180 (2005) codified common practice, and most modern tools follow it as a baseline.

A CSV file is plain text where each line represents a data record, and fields within a record are separated by a delimiter — traditionally a comma, but sometimes a semicolon (common in European locales where commas serve as decimal separators), a tab character (TSV), a pipe |, or any other agreed-upon delimiter.

RFC 4180 — The De Facto Standard

RFC 4180 defines the following grammar rules:

Each record is on its own line, terminated by CRLF (\r\n).
The last record may or may not have an ending CRLF.
An optional header record appears as the first line.
Fields may be enclosed in double quotes.
Fields containing commas, double quotes, or line breaks must be enclosed in double quotes.
A double quote within a field is escaped by doubling it: "".
Spaces are considered part of a field and should not be ignored.

"id","name","price","description"
1,"Laptop","999.99","High-performance 15"" display"
2,"Mouse","29.99","Ergonomic, wireless"
3,"Keyboard","79.99","Mechanical
clicky"

Note line 2: the escaped double quote in 15"" and the quoted field containing a comma in line 3. Line 4 contains an embedded newline — valid per RFC 4180 but problematic for many line-by-line readers.

TSV — Tab-Separated Values

TSV uses a tab character (\t, U+0009) as the delimiter. Tabs are rare in text data, so TSV avoids most quoting requirements:

id	name	price	description
1	Laptop	999.99	High-performance 15" display
2	Mouse	29.99	Ergonomic wireless

TSV is natively produced by copying cells from Excel or Google Sheets and pasting into a text editor. It is preferred for bioinformatics data (FASTQ, BED, GFF formats) and Unix command pipelines where cut -f operates on tab fields.

Trade-off: TSV breaks silently if any data value contains a tab character. CSV with proper quoting handles tabs fine.

Encoding Pitfalls

CSV files have no header to declare encoding. The same byte sequence 0xE9 means é in ISO-8859-1 but is part of a multi-byte sequence in UTF-8. Common problems:

Issue	Cause	Fix
Garbled accented characters	Encoding mismatch (UTF-8 vs Latin-1)	Always save and read as UTF-8
BOM prefix `ï»¿`	UTF-8 BOM (`EF BB BF`) displayed as text	Strip BOM or use UTF-8-without-BOM
Missing BOM in Excel	Excel expects UTF-8 BOM to recognize the encoding	Add BOM or use UTF-16 LE for Excel
Wrong line endings	Unix LF vs Windows CRLF	Normalize with `dos2unix` / `unix2dos`

The UTF-8 BOM (\xEF\xBB\xBF) at the start of a file tells Windows applications (especially Excel) that the file is UTF-8. Many Unix tools strip it or include it in the first field value.

Parsing CSV Correctly

Never split on commas with a simple split(',') — this breaks on quoted fields. Use a proper CSV library:

Python:

import csv

with open('data.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], row['price'])

JavaScript (Node.js):

const { parse } = require('csv-parse/sync');
const records = parse(csvText, { columns: true, skip_empty_lines: true });

PHP:

if (($fh = fopen('data.csv', 'r')) !== false) {
    $header = fgetcsv($fh);
    while (($row = fgetcsv($fh)) !== false) {
        $record = array_combine($header, $row);
        echo $record['name'];
    }
}

pandas (Python):

import pandas as pd
df = pd.read_csv('data.csv', encoding='utf-8', dtype={'id': int, 'price': float})
print(df.describe())

Schema and Type Inference

CSV has no built-in types — every value is a string. Type inference is done at read time, which causes notorious surprises:

Raw value	Excel interprets as	pandas interprets as
`1/2`	January 2nd (date)	string `"1/2"`
`0001`	integer `1` (drops leading zero)	string `"0001"` (with `dtype=str`)
`TRUE`	boolean `True`	boolean `True`
`1E5`	`100000` (number)	float `100000.0`
`NA`	string `"NA"`	float `NaN`

Solutions:

Declare types explicitly when reading (dtype={'zip': str} in pandas).
Quote all values in the CSV to prevent auto-interpretation.
Use CSV Schema (IETF draft) or Frictionless Data Specs to document field types alongside the CSV.
Use JSON Table Schema embedded in a datapackage.json descriptor.

CSV Dialects

Many tools produce non-standard CSV variants:

Dialect	Separator	Quote	Notes
RFC 4180	`,`	`"`	Official baseline
Excel CSV	`,`	`"`	Adds UTF-8 BOM, CRLF
Excel CSV European	`;`	`"`	Semicolon for locales using comma as decimal
MySQL CSV	`,`	`"`	`NULL` represented as `\N`
PostgreSQL COPY	`\t`	`"`	TSV-like, `\N` for NULL
Python csv.writer	`,`	`"`	LF on Unix, platform-native
LibreOffice Calc	`;`	`"`	Default for many EU locales

Always document the dialect — or use a header-carrying format like Parquet or Arrow IPC for production pipelines.

Large CSV Files

For files beyond ~1 GB:

Stream row by row — never load into memory at once.
Parallel processing: split on newlines (careful with embedded newlines) or use Apache Spark's spark.read.csv() with automatic partition detection.
Convert to Parquet: 5–10× smaller, columnar storage, push-down predicates, typed schema. One-time conversion pays dividends for repeated queries.

DuckDB: SQL directly on CSV files with zero setup:

SELECT name, AVG(price) FROM read_csv_auto('data.csv') GROUP BY name;

csvkit: command-line toolkit (csvstat, csvcut, csvsql, in2csv) for Unix pipelines.

CSV in Data Pipelines

CSV remains ubiquitous at pipeline boundaries because it requires no schema negotiation:

[Source DB] → mysqldump CSV → [SFTP] → ETL tool → [Data Warehouse]
[API JSON]  → pandas json_normalize → to_csv → [Partner]
[Sensor]    → logger writes CSV → [Batch upload] → S3 → Athena

For internal pipeline hops, prefer binary formats (Parquet, Arrow, Avro) that carry schema. Use CSV only at human-facing boundaries or where the receiving system mandates it.

Geospatial and Specialized Variants

GeoCSV: lat/lon columns for point data, imported directly by QGIS, ArcGIS.
CSVW (CSV on the Web): W3C recommendation associating metadata JSON with a CSV file to declare column types, foreign keys, and provenance.
iCalendar vCard MIME: some vCard exporters produce pseudo-CSV (comma-separated fields on one line) — not RFC 4180 compliant.
FASTQ-adjacent: bioinformatics tools often use TSV with fixed column semantics (GFF3, BED, VCF, SAM).

Converting CSV

CSV → JSON: csvkit's csvjson, pandas df.to_json(orient='records').
CSV → Parquet: pandas + pyarrow, or DuckDB COPY (SELECT * FROM read_csv_auto('in.csv')) TO 'out.parquet' (FORMAT PARQUET).
CSV → SQL: csvsql --db sqlite:///data.db --insert data.csv.
CSV → Excel: pandas to_excel(), or openpyxl directly.
JSON → CSV: pandas json_normalize() + to_csv(), or jq + @csv.
XML → CSV: XSLT, or Python xmltodict + pandas.

Best Practices

Always specify encoding — UTF-8 without BOM for Unix pipelines; UTF-8 with BOM if the recipient is Excel on Windows.
Quote all string fields that could contain the delimiter, quotes, or newlines.
Include a header row — column names make the file self-describing.
Document the dialect in accompanying metadata or a README.
Avoid embedding newlines in field values if the downstream tool does not handle them.
Use NULL representation consistently — empty string, \N, NULL, or NA; document which.
Validate with a schema (Frictionless Data, CSVW) before ingestion.
Convert to Parquet for any dataset > 100 MB that will be queried repeatedly.
Use TSV for bioinformatics data where tabs never appear in values.
Never parse CSV with regex or split — use a proper CSV library.

Related conversions

Document conversions that follow this topic naturally:

Related conversions

Put what you just learned into practice — convert your files now in seconds, free and without registration.

Convert →

Convert →

Convert →

Convert →

Convert →

Convert →

Frequently Asked Questions

Excel does not auto-detect encoding. When you open a .csv file by double-clicking it, Excel assumes the system default encoding (e.g., Windows-1252 on Western European systems) rather than UTF-8. Accented characters and symbols are then misinterpreted. Fix options: (1) add a UTF-8 BOM (\xEF\xBB\xBF) at the start of the file — Excel respects it; (2) use the Data → From Text/CSV import wizard and manually select UTF-8; (3) save as .xlsx instead of .csv when Excel is the final consumer.

Excel does not auto-detect codificação. When you abrir a .csv arquivo by double-clicking it, Excel assumes the system default codificação (e.g., Windows-1252 on Western European systems) em vez de UTF-8. Accented characters e symbols are then misinterpreted. Fix options: (1) add a UTF-8 BOM (\xEF\xBB\xBF) at the start of o arquivo — Excel respects it; (2) usar the Data → de Text/CSV import wizard e manually select UTF-8; (3) salvar as .xlsx em vez de .csv when Excel is the final consumer.

Excel does not auto-detect Codierung. When you öffnen a .csv Datei by double-clicking it, Excel assumes the system default Codierung (e.g., Windows-1252 on Western European systems) rather than UTF-8. Accented characters und symbols are then misinterpreted. Fix options: (1) add a UTF-8 BOM (\xEF\xBB\xBF) at the start von die Datei — Excel respects it; (2) verwenden the Data → von Text/CSV import wizard und manually select UTF-8; (3) speichern as .xlsx anstatt .csv when Excel is the final consumer.

Excel does not auto-detect codificación. When you abrir a .csv archivo by double-clicking it, Excel assumes the system default codificación (e.g., Windows-1252 on Western European systems) rather than UTF-8. Accented characters y symbols are then misinterpreted. Fix options: (1) add a UTF-8 BOM (\xEF\xBB\xBF) at the start de el archivo — Excel respects it; (2) usar the Data → de Text/CSV import wizard y manually select UTF-8; (3) guardar as .xlsx en vez de .csv when Excel is the final consumer.

Send <strong>PDF</strong> when the document is final and the layout must be preserved exactly (contracts, invoices, certificates). Send <strong>DOCX</strong> when reviewers need to edit, comment, or track changes. Many teams send both: PDF as the canonical version + DOCX for editable feedback. PDF/A is the right pick for legal archival (ISO 19005).

CSV uses a comma (or locale-specific character like semicolon) as the field delimiter and requires quoting and escaping for fields containing the delimiter, double quotes, or newlines. TSV uses a tab character, which rarely appears in data, so quoting is almost never needed — producing simpler files. Use TSV for bioinformatics pipelines (GFF3, BED, VCF), Unix command-line tools (cut -f works natively on tabs), and clipboard paste from spreadsheets. Use CSV when the receiving system mandates it or when values may contain tabs.

CSV uses a comma (or locale-specific character like semicolon) como o field delimiter e requires quoting e escaping para fields containing the delimiter, double quotes, ou newlines. TSV uses a tab character, which rarely appears in data, so quoting is almost never needed — producing simplesr files. usar TSV para bioinformatics pipelines (GFF3, BED, VCF), Unix command-line ferramentas (cut -f funciona natively on tabs), e clipboard paste de spreadsheets. usar CSV when the receiving system mandates it ou when values may contain tabs.

CSV uses a comma (or locale-specific character like semicolon) als das field delimiter und requires quoting und escaping für fields containing the delimiter, double quotes, oder newlines. TSV uses a tab character, which rarely appears in data, so quoting is almost never needed — producing einfachr files. verwenden TSV für bioinformatics pipelines (GFF3, BED, VCF), Unix command-line Werkzeuge (cut -f works natively on tabs), und clipboard paste von spreadsheets. verwenden CSV when the receiving system mandates it oder when values may contain tabs.

CSV uses a comma (or locale-specific character like semicolon) como el field delimiter y requires quoting y escaping para fields containing the delimiter, double quotes, o newlines. TSV uses a tab character, which rarely appears in data, so quoting is almost never needed — producing simpler files. usar TSV para bioinformatics pipelines (GFF3, BED, VCF), Unix command-line herramientas (cut -f works natively on tabs), y clipboard paste de spreadsheets. usar CSV when the receiving system mandates it o when values may contain tabs.

Round-tripping between similar formats (DOCX ↔ ODT, DOCX → PDF) is generally safe. Round-tripping with format-specific features (Word macros, complex tables, footnotes) often loses fidelity. Embedded fonts survive only if both source and target support font embedding (PDF yes, DOCX yes, plain HTML no). Always preview the result before deleting the original.

Stream the file row-by-row instead of loading it all at once. In Python, csv.reader is already a lazy iterator — never call list(reader) on a huge file. With pandas use chunksize: for chunk in pd.read_csv("big.csv", chunksize=100000): process(chunk). For SQL-style queries with zero memory overhead, DuckDB can query CSV directly: SELECT * FROM read_csv_auto("big.csv") WHERE price > 100. For repeated queries convert once to Parquet (5-10× smaller, columnar, typed) using pyarrow or DuckDB's COPY TO command.

Stream o arquivo row-by-row em vez de loading it all at once. In Python, csv.reader is already a lazy iterator — never call list(reader) em um huge file. com pandas usar chunksize: para chunk in pd.read_csv("big.csv", chunksize=100000): process(chunk). para SQL-style queries com zero memory overhead, DuckDB can query CSV directly: SELECT * de read_csv_auto("big.csv") WHERE price > 100. para repeated queries converter once to Parquet (5-10× smaller, columnar, typed) usando pyarrow ou DuckDB's COPY TO command.

Stream die Datei row-by-row anstatt loading it all at once. In Python, csv.reader is already a lazy iterator — never call list(reader) auf einem huge file. mit pandas verwenden chunksize: für chunk in pd.read_csv("big.csv", chunksize=100000): process(chunk). für SQL-style queries mit zero memory overhead, DuckDB can query CSV directly: SELECT * von read_csv_auto("big.csv") WHERE price > 100. für repeated queries umwandeln once to Parquet (5-10× smaller, columnar, typed) using pyarrow oder DuckDB's COPY TO command.

Stream el archivo row-by-row en vez de loading it all at once. In Python, csv.reader is already a lazy iterator — never call list(reader) en un huge file. con pandas usar chunksize: para chunk in pd.read_csv("big.csv", chunksize=100000): process(chunk). para SQL-style queries con zero memory overhead, DuckDB can query CSV directly: SELECT * de read_csv_auto("big.csv") WHERE price > 100. para repeated queries convertir once to Parquet (5-10× smaller, columnar, typed) using pyarrow o DuckDB's COPY TO command.

If the PDF contains real text (not scanned images), <code>pdftotext</code> from poppler-utils or <a href="/convert/pdf-to-txt">PDF to TXT</a> works in seconds. If the PDF is a scanned image, you need OCR — Tesseract is the open-source standard. KaijuConverter's PDF tools auto-detect text-vs-image PDFs and route accordingly.

Excel applies automatic type inference when opening CSV files, converting "001" to integer 1 (dropping the leading zero) and "1/2" to a date (January 2nd). To prevent this: (1) use the Data → From Text/CSV import wizard and explicitly set the column type to Text; (2) prefix the value with a tab or apostrophe (only works interactively); (3) programmatically quote all values in the CSV — Excel still infers types for quoted numbers, so the real fix is the import wizard. Alternatively, generate .xlsx directly with openpyxl or xlsxwriter where you control cell formats explicitly.

Excel applies automatic type inference when abrindo CSV files, convertendo "001" to integer 1 (dropping the leading zero) e "1/2" para um date (January 2nd). To prevent this: (1) usar the Data → de Text/CSV import wizard e explicitly set the column type to Text; (2) prefix the value com a tab ou apostrophe (only funciona interactively); (3) programmatically quote all values no CSV — Excel still infers types para quoted numbers, so the real fix is the import wizard. Alternatively, generate .xlsx directly com openpyxl ou xlsxwriter where you control cell formatoos explicitly.

Excel applies automatic type inference when opening CSV files, umwandelnd "001" to integer 1 (dropping the leading zero) und "1/2" zu einem date (January 2nd). To prevent this: (1) verwenden the Data → von Text/CSV import wizard und explicitly set the column type to Text; (2) prefix the value mit a tab oder apostrophe (only works interactively); (3) programmatically quote all values im CSV — Excel still infers types für quoted numbers, so the real fix is the import wizard. Alternatively, generate .xlsx directly mit openpyxl oder xlsxwriter where you control cell Formate explicitly.

Excel applies automatic type inference when opening CSV files, convirtiendo "001" to integer 1 (dropping the leading zero) y "1/2" a un date (January 2nd). To prevent this: (1) usar the Data → de Text/CSV import wizard y explicitly set the column type to Text; (2) prefix the value con a tab o apostrophe (only works interactively); (3) programmatically quote all values en el CSV — Excel still infers types para quoted numbers, so the real fix is the import wizard. Alternatively, generate .xlsx directly con openpyxl o xlsxwriter where you control cell formatoos explicitly.

Light edits (annotations, signatures, form fields) are fine in any PDF reader. Structural edits (changing paragraphs, replacing images) are awkward — PDF is a presentation format, not an editing format. The robust workflow is: keep the source DOCX/MD/HTML as the master, regenerate the PDF when changes are needed. Tools that "edit PDFs" reverse-engineer the layout and frequently break it.