JSONL FAQ

Frequently asked questions about JSON Lines format

General Questions

What is JSONL?

JSONL (JSON Lines) is a text format for structured data where each line is a valid JSON object. It is also known as newline-delimited JSON (NDJSON) or JSON Lines. Each line represents a separate, self-contained JSON record.

{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}
{"name": "Charlie", "age": 35}

What is the difference between JSONL, NDJSON, and JSON Lines?

These terms are essentially the same format with minor variations in naming conventions:

  • JSON Lines - The official format name from jsonlines.org
  • JSONL - Common file extension (.jsonl) and abbreviated name
  • NDJSON - Newline-Delimited JSON, emphasizes the line delimiter
  • LDJSON - Line-Delimited JSON, another variant name

All follow the same core principle: one JSON object per line. Use whichever name is most common in your ecosystem.

When should I use JSONL instead of regular JSON?

Use JSONL when:

  • Streaming data - Processing records as they arrive without loading everything into memory
  • Appending data - Adding new records without rewriting the entire file
  • Log files - Each log entry is independent and can be written atomically
  • Large datasets - Files too big to fit in memory comfortably
  • ML training data - Standard format for AI/ML pipelines (OpenAI, Hugging Face, etc.)
  • Analytics events - Event tracking where each event is independent

Use regular JSON when you need nested arrays/objects as the top-level structure or when the entire dataset represents a single cohesive object.

What file extension should I use?

The most common file extensions are:

  • .jsonl - Most widely used and recommended
  • .ndjson - Alternative, used in some ecosystems
  • .jsonlines - Less common but valid
  • .ldjson - Rarely used

For consistency and maximum compatibility, use .jsonl. If working with legacy systems that expect .ndjson, use that instead.

Technical Questions

What line endings should I use?

The JSON Lines specification recommends LF (\n) line endings, but most parsers accept:

  • \n (LF) - Unix/Linux/macOS line ending (recommended)
  • \r\n (CRLF) - Windows line ending (widely supported)
  • \r (CR) - Old Mac line ending (avoid, rarely supported)

Most modern JSONL parsers handle both LF and CRLF transparently. For maximum compatibility, use LF (\n).

How do I handle errors or corrupt lines?

JSONL's line-based format makes error handling straightforward. Best practices:

  • Parse line-by-line - Wrap each JSON.parse() in try-catch
  • Log errors - Record line number and raw content for debugging
  • Skip or fail - Choose whether to continue processing or halt on error
  • Validate schema - Use JSON Schema validation after parsing
// Python example with error handling
import json

with open('data.jsonl', 'r') as f:
    for line_num, line in enumerate(f, 1):
        try:
            data = json.loads(line)
            process(data)
        except json.JSONDecodeError as e:
            print(f"Error on line {line_num}: {e}")
            # Continue or break depending on requirements

Can JSON objects span multiple lines?

No. Each line must contain exactly one complete, valid JSON object. The JSON object itself must not contain literal newlines - all newlines within strings must be escaped as \n.

Invalid

{
  "name": "Alice",
  "age": 30
}

Valid

{"name": "Alice", "age": 30}

If you need pretty-printed JSON for debugging, use regular JSON format. JSONL is designed for machine processing, not human reading.

Can I have empty lines or comments?

According to the official specification:

  • Empty lines - Not allowed in strict JSONL (some parsers tolerate them)
  • Comments - JSON does not support comments, so JSONL does not either
  • Trailing newline - Optional but recommended

For maximum compatibility, avoid empty lines and comments. If you need metadata, include it as JSON objects with a special type field:

{"type": "metadata", "version": "1.0", "created": "2025-11-11"}
{"type": "record", "id": 1, "name": "Alice"}

What character encoding should I use?

Always use UTF-8 encoding. This is the standard for JSON and JSONL.

  • UTF-8 - Universal standard, supports all Unicode characters
  • ASCII - Too limited, cannot represent international characters
  • UTF-16/UTF-32 - Not compatible with JSON specification

UTF-8 is backward compatible with ASCII and handles emoji, international characters, and special symbols correctly.

How do I validate a JSONL file?

Validation involves checking both format and content:

  • Format validation - Ensure each line is valid JSON
  • Schema validation - Verify each object matches expected structure
  • Line count - Check file has expected number of records
# Command-line validation with jq
cat data.jsonl | jq -c . > /dev/null && echo "Valid JSONL" || echo "Invalid"

# Python validation
import json

def validate_jsonl(filepath):
    line_count = 0
    with open(filepath, 'r') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:  # Skip empty lines if tolerant
                continue
            try:
                json.loads(line)
                line_count += 1
            except json.JSONDecodeError as e:
                print(f"Invalid JSON on line {line_num}: {e}")
                return False
    print(f"Valid JSONL: {line_count} records")
    return True

Performance & Compression

Is JSONL faster than regular JSON?

It depends on your use case:

  • Streaming scenarios - JSONL is much faster (can process records immediately)
  • Memory usage - JSONL uses constant memory, JSON array uses O(n) memory
  • Parsing speed - Similar parse times per record, but JSONL starts processing sooner
  • Append operations - JSONL is instant, JSON arrays require rewriting entire file

For large datasets (>10MB) or streaming use cases, JSONL significantly outperforms JSON arrays. For small, static datasets, the difference is negligible.

Should I compress JSONL files?

Yes, JSONL compresses extremely well. Common strategies:

  • gzip (.jsonl.gz) - Most common, 70-90% size reduction, streamable
  • bzip2 (.jsonl.bz2) - Better compression, slower, streamable
  • xz (.jsonl.xz) - Best compression, slowest, streamable
  • zstd (.jsonl.zst) - Modern, fast, excellent compression

Gzip is recommended for most use cases. Many tools (Python, Go, JavaScript) can stream-decompress .gz files, maintaining JSONL's memory efficiency.

# Compress JSONL with gzip
gzip data.jsonl  # Creates data.jsonl.gz

# Read compressed JSONL in Python
import gzip
import json

with gzip.open('data.jsonl.gz', 'rt') as f:
    for line in f:
        data = json.loads(line)

How do I handle very large JSONL files?

Large file strategies:

  • Stream processing - Never load entire file into memory
  • Split files - Divide into chunks (e.g., data-001.jsonl, data-002.jsonl)
  • Parallel processing - Process multiple chunks simultaneously
  • Compress - Use gzip to reduce disk I/O
  • Index - Create line offset index for random access
# Split large JSONL file into 10,000 line chunks
split -l 10000 huge.jsonl chunk- --additional-suffix=.jsonl

# Process in parallel with GNU Parallel
parallel "cat {} | jq '.field' > {}.output" ::: chunk-*.jsonl

Can I use JSONL over HTTP/network?

Yes! JSONL is excellent for streaming APIs and network protocols:

  • HTTP streaming - Use Transfer-Encoding: chunked
  • WebSockets - Send one JSON object per message
  • Server-Sent Events - Each event is a JSON line
  • Message queues - Kafka, RabbitMQ often use JSONL

Set Content-Type to application/x-ndjson or application/jsonl. Clients can process records as they arrive.

// Node.js HTTP streaming example
const http = require('http');

http.createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'application/x-ndjson' });

  // Stream records one at a time
  for (let i = 0; i < 1000; i++) {
    res.write(JSON.stringify({ id: i, data: 'value' }) + '\n');
  }
  res.end();
}).listen(8080);

Use Cases & Ecosystem

Which companies/tools use JSONL?

JSONL is widely adopted across the industry:

AI/ML

  • OpenAI (GPT fine-tuning)
  • Hugging Face (datasets)
  • Google Vertex AI
  • Amazon SageMaker

Big Data

  • Apache Spark
  • Apache Hadoop
  • Snowflake
  • BigQuery

Analytics

  • Elasticsearch
  • Splunk
  • Datadog
  • New Relic

Data Engineering

  • Apache Kafka
  • Airflow
  • dbt (data build tool)
  • Pandas

Can I convert between JSONL and other formats?

Yes! JSONL easily converts to/from many formats:

CSV

# JSONL to CSV with jq
cat data.jsonl | jq -r '[.name, .age, .email] | @csv' > data.csv

# CSV to JSONL with Python pandas
import pandas as pd
df = pd.read_csv('data.csv')
df.to_json('data.jsonl', orient='records', lines=True)

JSON Array

# JSONL to JSON array with jq
jq -s '.' data.jsonl > data.json

# JSON array to JSONL with jq
jq -c '.[]' data.json > data.jsonl

SQL Database

# Import JSONL into PostgreSQL
CREATE TABLE data (doc jsonb);
COPY data FROM '/path/to/data.jsonl';

# Export from SQLite to JSONL
sqlite3 db.sqlite ".mode json" "SELECT * FROM table" > data.jsonl

How do I sort or deduplicate JSONL files?

Common data operations on JSONL:

Sorting

# Sort by field with jq
jq -s 'sort_by(.age)' data.jsonl > sorted.jsonl

# Sort by multiple fields
jq -s 'sort_by(.lastName, .firstName)' data.jsonl > sorted.jsonl

Deduplication

# Remove duplicates by ID with jq
jq -s 'unique_by(.id)' data.jsonl > deduped.jsonl

# Remove exact duplicates (entire object)
sort data.jsonl | uniq > deduped.jsonl

# Deduplicate in Python (memory efficient)
import json

seen_ids = set()
with open('data.jsonl', 'r') as fin, open('deduped.jsonl', 'w') as fout:
    for line in fin:
        obj = json.loads(line)
        if obj['id'] not in seen_ids:
            seen_ids.add(obj['id'])
            fout.write(line)

When should I NOT use JSONL?

JSONL is not ideal when:

  • Deeply nested structures - Top-level is an array/object tree, not a list of records
  • Human readability - Needs to be manually edited (use pretty-printed JSON)
  • Small datasets - If file is tiny (<1MB), JSON array is simpler
  • Complex relationships - Heavy cross-references between objects (use relational DB)
  • Config files - Application configs are better as single JSON/YAML/TOML
  • APIs returning single objects - REST endpoints typically return JSON, not JSONL

Use the right tool for the job. JSONL excels at large-scale, record-oriented data, but it is not a universal replacement for JSON.

How do I pretty-print or debug JSONL?

Debugging and inspection tools:

# Pretty-print first 10 records with jq
head -10 data.jsonl | jq '.'

# Count records
wc -l data.jsonl

# Show unique keys across all records
cat data.jsonl | jq -r 'keys[]' | sort | uniq

# Sample random records
shuf -n 5 data.jsonl | jq '.'

# Check for parse errors
cat data.jsonl | jq -c . > /dev/null

# Convert to pretty JSON array for viewing (careful with large files!)
jq -s '.' data.jsonl > pretty.json

Is there a MIME type for JSONL?

While not officially registered with IANA, common MIME types used in practice:

  • application/x-ndjson - Most widely used
  • application/jsonl - Alternative
  • application/jsonlines - Less common
  • text/x-ndjson - Text-based variant

For HTTP APIs, application/x-ndjson is recommended. Always specify charset:

Content-Type: application/x-ndjson; charset=utf-8