JSONL for Machine Learning

Why JSONL has become the standard format for training data, embeddings, and model outputs in modern AI development

Why JSONL Dominates ML Workflows

Dataset Management

Machine learning datasets can contain millions or billions of examples. JSONL allows you to stream training data without loading entire datasets into memory, making it possible to train models on commodity hardware.

  • Process one example at a time
  • Easily shuffle and batch data
  • Resume training from any point
  • Append new training examples

Industry Adoption

Major ML platforms have standardized on JSONL for training data and model outputs:

  • OpenAI: Fine-tuning format for GPT models
  • Google Vertex AI: AutoML training data
  • Hugging Face: Datasets library native format
  • PyTorch/TensorFlow: Streaming data pipelines

Jump to Topic

Training Data Format

Supervised Learning

Each line contains input features and corresponding labels. Perfect for classification and regression tasks.

{"text": "This movie was fantastic!", "label": "positive", "confidence": 0.95}
{"text": "Terrible experience, would not recommend", "label": "negative", "confidence": 0.89}
{"text": "Average quality, nothing special", "label": "neutral", "confidence": 0.72}
{"text": "Best purchase I've ever made!", "label": "positive", "confidence": 0.98}

Pro Tip: Include metadata like confidence scores or data source to help analyze model performance later.

Multi-Label Classification

When examples can have multiple labels simultaneously, use arrays for labels.

{"document": "Python tutorial for beginners...", "labels": ["programming", "python", "tutorial", "beginner"]}
{"document": "Machine learning with PyTorch...", "labels": ["programming", "python", "machine-learning", "deep-learning"]}
{"document": "Web scraping best practices...", "labels": ["programming", "python", "web-scraping", "automation"]}

Question Answering

Format for QA systems, chatbots, and conversational AI models.

{"question": "What is the capital of France?", "context": "France is a country in Europe. Paris is its capital and largest city.", "answer": "Paris", "answer_start": 47}
{"question": "How many states are in the US?", "context": "The United States consists of 50 states, a federal district, and several territories.", "answer": "50", "answer_start": 35}
{"question": "Who wrote Romeo and Juliet?", "context": "Romeo and Juliet is a tragedy written by William Shakespeare early in his career.", "answer": "William Shakespeare", "answer_start": 49}

Image Classification

Store image paths or base64 encoded images with labels. Include bounding boxes for object detection.

{"image_path": "/data/images/cat_001.jpg", "label": "cat", "breed": "siamese", "age": "adult"}
{"image_path": "/data/images/dog_042.jpg", "label": "dog", "breed": "labrador", "age": "puppy"}
{"image_path": "/data/images/car_199.jpg", "label": "vehicle", "type": "sedan", "color": "blue", "bbox": [100, 150, 300, 400]}

Note: For large images, store file paths rather than base64 data to keep JSONL files manageable.

Time Series Data

Format temporal sequences for forecasting and anomaly detection.

{"timestamp": "2025-01-01T00:00:00Z", "sensor_id": "temp_01", "value": 72.5, "unit": "fahrenheit", "location": "room_a"}
{"timestamp": "2025-01-01T00:05:00Z", "sensor_id": "temp_01", "value": 72.8, "unit": "fahrenheit", "location": "room_a"}
{"timestamp": "2025-01-01T00:10:00Z", "sensor_id": "temp_01", "value": 73.1, "unit": "fahrenheit", "location": "room_a"}

Fine-Tuning Language Models

OpenAI Fine-Tuning Format

OpenAI requires JSONL format for fine-tuning GPT models. Each line represents a conversation or completion example.

Chat Format (GPT-3.5/GPT-4)

{"messages": [{"role": "system", "content": "You are a helpful customer service assistant."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password, go to Settings > Account > Reset Password. You'll receive an email with instructions."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer service assistant."}, {"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "I can help you track your order. Please provide your order number, and I'll look that up for you."}]}

Completion Format (Legacy)

{"prompt": "Translate English to French: Hello", "completion": "Bonjour"}
{"prompt": "Translate English to French: Goodbye", "completion": "Au revoir"}
{"prompt": "Translate English to French: Thank you", "completion": "Merci"}

Validation: Use OpenAI's CLI tool to validate your JSONL before uploading:

openai tools fine_tunes.prepare_data -f training_data.jsonl

Instruction Tuning

Format for training models to follow instructions (like Alpaca, Vicuna, LLaMA fine-tuning).

{"instruction": "Write a Python function to calculate factorial", "input": "", "output": "def factorial(n):\n    if n == 0 or n == 1:\n        return 1\n    return n * factorial(n - 1)"}
{"instruction": "Summarize this article", "input": "Artificial intelligence is transforming industries...", "output": "AI is revolutionizing multiple sectors through automation and enhanced decision-making capabilities."}
{"instruction": "Classify sentiment", "input": "I love this product!", "output": "Positive"}

RLHF (Reinforcement Learning from Human Feedback)

Format for preference-based training with ranked outputs.

{"prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits (qubits) that can exist in superposition...", "rejected": "Quantum computing is just really fast computers.", "ranking": [1, 2]}
{"prompt": "Write a haiku about nature", "chosen": "Cherry blossoms fall\nGentle breeze through autumn leaves\nNature's symphony", "rejected": "Trees are green and nice\nFlowers bloom in the spring time\nI like nature lots", "ranking": [1, 2]}

Multi-Turn Conversations

Train models to maintain context across multiple dialogue turns.

{"conversation_id": "conv_001", "messages": [{"speaker": "user", "text": "What's the weather like?"}, {"speaker": "bot", "text": "I'd be happy to check. What's your location?"}, {"speaker": "user", "text": "San Francisco"}, {"speaker": "bot", "text": "In San Francisco, it's currently 65F with partly cloudy skies."}]}
{"conversation_id": "conv_002", "messages": [{"speaker": "user", "text": "I need help with my order"}, {"speaker": "bot", "text": "I can help with that. What's your order number?"}, {"speaker": "user", "text": "ORDER-12345"}, {"speaker": "bot", "text": "I found your order. It's currently in transit and should arrive tomorrow."}]}

Embeddings Storage

Text Embeddings

Store vector embeddings alongside their source text for semantic search and similarity matching.

{"id": "doc_001", "text": "Machine learning is a subset of AI", "embedding": [0.123, -0.456, 0.789, ...], "model": "text-embedding-ada-002", "dimensions": 1536}
{"id": "doc_002", "text": "Neural networks are inspired by the brain", "embedding": [-0.234, 0.567, -0.123, ...], "model": "text-embedding-ada-002", "dimensions": 1536}

Document Embeddings with Metadata

Include rich metadata for filtering and retrieval in RAG (Retrieval Augmented Generation) systems.

{"doc_id": "article_123", "title": "Introduction to Python", "content": "Python is a versatile programming language...", "embedding": [0.23, -0.45, 0.67, ...], "author": "Jane Doe", "published": "2025-01-15", "category": "programming", "tags": ["python", "tutorial", "beginner"], "word_count": 1250}
{"doc_id": "article_124", "title": "Advanced JavaScript Techniques", "content": "Modern JavaScript offers powerful features...", "embedding": [-0.12, 0.89, -0.34, ...], "author": "John Smith", "published": "2025-01-16", "category": "programming", "tags": ["javascript", "advanced", "es6"], "word_count": 2100}

Image Embeddings

Store image vectors from models like CLIP for multimodal search.

{"image_id": "img_5001", "path": "/images/cat_001.jpg", "embedding": [0.45, -0.23, 0.78, ...], "model": "clip-vit-base-patch32", "labels": ["cat", "pet", "animal"], "dimensions": 512}
{"image_id": "img_5002", "path": "/images/landscape_042.jpg", "embedding": [-0.67, 0.34, -0.12, ...], "model": "clip-vit-base-patch32", "labels": ["nature", "mountain", "outdoor"], "dimensions": 512}

Processing Embeddings with Python

import json
import numpy as np
from openai import OpenAI

client = OpenAI()

# Generate embeddings
texts = ["Sample text 1", "Sample text 2"]
embeddings = []

for text in texts:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    embeddings.append(response.data[0].embedding)

# Save to JSONL
with open('embeddings.jsonl', 'w') as f:
    for text, embedding in zip(texts, embeddings):
        record = {
            "text": text,
            "embedding": embedding,
            "dimensions": len(embedding)
        }
        f.write(json.dumps(record) + '\n')

# Load and compute similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

with open('embeddings.jsonl', 'r') as f:
    docs = [json.loads(line) for line in f]

# Find similar documents
query_embedding = docs[0]['embedding']
for doc in docs[1:]:
    similarity = cosine_similarity(query_embedding, doc['embedding'])
    print(f"Similarity: {similarity:.4f}")

Model Outputs & Predictions

Batch Predictions

Store model predictions with confidence scores and metadata for analysis.

{"input": "This product exceeded my expectations!", "prediction": "positive", "confidence": 0.96, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:00Z"}
{"input": "Shipping took too long", "prediction": "negative", "confidence": 0.84, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:01Z"}
{"input": "It's okay, nothing special", "prediction": "neutral", "confidence": 0.71, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:02Z"}

Multi-Class Predictions

Include probability distributions across all classes for detailed analysis.

{"input": "Apple announces new iPhone", "predicted_class": "technology", "probabilities": {"technology": 0.89, "business": 0.07, "entertainment": 0.03, "sports": 0.01}, "top_3": ["technology", "business", "entertainment"]}
{"input": "Lakers win championship game", "predicted_class": "sports", "probabilities": {"sports": 0.94, "entertainment": 0.04, "technology": 0.01, "business": 0.01}, "top_3": ["sports", "entertainment", "technology"]}

LLM Generation Outputs

Log language model generations with usage statistics for monitoring costs and quality.

{"prompt": "Explain photosynthesis simply", "completion": "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide...", "model": "gpt-4", "tokens_prompt": 5, "tokens_completion": 28, "total_tokens": 33, "cost": 0.00099, "finish_reason": "stop", "timestamp": "2025-01-15T14:22:00Z"}
{"prompt": "Write a haiku about coding", "completion": "Lines of code take shape\nDebugging through the long night\nBug-free at sunrise", "model": "gpt-4", "tokens_prompt": 6, "tokens_completion": 18, "total_tokens": 24, "cost": 0.00072, "finish_reason": "stop", "timestamp": "2025-01-15T14:22:15Z"}

A/B Testing Results

Compare multiple model versions for performance evaluation.

{"test_id": "ab_test_001", "input": "customer query...", "model_a": {"version": "v1.0", "prediction": "class_a", "confidence": 0.82, "latency_ms": 45}, "model_b": {"version": "v2.0", "prediction": "class_a", "confidence": 0.91, "latency_ms": 38}, "ground_truth": "class_a", "winner": "model_b"}
{"test_id": "ab_test_002", "input": "another query...", "model_a": {"version": "v1.0", "prediction": "class_b", "confidence": 0.76, "latency_ms": 42}, "model_b": {"version": "v2.0", "prediction": "class_c", "confidence": 0.88, "latency_ms": 41}, "ground_truth": "class_c", "winner": "model_b"}

PyTorch with JSONL

Custom Dataset Class

import json
import torch
from torch.utils.data import Dataset, DataLoader

class JSONLDataset(Dataset):
    """Custom PyTorch Dataset for JSONL files"""

    def __init__(self, jsonl_path, tokenizer, max_length=512):
        self.data = []
        self.tokenizer = tokenizer
        self.max_length = max_length

        # Load JSONL file
        with open(jsonl_path, 'r', encoding='utf-8') as f:
            for line in f:
                self.data.append(json.loads(line))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Tokenize text
        encoding = self.tokenizer(
            item['text'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(item['label'], dtype=torch.long)
        }

# Usage
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = JSONLDataset('train.jsonl', tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Training loop
for batch in dataloader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    labels = batch['label']
    # ... training code ...

Streaming Large JSONL Files

Process massive datasets without loading everything into memory using IterableDataset.

import json
import torch
from torch.utils.data import IterableDataset, DataLoader

class StreamingJSONLDataset(IterableDataset):
    """Memory-efficient streaming dataset for large JSONL files"""

    def __init__(self, jsonl_path, tokenizer, max_length=512):
        self.jsonl_path = jsonl_path
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __iter__(self):
        with open(self.jsonl_path, 'r', encoding='utf-8') as f:
            for line in f:
                item = json.loads(line)

                encoding = self.tokenizer(
                    item['text'],
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                yield {
                    'input_ids': encoding['input_ids'].squeeze(),
                    'attention_mask': encoding['attention_mask'].squeeze(),
                    'label': torch.tensor(item['label'], dtype=torch.long)
                }

# Process terabytes of data with constant memory usage
streaming_dataset = StreamingJSONLDataset('huge_dataset.jsonl', tokenizer)
dataloader = DataLoader(streaming_dataset, batch_size=32)

for batch in dataloader:
    # Process batch without loading entire dataset
    pass

Multi-File Training

import glob
import random

class MultiFileJSONLDataset(IterableDataset):
    """Dataset that streams from multiple JSONL files"""

    def __init__(self, pattern, tokenizer, max_length=512):
        self.files = glob.glob(pattern)
        self.tokenizer = tokenizer
        self.max_length = max_length
        random.shuffle(self.files)  # Randomize file order

    def __iter__(self):
        for filepath in self.files:
            with open(filepath, 'r', encoding='utf-8') as f:
                for line in f:
                    item = json.loads(line)

                    encoding = self.tokenizer(
                        item['text'],
                        max_length=self.max_length,
                        padding='max_length',
                        truncation=True,
                        return_tensors='pt'
                    )

                    yield {
                        'input_ids': encoding['input_ids'].squeeze(),
                        'attention_mask': encoding['attention_mask'].squeeze(),
                        'label': torch.tensor(item['label'], dtype=torch.long)
                    }

# Train on thousands of JSONL files
dataset = MultiFileJSONLDataset('./data/*.jsonl', tokenizer)
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

TensorFlow with JSONL

TF Dataset from JSONL

import json
import tensorflow as tf

def jsonl_generator(filepath):
    """Generator function to yield examples from JSONL"""
    with open(filepath, 'r') as f:
        for line in f:
            item = json.loads(line)
            yield item['text'], item['label']

def create_dataset(jsonl_path, batch_size=32):
    """Create TensorFlow Dataset from JSONL file"""

    # Create dataset from generator
    dataset = tf.data.Dataset.from_generator(
        lambda: jsonl_generator(jsonl_path),
        output_signature=(
            tf.TensorSpec(shape=(), dtype=tf.string),
            tf.TensorSpec(shape=(), dtype=tf.int32)
        )
    )

    # Tokenization and preprocessing
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)

    def preprocess(text, label):
        # Tokenize and pad sequences
        tokens = tokenizer.texts_to_sequences([text.numpy().decode('utf-8')])
        padded = tf.keras.preprocessing.sequence.pad_sequences(
            tokens, maxlen=512, padding='post'
        )
        return padded[0], label

    dataset = dataset.map(
        lambda text, label: tf.py_function(
            preprocess, [text, label], [tf.int32, tf.int32]
        )
    )

    # Batch and prefetch
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)

    return dataset

# Usage
train_dataset = create_dataset('train.jsonl', batch_size=32)
model.fit(train_dataset, epochs=10)

High-Performance Streaming

import tensorflow as tf
import json

def create_optimized_dataset(jsonl_files, batch_size=32):
    """High-performance dataset with parallel processing"""

    def parse_line(line):
        item = json.loads(line.numpy().decode('utf-8'))
        return item['text'], item['label']

    # Read multiple JSONL files in parallel
    dataset = tf.data.Dataset.list_files(jsonl_files, shuffle=True)

    # Interleave files for better shuffling
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath),
        cycle_length=4,
        num_parallel_calls=tf.data.AUTOTUNE
    )

    # Parse JSON in parallel
    dataset = dataset.map(
        lambda line: tf.py_function(
            parse_line, [line], [tf.string, tf.int32]
        ),
        num_parallel_calls=tf.data.AUTOTUNE
    )

    # Shuffle with large buffer
    dataset = dataset.shuffle(buffer_size=10000)

    # Batch and prefetch
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)

    return dataset

# Train with multiple JSONL files
file_pattern = './data/*.jsonl'
dataset = create_optimized_dataset(file_pattern)
model.fit(dataset, epochs=10)

Hugging Face Datasets

Load JSONL into Datasets

Hugging Face Datasets library has native JSONL support with powerful features.

from datasets import load_dataset

# Load single JSONL file
dataset = load_dataset('json', data_files='train.jsonl')

# Load train/validation/test splits
dataset = load_dataset('json', data_files={
    'train': 'train.jsonl',
    'validation': 'val.jsonl',
    'test': 'test.jsonl'
})

# Load from multiple files with wildcards
dataset = load_dataset('json', data_files='data/*.jsonl')

# Stream large datasets without downloading entirely
dataset = load_dataset('json', data_files='huge_file.jsonl', streaming=True)

# Access data
print(dataset['train'][0])
print(f"Training examples: {len(dataset['train'])}")

Preprocessing and Tokenization

from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset
dataset = load_dataset('json', data_files='train.jsonl')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512
    )

# Apply tokenization to entire dataset
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4  # Parallel processing
)

# Remove unnecessary columns
tokenized_dataset = tokenized_dataset.remove_columns(['text'])
tokenized_dataset = tokenized_dataset.rename_column('label', 'labels')
tokenized_dataset.set_format('torch')

# Ready for training
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train']
)

trainer.train()

Streaming Mode for Massive Datasets

from datasets import load_dataset

# Stream JSONL without loading into memory
dataset = load_dataset('json', data_files='huge_dataset.jsonl', streaming=True)

# Iterate through examples
for example in dataset['train']:
    print(example)
    # Process one example at a time

# Apply transformations to streaming dataset
def preprocess(example):
    example['text'] = example['text'].lower()
    return example

dataset = dataset.map(preprocess)

# Take first N examples
dataset_sample = dataset['train'].take(1000)

# Shuffle with buffer
dataset_shuffled = dataset['train'].shuffle(buffer_size=10000)

# Filter examples
dataset_filtered = dataset['train'].filter(lambda x: len(x['text']) > 50)

# Perfect for training on multi-terabyte datasets
for epoch in range(3):
    for batch in dataset['train'].batch(32):
        # Train on batch
        pass

Save and Share Datasets

from datasets import Dataset, load_dataset

# Create dataset from Python objects
data = [
    {"text": "Example 1", "label": 0},
    {"text": "Example 2", "label": 1}
]
dataset = Dataset.from_list(data)

# Save as JSONL
dataset.to_json('output.jsonl')

# Load it back
reloaded = load_dataset('json', data_files='output.jsonl')

# Push to Hugging Face Hub
dataset.push_to_hub('username/my-dataset')

# Load from Hub
hub_dataset = load_dataset('username/my-dataset')

# Export splits separately
dataset['train'].to_json('train.jsonl')
dataset['test'].to_json('test.jsonl')

Best Practices

Data Quality

  • Validate JSON on each line before training
  • Include data versioning metadata (timestamp, version, source)
  • Remove duplicates and corrupted examples
  • Balance class distributions for classification tasks
  • Document your schema and field meanings

Performance Optimization

  • Use streaming for datasets that don't fit in memory
  • Compress JSONL files with gzip (reduces size by 80-90%)
  • Split large files into smaller chunks for parallel processing
  • Use multiprocessing for data loading (num_workers > 0)
  • Enable dataset caching to avoid re-processing

File Organization

  • Separate train/validation/test into different files
  • Use descriptive filenames: train_v2_cleaned.jsonl
  • Store embeddings separate from text when possible
  • Keep raw and processed datasets in different directories
  • Version control your JSONL schemas (not the data)

Data Privacy & Security

  • Anonymize PII (personally identifiable information)
  • Encrypt sensitive training data at rest and in transit
  • Audit data sources and license compliance
  • Implement access controls for production datasets
  • Log data access for compliance tracking

Testing & Validation

  • Validate JSONL syntax before uploading to training platforms
  • Check for required fields in all examples
  • Test on small sample before full training run
  • Monitor for data drift over time
  • Keep holdout test set separate and unmodified

Related Resources