Why JSONL has become the standard format for training data, embeddings, and model outputs in modern AI development
Machine learning datasets can contain millions or billions of examples. JSONL allows you to stream training data without loading entire datasets into memory, making it possible to train models on commodity hardware.
Major ML platforms have standardized on JSONL for training data and model outputs:
Each line contains input features and corresponding labels. Perfect for classification and regression tasks.
{"text": "This movie was fantastic!", "label": "positive", "confidence": 0.95}
{"text": "Terrible experience, would not recommend", "label": "negative", "confidence": 0.89}
{"text": "Average quality, nothing special", "label": "neutral", "confidence": 0.72}
{"text": "Best purchase I've ever made!", "label": "positive", "confidence": 0.98}
Pro Tip: Include metadata like confidence scores or data source to help analyze model performance later.
When examples can have multiple labels simultaneously, use arrays for labels.
{"document": "Python tutorial for beginners...", "labels": ["programming", "python", "tutorial", "beginner"]}
{"document": "Machine learning with PyTorch...", "labels": ["programming", "python", "machine-learning", "deep-learning"]}
{"document": "Web scraping best practices...", "labels": ["programming", "python", "web-scraping", "automation"]}
Format for QA systems, chatbots, and conversational AI models.
{"question": "What is the capital of France?", "context": "France is a country in Europe. Paris is its capital and largest city.", "answer": "Paris", "answer_start": 47}
{"question": "How many states are in the US?", "context": "The United States consists of 50 states, a federal district, and several territories.", "answer": "50", "answer_start": 35}
{"question": "Who wrote Romeo and Juliet?", "context": "Romeo and Juliet is a tragedy written by William Shakespeare early in his career.", "answer": "William Shakespeare", "answer_start": 49}
Store image paths or base64 encoded images with labels. Include bounding boxes for object detection.
{"image_path": "/data/images/cat_001.jpg", "label": "cat", "breed": "siamese", "age": "adult"}
{"image_path": "/data/images/dog_042.jpg", "label": "dog", "breed": "labrador", "age": "puppy"}
{"image_path": "/data/images/car_199.jpg", "label": "vehicle", "type": "sedan", "color": "blue", "bbox": [100, 150, 300, 400]}
Note: For large images, store file paths rather than base64 data to keep JSONL files manageable.
Format temporal sequences for forecasting and anomaly detection.
{"timestamp": "2025-01-01T00:00:00Z", "sensor_id": "temp_01", "value": 72.5, "unit": "fahrenheit", "location": "room_a"}
{"timestamp": "2025-01-01T00:05:00Z", "sensor_id": "temp_01", "value": 72.8, "unit": "fahrenheit", "location": "room_a"}
{"timestamp": "2025-01-01T00:10:00Z", "sensor_id": "temp_01", "value": 73.1, "unit": "fahrenheit", "location": "room_a"}
OpenAI requires JSONL format for fine-tuning GPT models. Each line represents a conversation or completion example.
{"messages": [{"role": "system", "content": "You are a helpful customer service assistant."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password, go to Settings > Account > Reset Password. You'll receive an email with instructions."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer service assistant."}, {"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "I can help you track your order. Please provide your order number, and I'll look that up for you."}]}
{"prompt": "Translate English to French: Hello", "completion": "Bonjour"}
{"prompt": "Translate English to French: Goodbye", "completion": "Au revoir"}
{"prompt": "Translate English to French: Thank you", "completion": "Merci"}
Validation: Use OpenAI's CLI tool to validate your JSONL before uploading:
openai tools fine_tunes.prepare_data -f training_data.jsonl
Format for training models to follow instructions (like Alpaca, Vicuna, LLaMA fine-tuning).
{"instruction": "Write a Python function to calculate factorial", "input": "", "output": "def factorial(n):\n if n == 0 or n == 1:\n return 1\n return n * factorial(n - 1)"}
{"instruction": "Summarize this article", "input": "Artificial intelligence is transforming industries...", "output": "AI is revolutionizing multiple sectors through automation and enhanced decision-making capabilities."}
{"instruction": "Classify sentiment", "input": "I love this product!", "output": "Positive"}
Format for preference-based training with ranked outputs.
{"prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits (qubits) that can exist in superposition...", "rejected": "Quantum computing is just really fast computers.", "ranking": [1, 2]}
{"prompt": "Write a haiku about nature", "chosen": "Cherry blossoms fall\nGentle breeze through autumn leaves\nNature's symphony", "rejected": "Trees are green and nice\nFlowers bloom in the spring time\nI like nature lots", "ranking": [1, 2]}
Train models to maintain context across multiple dialogue turns.
{"conversation_id": "conv_001", "messages": [{"speaker": "user", "text": "What's the weather like?"}, {"speaker": "bot", "text": "I'd be happy to check. What's your location?"}, {"speaker": "user", "text": "San Francisco"}, {"speaker": "bot", "text": "In San Francisco, it's currently 65F with partly cloudy skies."}]}
{"conversation_id": "conv_002", "messages": [{"speaker": "user", "text": "I need help with my order"}, {"speaker": "bot", "text": "I can help with that. What's your order number?"}, {"speaker": "user", "text": "ORDER-12345"}, {"speaker": "bot", "text": "I found your order. It's currently in transit and should arrive tomorrow."}]}
Store vector embeddings alongside their source text for semantic search and similarity matching.
{"id": "doc_001", "text": "Machine learning is a subset of AI", "embedding": [0.123, -0.456, 0.789, ...], "model": "text-embedding-ada-002", "dimensions": 1536}
{"id": "doc_002", "text": "Neural networks are inspired by the brain", "embedding": [-0.234, 0.567, -0.123, ...], "model": "text-embedding-ada-002", "dimensions": 1536}
Include rich metadata for filtering and retrieval in RAG (Retrieval Augmented Generation) systems.
{"doc_id": "article_123", "title": "Introduction to Python", "content": "Python is a versatile programming language...", "embedding": [0.23, -0.45, 0.67, ...], "author": "Jane Doe", "published": "2025-01-15", "category": "programming", "tags": ["python", "tutorial", "beginner"], "word_count": 1250}
{"doc_id": "article_124", "title": "Advanced JavaScript Techniques", "content": "Modern JavaScript offers powerful features...", "embedding": [-0.12, 0.89, -0.34, ...], "author": "John Smith", "published": "2025-01-16", "category": "programming", "tags": ["javascript", "advanced", "es6"], "word_count": 2100}
Store image vectors from models like CLIP for multimodal search.
{"image_id": "img_5001", "path": "/images/cat_001.jpg", "embedding": [0.45, -0.23, 0.78, ...], "model": "clip-vit-base-patch32", "labels": ["cat", "pet", "animal"], "dimensions": 512}
{"image_id": "img_5002", "path": "/images/landscape_042.jpg", "embedding": [-0.67, 0.34, -0.12, ...], "model": "clip-vit-base-patch32", "labels": ["nature", "mountain", "outdoor"], "dimensions": 512}
import json
import numpy as np
from openai import OpenAI
client = OpenAI()
# Generate embeddings
texts = ["Sample text 1", "Sample text 2"]
embeddings = []
for text in texts:
response = client.embeddings.create(
model="text-embedding-ada-002",
input=text
)
embeddings.append(response.data[0].embedding)
# Save to JSONL
with open('embeddings.jsonl', 'w') as f:
for text, embedding in zip(texts, embeddings):
record = {
"text": text,
"embedding": embedding,
"dimensions": len(embedding)
}
f.write(json.dumps(record) + '\n')
# Load and compute similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
with open('embeddings.jsonl', 'r') as f:
docs = [json.loads(line) for line in f]
# Find similar documents
query_embedding = docs[0]['embedding']
for doc in docs[1:]:
similarity = cosine_similarity(query_embedding, doc['embedding'])
print(f"Similarity: {similarity:.4f}")
Store model predictions with confidence scores and metadata for analysis.
{"input": "This product exceeded my expectations!", "prediction": "positive", "confidence": 0.96, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:00Z"}
{"input": "Shipping took too long", "prediction": "negative", "confidence": 0.84, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:01Z"}
{"input": "It's okay, nothing special", "prediction": "neutral", "confidence": 0.71, "model": "sentiment-v2.1", "timestamp": "2025-01-15T10:30:02Z"}
Include probability distributions across all classes for detailed analysis.
{"input": "Apple announces new iPhone", "predicted_class": "technology", "probabilities": {"technology": 0.89, "business": 0.07, "entertainment": 0.03, "sports": 0.01}, "top_3": ["technology", "business", "entertainment"]}
{"input": "Lakers win championship game", "predicted_class": "sports", "probabilities": {"sports": 0.94, "entertainment": 0.04, "technology": 0.01, "business": 0.01}, "top_3": ["sports", "entertainment", "technology"]}
Log language model generations with usage statistics for monitoring costs and quality.
{"prompt": "Explain photosynthesis simply", "completion": "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide...", "model": "gpt-4", "tokens_prompt": 5, "tokens_completion": 28, "total_tokens": 33, "cost": 0.00099, "finish_reason": "stop", "timestamp": "2025-01-15T14:22:00Z"}
{"prompt": "Write a haiku about coding", "completion": "Lines of code take shape\nDebugging through the long night\nBug-free at sunrise", "model": "gpt-4", "tokens_prompt": 6, "tokens_completion": 18, "total_tokens": 24, "cost": 0.00072, "finish_reason": "stop", "timestamp": "2025-01-15T14:22:15Z"}
Compare multiple model versions for performance evaluation.
{"test_id": "ab_test_001", "input": "customer query...", "model_a": {"version": "v1.0", "prediction": "class_a", "confidence": 0.82, "latency_ms": 45}, "model_b": {"version": "v2.0", "prediction": "class_a", "confidence": 0.91, "latency_ms": 38}, "ground_truth": "class_a", "winner": "model_b"}
{"test_id": "ab_test_002", "input": "another query...", "model_a": {"version": "v1.0", "prediction": "class_b", "confidence": 0.76, "latency_ms": 42}, "model_b": {"version": "v2.0", "prediction": "class_c", "confidence": 0.88, "latency_ms": 41}, "ground_truth": "class_c", "winner": "model_b"}
import json
import torch
from torch.utils.data import Dataset, DataLoader
class JSONLDataset(Dataset):
"""Custom PyTorch Dataset for JSONL files"""
def __init__(self, jsonl_path, tokenizer, max_length=512):
self.data = []
self.tokenizer = tokenizer
self.max_length = max_length
# Load JSONL file
with open(jsonl_path, 'r', encoding='utf-8') as f:
for line in f:
self.data.append(json.loads(line))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
# Tokenize text
encoding = self.tokenizer(
item['text'],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(item['label'], dtype=torch.long)
}
# Usage
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = JSONLDataset('train.jsonl', tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Training loop
for batch in dataloader:
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']
# ... training code ...
Process massive datasets without loading everything into memory using IterableDataset.
import json
import torch
from torch.utils.data import IterableDataset, DataLoader
class StreamingJSONLDataset(IterableDataset):
"""Memory-efficient streaming dataset for large JSONL files"""
def __init__(self, jsonl_path, tokenizer, max_length=512):
self.jsonl_path = jsonl_path
self.tokenizer = tokenizer
self.max_length = max_length
def __iter__(self):
with open(self.jsonl_path, 'r', encoding='utf-8') as f:
for line in f:
item = json.loads(line)
encoding = self.tokenizer(
item['text'],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
yield {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(item['label'], dtype=torch.long)
}
# Process terabytes of data with constant memory usage
streaming_dataset = StreamingJSONLDataset('huge_dataset.jsonl', tokenizer)
dataloader = DataLoader(streaming_dataset, batch_size=32)
for batch in dataloader:
# Process batch without loading entire dataset
pass
import glob
import random
class MultiFileJSONLDataset(IterableDataset):
"""Dataset that streams from multiple JSONL files"""
def __init__(self, pattern, tokenizer, max_length=512):
self.files = glob.glob(pattern)
self.tokenizer = tokenizer
self.max_length = max_length
random.shuffle(self.files) # Randomize file order
def __iter__(self):
for filepath in self.files:
with open(filepath, 'r', encoding='utf-8') as f:
for line in f:
item = json.loads(line)
encoding = self.tokenizer(
item['text'],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
yield {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(item['label'], dtype=torch.long)
}
# Train on thousands of JSONL files
dataset = MultiFileJSONLDataset('./data/*.jsonl', tokenizer)
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
import json
import tensorflow as tf
def jsonl_generator(filepath):
"""Generator function to yield examples from JSONL"""
with open(filepath, 'r') as f:
for line in f:
item = json.loads(line)
yield item['text'], item['label']
def create_dataset(jsonl_path, batch_size=32):
"""Create TensorFlow Dataset from JSONL file"""
# Create dataset from generator
dataset = tf.data.Dataset.from_generator(
lambda: jsonl_generator(jsonl_path),
output_signature=(
tf.TensorSpec(shape=(), dtype=tf.string),
tf.TensorSpec(shape=(), dtype=tf.int32)
)
)
# Tokenization and preprocessing
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)
def preprocess(text, label):
# Tokenize and pad sequences
tokens = tokenizer.texts_to_sequences([text.numpy().decode('utf-8')])
padded = tf.keras.preprocessing.sequence.pad_sequences(
tokens, maxlen=512, padding='post'
)
return padded[0], label
dataset = dataset.map(
lambda text, label: tf.py_function(
preprocess, [text, label], [tf.int32, tf.int32]
)
)
# Batch and prefetch
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
# Usage
train_dataset = create_dataset('train.jsonl', batch_size=32)
model.fit(train_dataset, epochs=10)
import tensorflow as tf
import json
def create_optimized_dataset(jsonl_files, batch_size=32):
"""High-performance dataset with parallel processing"""
def parse_line(line):
item = json.loads(line.numpy().decode('utf-8'))
return item['text'], item['label']
# Read multiple JSONL files in parallel
dataset = tf.data.Dataset.list_files(jsonl_files, shuffle=True)
# Interleave files for better shuffling
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath),
cycle_length=4,
num_parallel_calls=tf.data.AUTOTUNE
)
# Parse JSON in parallel
dataset = dataset.map(
lambda line: tf.py_function(
parse_line, [line], [tf.string, tf.int32]
),
num_parallel_calls=tf.data.AUTOTUNE
)
# Shuffle with large buffer
dataset = dataset.shuffle(buffer_size=10000)
# Batch and prefetch
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
# Train with multiple JSONL files
file_pattern = './data/*.jsonl'
dataset = create_optimized_dataset(file_pattern)
model.fit(dataset, epochs=10)
Hugging Face Datasets library has native JSONL support with powerful features.
from datasets import load_dataset
# Load single JSONL file
dataset = load_dataset('json', data_files='train.jsonl')
# Load train/validation/test splits
dataset = load_dataset('json', data_files={
'train': 'train.jsonl',
'validation': 'val.jsonl',
'test': 'test.jsonl'
})
# Load from multiple files with wildcards
dataset = load_dataset('json', data_files='data/*.jsonl')
# Stream large datasets without downloading entirely
dataset = load_dataset('json', data_files='huge_file.jsonl', streaming=True)
# Access data
print(dataset['train'][0])
print(f"Training examples: {len(dataset['train'])}")
from datasets import load_dataset
from transformers import AutoTokenizer
# Load dataset
dataset = load_dataset('json', data_files='train.jsonl')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Tokenization function
def tokenize_function(examples):
return tokenizer(
examples['text'],
padding='max_length',
truncation=True,
max_length=512
)
# Apply tokenization to entire dataset
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=4 # Parallel processing
)
# Remove unnecessary columns
tokenized_dataset = tokenized_dataset.remove_columns(['text'])
tokenized_dataset = tokenized_dataset.rename_column('label', 'labels')
tokenized_dataset.set_format('torch')
# Ready for training
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train']
)
trainer.train()
from datasets import load_dataset
# Stream JSONL without loading into memory
dataset = load_dataset('json', data_files='huge_dataset.jsonl', streaming=True)
# Iterate through examples
for example in dataset['train']:
print(example)
# Process one example at a time
# Apply transformations to streaming dataset
def preprocess(example):
example['text'] = example['text'].lower()
return example
dataset = dataset.map(preprocess)
# Take first N examples
dataset_sample = dataset['train'].take(1000)
# Shuffle with buffer
dataset_shuffled = dataset['train'].shuffle(buffer_size=10000)
# Filter examples
dataset_filtered = dataset['train'].filter(lambda x: len(x['text']) > 50)
# Perfect for training on multi-terabyte datasets
for epoch in range(3):
for batch in dataset['train'].batch(32):
# Train on batch
pass
from datasets import Dataset, load_dataset
# Create dataset from Python objects
data = [
{"text": "Example 1", "label": 0},
{"text": "Example 2", "label": 1}
]
dataset = Dataset.from_list(data)
# Save as JSONL
dataset.to_json('output.jsonl')
# Load it back
reloaded = load_dataset('json', data_files='output.jsonl')
# Push to Hugging Face Hub
dataset.push_to_hub('username/my-dataset')
# Load from Hub
hub_dataset = load_dataset('username/my-dataset')
# Export splits separately
dataset['train'].to_json('train.jsonl')
dataset['test'].to_json('test.jsonl')