16 Appendix 1: LLM Data Schemas
Here, we describe many variants of LLM data. This will be relevant for when we extend the flywheel to include more types of data, and especially shift towards promoting the sharing (via opt-in flywheels, but also via new market mechanisms) of richer “content data”.
16.1 Open Web / Crawls
WARC/WAT/WET
- WARC (container for HTTP request/response records) — spec & overview: (International Internet Preservation Consortium 2017; Library of Congress, n.d.)
- WAT (JSON metadata extracted from WARC) and WET (plain text extracted from HTML) — Common Crawl guides (Common Crawl, n.d.a, n.d.b)
C4 (Colossal Clean Crawled Corpus) — TFDS catalog & generator code (TensorFlow Datasets, n.d.a, n.d.b)
The Pile (22-source, mixed corpus) — paper (Gao et al. 2021)
16.2 Encyclopedic / Books
Wikipedia XML dumps (page/revision XML; SQL tables for links) — (Wikimedia Meta-Wiki, n.d.; Wikipedia, n.d.)
Project Gutenberg
- Books: plain text/HTML master formats; ePub/MOBI derived (Project Gutenberg, n.d.a)
- Catalog schema: daily RDF/XML (also CSV) for metadata; offline catalogs (Project Gutenberg, n.d.b)
16.3 Scientific / Legal
arXiv (Atom/OAI-PMH metadata; bulk & API) — (arXiv.org, n.d.c, n.d.a, n.d.b)
JATS XML (journal article tag suite) — (NISO 2024; NLM, n.d.)
16.4 Code
- BigCode — The Stack / The Stack v2 (source files + license/provenance metadata; dedup variants) — (BigCode Project, n.d.b, n.d.c, n.d.a; stack_paper?)
16.6 Instruction / Conversations (Post-training SFT)
OpenAI-style chat schema (role-tagged:
system|user|assistant
, plus tool calls) — (OpenAI, n.d.c)Alpaca (JSON prompts/instructions/outputs) — (Stanford CRFM 2023; Tatsu Lab, n.d.; gururise, n.d.)
Databricks Dolly-15k (human-written instruction/response pairs) — (Databricks, n.d.)
OpenAssistant OASST1 (message-tree conversations with roles) — (OpenAssistant, n.d.)
16.7 Preference / Feedback (RLHF & DPO)
HH-RLHF (Anthropic helpful/harmless, JSONL pairs:
chosen
vsrejected
) — (Anthropic, n.d.)DPO format (prompt + preferred vs dispreferred response) — (Rafailov et al. 2024)
16.8 Multimodal (for VLMs/ASR)
LAION-5B / Re-LAION-5B (image–text pairs with CLIP scores; links) — (LAION 2022a, 2022b)
Whisper (weakly-supervised ASR; audio → text pairs) — (Radford et al. 2022; OpenAI 2022)
HowTo100M (YouTube instructional video clips + narrations) — (École Normale Supérieure, n.d.; Miech et al. 2019)
16.9 Math Reasoning (often for post-training/eval)
GSM8K (grade-school word problems; JSON) — (OpenAI, n.d.a, n.d.b)
MATH (competition problems with step-by-step solutions) — (Hendrycks et al. 2021; Hendrycks, n.d.)
16.10 Common Storage Containers
JSON Lines / NDJSON — (jsonlines.org, n.d.; ndjson, n.d.)
TFRecord — (TensorFlow, n.d.)
Apache Parquet — (Apache Software Foundation, n.d.)