21 Appendix 1: LLM Data Schemas

Status: first draft complete.

Here, we describe many variants of LLM data. This will be relevant for when we extend the flywheel to include more types of data, and especially shift towards promoting the sharing (via opt-in flywheels, but also via new market mechanisms) of richer “content data”.

21.1 Open Web / Crawls

WARC/WAT/WET
- WARC (container for HTTP request/response records) — spec & overview: (International Internet Preservation Consortium 2017; Library of Congress, n.d.)
- WAT (JSON metadata extracted from WARC) and WET (plain text extracted from HTML) — Common Crawl guides (Common Crawl, n.d.a, n.d.b)
C4 (Colossal Clean Crawled Corpus) — TFDS catalog & generator code (TensorFlow Datasets, n.d.a, n.d.b)
The Pile (22-source, mixed corpus) — paper (Gao et al. 2021)

21.2 Encyclopedic / Books

Wikipedia XML dumps (page/revision XML; SQL tables for links) — (Wikimedia Meta-Wiki, n.d.; Wikipedia, n.d.)
Project Gutenberg
- Books: plain text/HTML master formats; ePub/MOBI derived (Project Gutenberg, n.d.a)
- Catalog schema: daily RDF/XML (also CSV) for metadata; offline catalogs (Project Gutenberg, n.d.b)

21.3 Scientific / Legal

arXiv (Atom/OAI-PMH metadata; bulk & API) — (arXiv.org, n.d.c, n.d.a, n.d.b)
JATS XML (journal article tag suite) — (NISO 2024; NLM, n.d.)

21.4 Code

BigCode — The Stack / The Stack v2 (source files + license/provenance metadata; dedup variants) — (BigCode Project, n.d.b, n.d.c, n.d.a, 2022)

21.6 Instruction / Conversations (Post-training SFT)

OpenAI-style chat schema (role-tagged: system|user|assistant, plus tool calls) — (OpenAI, n.d.c)
Alpaca (JSON prompts/instructions/outputs) — (Stanford CRFM 2023; Tatsu Lab, n.d.; gururise, n.d.)
Databricks Dolly-15k (human-written instruction/response pairs) — (Databricks, n.d.)
OpenAssistant OASST1 (message-tree conversations with roles) — (OpenAssistant, n.d.)

21.7 Preference / Feedback (RLHF & DPO)

HH-RLHF (Anthropic helpful/harmless, JSONL pairs: chosen vs rejected) — (Anthropic, n.d.)
DPO format (prompt + preferred vs dispreferred response) — (Rafailov et al. 2023)

21.8 Multimodal (for VLMs/ASR)

LAION-5B / Re-LAION-5B (image–text pairs with CLIP scores; links) — (LAION 2022, 2024)
Whisper (weakly-supervised ASR; audio → text pairs) — (Radford et al. 2022; OpenAI 2022)
HowTo100M (YouTube instructional video clips + narrations) — (École Normale Supérieure, n.d.; Miech et al. 2019)

21.9 Math Reasoning (often for post-training/eval)

GSM8K (grade-school word problems; JSON) — (OpenAI, n.d.a, n.d.b)
MATH (competition problems with step-by-step solutions) — (Hendrycks et al. 2021; Hendrycks, n.d.)

21.10 Common Storage Containers

JSON Lines / NDJSON — (jsonlines.org, n.d.; ndjson, n.d.)
TFRecord — (TensorFlow, n.d.)
Apache Parquet — (Apache Software Foundation, n.d.)