9 Appendix 1: LLM Data Schemas
Here, we describe many variants of LLM data. This will be relevant for when we extend the flywheel to include more types of data, and especially shift towards promoting the sharing (via opt-in flywheels, but also via new market mechanisms) of richer “content data”.
Open Web / Crawls
WARC/WAT/WET
- WARC (container for HTTP request/response records) — spec & overview: IIPC WARC 1.1; Library of Congress format note. (IIPC Community Resources, The Library of Congress)
- WAT (JSON metadata extracted from WARC) and WET (plain text extracted from HTML) — Common Crawl guides. (Common Crawl, Common Crawl)
C4 (Colossal Clean Crawled Corpus) — TFDS catalog & generator code. Fields are essentially clean text segments with basic metadata. (TensorFlow, GitHub)
The Pile (22-source, mixed corpus) — paper & HTML view. (arXiv, ar5iv)
Encyclopedic / Books
Wikipedia XML dumps (page/revision XML; SQL tables for links) — Meta-Wiki dump format; Wikipedia database download. (Meta, Wikipedia)
Project Gutenberg
- Books: plain text/HTML master formats; ePub/MOBI derived. (Project Gutenberg)
- Catalog schema: daily RDF/XML (also CSV) for metadata; offline catalogs. (Project Gutenberg)
Scientific / Legal
- arXiv (Atom/OAI-PMH metadata; bulk & API) — OAI-PMH + API docs; bulk metadata page. (info.arxiv.org, info.arxiv.org, info.arxiv.org)
- JATS XML (journal article tag suite) — NISO standards; NLM JATS site. (niso.org, jats.nlm.nih.gov)
Code
- BigCode — The Stack / The Stack v2 (source files + license/provenance metadata; dedup variants) — HF datasets, project docs, arXiv overview. (Hugging Face, Hugging Face, BigCode, arXiv)
Forums / Q&A / Social
Stack Exchange dumps (XML: Posts, Users, Comments, Votes, etc.) — SE Meta/docs & Data Explorer. (Meta Stack Exchange, data.stackexchange.com)
Reddit
- API JSON schema — official API docs & help. (Reddit, Reddit Help)
- Pushshift (historical dumps; research dataset) — site & paper. (pushshift.io, arXiv)
Instruction / Conversations (Post-training SFT)
- OpenAI-style chat schema (role-tagged:
system|user|assistant
, plus tool calls) — API reference. (OpenAI Platform) - Alpaca (JSON prompts/instructions/outputs) — Stanford post & repo; cleaned community set. (crfm.stanford.edu, GitHub, GitHub)
- Databricks Dolly-15k (human-written instruction/response pairs) — repo. (GitHub)
- OpenAssistant OASST1 (message-tree conversations with roles) — HF dataset card. (Hugging Face)
- OpenAI-style chat schema (role-tagged:
Preference / Feedback (RLHF & DPO)
Multimodal (for VLMs/ASR)
Math-reasoning (often for post-training/eval)
- GSM8K (grade-school word problems; JSON) — repo & HF dataset card. (GitHub, Hugging Face)
- MATH (competition problems with step-by-step solutions) — paper & HF. (arXiv, Hugging Face)
Common storage containers
- JSON Lines / NDJSON — jsonlines.org; ndjson spec. (jsonlines.org, GitHub)
- TFRecord — TensorFlow tutorial. (TensorFlow)
- Apache Parquet — project site. (Apache Parquet)
#todo check all refs