16  Appendix 1: LLM Data Schemas

Here, we describe many variants of LLM data. This will be relevant for when we extend the flywheel to include more types of data, and especially shift towards promoting the sharing (via opt-in flywheels, but also via new market mechanisms) of richer “content data”.


16.1 Open Web / Crawls


16.2 Encyclopedic / Books


16.4 Code


16.5 Forums / Q&A / Social


16.6 Instruction / Conversations (Post-training SFT)


16.7 Preference / Feedback (RLHF & DPO)

  • HH-RLHF (Anthropic helpful/harmless, JSONL pairs: chosen vs rejected) — (Anthropic, n.d.)

  • DPO format (prompt + preferred vs dispreferred response) — (Rafailov et al. 2024)


16.8 Multimodal (for VLMs/ASR)


16.9 Math Reasoning (often for post-training/eval)


16.10 Common Storage Containers