21  Appendix 1: LLM Data Schemas

Status: first draft complete.

Here, we describe many variants of LLM data. This will be relevant for when we extend the flywheel to include more types of data, and especially shift towards promoting the sharing (via opt-in flywheels, but also via new market mechanisms) of richer “content data”.


21.1 Open Web / Crawls


21.2 Encyclopedic / Books


21.4 Code


21.5 Forums / Q&A / Social


21.6 Instruction / Conversations (Post-training SFT)


21.7 Preference / Feedback (RLHF & DPO)

  • HH-RLHF (Anthropic helpful/harmless, JSONL pairs: chosen vs rejected) — (Anthropic, n.d.)

  • DPO format (prompt + preferred vs dispreferred response) — (Rafailov et al. 2023)


21.8 Multimodal (for VLMs/ASR)


21.9 Math Reasoning (often for post-training/eval)


21.10 Common Storage Containers