20 References
Status: first draft complete - auto-generated references list.
Anthropic. n.d. “HH-RLHF Dataset.” https://huggingface.co/datasets/Anthropic/hh-rlhf.
Apache Software Foundation. n.d. “Apache Parquet Project.”
https://parquet.apache.org/.
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael
Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers,
and Gregor Weber. 2019. “Common Voice: A Massively-Multilingual
Speech Corpus.” arXiv Preprint arXiv:1912.06670. https://arxiv.org/abs/1912.06670.
Arnold, Eckhart. 2014. “What’s Wrong with Social
Simulations?” The Monist 97: 359–77. https://api.semanticscholar.org/CorpusID:67844223.
arXiv.org. n.d.a. “arXiv API User’s Manual.” https://info.arxiv.org/help/api/user-manual.html.
———. n.d.b. “arXiv Bulk Data Access.” https://info.arxiv.org/help/bulk_data.html.
———. n.d.c. “arXiv OAI-PMH Interface.” https://info.arxiv.org/help/oa/index.html.
Aryabumi, Viraat, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang,
Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024.
“To Code, or Not to Code? Exploring Impact of Code in
Pre-Training.” arXiv Preprint arXiv:2408.10914. https://arxiv.org/abs/2408.10914.
Barocas, Solon, and Andrew D. Selbst. 2016. “Big Data’s Disparate
Impact.” California Law Review 104 (3): 671–732. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899.
Batty, Michael, and Paul M. Torrens. 2001. “Modeling Complexity :
The Limits to Prediction.” Cybergeo: European Journal of
Geography. https://api.semanticscholar.org/CorpusID:102344300.
Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and
Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset.”
Proceedings of the International AAAI Conference on Web and Social
Media 14: 830–39. https://doi.org/10.1609/icwsm.v14i1.7347.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019.
“Reconciling Modern Machine-Learning Practice and the Classical
Bias–Variance Trade-Off.” Proceedings of the National Academy
of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can
Language Models Be Too Big?” In Proceedings of the ACM
Conference on Fairness, Accountability, and Transparency (FAccT),
610–23. https://doi.org/10.1145/3442188.3445922.
BigCode Project. 2022. “The Stack: A Permissively Licensed Source
Code Dataset.” Dataset documentation. https://www.bigcode-project.org/dataset/the-stack.
———. n.d.a. “BigCode Project Documentation.” https://www.bigcode-project.org/docs/about/the-stack/.
———. n.d.b. “The Stack Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack.
———. n.d.c. “The Stack V2 Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack-v2.
Blodgett, Su Lin, Solon Barocas, Hal Daume III, and Hanna Wallach. 2020.
“Language (Technology) Is Power: A Critical Survey of "Bias" in
NLP.” In Proceedings of ACL, 5454–76. https://doi.org/10.18653/v1/2020.acl-main.485.
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades:
Intersectional Accuracy Disparities in Commercial Gender
Classification.” In Proceedings of the Conference on
Fairness, Accountability and Transparency (FAT*), 77–91. https://proceedings.mlr.press/v81/buolamwini18a.html.
Carlini, Nicholas, Matthew Jagielski, Christopher A. Choquette-Choo,
Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas,
and Florian Tramèr. 2024. “Poisoning Web-Scale Training Datasets
Is Practical.” https://arxiv.org/abs/2302.10149.
Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski,
Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2021.
“Extracting Training Data from Large Language Models.” In
Proceedings of USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.
Common Crawl. n.d.a. “Common Crawl – Get Started.” https://commoncrawl.org/get-started.
———. n.d.b. “Web Archiving File Formats Explained.” https://commoncrawl.org/blog/web-archiving-file-formats-explained.
Crawford, Kate, and Trevor Paglen. 2019. “Excavating AI: The
Politics of Images in Machine Learning Training Sets.” https://excavating.ai/.
Creative Commons. 2023. “Understanding CC Licenses and Generative
AI.” https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/.
Databricks. n.d. “Databricks Dolly Repository.” https://github.com/databrickslabs/dolly.
Deckelmann, Selena. 2023. “Wikipedia’s Value in the Age of
Generative AI.” Wikimedia Foundation. https://wikimediafoundation.org/news/2023/07/12/wikipedias-value-in-the-age-of-generative-ai/.
École Normale Supérieure. n.d. “HowTo100M Project.” https://www.di.ens.fr/willow/research/howto100m/.
European Union. 2016. “General Data Protection Regulation (EU)
2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.
———. 2024. “Artificial Intelligence Act.” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689.
Federal Trade Commission. 2013. “Children’s Online Privacy
Protection Rule (COPPA) — 16 CFR Part 312.” https://www.ftc.gov/legal-library/browse/rules/childrens-online-privacy-protection-rule-coppa.
Fernandez, Raul Castro. 2023. “Data-Sharing Markets: Model,
Protocol, and Algorithms to Incentivize the Formation of Data-Sharing
Consortia.” Proceedings of the ACM on Management of Data
1: 1–25. https://api.semanticscholar.org/CorpusID:259213174.
Fiesler, Casey, Cliff Lampe, and Amy S. Bruckman. 2016. “Reality
and Perception of Copyright Terms of Service for Online Content
Creation.” In Proceedings of the 19th ACM Conference on
Computer-Supported Cooperative Work & Social Computing. https://doi.org/10.1145/2818048.2819931.
Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe,
Charles Foster, Jason Phang, et al. 2021. “The Pile: An 800GB
Dataset of Diverse Text for Language Modeling.” CoRR
abs/2101.00027. https://arxiv.org/abs/2101.00027.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman
Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018.
“Datasheets for Datasets.” In arXiv:1803.09010. https://arxiv.org/abs/1803.09010.
Ghorbani, Amirata, and James Zou. 2019. “Data Shapley: Equitable
Valuation of Data for Machine Learning.” In International
Conference on Machine Learning, 2242–51. PMLR. https://proceedings.mlr.press/v97/ghorbani19c.html.
Goldblum, Micah, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi
Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, and Tom Goldstein.
2022. “Dataset Security for Machine Learning: Data Poisoning,
Backdoor Attacks, and Defenses.” IEEE Transactions on Pattern
Analysis and Machine Intelligence. https://arxiv.org/abs/2012.10544.
Gray, Colin M., Yubo Kou, Bryan Battles, Joseph Hoggatt, and Austin L.
Toombs. 2018. “The Dark (Patterns) Side of UX Design.” In
Proceedings of the 2018 CHI Conference on Human Factors in Computing
Systems. https://doi.org/10.1145/3173574.3174108.
Grother, Patrick, Mei Ngan, and Kayee Hanaoka. 2019. “Face
Recognition Vendor Test (FRVT) Part 3: Demographic Effects.”
NISTIR 8280. NIST. https://doi.org/10.6028/NIST.IR.8280.
gururise. n.d. “Alpaca Data Cleaned Repository.” https://github.com/gururise/AlpacaDataCleaned.
Hendrycks, Dan. n.d. “Competition Math Dataset on Hugging
Face.” https://huggingface.co/datasets/hendrycks/competition_math.
Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven
Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021.
“Measuring Mathematical Problem Solving with the MATH
Dataset.” https://arxiv.org/abs/2103.03874.
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo
Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi
Zhou. 2017. “Deep Learning Scaling Is Predictable,
Empirically.” arXiv Preprint arXiv:1712.00409. https://arxiv.org/abs/1712.00409.
Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia
Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to
Drive Higher Data Quality Standards.” https://arxiv.org/abs/1805.03677.
Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte
MacDiarmid, Tamera Lanham, et al. 2024. “Sleeper Agents: Training
Deceptive LLMs That Persist Through Safety Training.” https://arxiv.org/abs/2401.05566.
Hwang, Sohyeon, Priyanka Nanayakkara, and Yan Shvartzshnaider. 2025.
“Trust and Friction: Negotiating How Information Flows Through
Decentralized Social Media.” Proceedings of the ACM on
Human-Computer Interaction (CSCW). https://doi.org/10.1145/3757516.
Illinois General Assembly. 2008. “Biometric Information Privacy
Act (BIPA), 740 ILCS 14.” https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004.
International Internet Preservation Consortium. 2017. “The WARC
Format 1.1.” https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.
ISO/IEC. 2023. ISO/IEC 23894:2023 Information Technology—Artificial
Intelligence—Risk Management. ISO/IEC. https://www.iso.org/standard/77304.html.
Jackson, Brandon, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein,
Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. “Public
AI: Infrastructure for the Common
Good.” Public AI Network. https://doi.org/10.5281/zenodo.13914560.
Jo, Emily, and Timnit Gebru. 2020. “Lessons from Archives:
Strategies for Collecting Sociocultural Data in Machine
Learning.” In Proceedings of FAccT, 306–16. https://doi.org/10.1145/3351095.3372829.
Johnson, Isaac, Lucie-Aimée Kaffee, and Miriam Redi. 2024.
“Wikimedia Data for AI: A Review of Wikimedia Datasets for NLP
Tasks and AI-Assisted Editing.” arXiv Preprint
arXiv:2410.08918. https://arxiv.org/abs/2410.08918.
jsonlines.org. n.d. “JSON Lines Specification.” https://jsonlines.org/.
Koh, Pang Wei, and Percy Liang. 2017. “Understanding Black-Box
Predictions via Influence Functions.” In Proceedings of the
34th International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v70/koh17a.html.
LAION. 2022. “LAION-5B: A New Era of Open Large-Scale Multi-Modal
Datasets.” https://laion.ai/blog/laion-5b/.
———. 2024. “Releasing Re-LAION-5B: Transparent Iteration on
LAION-5B with Additional Safety Fixes.” https://laion.ai/blog/relaion-5b/.
Library of Congress. n.d. “WARC, Web ARChive File Format.”
https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
Liu, Jason. 2024. “Data Flywheel Go Brrr: Using Your Users to
Build Better Products.” https://jxnl.co/writing/2024/03/28/data-flywheel/.
Liu, Jiacheng, Thomas N. Blanton, Sewon Min, Arnavi Chheda-Kothary, Huy
Tran, Eric Marsh, Cassidy Trier, John T. James, Jon Borchardt, and Evie
Yu-Yen Cheng. 2025. “OLMoTrace: Tracing Language Model Outputs
Back to Trillions of Training Tokens.” arXiv Preprint
arXiv:2504.07096. https://arxiv.org/abs/2504.07096.
Marda, Nik, Jasmine Sun, and Mark Surman. 2024. “Public AI: Making
AI Work for Everyone, by Everyone.” Mozilla Foundation. https://assets.mofoprod.net/network/documents/Public_AI_Mozilla.pdf.
Marwell, Gerald, and Pamela Oliver. 1993. The Critical Mass in
Collective Action. Cambridge University Press. https://doi.org/10.1017/CBO9780511663765.
McCallister, Erika, Tim Grance, and Karen Scarfone. 2010. “Guide
to Protecting the Confidentiality of Personally Identifiable Information
(PII).” SP 800-122. NIST. https://csrc.nist.gov/pubs/sp/800/122/final.
McDonald, Aleecia M., and Lorrie Faith Cranor. 2008. “The Cost of
Reading Privacy Policies.” I/S: A Journal of Law and Policy
for the Information Society 4 (3): 543–68. https://kb.osu.edu/handle/1811/72839.
McDonald, Nora, Benjamin Mako Hill, Rachel Greenstadt, and Andrea Forte.
2019. “Privacy, Anonymity, and Perceived Risk in Open
Collaboration: A Study of Service Providers.” In Proceedings
of the 2019 CHI Conference on Human Factors in Computing Systems,
1–12. https://doi.org/10.1145/3290605.3300901.
Meta Stack Exchange. n.d. “Why Is the Stack Exchange Data Dump
Only Available in XML?” https://meta.stackexchange.com/questions/267329/why-is-the-stack-exchange-data-dump-only-available-in-xml-file-format.
Miech, Antoine, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi,
Ivan Laptev, and Josef Sivic. 2019. “HowTo100M: Learning a
Text-Video Embedding by Watching Hundred Million Narrated Video
Clips.” CoRR abs/1906.03327. http://arxiv.org/abs/1906.03327.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy
Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and
Timnit Gebru. 2019. “Model Cards for Model Reporting.” In
Proceedings of the ACM Conference on Fairness, Accountability, and
Transparency (FAccT), 220–29. https://doi.org/10.1145/3287560.3287596.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An
Introduction. MIT Press. https://probml.github.io/pml-book/book1.html.
Narayanan, Arvind, and Vitaly Shmatikov. 2008. “Robust
de-Anonymization of Large Sparse Datasets.” In Proceedings of
the IEEE Symposium on Security and Privacy, 111–25. https://doi.org/10.1109/SP.2008.33.
ndjson. n.d. “NDJSON Specification.” https://github.com/ndjson/ndjson-spec.
NISO. 2024. “ANSI/NISO Z39.96-2024, JATS: Journal Article Tag
Suite.” https://www.niso.org/publications/z3996-2024-jats.
Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.”
Washington Law Review 79 (1): 119–57. https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10/.
NIST. 2023. “Artificial Intelligence Risk Management Framework (AI
RMF 1.0).” NIST AI 100-1. National Institute of Standards;
Technology. https://doi.org/10.6028/NIST.AI.100-1.
NLM. n.d. “Journal Article Tag Suite.” https://jats.nlm.nih.gov/.
Obar, Jonathan A., and Anne Oeldorf-Hirsch. 2020. “The Biggest Lie
on the Internet: Ignoring the Privacy Policies and Terms of Service
Policies of Social Networking Services.” Information,
Communication & Society 23 (1): 128–47. https://doi.org/10.1080/1369118X.2018.1486870.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil
Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used
to Manage the Health of Populations.” Science 366
(6464): 447–53. https://doi.org/10.1126/science.aax2342.
OpenAI. 2022. “Introducing Whisper.” https://openai.com/index/whisper/.
———. n.d.a. “Grade-School Math (GSM8K) Repository.” https://github.com/openai/grade-school-math.
———. n.d.b. “GSM8K Hugging Face Dataset Card.” https://huggingface.co/datasets/openai/gsm8k.
———. n.d.c. “OpenAI API Reference – Chat Completions.” https://platform.openai.com/docs/api-reference/chat.
OpenAssistant. n.d. “OpenAssistant OASST1 Dataset Card.” https://huggingface.co/datasets/OpenAssistant/oasst1.
OWASP. 2023. “OWASP Top 10 for Large Language Model
Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/.
Project Gutenberg. n.d.a. “Project Gutenberg File Formats.”
https://www.gutenberg.org/help/file_formats.html.
———. n.d.b. “Project Gutenberg Offline Catalogs and Feeds.”
https://www.gutenberg.org/ebooks/offline_catalogs.html.
Pushshift. n.d. “Pushshift.io.” https://pushshift.io/.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey,
and Ilya Sutskever. 2022. “Robust Speech Recognition via
Large-Scale Weak Supervision.” https://arxiv.org/abs/2212.04356.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon,
Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference
Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.
Raji, Inioluwa Deborah, Indra Elizabeth Kumar, Aaron Horowitz, and
Andrew D. Selbst. 2022. “The Fallacy of AI Functionality.”
Proceedings of the 2022 ACM Conference on Fairness, Accountability,
and Transparency. https://api.semanticscholar.org/CorpusID:249872658.
Rakova, Bogdana, Renee Shelby, and Megan Ma. 2023.
“Terms-We-Serve-with: Five Dimensions for Anticipating and
Repairing Algorithmic Harm.” Big Data & Society 10
(2): 20539517231211553. https://doi.org/10.1177/20539517231211553.
Reddit. n.d. “Reddit API Documentation.” https://www.reddit.com/dev/api/.
Reddit Help. n.d. “Reddit Data API Wiki.” https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki.
Roche, Adam, and Yali Sassoon. 2024. “What Is a Data Flywheel? A
Guide to Sustainable Business Growth.” Snowplow Blog. https://snowplow.io/blog/what-is-a-data-flywheel.
Selbst, Andrew D., Danah Boyd, Sorelle A. Friedler, Suresh
Venkatasubramanian, and Janet Vertesi. 2019. “Fairness and
Abstraction in Sociotechnical Systems.” In Proceedings of the
ACM Conference on Fairness, Accountability, and Transparency
(FAccT), 59–68. https://doi.org/10.1145/3287560.3287598.
Shankar, Shreya. 2024. “Data Flywheels for LLM
Applications.” Shreya Shankar’s blog. https://www.sh-reya.com/blog/ai-engineering-flywheel/.
Shelby, Renee, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar
Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, et al. 2023.
“Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy
for Harm Reduction.” In Proceedings of the 2023 AAAI/ACM
Conference on AI, Ethics, and Society, 723–41. Association for
Computing Machinery. https://doi.org/10.1145/3600211.3604673.
Shen, Judy Hanwen, Inioluwa Deborah Raji, and Irene Y Chen. 2024.
“The Data Addition Dilemma.” MLHC 2024. https://proceedings.mlr.press/v252/shen24a.html.
Silver, David, and Richard S. Sutton. 2025. “Welcome to the Era of
Experience.” Google DeepMind. https://storage.googleapis.com/deepmind-media/Era-of-Experience/The%20Era%20of%20Experience%20Paper.pdf.
Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari
Morcos. 2022. “Beyond Neural Scaling Laws: Beating Power Law
Scaling via Data Pruning.” Advances in Neural Information
Processing Systems 35: 19523–36. https://arxiv.org/abs/2206.14486.
Stack Exchange. n.d. “Stack Exchange Data Explorer Help.”
https://data.stackexchange.com/help.
Stanford CRFM. 2023. “Alpaca: A Strong, Replicable
Instruction-Following Model.” https://crfm.stanford.edu/2023/03/13/alpaca.html.
Supreme Court of Illinois. 2019. “Rosenbach v. Six Flags
Entertainment Corp.” 2019 IL 123186, Supreme Court of Illinois.
https://www.illinoiscourts.gov/Resources/f71510f1-fb2a-43d8-ba14-292c8009dfd9/123186.pdf.
Sweeney, Latanya. 2000. “Simple Demographics Often Identify People
Uniquely.” Carnegie Mellon University, Data Privacy Working
Paper. https://dataprivacylab.org/projects/identifiability/paper1.pdf.
Tan, Joshua, Nicholas Vincent, Katherine Elkins, and Magnus Sahlgren.
2025. “If Open Source Is to Win, It Must Go Public.”
arXiv Preprint arXiv:2507.09296. https://arxiv.org/abs/2507.09296.
Tatsu Lab. n.d. “Stanford Alpaca GitHub Repository.” https://github.com/tatsu-lab/stanford_alpaca.
TensorFlow. n.d. “TFRecord and Tf.train.example Tutorial.”
https://www.tensorflow.org/tutorials/load_data/tfrecord.
TensorFlow Datasets. n.d.a. “C4 Dataset in TensorFlow
Datasets.” https://www.tensorflow.org/datasets/catalog/c4.
———. n.d.b. “C4 Generator Code.” https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py.
Tran, Chau, Kaylea Champion, Andrea Forte, Benjamin Mako Hill, and
Rachel Greenstadt. 2020. “Are Anonymity-Seekers Just Like
Everybody Else? An Analysis of Contributions to Wikipedia from
Tor.” In 2020 IEEE Symposium on Security and Privacy
(SP), 186–202. IEEE. https://doi.org/10.1109/SP40000.2020.00053.
U.S. Copyright Office. 2024. “Copyright and Artificial
Intelligence: Policy Studies and Guidance.” https://copyright.gov/ai/.
U.S. Department of Education. 1974. “Family Educational Rights and
Privacy Act (FERPA).” https://studentprivacy.ed.gov/ferpa.
U.S. Department of Health and Human Services. 2000. “HIPAA Privacy
Rule — 45 CFR Parts 160 and 164.” https://www.hhs.gov/hipaa/for-professionals/privacy/index.html.
Vincent, Nicholas, David Bau, Sarah Schwettmann, and Joshua Tan. 2023.
“An Alternative to Regulation: The Case for Public AI.”
arXiv Preprint arXiv:2311.11350. https://arxiv.org/abs/2311.11350.
Vincent, Nicholas, Mark Surman, and Jake Hirsch-Allen. 2025.
“Canada as a Champion for Public AI: Data, Compute and Open Source
Infrastructure for Economic Growth and Inclusive Innovation.”
Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan
Uesato, Po-Sen Huang, Myra Cheng, et al. 2021. “Ethical and Social
Risks of Harm from Language Models.” arXiv Preprint
arXiv:2112.04359. https://arxiv.org/abs/2112.04359.
Wikimedia Meta-Wiki. n.d. “Wikipedia Data Dumps – Dump
Format.” https://meta.wikimedia.org/wiki/Data_dumps/Dump_format.
Wikipedia. n.d. “Wikipedia Database Download.” https://en.wikipedia.org/wiki/Wikipedia:Database_download.
Wolpert, David H, and William G Macready. 1997. “No Free Lunch
Theorems for Optimization.” IEEE Transactions on Evolutionary
Computation 1 (1): 67–82. https://doi.org/10.1109/4235.585893.