References

Acemoglu, Daron, and Simon Johnson. 2025. “Power and Progress: Our Thousand-Year Struggle over Technology and Prosperity.” Perspectives on Science and Christian Faith. https://api.semanticscholar.org/CorpusID:265119352.
Anthropic. n.d. “HH-RLHF Dataset.” https://github.com/anthropics/hh-rlhf.
Apache Software Foundation. n.d. “Apache Parquet Project.” https://parquet.apache.org/.
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. “Common Voice: A Massively-Multilingual Speech Corpus.” arXiv Preprint arXiv:1912.06670.
Arnold, Eckhart. 2014. “What’s Wrong with Social Simulations?” The Monist 97: 359–77. https://api.semanticscholar.org/CorpusID:67844223.
arXiv.org. n.d.a. “arXiv API User’s Manual.” https://info.arxiv.org/help/api/user-manual.html.
———. n.d.b. “arXiv Bulk Data Access.” https://info.arxiv.org/help/bulk_data.html.
———. n.d.c. “arXiv OAI-PMH Interface.” https://info.arxiv.org/help/oa/index.html.
Aryabumi, Viraat, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. “To Code, or Not to Code? Exploring Impact of Code in Pre-Training.” arXiv Preprint arXiv:2408.10914.
Barocas, Solon, and Andrew D. Selbst. 2016. “Big Data’s Disparate Impact.” California Law Review 104 (3): 671–732.
Batty, Michael, and Paul M. Torrens. 2001. “Modeling Complexity : The Limits to Prediction.” Cybergeo: European Journal of Geography. https://api.semanticscholar.org/CorpusID:102344300.
Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset” 14: 830–39.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610–23.
BigCode Project. n.d.a. “BigCode Project Documentation.” https://www.bigcode-project.org/docs/about/the-stack/.
———. n.d.b. “The Stack Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack/tree/main.
———. n.d.c. “The Stack V2 Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack-v2.
Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. “Language (Technology) Is Power: A Critical Survey of ‘Bias’ in NLP.” In Proceedings of ACL, 5454–76.
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Proceedings of the Conference on Fairness, Accountability and Transparency (FAT*), 77–91.
Carlini, Nicholas, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. 2024. “Poisoning Web-Scale Training Datasets Is Practical.” https://arxiv.org/abs/2302.10149.
Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. “The Secret Sharer: Measuring Unintended Memorization in Neural Networks.” In Proceedings of USENIX Security Symposium.
Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2021. “Extracting Training Data from Large Language Models.” In Proceedings of USENIX Security Symposium.
Common Crawl. n.d.a. “Common Crawl – Get Started.” https://commoncrawl.org/get-started.
———. n.d.b. “Web Archiving File Formats Explained.” https://commoncrawl.org/blog/web-archiving-file-formats-explained.
Crawford, Kate, and Trevor Paglen. 2019. “Excavating AI: The Politics of Images in Machine Learning Training Sets.” https://www.excavating.ai/.
Creative Commons. 2023. “Understanding CC Licenses and Generative AI.” https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/.
“Cybernetics.” 2025. Wikipedia. https://en.wikipedia.org/w/index.php?title=Cybernetics&oldid=1300921342.
Databricks. n.d. “Databricks Dolly Repository.” https://github.com/databrickslabs/dolly.
Deckelmann, Selena. 2023. “Wikipedia’s Value in the Age of Generative AI.” Wikimedia Foundation. https://wikimediafoundation.org/news/2023/07/12/wikipedias-value-in-the-age-of-generative-ai/.
École Normale Supérieure. n.d. “HowTo100M Project.” https://www.di.ens.fr/willow/research/howto100m/.
European Union. 2016. “General Data Protection Regulation (EU) 2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.
———. 2024. “Artificial Intelligence Act.” https://eur-lex.europa.eu/.
Federal Trade Commission. 2013. “Children’s Online Privacy Protection Rule (COPPA) — 16 CFR Part 312.” https://www.ftc.gov/legal-library/browse/rules/childrens-online-privacy-protection-rule-coppa.
Fernandez, Raul Castro. 2023. “Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia.” Proceedings of the ACM on Management of Data 1: 1–25. https://api.semanticscholar.org/CorpusID:259213174.
Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2021. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” CoRR abs/2101.00027. https://arxiv.org/abs/2101.00027.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” In arXiv:1803.09010.
Grother, Patrick, Mei Ngan, and Kayee Hanaoka. 2019. “Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects.” NISTIR 8280. NIST. https://doi.org/10.6028/NIST.IR.8280.
gururise. n.d. “Alpaca Data Cleaned Repository.” https://github.com/gururise/AlpacaDataCleaned.
Hendrycks, Dan. n.d. “Competition Math Dataset on Hugging Face.” https://huggingface.co/datasets/hendrycks/competition_math.
Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.” https://arxiv.org/abs/2103.03874.
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. “Deep Learning Scaling Is Predictable, Empirically.” arXiv Preprint arXiv:1712.00409.
Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards.” https://arxiv.org/abs/1805.03677.
Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, et al. 2024. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training.” https://arxiv.org/abs/2401.05566.
Hwang, Sohyeon, Priyanka Nanayakkara, and Yan Shvartzshnaider. 2025. “Trust and Friction: Negotiating How Information Flows Through Decentralized Social Media.” arXiv Preprint arXiv:2503.02150.
Illinois General Assembly. 2008. “Biometric Information Privacy Act (BIPA), 740 ILCS 14.” https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004.
International Internet Preservation Consortium. 2017. “The WARC Format 1.1.” https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.
ISO/IEC 23894:2023 Information Technology—Artificial Intelligence—Risk Management. 2023. ISO/IEC.
Jackson, Brandon, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. “Public AI: Infrastructure for the Common Good.” Public AI Network. https://doi.org/10.5281/zenodo.13914560.
Jo, Emily, and Timnit Gebru. 2020. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Proceedings of FAccT, 306–16.
Johnson, Isaac, Lucie-Aimée Kaffee, and Miriam Redi. 2024. “Wikimedia Data for AI: A Review of Wikimedia Datasets for NLP Tasks and AI-Assisted Editing.” arXiv Preprint arXiv:2410.08918.
jsonlines.org. n.d. “JSON Lines Specification.” https://jsonlines.org/.
Kollock, Peter. 1998. “Social Dilemmas: The Anatomy of Cooperation.” Annual Review of Sociology 24 (1): 183–214. https://doi.org/10.1146/annurev.soc.24.1.183.
LAION. 2022a. “LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets.” https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/.
———. 2022b. “Releasing Re-LAION-5B.” https://laion.ai/blog/relaion-5b/.
Library of Congress. n.d. “WARC, Web ARChive File Format.” https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
Liu, Jason. 2024. “Data Flywheel Go Brrr: Using Your Users to Build Better Products - Jason Liu.” https://jxnl.co/writing/2024/03/28/data-flywheel/.
Liu, Jiacheng, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, et al. 2025. “OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens.” arXiv Preprint arXiv:2504.07096.
Marda, Nik, Jasmine Sun, and Mark Surman. 2024. “Public AI: Making AI Work for Everyone, by Everyone.” Mozilla. https://assets. mofoprod. net/network/documents/Public_AI_Mozilla. pdf.
Marwell, Gerald, and Pamela Oliver. 1993. The Critical Mass in Collective Action. Cambridge University Press.
McCallister, Erika, Tim Grance, and Karen Scarfone. 2010. “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII).” SP 800-122. NIST.
McDonald, Nora, Benjamin Mako Hill, Rachel Greenstadt, and Andrea Forte. 2019. “Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Service Providers.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12.
Meta Stack Exchange. n.d. “Why Is the Stack Exchange Data Dump Only Available in XML?” https://meta.stackexchange.com/questions/267329/why-is-the-stack-exchange-data-dump-only-available-in-xml-file-format.
Miech, Antoine, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips.” CoRR abs/1906.03327. http://arxiv.org/abs/1906.03327.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 220–29.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. http://probml.github.io/book1.
Narayanan, Arvind, and Vitaly Shmatikov. 2008. “Robust de-Anonymization of Large Sparse Datasets.” In Proceedings of the IEEE Symposium on Security and Privacy, 111–25.
ndjson. n.d. “NDJSON Specification.” https://github.com/ndjson/ndjson-spec.
NISO. 2024. “ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite.” https://www.niso.org/publications/z3996-2024-jats.
Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 119–57.
NIST. 2023. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST AI 100-1. National Institute of Standards; Technology; https://www.nist.gov/ai.
NLM. n.d. “Journal Article Tag Suite.” https://jats.nlm.nih.gov/.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53.
OpenAI. 2022. “Introducing Whisper.” https://openai.com/index/whisper/.
———. n.d.a. “Grade-School Math (GSM8K) Repository.” https://github.com/openai/grade-school-math.
———. n.d.b. “GSM8K Hugging Face Dataset Card.” https://huggingface.co/datasets/openai/gsm8k.
———. n.d.c. “OpenAI API Reference – Chat.” https://platform.openai.com/docs/api-reference/chat.
OpenAssistant. n.d. “OpenAssistant OASST1 Dataset Card.” https://huggingface.co/datasets/OpenAssistant/oasst1.
OWASP. 2023. “OWASP Top 10 for Large Language Model Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/.
Project Gutenberg. n.d.a. “Project Gutenberg File Formats.” https://www.gutenberg.org/help/file_formats.html.
———. n.d.b. “Project Gutenberg Offline Catalogs and Feeds.” https://www.gutenberg.org/ebooks/offline_catalogs.html.
Pushshift. n.d. “Pushshift.io.” https://pushshift.io/.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” https://arxiv.org/abs/2212.04356.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.
Raji, Inioluwa Deborah, Indra Elizabeth Kumar, Aaron Horowitz, and Andrew D. Selbst. 2022. “The Fallacy of AI Functionality.” Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. https://api.semanticscholar.org/CorpusID:249872658.
Rakova, Bogdana, Renee Shelby, and Megan Ma. 2023. “Terms-We-Serve-with: Five Dimensions for Anticipating and Repairing Algorithmic Harm.” Big Data & Society 10 (2): 20539517231211553.
Reddit. n.d. “Reddit API Documentation.” https://www.reddit.com/dev/api/.
Reddit Help. n.d. “Reddit Data API Wiki.” https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki.
Roche, Adam, and Yali Sassoon. 2024. “What Is a Data Flywheel? A Guide to Sustainable Business Growth.” Snowplow Blog. https://snowplow.io/blog/what-is-a-data-flywheel.
“Rosenbach v. Six Flags Entertainment Corp.” 2019. 2019 IL 123186, Supreme Court of Illinois.
Selbst, Andrew D., Danah Boyd, Suresh Venkatasubramanian Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. “Fairness and Abstraction in Sociotechnical Systems.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 59–68.
Shankar, Shreya. 2024. “Data Flywheels for LLM Applications.” Shreya Shankar’s Blog. https://www.sh-reya.com/blog/ai-engineering-flywheel/.
Shelby, Renee, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, et al. 2023. “Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction.” In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–41. AIES ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3600211.3604673.
Shen, Judy Hanwen, Inioluwa Deborah Raji, and Irene Y Chen. 2024. “The Data Addition Dilemma.” arXiv Preprint arXiv:2408.04154. https://arxiv.org/abs/2408.04154.
Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. 2022. “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.” Advances in Neural Information Processing Systems 35: 19523–36.
Stack Exchange. n.d. “Stack Exchange Data Explorer Help.” https://data.stackexchange.com/help.
Stanford CRFM. 2023. “Alpaca: A Strong, Replicable Instruction-Following Model.” https://crfm.stanford.edu/2023/03/13/alpaca.html.
Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Carnegie Mellon University, Data Privacy Working Paper.
Tan, Joshua, Nicholas Vincent, Katherine Elkins, and Magnus Sahlgren. 2025. “If Open Source Is to Win, It Must Go Public.” arXiv Preprint arXiv:2507.09296.
Tatsu Lab. n.d. “Stanford Alpaca GitHub Repository.” https://github.com/tatsu-lab/stanford_alpaca.
TensorFlow. n.d. “TFRecord and Tf.train.example Tutorial.” https://www.tensorflow.org/tutorials/load_data/tfrecord.
TensorFlow Datasets. n.d.a. “C4 Dataset in TensorFlow Datasets.” https://www.tensorflow.org/datasets/catalog/c4.
———. n.d.b. “C4 Generator Code.” https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py.
Tran, Chau, Kaylea Champion, Andrea Forte, Benjamin Mako Hill, and Rachel Greenstadt. 2020. “Are Anonymity-Seekers Just Like Everybody Else? An Analysis of Contributions to Wikipedia from Tor.” In 2020 IEEE Symposium on Security and Privacy (SP), 186–202. IEEE.
U.S. Copyright Office. 2024. “Copyright and Artificial Intelligence: Policy Studies and Guidance.” https://copyright.gov/ai/.
U.S. Department of Education. 1974. “Family Educational Rights and Privacy Act (FERPA).” https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html.
U.S. Department of Health and Human Services. 2000. “HIPAA Privacy Rule — 45 CFR Parts 160 and 164.” https://www.hhs.gov/hipaa/for-professionals/privacy/index.html.
Vincent, Nicholas, David Bau, Sarah Schwettmann, and Joshua Tan. 2023. “An Alternative to Regulation: The Case for Public AI.” arXiv Preprint arXiv:2311.11350.
Vincent, Nicholas, Mark Surman, and Jake Hirsch-Allen. 2025. “Canada as a Champion for Public AI: Data, Compute and Open Source Infrastructure for Economic Growth and Inclusive Innovation.”
Weidinger, Laura, John Mellor, et al. 2021. “Ethical and Social Risks of Harm from Language Models.” arXiv Preprint arXiv:2112.04359.
Wikimedia Meta-Wiki. n.d. “Wikipedia Data Dumps – Dump Format.” https://meta.wikimedia.org/wiki/Data_dumps/Dump_format.
Wikipedia. n.d. “Wikipedia Database Download.” https://en.wikipedia.org/wiki/Wikipedia:Database_download.
Wolpert, David H, and William G Macready. 2002. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1 (1): 67–82.