References

Acemoglu, Daron, and Simon Johnson. 2025. “Power and Progress: Our Thousand-Year Struggle over Technology and Prosperity.” Perspectives on Science and Christian Faith. https://api.semanticscholar.org/CorpusID:265119352.
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. “Common Voice: A Massively-Multilingual Speech Corpus.” arXiv Preprint arXiv:1912.06670.
Arnold, Eckhart. 2014. “What’s Wrong with Social Simulations?” The Monist 97: 359–77. https://api.semanticscholar.org/CorpusID:67844223.
Aryabumi, Viraat, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. “To Code, or Not to Code? Exploring Impact of Code in Pre-Training.” arXiv Preprint arXiv:2408.10914.
Barocas, Solon, and Andrew D. Selbst. 2016. “Big Data’s Disparate Impact.” California Law Review 104 (3): 671–732.
Batty, Michael, and Paul M. Torrens. 2001. “Modeling Complexity : The Limits to Prediction.” Cybergeo: European Journal of Geography. https://api.semanticscholar.org/CorpusID:102344300.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610–23.
Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. “Language (Technology) Is Power: A Critical Survey of ‘Bias’ in NLP.” In Proceedings of ACL, 5454–76.
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Proceedings of the Conference on Fairness, Accountability and Transparency (FAT*), 77–91.
Carlini, Nicholas, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. 2024. “Poisoning Web-Scale Training Datasets Is Practical.” https://arxiv.org/abs/2302.10149.
Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. “The Secret Sharer: Measuring Unintended Memorization in Neural Networks.” In Proceedings of USENIX Security Symposium.
Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2021. “Extracting Training Data from Large Language Models.” In Proceedings of USENIX Security Symposium.
Crawford, Kate, and Trevor Paglen. 2019. “Excavating AI: The Politics of Images in Machine Learning Training Sets.” https://www.excavating.ai/.
Creative Commons. 2023. “Understanding CC Licenses and Generative AI.” https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/.
“Cybernetics.” 2025. Wikipedia. https://en.wikipedia.org/w/index.php?title=Cybernetics&oldid=1300921342.
Deckelmann, Selena. 2023. “Wikipedia’s Value in the Age of Generative AI.” Wikimedia Foundation. https://wikimediafoundation.org/news/2023/07/12/wikipedias-value-in-the-age-of-generative-ai/.
European Union. 2016. “General Data Protection Regulation (EU) 2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.
———. 2024. “Artificial Intelligence Act.” https://eur-lex.europa.eu/.
Federal Trade Commission. 2013. “Children’s Online Privacy Protection Rule (COPPA) — 16 CFR Part 312.” https://www.ftc.gov/legal-library/browse/rules/childrens-online-privacy-protection-rule-coppa.
Fernandez, Raul Castro. 2023. “Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia.” Proceedings of the ACM on Management of Data 1: 1–25. https://api.semanticscholar.org/CorpusID:259213174.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” In arXiv:1803.09010.
Grother, Patrick, Mei Ngan, and Kayee Hanaoka. 2019. “Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects.” NISTIR 8280. NIST. https://doi.org/10.6028/NIST.IR.8280.
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. “Deep Learning Scaling Is Predictable, Empirically.” arXiv Preprint arXiv:1712.00409.
Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards.” https://arxiv.org/abs/1805.03677.
Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, et al. 2024. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training.” https://arxiv.org/abs/2401.05566.
Hwang, Sohyeon, Priyanka Nanayakkara, and Yan Shvartzshnaider. 2025. “Trust and Friction: Negotiating How Information Flows Through Decentralized Social Media.” arXiv Preprint arXiv:2503.02150.
Illinois General Assembly. 2008. “Biometric Information Privacy Act (BIPA), 740 ILCS 14.” https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004.
ISO/IEC 23894:2023 Information Technology—Artificial Intelligence—Risk Management. 2023. ISO/IEC.
Jackson, Brandon, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. “Public AI: Infrastructure for the Common Good.” Public AI Network. https://doi.org/10.5281/zenodo.13914560.
Jo, Emily, and Timnit Gebru. 2020. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Proceedings of FAccT, 306–16.
Johnson, Isaac, Lucie-Aimée Kaffee, and Miriam Redi. 2024. “Wikimedia Data for AI: A Review of Wikimedia Datasets for NLP Tasks and AI-Assisted Editing.” arXiv Preprint arXiv:2410.08918.
Kollock, Peter. 1998. “Social Dilemmas: The Anatomy of Cooperation.” Annual Review of Sociology 24 (1): 183–214. https://doi.org/10.1146/annurev.soc.24.1.183.
Liu, Jason. 2024. “Data Flywheel Go Brrr: Using Your Users to Build Better Products - Jason Liu.” https://jxnl.co/writing/2024/03/28/data-flywheel/.
Liu, Jiacheng, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, et al. 2025. “OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens.” arXiv Preprint arXiv:2504.07096.
Marda, Nik, Jasmine Sun, and Mark Surman. 2024. “Public AI: Making AI Work for Everyone, by Everyone.” Mozilla. https://assets. mofoprod. net/network/documents/Public_AI_Mozilla. pdf.
Marwell, Gerald, and Pamela Oliver. 1993. The Critical Mass in Collective Action. Cambridge University Press.
McCallister, Erika, Tim Grance, and Karen Scarfone. 2010. “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII).” SP 800-122. NIST.
McDonald, Nora, Benjamin Mako Hill, Rachel Greenstadt, and Andrea Forte. 2019. “Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Service Providers.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 220–29.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. http://probml.github.io/book1.
Narayanan, Arvind, and Vitaly Shmatikov. 2008. “Robust de-Anonymization of Large Sparse Datasets.” In Proceedings of the IEEE Symposium on Security and Privacy, 111–25.
Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 119–57.
NIST. 2023. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST AI 100-1. National Institute of Standards; Technology; https://www.nist.gov/ai.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53.
OWASP. 2023. “OWASP Top 10 for Large Language Model Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/.
Raji, Inioluwa Deborah, Indra Elizabeth Kumar, Aaron Horowitz, and Andrew D. Selbst. 2022. “The Fallacy of AI Functionality.” Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. https://api.semanticscholar.org/CorpusID:249872658.
Rakova, Bogdana, Renee Shelby, and Megan Ma. 2023. “Terms-We-Serve-with: Five Dimensions for Anticipating and Repairing Algorithmic Harm.” Big Data & Society 10 (2): 20539517231211553.
Roche, Adam, and Yali Sassoon. 2024. “What Is a Data Flywheel? A Guide to Sustainable Business Growth.” Snowplow Blog. https://snowplow.io/blog/what-is-a-data-flywheel.
“Rosenbach v. Six Flags Entertainment Corp.” 2019. 2019 IL 123186, Supreme Court of Illinois.
Selbst, Andrew D., Danah Boyd, Suresh Venkatasubramanian Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. “Fairness and Abstraction in Sociotechnical Systems.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 59–68.
Shankar, Shreya. 2024. “Data Flywheels for LLM Applications.” Shreya Shankar’s Blog. https://www.sh-reya.com/blog/ai-engineering-flywheel/.
Shelby, Renee, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, et al. 2023. “Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction.” In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–41. AIES ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3600211.3604673.
Shen, Judy Hanwen, Inioluwa Deborah Raji, and Irene Y Chen. 2024. “The Data Addition Dilemma.” arXiv Preprint arXiv:2408.04154. https://arxiv.org/abs/2408.04154.
Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. 2022. “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.” Advances in Neural Information Processing Systems 35: 19523–36.
Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Carnegie Mellon University, Data Privacy Working Paper.
Tan, Joshua, Nicholas Vincent, Katherine Elkins, and Magnus Sahlgren. 2025. “If Open Source Is to Win, It Must Go Public.” arXiv Preprint arXiv:2507.09296.
U.S. Copyright Office. 2024. “Copyright and Artificial Intelligence: Policy Studies and Guidance.” https://copyright.gov/ai/.
U.S. Department of Education. 1974. “Family Educational Rights and Privacy Act (FERPA).” https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html.
U.S. Department of Health and Human Services. 2000. “HIPAA Privacy Rule — 45 CFR Parts 160 and 164.” https://www.hhs.gov/hipaa/for-professionals/privacy/index.html.
Vincent, Nicholas, David Bau, Sarah Schwettmann, and Joshua Tan. 2023. “An Alternative to Regulation: The Case for Public AI.” arXiv Preprint arXiv:2311.11350.
Vincent, Nicholas, Mark Surman, and Jake Hirsch-Allen. 2025. “Canada as a Champion for Public AI: Data, Compute and Open Source Infrastructure for Economic Growth and Inclusive Innovation.”
Weidinger, Laura, John Mellor, et al. 2021. “Ethical and Social Risks of Harm from Language Models.” arXiv Preprint arXiv:2112.04359.
Wolpert, David H, and William G Macready. 2002. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1 (1): 67–82.