20 References

Status: first draft complete - auto-generated references list.

Anthropic. n.d. “HH-RLHF Dataset.” https://huggingface.co/datasets/Anthropic/hh-rlhf.

Apache Software Foundation. n.d. “Apache Parquet Project.” https://parquet.apache.org/.

Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. “Common Voice: A Massively-Multilingual Speech Corpus.” arXiv Preprint arXiv:1912.06670. https://arxiv.org/abs/1912.06670.

Arnold, Eckhart. 2014. “What’s Wrong with Social Simulations?” The Monist 97: 359–77. https://api.semanticscholar.org/CorpusID:67844223.

arXiv.org. n.d.a. “arXiv API User’s Manual.” https://info.arxiv.org/help/api/user-manual.html.

———. n.d.b. “arXiv Bulk Data Access.” https://info.arxiv.org/help/bulk_data.html.

———. n.d.c. “arXiv OAI-PMH Interface.” https://info.arxiv.org/help/oa/index.html.

Aryabumi, Viraat, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. “To Code, or Not to Code? Exploring Impact of Code in Pre-Training.” arXiv Preprint arXiv:2408.10914. https://arxiv.org/abs/2408.10914.

Barocas, Solon, and Andrew D. Selbst. 2016. “Big Data’s Disparate Impact.” California Law Review 104 (3): 671–732. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899.

Batty, Michael, and Paul M. Torrens. 2001. “Modeling Complexity : The Limits to Prediction.” Cybergeo: European Journal of Geography. https://api.semanticscholar.org/CorpusID:102344300.

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset.” Proceedings of the International AAAI Conference on Web and Social Media 14: 830–39. https://doi.org/10.1609/icwsm.v14i1.7347.

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610–23. https://doi.org/10.1145/3442188.3445922.

BigCode Project. 2022. “The Stack: A Permissively Licensed Source Code Dataset.” Dataset documentation. https://www.bigcode-project.org/dataset/the-stack.

———. n.d.a. “BigCode Project Documentation.” https://www.bigcode-project.org/docs/about/the-stack/.

———. n.d.b. “The Stack Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack.

———. n.d.c. “The Stack V2 Dataset on Hugging Face.” https://huggingface.co/datasets/bigcode/the-stack-v2.

Blodgett, Su Lin, Solon Barocas, Hal Daume III, and Hanna Wallach. 2020. “Language (Technology) Is Power: A Critical Survey of "Bias" in NLP.” In Proceedings of ACL, 5454–76. https://doi.org/10.18653/v1/2020.acl-main.485.

Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Proceedings of the Conference on Fairness, Accountability and Transparency (FAT*), 77–91. https://proceedings.mlr.press/v81/buolamwini18a.html.

Carlini, Nicholas, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. 2024. “Poisoning Web-Scale Training Datasets Is Practical.” https://arxiv.org/abs/2302.10149.

Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. “The Secret Sharer: Measuring Unintended Memorization in Neural Networks.” In Proceedings of USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity19/presentation/carlini.

Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2021. “Extracting Training Data from Large Language Models.” In Proceedings of USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.

Common Crawl. n.d.a. “Common Crawl – Get Started.” https://commoncrawl.org/get-started.

———. n.d.b. “Web Archiving File Formats Explained.” https://commoncrawl.org/blog/web-archiving-file-formats-explained.

Crawford, Kate, and Trevor Paglen. 2019. “Excavating AI: The Politics of Images in Machine Learning Training Sets.” https://excavating.ai/.

Creative Commons. 2023. “Understanding CC Licenses and Generative AI.” https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/.

“Cybernetics.” 2025. Wikipedia. https://en.wikipedia.org/w/index.php?title=Cybernetics&oldid=1300921342.

Databricks. n.d. “Databricks Dolly Repository.” https://github.com/databrickslabs/dolly.

Deckelmann, Selena. 2023. “Wikipedia’s Value in the Age of Generative AI.” Wikimedia Foundation. https://wikimediafoundation.org/news/2023/07/12/wikipedias-value-in-the-age-of-generative-ai/.

École Normale Supérieure. n.d. “HowTo100M Project.” https://www.di.ens.fr/willow/research/howto100m/.

European Union. 2016. “General Data Protection Regulation (EU) 2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.

———. 2024. “Artificial Intelligence Act.” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689.

Federal Trade Commission. 2013. “Children’s Online Privacy Protection Rule (COPPA) — 16 CFR Part 312.” https://www.ftc.gov/legal-library/browse/rules/childrens-online-privacy-protection-rule-coppa.

Fernandez, Raul Castro. 2023. “Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia.” Proceedings of the ACM on Management of Data 1: 1–25. https://api.semanticscholar.org/CorpusID:259213174.

Fiesler, Casey, Cliff Lampe, and Amy S. Bruckman. 2016. “Reality and Perception of Copyright Terms of Service for Online Content Creation.” In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. https://doi.org/10.1145/2818048.2819931.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2021. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” CoRR abs/2101.00027. https://arxiv.org/abs/2101.00027.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” In arXiv:1803.09010. https://arxiv.org/abs/1803.09010.

Ghorbani, Amirata, and James Zou. 2019. “Data Shapley: Equitable Valuation of Data for Machine Learning.” In International Conference on Machine Learning, 2242–51. PMLR. https://proceedings.mlr.press/v97/ghorbani19c.html.

Goldblum, Micah, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, and Tom Goldstein. 2022. “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses.” IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2012.10544.

Gray, Colin M., Yubo Kou, Bryan Battles, Joseph Hoggatt, and Austin L. Toombs. 2018. “The Dark (Patterns) Side of UX Design.” In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3173574.3174108.

Grother, Patrick, Mei Ngan, and Kayee Hanaoka. 2019. “Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects.” NISTIR 8280. NIST. https://doi.org/10.6028/NIST.IR.8280.

gururise. n.d. “Alpaca Data Cleaned Repository.” https://github.com/gururise/AlpacaDataCleaned.

Hendrycks, Dan. n.d. “Competition Math Dataset on Hugging Face.” https://huggingface.co/datasets/hendrycks/competition_math.

Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.” https://arxiv.org/abs/2103.03874.

Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. “Deep Learning Scaling Is Predictable, Empirically.” arXiv Preprint arXiv:1712.00409. https://arxiv.org/abs/1712.00409.

Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards.” https://arxiv.org/abs/1805.03677.

Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, et al. 2024. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training.” https://arxiv.org/abs/2401.05566.

Hwang, Sohyeon, Priyanka Nanayakkara, and Yan Shvartzshnaider. 2025. “Trust and Friction: Negotiating How Information Flows Through Decentralized Social Media.” Proceedings of the ACM on Human-Computer Interaction (CSCW). https://doi.org/10.1145/3757516.

Illinois General Assembly. 2008. “Biometric Information Privacy Act (BIPA), 740 ILCS 14.” https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004.

International Internet Preservation Consortium. 2017. “The WARC Format 1.1.” https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.

ISO/IEC. 2023. ISO/IEC 23894:2023 Information Technology—Artificial Intelligence—Risk Management. ISO/IEC. https://www.iso.org/standard/77304.html.

Jackson, Brandon, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. “Public AI: Infrastructure for the Common Good.” Public AI Network. https://doi.org/10.5281/zenodo.13914560.

Jo, Emily, and Timnit Gebru. 2020. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Proceedings of FAccT, 306–16. https://doi.org/10.1145/3351095.3372829.

Johnson, Isaac, Lucie-Aimée Kaffee, and Miriam Redi. 2024. “Wikimedia Data for AI: A Review of Wikimedia Datasets for NLP Tasks and AI-Assisted Editing.” arXiv Preprint arXiv:2410.08918. https://arxiv.org/abs/2410.08918.

jsonlines.org. n.d. “JSON Lines Specification.” https://jsonlines.org/.

Koh, Pang Wei, and Percy Liang. 2017. “Understanding Black-Box Predictions via Influence Functions.” In Proceedings of the 34th International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v70/koh17a.html.

Kollock, Peter. 1998. “Social Dilemmas: The Anatomy of Cooperation.” Annual Review of Sociology 24 (1): 183–214. https://doi.org/10.1146/annurev.soc.24.1.183.

LAION. 2022. “LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets.” https://laion.ai/blog/laion-5b/.

———. 2024. “Releasing Re-LAION-5B: Transparent Iteration on LAION-5B with Additional Safety Fixes.” https://laion.ai/blog/relaion-5b/.

Library of Congress. n.d. “WARC, Web ARChive File Format.” https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.

Liu, Jason. 2024. “Data Flywheel Go Brrr: Using Your Users to Build Better Products.” https://jxnl.co/writing/2024/03/28/data-flywheel/.

Liu, Jiacheng, Thomas N. Blanton, Sewon Min, Arnavi Chheda-Kothary, Huy Tran, Eric Marsh, Cassidy Trier, John T. James, Jon Borchardt, and Evie Yu-Yen Cheng. 2025. “OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens.” arXiv Preprint arXiv:2504.07096. https://arxiv.org/abs/2504.07096.

Marda, Nik, Jasmine Sun, and Mark Surman. 2024. “Public AI: Making AI Work for Everyone, by Everyone.” Mozilla Foundation. https://assets.mofoprod.net/network/documents/Public_AI_Mozilla.pdf.

Marwell, Gerald, and Pamela Oliver. 1993. The Critical Mass in Collective Action. Cambridge University Press. https://doi.org/10.1017/CBO9780511663765.

McCallister, Erika, Tim Grance, and Karen Scarfone. 2010. “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII).” SP 800-122. NIST. https://csrc.nist.gov/pubs/sp/800/122/final.

McDonald, Aleecia M., and Lorrie Faith Cranor. 2008. “The Cost of Reading Privacy Policies.” I/S: A Journal of Law and Policy for the Information Society 4 (3): 543–68. https://kb.osu.edu/handle/1811/72839.

McDonald, Nora, Benjamin Mako Hill, Rachel Greenstadt, and Andrea Forte. 2019. “Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Service Providers.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12. https://doi.org/10.1145/3290605.3300901.

Meta Stack Exchange. n.d. “Why Is the Stack Exchange Data Dump Only Available in XML?” https://meta.stackexchange.com/questions/267329/why-is-the-stack-exchange-data-dump-only-available-in-xml-file-format.

Miech, Antoine, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips.” CoRR abs/1906.03327. http://arxiv.org/abs/1906.03327.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 220–29. https://doi.org/10.1145/3287560.3287596.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html.

Narayanan, Arvind, and Vitaly Shmatikov. 2008. “Robust de-Anonymization of Large Sparse Datasets.” In Proceedings of the IEEE Symposium on Security and Privacy, 111–25. https://doi.org/10.1109/SP.2008.33.

ndjson. n.d. “NDJSON Specification.” https://github.com/ndjson/ndjson-spec.

NISO. 2024. “ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite.” https://www.niso.org/publications/z3996-2024-jats.

Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 119–57. https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10/.

NIST. 2023. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST AI 100-1. National Institute of Standards; Technology. https://doi.org/10.6028/NIST.AI.100-1.

NLM. n.d. “Journal Article Tag Suite.” https://jats.nlm.nih.gov/.

Obar, Jonathan A., and Anne Oeldorf-Hirsch. 2020. “The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services.” Information, Communication & Society 23 (1): 128–47. https://doi.org/10.1080/1369118X.2018.1486870.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

OpenAI. 2022. “Introducing Whisper.” https://openai.com/index/whisper/.

———. n.d.a. “Grade-School Math (GSM8K) Repository.” https://github.com/openai/grade-school-math.

———. n.d.b. “GSM8K Hugging Face Dataset Card.” https://huggingface.co/datasets/openai/gsm8k.

———. n.d.c. “OpenAI API Reference – Chat Completions.” https://platform.openai.com/docs/api-reference/chat.

OpenAssistant. n.d. “OpenAssistant OASST1 Dataset Card.” https://huggingface.co/datasets/OpenAssistant/oasst1.

OWASP. 2023. “OWASP Top 10 for Large Language Model Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/.

Project Gutenberg. n.d.a. “Project Gutenberg File Formats.” https://www.gutenberg.org/help/file_formats.html.

———. n.d.b. “Project Gutenberg Offline Catalogs and Feeds.” https://www.gutenberg.org/ebooks/offline_catalogs.html.

Pushshift. n.d. “Pushshift.io.” https://pushshift.io/.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” https://arxiv.org/abs/2212.04356.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.

Raji, Inioluwa Deborah, Indra Elizabeth Kumar, Aaron Horowitz, and Andrew D. Selbst. 2022. “The Fallacy of AI Functionality.” Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. https://api.semanticscholar.org/CorpusID:249872658.

Rakova, Bogdana, Renee Shelby, and Megan Ma. 2023. “Terms-We-Serve-with: Five Dimensions for Anticipating and Repairing Algorithmic Harm.” Big Data & Society 10 (2): 20539517231211553. https://doi.org/10.1177/20539517231211553.

Reddit. n.d. “Reddit API Documentation.” https://www.reddit.com/dev/api/.

Reddit Help. n.d. “Reddit Data API Wiki.” https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki.

Roche, Adam, and Yali Sassoon. 2024. “What Is a Data Flywheel? A Guide to Sustainable Business Growth.” Snowplow Blog. https://snowplow.io/blog/what-is-a-data-flywheel.

Selbst, Andrew D., Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. “Fairness and Abstraction in Sociotechnical Systems.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 59–68. https://doi.org/10.1145/3287560.3287598.

Shankar, Shreya. 2024. “Data Flywheels for LLM Applications.” Shreya Shankar’s blog. https://www.sh-reya.com/blog/ai-engineering-flywheel/.

Shelby, Renee, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, et al. 2023. “Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction.” In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–41. Association for Computing Machinery. https://doi.org/10.1145/3600211.3604673.

Shen, Judy Hanwen, Inioluwa Deborah Raji, and Irene Y Chen. 2024. “The Data Addition Dilemma.” MLHC 2024. https://proceedings.mlr.press/v252/shen24a.html.

Silver, David, and Richard S. Sutton. 2025. “Welcome to the Era of Experience.” Google DeepMind. https://storage.googleapis.com/deepmind-media/Era-of-Experience/The%20Era%20of%20Experience%20Paper.pdf.

Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. 2022. “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.” Advances in Neural Information Processing Systems 35: 19523–36. https://arxiv.org/abs/2206.14486.

Stack Exchange. n.d. “Stack Exchange Data Explorer Help.” https://data.stackexchange.com/help.

Stanford CRFM. 2023. “Alpaca: A Strong, Replicable Instruction-Following Model.” https://crfm.stanford.edu/2023/03/13/alpaca.html.

Supreme Court of Illinois. 2019. “Rosenbach v. Six Flags Entertainment Corp.” 2019 IL 123186, Supreme Court of Illinois. https://www.illinoiscourts.gov/Resources/f71510f1-fb2a-43d8-ba14-292c8009dfd9/123186.pdf.

Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Carnegie Mellon University, Data Privacy Working Paper. https://dataprivacylab.org/projects/identifiability/paper1.pdf.

Tan, Joshua, Nicholas Vincent, Katherine Elkins, and Magnus Sahlgren. 2025. “If Open Source Is to Win, It Must Go Public.” arXiv Preprint arXiv:2507.09296. https://arxiv.org/abs/2507.09296.

Tatsu Lab. n.d. “Stanford Alpaca GitHub Repository.” https://github.com/tatsu-lab/stanford_alpaca.

TensorFlow. n.d. “TFRecord and Tf.train.example Tutorial.” https://www.tensorflow.org/tutorials/load_data/tfrecord.

TensorFlow Datasets. n.d.a. “C4 Dataset in TensorFlow Datasets.” https://www.tensorflow.org/datasets/catalog/c4.

———. n.d.b. “C4 Generator Code.” https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py.

Tran, Chau, Kaylea Champion, Andrea Forte, Benjamin Mako Hill, and Rachel Greenstadt. 2020. “Are Anonymity-Seekers Just Like Everybody Else? An Analysis of Contributions to Wikipedia from Tor.” In 2020 IEEE Symposium on Security and Privacy (SP), 186–202. IEEE. https://doi.org/10.1109/SP40000.2020.00053.

U.S. Copyright Office. 2024. “Copyright and Artificial Intelligence: Policy Studies and Guidance.” https://copyright.gov/ai/.

U.S. Department of Education. 1974. “Family Educational Rights and Privacy Act (FERPA).” https://studentprivacy.ed.gov/ferpa.

U.S. Department of Health and Human Services. 2000. “HIPAA Privacy Rule — 45 CFR Parts 160 and 164.” https://www.hhs.gov/hipaa/for-professionals/privacy/index.html.

Vincent, Nicholas, David Bau, Sarah Schwettmann, and Joshua Tan. 2023. “An Alternative to Regulation: The Case for Public AI.” arXiv Preprint arXiv:2311.11350. https://arxiv.org/abs/2311.11350.

Vincent, Nicholas, Mark Surman, and Jake Hirsch-Allen. 2025. “Canada as a Champion for Public AI: Data, Compute and Open Source Infrastructure for Economic Growth and Inclusive Innovation.”

Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, et al. 2021. “Ethical and Social Risks of Harm from Language Models.” arXiv Preprint arXiv:2112.04359. https://arxiv.org/abs/2112.04359.

Wikimedia Meta-Wiki. n.d. “Wikipedia Data Dumps – Dump Format.” https://meta.wikimedia.org/wiki/Data_dumps/Dump_format.

Wikipedia. n.d. “Wikipedia Database Download.” https://en.wikipedia.org/wiki/Wikipedia:Database_download.

Wolpert, David H, and William G Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1 (1): 67–82. https://doi.org/10.1109/4235.585893.