Home

Syllabus: CMPT 419 D200, Nicholas Vincent, Spring 2025

Lectures and Office Hours

Classes are on Tuesday/Thursday. See go.sfu.ca for exact location and time.

We will have office hours for an hour after class starting Week 2. Location TBA.

We will have additional office hours by appointment and/or popular demand.

General structure of our “lecture” time:

Each Tuesday (1 hr sessions), we’ll briefly discuss the previous week’s readings, I’ll introduce any readings and assignments for the week, and I’ll start the “lecture content” for the week.
I’ll aim to hold at least 5-10 min every Tuesday to walk through assignments together and take questions. You’re welcome to use this time to start working and see if questions arise.
On Thursday (2 hr sessions), we’ll finish lecture content and have a discussion about the lecture/readings for the first hour, and then typically use the second hour for some kind of activity or “lab time”. We may use some of this time to work on assignments and projects and to take quizzes or practice quizzes.
I’ll always take questions at the beginning and end of each lecture session. You’re always welcome to email me, but I may take 2-3 business days to respond to emails. Asking questions in class will provide a quicker response and your classmates may benefit from your questions as well!

This course is designed to have a particularly heavy reading and discussion component. Please be prepared to read quite a bit of material, and to talk about it.

About course assignments:

Each week has a set of assigned readings:

There will be a set of mandatory readings.
There will also be some optional readings. You are encouraged to read the abstracts and/or Introduction sections of the optional readings to see if they align with what you hope to get out of the class. I’ll do my best to organize these by theme, and will add more based on the interests you express.
Each week, you’ll submit some relatively brief “reading responses” via Coursys. These will be very lightly graded (there really aren’t wrong answers). However, you should be prepared to defend your reading responses live in class (I may cold call students, and you should be able to speak to your reading response in a way that suggests that you did indeed read the required material. You need not agree with the all the arguments presented or understand all the material).
For reading responses, I strongly recommend against AI assistance. I personally prefer that you submit bullet points rather than bullet points that prompt an LLM to output flowerly text.

Reading schedule:

Assigned readings for Week X are considered “finalized” on Tuesday of the preceding week (Week X-1), and should be completed by Tuesday of Week X. Each reading response is due immediately before class begins.
- For example: During class on Tuesday of Week 1, I’ll post and tell you all the required readings for Week 2, which you should finish over the next 7 days.
- I’ll try to provide a solid “look ahead” of course material, but it may be subject to change based on your feedback, course progress, and even current events – so you should check the readings each Tuesday after class. For instance, in the past, I have extended time to complete readings that students found particularly dense.

About course organization

The course will be organized roughly in terms of 4 “modules”:

Module 1: Administration and Introduction to Different Frameworks for doing “Human-Centered” or “Data-Centered” Work (Weeks 1-4)
Module 2: Technical work in Data Valuation and Scaling. (Weeks 5-7, 3 in total)
Module 3: Online platforms and Content Ecosystems. (Weeks 8-10, 3 in total).
Module 4: Frontiers in Data Governance: Voting, Markets, and More (Week 11-13, 3 in total).

We will have one assignment per module (coding / data analysis).

Grading

10% reading responses
- 10 total reading responses, each worth 1%
20% coding assignments
30% quizzes
40% final project

Course FAQs

Q: Is attendance mandatory?

A: While I won’t give you direct marks for attendance, you are highly encouraged to attend class whenever you are able to. I do expect all students to participate in class discussion at some point (i.e. I do want everybody to speak up at least once). I will try to facilitate this “softly” via some cold-calling to discuss reading responses but this will not be strictly enforced (e.g., if circumstances arise, we can meet in office hours to discuss your progress in the course). If a very “loose approach” to soliciting participation isn’t working at the mid-point to class, we’ll discuss (as a class) alternatives.

I am very supportive of students staying home when sick, and understand a variety of personal situations may arise that prevent you from going to class. You do not need to email me to miss class, but are welcome to ask follow up questions (I may just point you to the class notes and encourage you to talk to your classmates). To earn a high mark in this class, I encourage you to plan to attend all lectures you are able to.

Q: Will this class involve coding?

A: Yes, there will be some coding assignments in the class that are designed to give hands-on experience with certain course concepts. You are free to use a variety of programming languages and tools for these assignments, though will be encouraged to use some “standard” solutions based (primarily: Python for ML and data science related components, Javascript and web-programming for some design components). For coding assignment, LLM assistance will be allowed (with some caveats). I expect available LLM tooling to change quite a bit during our semester, so we’ll play with tools together as part of the course.

Q: How many assignments will we have?

A: You will complete 4 assignments (involving coding and data analysis) and 1 project.

Q: Can I work in a group?

A: There will be opportunities to do group work, but you must write a contribution statement for everything. You must review all your team’s code and writing! Individual assignments that allow group work will have specific details for how this will work.

Q: Are there quizzes, a midterm, and/or a final exam?

A: There will be in-class quizzes, but no “midterm” or “final”. There will be one quiz for each module (4 total). They will be announced in advance and some kind of make-up option will be available for sick students. Any “testable” material will be drawn only from in class lecture materials and mandatory readings. The goal of the quizzes is to provide additional incentives to engage with material each week.

Q: What materials do I need?

Reading materials will be provided digitally by the instructor. There will be no single textbook – rather, we will read an assortment of research papers, book chapters, etc. You will be asked to spend some time installing software tools on your own. You will have some flexibility in which tools you choose – there will always be a free option available.

Q: Can I use ChatGPT (etc.)?

A: You may use generative AI tools to assist with your coursework, but must provide complete logs for any outputs you use directly and any artifacts you submit should indicate the provenance of any generative AI outputs.

e.g.

“This slide was produced by model XYZ”
“This summary paragraph or code snippet was produced entirely by ChatGPT”
“This code was generated with the help of ChatGPT, but heavily edited”

Individual assignments may have specific requirements you should pay attention to

Course Readings

Week 2

The goal of the week 2 readings is to begin getting some exposure to what different researchers mean when they refer to human and data centered ML/AI. We want to start developing some intuiton for when human-centered practices or data-centred thinking might materially change how we design a system, come up with a research question, or deploy a model.

Reading 1: Chancellor 2023.

Citation: Chancellor, S., 2023. Toward practices for human-centered machine learning. Communications of the ACM, 66(3), pp.78-85.
About: First, we’ll read “Toward Practices for Human-Centered Machine Learning” by Stevie Chancellor, published in the Communications of the ACM. CACM is a venue in which experts in various fields of computing write broad pieces for the entire computing community.
How to access: Visit https://cacm.acm.org/magazines/2023/3/270209-toward-practices-for-human-centered-machine-learning/fulltext

Reading 2: Mazmunder et al. 2022.

Citation: Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Gaviria Rojas, W., Diamos, S., Diamos, G., He, L., Parrish, A., Kirk, H.R. and Quaye, J., 2023. Dataperf: Benchmarks for data-centric ai development. Advances in Neural Information Processing Systems, 36, pp.5320-5347.
About: Second, we’ll read the Introduction of the DataPerfs paper, published in NeurIPS 2023 Datasets and Benchmarks Track.
How to access: Visit https://arxiv.org/abs/2207.10062
Notes: You only need to read the Introduction this week.

Response Instructions:

1) Please write one to two paragraphs describing why you’d like to work on, or with, ML/AI systems? You can imagine these paragraphs as text you might include in a cover letter.
2) Please list 1-3 “domains of interest” (e.g., social media, content recommendation, law, health care, mental health, the environment, economics). They can be at any level of granularity (e.g. “AI for health” is OK, as is “AI for oncology”). Similarly to part 1, the purpose of this is to help me identify trends in your interests so I can suggest optional readings that are of interest to you and your classmates!

If you submit any reasonable formatted submission for this reading response, you’ll receive full credit. In future response instructions, you might see something along the lines of, “you must quote on of the readings directly to support your point”).

For this reading response, you’ll submit via CourSys.

Optional reading

If interested in data-centric approach to large language models, check out this blog: https://sebastianraschka.com/blog/2023/optimizing-LLMs-dataset-perspective.html

Week 3

The goal of the week 3 reading is to gain further exposure to various frameworks put forward around focusing on humans (Schneiderman reading) and/or data (Sambasivan et al reading and Zha et al reading).

Note this week is a bit longer than than Week 2. We’ll check in on how it’s going, workflow wise, to complete these readings, and focus on challenges that may come up for those who haven’t had many reading heavy computing classes previously.

Reading 1: Shneiderman 2020.

Citation: Shneiderman, B., 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36(6), pp.495-504.
About: This is a paper published in the International Journal of Human–Computer Interaction.
How to access: visit https://www.tandfonline.com/doi/full/10.1080/10447318.2020.1741118 on campus or https://arxiv.org/abs/2002.04087 off campus

Reading 2: Sambasivan et al 2021.

Citation: Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. and Aroyo, L.M., 2021, May. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
About: This is a paper published in ACM CHI, the main venue for human-computer interaction research.
How to access: visit https://research.google/pubs/everyone-wants-to-do-the-model-work-not-the-data-work-data-cascades-in-high-stakes-ai/

Reading 3: Zha et al 2023.

Citation: Zha, D., Bhat, Z.P., Lai, K.H., Yang, F. and Hu, X., 2023. Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) (pp. 945-948). Society for Industrial and Applied Mathematics.
About: This is a short perspective paper in a data mining conference.
How to access: visit https://epubs.siam.org/doi/abs/10.1137/1.9781611977653.ch106
Notes: you should read the short perspective paper. You may optionally also check out the longer survey paper and repo linked here: https://github.com/daochenzha/data-centric-AI

Optional (mainly to give some more examples of academics using human- or data-centric framing):

https://dl.acm.org/doi/abs/10.1145/3517337
https://arxiv.org/abs/2311.06703v2
https://dl.acm.org/doi/10.1145/3544549.3585752

Response Instructions

Imagine you are a manager at a large tech company tasked with developing a new AI product. You can pick one of the following three options based on your interests, or suggest your own product:

A large language model that will read physician notes and make suggestions about how to treat patients
A recommender system for a video-based content app
A facial recognition system that will be sold via API credits

Q1: Thought experiment: Please write 1-2 paragraphs describing how adopting any of the suggestions from any of this week’s readings might change your product features (first define the product). Please directly reference (e.g. directly quote) one or more of the readings.

Q2: Please list three examples of “harms” that might occur from a failure to do “data work” as defined in the Sambasivan reading. You can use the same AI product you picked for Q1, or discuss one or more different AI products. You don’t need to quote the reading directly for this part.

Q3: Quick retrieval question: According to Zha et al., what category would each of the following techniques fall into: feature selection, creating images with randomly occluded patches, using Mechanical Turk to label documents.

Q4: Please let me know roughly how the long the readings and responses took so we can calibrate!

Week 4

For this week, there will be just two readings. The goal this week is still to gain exposure to all the different frameworks for thinking that motivate “human-centered AI” and “data-centered AI”. Last week, we saw several more frameworks, and in particular learned more about specific “data-centric” task formulations.

In our first reading, Chancellor highlighted that human-centered ML is often tied deeply to specific goals around fairness, justice, and values. This week, we’ll dive into this with a reading from a textbook.

This week we’ll just read two pieces: one is a longer introduction to a fairness in ML textbook, and the other is the Introduction to another research paper.

Please read the Introduction of FairML: https://fairmlbook.org/introduction.html

While our course material will differ in some ways from a Special Topics course that’s entirely focused on fair ML, there’s quite a bit of conceptual overlap between being human-centered and trying to achieve some notion of fairness.
For our purposes, the concept of the “machine learning loop”, and especially measurements and going “from data to models” will be highly salient to almost all the topics we discuss, so try to read this one closely! We’ll discuss this quite a bit together in class as well.

Please read the Introduction of “Value-Sensitive Algorithm Design: Method, Case Study, and Lessons” by Zhu et al, published in CSCW: https://dl.acm.org/doi/10.1145/3274463

The goal of this reading is to see another example of how a research project might concretely seek to incorporate values into design. You don’t need to read the full paper, though if you’re particularly interested in working on algorithm design you might want to!

Response Instructions:

Q1: Please summarize in your own words the idea of the “machine learning loop”. Do your best to capture the key concepts from the FairML intro.

Q2: How does the discussion of feedback loops in FairML Introduction compare to the discussion of feedback in Schneiderman’s HCAI? You can just write 2-3 sentences describing major differences or similarities you see. There’s not a correct answer here.

Q3: Quick retrieval: What online platform do Zhu et al. use to study value-sensitive design in a real-world setting?

Q4: Please let me know roughly how the long the readings and responses took so I can continue to calibrate!

Week 5

This week, we are going to start reading a long piece that surveys training data influence:

Citation: Hammoudeh, Z. and Lowd, D., 2024. Training data influence analysis and estimation: A survey. Machine Learning, 113(5), pp.2351-2403.
How to access: https://arxiv.org/abs/2212.04612

This piece will represent a large jump from reading about high-level frameworks that consider social factors, incentives, etc. to a much more mathematical framework for thinking about data-centricity. Accordingly, we’re going to work through this piece (and some excerpts from the key citations) fairly slowly. For this week, you should just read pages 1-10 (on the arxiv version – up to Section 4).

For this week’s reading responses, you do not need to answer any questions. Instead, please use the reading response as a chance to record any questions that come up (if you want to just ask them in lecture, that’s great too!)

Week 6

This week we will continue reading the Hammoudeh and Lowd survey.

Please read pages 10-21 (up to Section 5.1.2, “Representer Point Methods”).

For your response, please answer the following 3 questions:

Q1) Please describe the difference between a leave one out influence value and a Shapley value, in the context of training data influence.

Q2) What is the main issue with calculating retraining-based data values, as described in our reading?

Q3) If you were asked to run a new data market that makes use of influence estimates, which approach from the reading would you use and why? There is no correct answer to question, but you should aim to think through some of the trade-offs.

Week 7 (Reading Week)

This week, please read:

Citation: Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M.M.A., Yang, Y. and Zhou, Y., 2017. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
How to access: https://arxiv.org/pdf/1712.00409.pdf

For your response, please answer the following 3 questions:

Q1) Please describe the consistent finding across all ML domains in this study.

Q2) What are the three “learning regions” that the authors identify?

Q3) About how long did this reading take?

The reading response for this week will be due Week 8 (i.e., two responses due Week 8!)

Week 8

This week, we’re going to start talking about online platforms and their role a key AI training data source. We’ll orient much of our discussion around recent advances in Large Language Models, but with the caveat that the core ideas are equally relevant to search, recommendation, and classification systems in many applied domains of interest to our class (e.g. medicine, analytics for sports and games).

First, please read these two short blog posts from 2020 and 2022.

https://dataleverage.substack.com/p/dont-give-openai-all-the-credit-for
https://dataleverage.substack.com/p/chatgpt-is-awesome-and-scary-you-deserve-credit

Next, please read Sections 1 and 2 of this pre-print paper:

https://arxiv.org/abs/2101.00027

For this week, please list three specific online platforms that are useful for AI training.

Week 9

Quiz 2 this week + finish project proposal. No required reading. You’re encouraged to find reading that supports
your project proposal.

Week 10

This week, we’ll do something a bit different. For your “reading time” (i.e. 1-2 hours, hopefully), you should watch this YouTube video:
https://www.youtube.com/watch?v=zjkBMFhNj_g
If you absolutely are sure you’re not interested in Large Language Models, you can use this time to instead find a video or blog post covering your domain of interest.

For your response, please either:
1) Describe one thing from the video that surprised you, or
2) Provide a link to the non-LLM resource you found and describe what you learned from it.

Week 11

For this week, please read:

First 10 pages of https://arxiv.org/abs/2402.00159

Section 2 of https://dl.acm.org/doi/abs/10.1145/3531146.3534637

Skim this webpage: https://weborganizer.allen.ai/ and look at linked sample data on HuggingFace.

For your response, please:

Describe at a high-level three key components in preparing a high quality pre-training dataset.

Week 12

Please read the Abstract and Introduction of the following papers. The goal of this set of readings is to get some exposure to different arguments and research directions in the space of data-sharing markets. One reading is from Nature Medicine, one from a data-focused CS conference, and one from an economics journal.

Prainsack, B. and Forgó, N. 2022. Why paying individual people for their health data is a bad idea. Nature medicine. 28, 10 (Oct. 2022), 1989–1991.
https://www.nature.com/articles/s41591-022-01955-4

Fernandez, R.C. 2023. Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia. Proceedings ACMSIGMOD International Conference on Management of Data (2023).
http://raulcastrofernandez.com/papers/data-sharing-consortia-escrow.pdf

Acemoglu, D. et al. 2022. Too Much Data: Prices and Inefficiencies in Data Markets. American Economic Journal: Microeconomics. 14, 4 (Nov. 2022), 218–256.
https://www.aeaweb.org/articles?id=10.1257/mic.20200200

For your response, describe your planned project and the relevance, if any, of each of these three framings.

Week 13

The goal of this set of readings is to get some exposure to additional perspectives on data governance.

Read: Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., … & Hudson, M. (2020). The CARE principles for indigenous data governance. Data Science Journal, 19, 43-43.
Available as HTML at: https://www.adalovelaceinstitute.org/blog/care-principles-operationalising-indigenous-data-governance/

https://data-feminism.mitpress.mit.edu/pub/vi8obxh7/release/4

For your response, describe your planned project and the relevance, if any, of each of these three framings.

Optional Classics

This file will contains the opposite of late breaking links: these are links to writing that we can consider “classics” in AI, HCI, and other fields.

(These are super opinionated, and very much incomplete.)

Early computing and AI:

Vannevar Bush

As We May Think. 1945.

Norbert Wiener

Cybernetics: Or Control and Communication in the Animal and the Machine. 1948.
The Human Use of Human Beings. 1950.

Herbert Simon:

The Sciences of the Artificial. 1969.
Foundational concepts: Bounded rationality, Satisficing

Lucy Suchman

“Plans and Situated Actions” (1987) - Influential critique of AI planning

Philosophy:

John Rawls

A Theory of Justice, 1971.
- Thinking sometimes shows up in fairness in AI discussions

Particularly influential economists who wrote about economics of information and knowledge:

Friedrich Hayek

Economics and Knowledge. 1937.
The Use of Knowledge in Society. 1945.
See also the Wikipedia article on the “Socialist Calculation Debate” and discussion in the context of modern AI here

Kenneth Arrow

Economic Welfare and the Allocation of Resources for Invention

This Wikipedia article has many more works from economics that may be of interest and/or relevant ot a problem you’ll face in the future.

Optional Latebreaking

This is a file where I’ll record “late breaking links” – things that we (the class!) find via news, social media, our friends, etc.

Week 1

An excellent post about programming with LLMs: https://crawshaw.io/blog/programming-with-llms

Week 2

An academic position paper on data-centric AI: https://aclanthology.org/2024.findings-emnlp.695/. Most relevant to Module 1.

BibTex

@inproceedings{xu-etal-2024-position, title = "Position Paper: Data-Centric {AI} in the Age of Large Language Models", author = "Xu, Xinyi and Wu, Zhaoxuan and Qiao, Rui and Verma, Arun and Shu, Yao and Wang, Jingtan and Niu, Xinyuan and He, Zhenfeng and Chen, Jiangwei and Zhou, Zijian and Lau, Gregory Kang Ruey and Dao, Hieu and Agussurja, Lucas and Sim, Rachael Hwee Ling and Lin, Xiaoqiang and Hu, Wenyang and Dai, Zhongxiang and Koh, Pang Wei and Low, Bryan Kian Hsiang", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.695/", doi = "10.18653/v1/2024.findings-emnlp.695", pages = "11895--11913", abstract = "This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making a key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and advocate that data-centric research should receive more attention from the community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research." }

Zhang et al. 2024, New survey on data markets: https://arxiv.org/abs/2411.07267. Most relevant to Module 4.

BibTex

@article{zhang2024survey, title={A Survey on Data Markets}, author={Zhang, Jiayao and Bi, Yuran and Cheng, Mengye and Liu, Jinfei and Ren, Kui and Sun, Qiheng and Wu, Yihang and Cao, Yang and Fernandez, Raul Castro and Xu, Haifeng and others}, journal={arXiv preprint arXiv:2411.07267}, year={2024} }

Henderson and Lemley 2024, on AI Terms of Use: https://arxiv.org/abs/2412.07066. Most relevant to Module 4.

BibTex

@misc{henderson2024mirageartificialintelligenceterms, title={The Mirage of Artificial Intelligence Terms of Use Restrictions}, author={Peter Henderson and Mark A. Lemley}, year={2024}, eprint={2412.07066}, archivePrefix={arXiv}, primaryClass={cs.CY}, url={https://arxiv.org/abs/2412.07066}, }

The “People’s Capitalism Project”: https://www.peoplescapitalism.org/. Most relevant to Module 4.

Documents from Kadrey vs. Meta (tons of interesting data-centric insights into llama training): https://www.courtlistener.com/docket/67569326/kadrey-v-meta-platforms-inc/?page=3

Some recent policy-related docs: OpenAI’s economic blueprint: https://openai.com/global-affairs/openais-economic-blueprint/ and UK AI Opportunities Plan: https://www.gov.uk/government/publications/ai-opportunities-action-plan/ai-opportunities-action-plan

Week 3

Blog post from Zargham, Moore, and Stephenson: https://s.mirror.xyz/djByMntM2rQF4tqUISYS2MAO3oCfSWoOZSOpZjsYwaw

Week 4

Kulveit et al. on AI disempowerment: https://gradual-disempowerment.ai/

Week 5

https://ieeexplore.ieee.org/document/5197422

https://arxiv.org/abs/2501.18887

https://github.com/google-research/tuning_playbook

Week 9

BRAND new paper on scaling laws: https://arxiv.org/pdf/2502.18969

Nice post on analyzing common crawl, with practical tips: https://trufflesecurity.com/blog/research-finds-12-000-live-api-keys-and-passwords-in-deepseek-s-training-data

Week 10

Data selection for fine-tuning: https://github.com/hamishivi/automated-instruction-selection

Readings Tldr

This file contains the bare minimum info about the readings: links and special instructions (e.g. just read the first x pages). You’ll eventually need to read the longer doc to get the response instructions.

Week 2

https://cacm.acm.org/magazines/2023/3/270209-toward-practices-for-human-centered-machine-learning/fulltext
https://arxiv.org/abs/2207.10062, just read the intro

Week 3

https://arxiv.org/abs/2002.04087
https://research.google/pubs/everyone-wants-to-do-the-model-work-not-the-data-work-data-cascades-in-high-stakes-ai/
https://epubs.siam.org/doi/abs/10.1137/1.9781611977653.ch106

Week 4

https://fairmlbook.org/introduction.html
https://dl.acm.org/doi/10.1145/3274463, just read the intro

Week 5

https://arxiv.org/abs/2212.04612, p1-10

Week 6

https://arxiv.org/abs/2212.04612, p10-21

Week 7

https://arxiv.org/pdf/1712.00409.pdf

Week 8

https://dataleverage.substack.com/p/dont-give-openai-all-the-credit-for
https://dataleverage.substack.com/p/chatgpt-is-awesome-and-scary-you-deserve-credit
https://arxiv.org/abs/2101.00027, Sections 1 and 2.

Week 9

None

Week 10

https://www.youtube.com/watch?v=zjkBMFhNj_g

Week 11

First 10 pages of https://arxiv.org/abs/2402.00159
Section 2 of https://dl.acm.org/doi/abs/10.1145/3531146.3534637
Skim https://weborganizer.allen.ai/ and HuggingFace page

Week 12

Abstract and intro of:
- https://www.nature.com/articles/s41591-022-01955-4
- http://raulcastrofernandez.com/papers/data-sharing-consortia-escrow.pdf
- https://www.aeaweb.org/articles?id=10.1257/mic.20200200

Week 13

https://www.adalovelaceinstitute.org/blog/care-principles-operationalising-indigenous-data-governance/
https://data-feminism.mitpress.mit.edu/pub/vi8obxh7/release/4

Assignment 1 Tools

You will submit a short report on your tool-related exploration as your first “coding assignment”.

You don’t need to submit any particular code, but after you’ve completed this report you’re encouraged to try out a “practice run” with the tools you’ve selected. For instance, you might try spending 30-60 minutes on a quick “side project”, and make sure you’re able to complete it, produce an output file, etc.

After you’ve spent some time exploring tools, please submit via Canvas a report which describes your answers to the following questions. You can submit as a PDF file or in plaintext as a .md file. Please organize your answers in terms of question numbers, as indicated below. It’s perfectly OK if some of your answers are very short!

Your choices are not binding – this primarily to get you thinking about these choices early on and to encourage you to explore some of the available options before additional assignments are due.

Q1: Which tools you plan to use for writing code (IDE, AI assistance, version control). e.g. answers might include: VS Code, Sublime text, ChatGPT, Copilot, GitHub

Q2: What open questions or concerns do you have about code writing tools?

Q3: Which ML libraries / frameworks / tools do you have familiarity with already?

Q4: If given the choice, which ML libraries do you prefer to use for any assignments that involve training and evaluating a ML model?

Q5: Which ML libraries / frameworks / tools do you hope to learn more about (“I’m not sure, that’s why I’m taking this class” is an OK answer!)

Q6: Which tools you plan to use to read and take notes on papers, if any (pen + paper or PDF reader + notes app is perfectly fine answer!)

Q7: Which tools, if any, you plan to use for project management?

Assignment 2 Influence

Assignment 2

DEADLINE (Spring 2025): Mar 4, 23:59

Our Module 2 content is focused on understanding the broad question: Which groups of observations – or groups of people – are “responsible” for a given model output or “capability”?

In this assignment, we’ll get some hands-on experience with the concept of training data influence.

There are four parts to the assignment. You’ll need to write code to train a ML model and produce influence values for some of the training data in the model. Below, the requirements for each part are described.

Each part will have a coding component and a report component. You will turn in one file (or multiple) with code (e.g., a `.py` file or `.ipynb` file) and one report PDF. If using computational notebooks like a Jupyter notebook, you may combine these two into a single file (e.g. a notebook exported to a PDF with code visible.

In your code, you can use comments to designate which parts of your code correspond to each part.

Note 1: you may work on this assignment in groups of 1-3.

Note 2: you may use generative AI on this assignment, and must report your use. FYI – the instructor has tried out several models, and they’re definitely useful, but you’ll need to be careful about explaining your choices. In fact, I’ll even provide you some example outputs of what you get from directly copy-pasting the assignment into several strong models!

Note 3: Finally, as an additional incentive to avoid literally just copy-pasting the assignment in your favorite consumer AI product, I may randomly select some students to explain their solutions in class.

Part 1: Preliminaries

First, you should select a dataset to work with, define a specific classification (must do classification for this assignment) task, and establish a baseline model.

If you’re looking for inspiration, you might consider selecting something from https://archive.ics.uci.edu/

You will not be graded based on your dataset choice, task choice, or achieving a certain level of performance.

Rather, you will be graded based on your ability to describe, in a scientifically complete fashion, the choices you’ve made.

You are recommended to select a dataset from a domain of your interest and then take a small random sample of that dataset (e.g., 10000 rows – though you can lower this if using high-dimensional data, want to use deep learning, etc. – ask us if you’re unsure) to ensure that you can complete this assignment quickly, without being burdened by excessive computational costs. What constitutes “excessive” here will depend on your access to computing resources (you may wish to explore using an online tool with some degree of free compute like Google Colab).

If you select a dataset you are interested in, you may be able to reuse some of your code you write for this assignment for your project.

Suggested approach: I recommend first training several models on the “full dataset” (e.g. logistic regression, basic random forest, KNN, XGBoost). See how long this takes. Then, try subsampling 10% or 1% of your data and see if the training time falls low enough that you think you can reasonably retrain a model at least 50 total times.

Specifically, you should write code to do the following:

Load a dataset into memory. Describe the dataset in your report. (2 marks)
Process into features and labels. Describe the features and labels in your report. (2 marks)
Split into train and test sets. Describe your specific approach (e.g. random 80/20 split, time-based split, etc.) (2 marks)
Train some classifier. It does not need to be the “best” possible performance for your chosen dataset, though you may want to try a few options if feasible to do so. (2 marks)
Report performance of your baseline classifier: accuracy, confusion matrix. You are encouraged to include a precision-recall curve or TPR vs. FPR curve (i.e. AUROC curve), though if you think it isn’t helpful you can just mention why not. You must choose a “primary metric” that you will use for your data value estimates, and you should briefly justify this choice. (2 marks)

10 marks total for part 1

Part 2: Brute force LOO influence

Next, you should select (manually or randomly) 10 training data points (i.e., observations) and compute the exact leave-one-out LOO influence of these examples on your chosen primary metric.

You can earn up to 4 marks for clean and correct code.

Report the influence score for each of your observations. You may do this in a table or plot. (2 marks).

Please briefly comment on any trends you observe with your influence scores. Are any points with high influence unusual in any way? It’s OK if they’re not, but you should demonstrate that you looked. (2 marks)

8 marks total for part 2.

Part 3: Group-level influence

Next, you should select (manually or randomly) 10 different groups of data points of different sizes. For instance, you might randomly select 10%, 20%, 30%, etc. of the training data. You should compute the exact leave-entire-group-out influence for each group.

You can earn up to 4 marks for clean and correct code.

Report the influence score for each of your groups. (2 marks)

For part 3, you must also include a plot that shows group size compared with influence. (2 marks)

8 marks total for part 3.

Part 4: Shapley values

Finally, we will roughly estimate Shapley values for our training data.

For each observation and each group, you should compute the Shapley value using Truncated Monte Carlo Shapley Value Estimation (described briefly in our survey reading and in more detail here: http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf).

This will involve a coding challenge: implementing this particular Shapley value estimation algorithm.

In the Ghorbani and Zou paper, the authors suggest using a truncation cut-off: if performance for a given point / time step is very close to full performance V(D), we don’t need to retrain again.

We will go a step further and use the following rule to ensure our code doesn’t take too long to run: we should take our best guess at the Shapley value for each training data point after only 10 total permutations have been examined. In other words, your code should just re-shuffle the training data 10 times, compute the marginal impact of each training point, and then average these across the 10 permutations.

Furthermore, you may further subsample your training data (E.g. if you started with 100k rows and have only been using 10k so far, and need to drop down to 1k… you can) for this part if needed to complete the assignment in time.

You can earn up to 4 marks for clean and correct code.

Here, you need only to plot the distribution of all Shapley values. (2 marks)

If you have extra time, you are encouraged to compute more accurate Shapley value estimates by using more permutations and compare the Shapley values to LOO influence from part 2, but this is optional.

6 marks total for part 4.

Grading

This assignment will be graded based on both code correctness and an accompanying report. You can earn marks for each of these separately (i.e. if you have errors in your influence calculations, you can still earn the marks for reporting and visualizing the potentially erroneous data values).

To recap, there are:

10 marks available in part 1
8 in part 2
8 in part 3
6 in part 4
for a total of 32 marks.

Part 4 will likely be the most difficult, but offers the least marks, so you should consider completing the earlier sections first.

If you submitted with a group, your report must include a ‘contribution statement’ that describes how each member contributed.

Assignment 3 Data

Assignment 3

DEADLINE (Spring 2025): Mar 20, 23:59

Our Module 3 content will focus on understanding datasets from online platforms and elsewhere. This will be helpful in both understanding LLM pre-training data, and should also be helpful in making progress on your project.

In this assignment, we’ll inspect two datasets: a large text dataset and a second dataset from any domain that chose (either because of personal interest or because it helps you make progress on your project).

Some learning goals:
- Understand the process by which you might gather a very large web-scale dataset (we will not actually download any full datasets, however!)
- Get experience with dataset documentation practices
- Get experience with the “just look at your data!” hack

You may want to use:
- https://huggingface.co/docs/datasets/en/index
- https://github.com/allenai/wimbd

Part 1: Getting some data (4 marks)

First, you should gain access to a small sample of LLM training data. You may use Dolma (https://allenai.github.io/dolma/), RefinedWeb (https://huggingface.co/datasets/tiiuae/falcon-refinedweb), or any other source you’ve come across.

The main challenge of part 1 is acquiring a good sample.

Your goal is to acquire 300k tokens (about 0.01% of the 3 trillion token in Dolma).

For part 1, write a short ‘methods’ section and key code you used to get a random sample of LLM pre-training data onto your machine (3 marks).

Second, write a short ‘methods’ section that describes the dataset you chose based on your project/interests (1 mark).

Part 2: Datasheets (8 marks)

Next, you should visit https://arxiv.org/pdf/1803.09010 and answer all the questions in Section 3.2 for both datasets. (4 marks)

Part 3: Data Assessment (8 marks)

Next, you should prepare a random sample of 10 “observations” from each dataset. We will be manually assessing their quality!
You should produce a table that shows each observation and some kind of “assessment column” of your choosing. For instance, you might manually assess the “usefulness” to a certain task. You might consider the toxicity of the content (in the text domain).

To create this “assessment column”, you will likely need to make some subjective choices. You could create a quantitative assessment as well (e.g., the number of times of a key word appears in text data).

Please describe and briefly justify your chosen metric. For your project dataset, you’re encouraged to select something relevant to your project. The point of this assignment is to provide a forcing function to “look at your data”, which is a common adage and suggestion for all kinds of AI projects! (4 marks)

Section 3a of your report will consist of two tables, with 10 rows and at least 2 columns.

Example row:

“this is a sentence in my LLM training data from a blog”, Toxicity score: 0
“this is a really angry mean sentence in my LLM training data”, Toxicity score: 10

In your report, you should write a paragraph summarize what you found. Perhaps you were surprised by the text, or perhaps everything was just as expected. (4 marks).

Submission Instructions

You will turn in:
a single notebook-style report as a PDF file that fulfills all above criteria. You may use a Jupyter notebook, or a Word/Google doc with key code pasted in.

Assignment 4 Datanapkinmath

Assignment 4

DEADLINE (Spring 2025): April 3, 23:59

Assignment 4 will be short and open-ended.

For this assignment, you should visit https://nickmvincent.github.io/data_napkin_math/

You will produce a short report (1 page OK, more also OK) that describes some data napkin math estimation about your project data.

Please assess, to the best of your ability:
- How many hours of human labour were required to create the dataset you are using for your project?
- How much money would it cost to “commission” a fresh copy of this dataset (hint: use your hours estimate and make a reasonable guess about hourly costs)
- How much money could this dataset generate (hint: make a reasonable guess about this data could be used to make inferences, predictions, detections etc. and what the business value or other value is. The answer might be: not very much!)

You will need to write down a lot of assumptions. You will marked based on your completeness in listing and justifying the assumptions, not the empirical validity of your estimate (i.e., it is better to make wild guesses than to have unexplained details).

If you are in a project group, you may submit this with your group (will be a “Group Assignment” in CourSys).

Project Proposal

DEADLINE (Spring 2025): Feb 28 11:59pm.

We’re going to start thinking about our projects relatively early in the term! To scaffold the project ideation, you’ll be asked to turn in an initial project proposal on Feb 28.

You can submit a 1-2 page PDF, text, or Markdown file. Exact length is not critical here: as long as it contains the key ideas, you’re good to go.

This proposal is not binding, though you will earn some marks for turning it in. You can change your project topic, track, or group after the proposal is submitted (though you’re encouraged to stick relatively close to your proposal, just for the sake of your own time).

For the project, you can select from three tracks, described below.

Well before you turn your project in, you will be provided with a much more detailed rubric describing how your project will be graded. For the initial proposal, however, you should just focus on selecting a project that:

fits your personal interests in the course (including your career goals)
will give you an opportunity to explore and demonstrate understanding of the key concepts from our readings and lectures.

The two heuristic questions I recommend you ask while brainstorming project ideas:

Does this project meet the unique individual incentives of all group members (e.g., a chance to work with a particular ML library, a chance to work on a task of interest, a chance to produce a high quality report or prototype to include in my portfolio).
Does this project offer an opportunity to demonstrate understanding of key concepts from the course? For instance, does it fit into any of the frameworks for human-centered ML and AI that we’ve seen, or does it relate to any of the calls for data-centric we’ve seen?

Track 1: Tools and interfaces for human/data-centered AI

Track 1 will be a good fit for front-end focused projects. For this track, you can propose and develop some kind of tool or interface for data-centric AI. This interface might be a web application, mobile application, or even a user-focused CLI prototype.

To fit the project criteria, this tool should help users accomplish some kind of data-related action or some kind of data exploration task. In other words, it should either be targeted at users who want to control the flow of their data, or at data scientists who want to explore data in some way.

Please note that if you’re very uncomfortable doing prototyping and frontend development, you may not want to select this track. While I’m happy to support you if you want to learn these topics on the fly, we probably won’t have much time to cover core design, frontend, or software engineering concepts in this course, so this project is best suited to students who already have some of those skills and specifically want to use their project work time to advance in this area.

Examples:

A new interface for interacting with large language models that allows user to save or export conversation data (you might consider forking and contributing to something like https://github.com/ollama-webui/ollama-webui)
A browser extension that helpers user collect and use data generated by their own browsing (e.g. export my YouTube watch history and train a local personalization / recommender system)
A browser extension that blocks data collection and informs the user how data that’s collected might impact AI systems
A web interface for visually exploring aspects of a dataset, aimed at ML developers

Track 2: ML Project with Data Exploration Component

Track 2 will be the closest to what you might do in a typical project-focused ML course. For this project, you should select a machine learning task of interest and produce a thorough report describing how you might tackle the relevant ML challenges. What will set your project apart from a pure ML focused course is that you will also be asked to conduct a data-centric exploration of the task. This might involve using data valuation techniques we learned in the course, exploring different dataset selection choices, etc.

The DataPerf reading will be particularly useful to projects on this track.

Examples:

You might select a medical imaging dataset from a research lab or research challenge and show how selecting or deselecting certain training observations impact performance on a carefully chosen held out test set
You might fine-tune an open language model with a variety of different fine-tuning sets and explore the impact on benchmark performance or quality as perceived by humans

Track 3: Dataset Documentation and AI Auditing

Later in the course, we will discuss some research on dataset documentation and AI auditing. To summarize, this work involves carefully scrutinizing existing datasets and/or the outputs of AI systems to check for potential biases, performance gaps, unusual behavior, etc.

As your project, you might pick a famous dataset or AI system and conduct a systematic documentation effort or “audit”.

Examples:

You might select a popular dataset that’s been used to train LLMs like ChatGPT and use a mix of manual inspection and ML-powered investigation to try and understand the demographics of dataset contributors, or biases in the underlying the content.
A fun example of this might involve a question like, “How much do various fandom communities discussing their favorite movie, book, anime, etc.” contribute to the success of ChatGPT?

If you wish to pursue this option, please consult with the instructor first to discuss properly scoping this kind of project (obviously, investigating every single piece of training data underlying ChatGPT will not be possible with the time we have).

References:

BookCorpus datasheet: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pd
Mozilla’s Common Crawl data investigation: https://foundation.mozilla.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/

Mixing the tracks

If you have an idea for a project that involves mixing multiple tracks, that is totally great! Please let us know via the initial proposal draft.

In particular, mixing tracks might make sense if you have a larger group of students who want to work on multiple parts of a particular problem. For instance, if you want to build a prototype system that hooks up with a ML model and reports the results of a dataset documentation effort, you can definitely do so.

Project Rubric

Here, you will find detailed instructions for the class project.

This document assumes you’ve read the details in the Project Proposal.

When is the project due?

The official project deadline is April 3 23:59.

You are encouraged to finish a draft of your project report before the end of the semester. You will have an opportunity to present your project. This is voluntarily but operates under the “it can only help you” rule: if your presentation helps to clarify the contributions or challenges of your project, this may positively affect your grade. In particular, presenting your project can help you increase your “relevance to class themes” score, as I will ask questions and give you a chance to further demonstrate engagement with our key themes.

Furthermore, if you give a presentation, you can opt out of Quiz 4. If you give a presentation that does not meet the quality bar, I will inform you afterwards and can take Quiz 4 (so you will not “risk” anything by giving a presentation).

If you believe you can make major improvements with some extra time, you can write an “extension” plan at least a week in advance (detailing how your group will use the following time (between 1-6 days, with April 9 23:59 being the absolute, final, no exceptions, deadline). This should be structured much like an email or presentation you might give your boss explaining why a feature needs an extra week of dev time (something that may happen in your career!)

Even if you ask for an extension, you must submit a draft of your report on April 3rd.

How is the class project graded?

Group-based scaling: I’ve mentioned in class that larger groups will have higher overall expectations for what the project accomplishes.

In practice, this will be implemented based on an assessment of “overall contribution” and based on your report’s “contribution statements”. You should describe what everyone did and provide evidence that everyone did something. When I assess your output artifact, I will write a “contribution summary” myself with my understanding of what everyone did. If your artifact lacks the details for me to do this, you will lose marks!

More concretely, we will use the following rubric/process to assign grades to the projects. Remember that a goal of the highly flexible project “tracks” is so that you can make something that will look good on your portfolio based on your own career or personal interests. This will be a motivating theme in grading -- could the project impress a potential employer? Will I be excited to share your output artifacts with my colleagues?

You are expected to do some degree of self or collaborative learning as part of the project. You may need to try out libraries we didn’t use directly during class time or read work that wasn’t assigned. If you are only engaging with strictly required course materials, this is enough for our quizzes, but probably not be enough for the project.

Report Rubric

You will submit a report, which will be structured much like an academic paper.

This includes the following components

Category	Points (%)
Abstract	2 points (5%)
Visual Abstract	2 points (5%)
Introduction	4 points (10%)
Related Work	4 points (10%)
Methods	4 points (10%)
Results	4 points (10%)
Discussion	8 points (20%)
Connection to class themes	4 points (20%)
Overall artifact quality	8 points (40%)

Total: 40 points

The idea behind having each section be some multiple of 2 is because the “marking scheme” for each section will roughly follow some kind of 0 = bad, 1 = reasonable, 2 = good breakdown. More details follow:

Abstract: 2 points

Summarize the key contribution of your project. It should be understandable to all of your classmates, and at least partially understandable to your peers from across a variety of disciplines.

Marking scheme: 2/2 for clear concise project that explains what your project does and why you were motivated to do it. 1/2 for an abstract that is difficult to read, is vague, or doesn’t motivate the work. 0/2 for an abstract that is very vague.

Visual Abstract: 2 points

Create a figure or diagram that summarizes the key contribution of your project.

You can imagine this as a slide in your slide deck with key themes from your abstract. You might show some kind of feedback loop or a basic architectural diagram.

Same marking scheme as abstract.

Introduction: 4 points

State the problem your project solves, which might mean answering a research question (Track 1), providing some value to users (Track 2), or answering a dataset documentation question (Track 3). You should cite some motivating work and situate your project disciplinarily. Who is your audience? You should look at the Introductions of the research papers we read in class to get a sense of the appropriate style and tone. If in doubt, you can explicitly cite your “exemplar papers”.

Your introduction should also concisely state what your main contributions are. What did you do – did you perform experiments, or conduct a literature review, or something else entirely? (Note that this is the contributions your overall project, not a “contribution statement” that describes literally what each team member did. That will come later!)

Marking scheme: 2 points for clear problem statement. 2 points for a clear high-level description of your main contribution (from the Introduction, I should have a general sense of what you did, and I’ll get the details in Methods).

Related work: 4 points

You should conduct a reasonable literature review of related work. You may want to make use of tools like Google Scholar and Semantic Scholar. This does not need to be restricted to peer-reviewed academic works or class readings. You can, and should, cite anything that helped you work on the project or serves as a point of comparison, including software libraries on GitHub, pre-prints on arXiv, blog posts from ML researchers, etc. You should be able to find 3-4 references that you engage with closely, at least one of which is an academic work and itself provides upstream / “classics” in the subfield of your choice.

Marking scheme: 2 points for selecting relevant references. 2 points for clear description of their relevance (so, you can get 2 points if you literally just list references, and the remaining two come from your prose).

Methods / What you did: 4 points

You should describe what you did. You are encouraged to find and cite an exemplar paper in order to help you structure your Methods section.

This section will vary heavily amongst different project types.

If you are unable to find an exemplar paper, please let me know!

Marking scheme:

2 points for a clear description of what you did and justification for it.
- e.g. just saying “I used sklearn” will not earn marks – you should specify the model you used, how you selected hyperparameters, etc.
- e.g. just saying “I select this model because it’s popular” will not earn marks – you should specify the criteria you considered for selecting a model
2 points for appropriateness of methods choice
- e.g., if you said your goal is to explore recommender systems but you only did experiments with image classification, you will not earn marks

Results / What you produced: 4 points

This section will also vary quite a bit based on your project. You are encouraged to find and cite an exemplar paper here, as well.

Marking scheme:

“Results” will vary heavily, so this rubric will lend most to the “simulated reviewer” approach. Your results section should aim to “fulfill the promise” of your methods. If you said you will use a particular evaluation approach, you should provide the relevant data here and describe the insights from that data.
For this section in particular, you may send in any early draft if you’re unclear about whether you need to get “more” results.

Discussion: 8 points

You can earn up to 8 points for discussing the implications of your project and especially potential lines of future work that incorporate a data-centric or human-centric lens. This section has a large point total because it is your change to show your engagement with course materials and concepts. You do not need to only discuss course readings or quote from course materials, but to earn a high score you should demonstrate engagement with course themes.

Here, you are welcome to disagree with course materials. Perhaps your project results have tension with claims made in our readings. If so, you can describe them here (this is one aspect of your project report that you may wish to edit before sharing, as this “disagreement” may make less sense outside of the course project).

Marking scheme:

8 points for extremely in depth engagement – your writing here suggests you have deeply understood and considered key ideas from our readings and discussions, and were able to apply these effectively to concrete context
- E.g. you highlight benefits (or challenges!) that arise from applying abstract concepts in our readings to a real application area. Your discussion section would be a great starting point for a research publication or for a blog post about your software.
6 points for good level of engagement
- E.g., It is obvious you have done the course readings, but some discussion points may be unconvincing or shoehorned in.
4 points for reasonable engagement
- E.g.. You cite HCML and an online platforms-related reading and mainly summarize some points from those articles
2 points for last ditch effort
- E.g., You throw together a paragraph or two that mention human and data-centric AI, but I am not convinced you can describe key concepts.

Connection to class themes: 4 points

Between 0 and 4 points. Outside of your discussion, does your project show engagement with some of the frameworks for thinking we’ve discussed in the class? If I saw this project in your portfolio, would it give me confidence in your ability to work on “human-centered” or “data-centered” projects?

Marking scheme:

If you have discussed this with me before, you should earn 4 marks for this category.
If not, I will mark on a scale of “very connected” to “very much not connected”.

Overall artifact quality: 8 points

You will also earn points for producing high quality artifacts -- this might be code, UI design mock-ups, an actual interface, or a very well formatted report.

You should include at the end of your report a contribution statement describing specifically how each group member contributed to each section of the project and each artifact.

You will also turn in (via zip file or web link) your project “components” (code, other artifacts, etc.). You are encouraged to create a public GitHub repository and just put the link in your report PDF. This will make it easier to share with others as well.

Marking scheme:

8/8: a stunning artifact that I am thrilled to share with my colleagues
4/4: decent effort. Not my first choice to show off to my colleagues.
0/4: extremely low effort.

Note that I don’t expect everyone to produce an 8/8 quality (nor do you need to: if you get 36/40 on your report, 90% on all quizzes, and 90% on all assignments, you can still easily get a great grade).

What do I turn in?

Required: a report as PDF.

Optional: include a web link to your project materials (e.g. a GitHub link).

Optional: include a zip file with project materials.

Optional: Present your project during the final week of class.

Accessing Data Faq

Accessing Text Data FAQ

For Assignment 3 and your project, you’ll need to access LLM data. The instructions for both are relatively open-ended. Here are some snippets and tips that might be useful.

Approach 1: just download files from urls

If you visit the files section of the Dolma dataset (here: https://huggingface.co/datasets/allenai/dolma/tree/main/urls), you’ll see a urls directory, with a bunch of text files. Each text files has a list of urls, and (we can assume) those urls contain data.

(ex: in https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_6-sample.txt, we find a link to https://olmo-data.org/dolma-v1_6-8B-sample/v1_5r2_sample-0000.json.gz)

So one option might be to randomly sample a set of these URLs (perhaps downloading v1_6-sample.txt loading into a Python list or Pandas DF or other structure of your choice, and using something like random.choice(), df.sample(), etc.), then using your favorite approach to make http requests (requests lib, curl from your CLI, your web browser, etc.) to download those files. You’ll need to inspect the contents (unzip, then perhaps use head or a Pythons script to take a look at the first few lines) and then can finally load into your preferred data science environment (repl, notebook, etc) to proceed.

Approach 2: download files directly from HF

If you visit the files section for RefinedWeb (here: https://huggingface.co/datasets/tiiuae/falcon-refinedweb/tree/main), you’ll find directories with the data files directly included. You can go ahead and just start downloading from web browser if you want! Again, a similar approach here might be to select a sample of file names and grab them all.

Approach 3: Dataset library

The HuggingFace datasets library supports loading data via a single function call.

dataset = load_dataset("your_large_dataset", streaming=True)

If the dataset is large, getting a true random sample will be hard. But you can explore “reservoir sampling” (https://en.wikipedia.org/wiki/Reservoir_sampling) or just use a buffer and shuffle that and it’s reasonable for this assignment.

This option is likely fastest if you’ve already used/installed datasets, and might be a good idea to play around with if you plan to use this for your project.

Other FAQS

Do I need to convert words to tokens, or how should I count the tokens?

First of all, you may simply state the assumption that you assume 1 word = 1 token. You may also make a rough estimate based on common ratios (see e.g. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them and play around yourself here https://platform.openai.com/tokenizer).

Can I be sure I’m getting a true random sample for some target distribution?

In short, not really (for assignment purposes, any reasonable attempt with accompanying assumptions is ok). We’ll discuss when we “debrief” on the assignment.

I’m not sure my chosen text dataset is acceptable!

If it’s a large text dataset with at least 300,000 words or tokens, you’re good to go.

Key Dates

Week 6 Onwards

This document contains a summary of all remaining key dates for the rest of the term.

I’ll also try to also clarify here the relevant Week Number and day of the week.

For our purposes, Mon Feb 10 begins “Week 6”. Week 7 begins Feb 17, but we have no class – it is “consumed” by Reading Week.

Key Dates

Quizzes

Quiz 2: Feb 27th 12:30pm (Thursday of Week 8. Assignment 2 should help you study for this, so don’t leave it for the last minute).
- On paper, in class, first 30 mins. No notes, but some standardized “cheat sheet” will be provided (with e.g. table of time complexities for influence estimators).
- Make-up details TBA, please try not to miss it.
- Covers only materials from “module 2” (data valuation, influence, scaling)
Quiz 3: Mar 20, 12:30pm.
- On CourSys, take home with 24 hrs to complete. Primarily writing-focused. Open notes, open anything but you must submit original work.
Quiz 4: April 3, 12:30pm.
- On CourSys, take home with 24 hrs to complete. Primarily writing-focused and cumulative. Optionally, if you give a full project presentation with Q&A, you may use that as your “Quiz 4 submission”.

Assignment and Projects

Project proposal: Feb 28th 11:59pm (Friday of Week 8)
- CourSys.
Assignment 2: Mar 4 11:59pm (Tuesday of Week 9. Class time to work on this on Feb 13)
Assignment 3: Mar 18 23:59.
- CourSys.
Assignment 4: April 3, 23:59.
- CourSys.
Project Report: April 3, 12:30pm.
- CourSys.
Project presentations round 1: April 3, 12:30pm.
- In class.
Project presentation round 2: April 8, 1:30pm.
- In class.

Lecture Erratum

“An erratum or corrigendum (pl.: errata, corrigenda) (comes from Latin: errata corrige) is a correction of a published text” (source: Wikipedia).

In this doc, I’ll record any meaningful typos in materials or answers to questions to that may have caused confusion.

In Spring 2025, I started this doc on Feb 12

Week 6

Feb 11
- Minor: Several typos in slides, see this commit if curious
- Substantive: in lecture, my explanation of the denominator used when calculating Shapley values was a bit confusing. I compeltely reworked these slides to provide a clearer example where have 10 rows (n=10) and we consider all the “coalitions” of size 9, then of size 8, then of size 7…
- Substantive: in lecture, I went very quickly when describing the desirable economic properties of the Shapley value. If curious (and especially if you might be taking or will take econ courses that use Shapley values, see e.g. here. I’ll also revisit this with some examples on Thursday.)

Optional Deep Dives

Deep Dives on Influence

https://openreview.net/forum?id=hzbguA9zMJ: revisits influence functions in neural networks. It offers additional insight into how the influence function derivation aligns (or doesn’t) with leave‑one‑out retraining:
https://arxiv.org/abs/2112.08297: examines the accuracy of influence function approximations for deep networks and discusses the role of the Hessian inversion—helpful for checking the assumptions behind the derivations.

Optional Experimental Explainer Docs

This semester, I’m experimenting with the use of generative AI to create interactive “explainer docs”.

I provide these as examples and encourage you to try this out on your own as well. Part of the value in these comes from reviewing the outputted code, so they’re not meant to be “lecture slides”.

Also, these are not required content (you don’t need to calculate influence estimations by hand, but it can be useful to try!)

Key papers to double check formulas and derivations:
- KL17: https://arxiv.org/abs/1703.04730
- blog walkthrough on logistic regression: https://medium.com/towards-data-science/logistic-regression-from-scratch-870f0163bfc9 (log likelihood, newton method, )
- Pyvdl open source implementation of many estimators: https://pydvl.org/stable/influence/ (see e.g. https://github.com/aai-institute/pyDVL/blob/develop/src/pydvl/influence/influence_calculator.py)
- Hessian of logistic function on SE: https://stats.stackexchange.com/questions/68391/hessian-of-logistic-function
- Derive influence function on SE: https://datascience.stackexchange.com/questions/121608/influence-functions-on-neural-networks-help-with-understanding-of-result-and-de

Prerequisites Doc

There are no hard prerequisites for this course. The course outline says this:

Students may benefit from having taken a course in AI, ML, or data science (or have equivalent experience from e.g. an internship, a research project, a personal project).

Example SFU courses:
- CMPT 310 - Intro Artificial Intelligence
- CMPT 353 - Computational Data Science
- CMPT 414 - Computer Vision

Having taken an HCI course or relevant social science course (e.g., sociology, economics) is a plus, but students without this experience who want to explore interdisciplinary CS work that is “human-centered” are welcome.

That being said, here are some materials that you can use to assess your readiness for the course. My suggestion is to take a look at the code first and see if it seems completely unfamiliar. If you’ve taken CMPT 353, this will be very familiar. If this content is unfamiliar, you may want to glance through the sklearn tutorial (https://scikit-learn.org/1.4/tutorial/basic/tutorial.html), or for much for detail, the CMPT 353 lecture notes (https://ggbaker.ca/data-science/).

On the ML theory side of things, you may want to watch the 3Blue1Brown Deep Learning series and see if it seems completely unfamiliar or at least partially familiar: https://www.youtube.com/watch?v=aircAruvnKk

If at least one of these feels decently familiar and the other feels somewhat familiar, you’re probably in good shape. We won’t have any quiz questions in the course that are directly testing computational data science or deep learning material, but the concepts come up and you’ll want to have some basis in one of these areas to develop a good project idea.

In terms of HCI or social science materials, we’ll cover background in the course. Any and all perspectives are welcome!

You may also want to check out this ChatGPT-generated exercises below to see some examples of common operations machine learning operations.

# CRASH COURSE: Basic Model Training, Testing, and Data Analysis

# ===========================
# 1. IMPORT LIBRARIES
# ===========================
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_iris

# To display plots inline
%matplotlib inline

# ===========================
# 2. LOAD AND EXPLORE DATA
# ===========================
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

print("First five rows of data:")
display(X.head())

print("Data shape:", X.shape)
print("Target shape:", y.shape)

print("\nClass distribution:")
print(y.value_counts())

# ===========================
# 3. BASIC DATA ANALYSIS
# ===========================
# Statistical summary
print("\nStatistical summary of features:")
display(X.describe())

# Pairplot for a quick visual
sns.pairplot(pd.concat([X, y], axis=1), hue='target', diag_kind='kde')
plt.show()

# ===========================
# 4. TRAIN-TEST SPLIT
# ===========================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("\nTraining set size:", X_train.shape)
print("Test set size:", X_test.shape)

# ===========================
# 5. MODEL TRAINING
# ===========================
model = LogisticRegression(max_iter=200)  # Increase max_iter to ensure convergence
model.fit(X_train, y_train)

# ===========================
# 6. MODEL TESTING & EVALUATION
# ===========================
y_pred = model.predict(X_test)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

sns.heatmap(cm, annot=True, cmap="Blues", fmt="d", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

# ===========================
# 7. EXERCISES FOR STUDENTS
# ===========================
print("""
Exercises:
1. Try a different classifier (e.g., RandomForestClassifier) and compare results.
2. Experiment with different test sizes (e.g., test_size=0.3).
3. Visualize the coefficient importances or feature importances for your chosen model.
4. Use other performance metrics (e.g., accuracy_score, precision_score) for evaluation.
5. Analyze how class imbalance might affect results (if you artificially modify 'y').
""")

# DEEP LEARNING CRASH COURSE

# ===========================
# 1. IMPORTS & SETUP
# ===========================
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

%matplotlib inline

# ===========================
# 2. LOAD & PREPARE MNIST
# ===========================
# The MNIST dataset has 60,000 training images, 10,000 test images
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to [0, 1]
x_train = x_train.astype("float32") / 255.
x_test  = x_test.astype("float32") / 255.

# Flatten 28x28 images to 784-dimensional vectors for the MLP
x_train_flat = x_train.reshape((x_train.shape[0], 28 * 28))
x_test_flat  = x_test.reshape((x_test.shape[0], 28 * 28))

# ===========================
# 3. BASIC DENSE MODEL
# ===========================
mlp_model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

mlp_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# ===========================
# 4. TRAIN & EVALUATE (MLP)
# ===========================
history_mlp = mlp_model.fit(
    x_train_flat, y_train,
    validation_split=0.1,
    epochs=5,
    batch_size=64,
    verbose=1
)

test_loss, test_acc = mlp_model.evaluate(x_test_flat, y_test)
print(f"\nMLP Test accuracy: {test_acc:.4f}")

# Optional: Plot training curves
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history_mlp.history['loss'], label='Train Loss')
plt.plot(history_mlp.history['val_loss'], label='Val Loss')
plt.legend()
plt.title("MLP Loss")

plt.subplot(1,2,2)
plt.plot(history_mlp.history['accuracy'], label='Train Acc')
plt.plot(history_mlp.history['val_accuracy'], label='Val Acc')
plt.legend()
plt.title("MLP Accuracy")
plt.show()

# ===========================
# 5. ADVANCED SECTION: BASIC CNN
# ===========================
# Reshape data back to (28, 28, 1)
x_train_cnn = x_train.reshape((x_train.shape[0], 28, 28, 1))
x_test_cnn  = x_test.reshape((x_test.shape[0], 28, 28, 1))

cnn_model = keras.Sequential([
    layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu',
                  input_shape=(28, 28, 1)),
    layers.MaxPooling2D(pool_size=(2,2)),
    layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu'),
    layers.MaxPooling2D(pool_size=(2,2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),  # regularize
    layers.Dense(10, activation='softmax')
])

cnn_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

history_cnn = cnn_model.fit(
    x_train_cnn, y_train,
    validation_split=0.1,
    epochs=5,
    batch_size=64,
    verbose=1
)

test_loss_cnn, test_acc_cnn = cnn_model.evaluate(x_test_cnn, y_test)
print(f"\nCNN Test accuracy: {test_acc_cnn:.4f}")

# ===========================
# 6. EXERCISES FOR STUDENTS
# ===========================
print("""
Exercises:
1. Increase the number of epochs or change batch_size and observe results.
2. Modify the architecture (add more layers/neurons) and see if it improves accuracy.
3. Try different optimizers (RMSprop, SGD) and compare training dynamics.
4. Add Batch Normalization layers to see if training stabilizes.
5. Explore other datasets (CIFAR-10, Fashion-MNIST) for a broader challenge.
""")