5 Project Proposal

DEADLINE (Fall 2025): TBA.

We’re going to start thinking about our projects relatively early in the term! To scaffold the project ideation, you’ll be asked to turn in and present an initial project proposal early on in the semester.

You can view some short descriptions of previous projects here.

You can submit a 1-2 page PDF, text, or Markdown file. Exact length is not critical here: as long as it contains the key ideas, you’re good to go.

This proposal is not binding, though you will earn some marks for turning it in and presenting it. You can change your project topic, track, or group after the proposal is submitted (though you’re encouraged to stick relatively close to your proposal, just for the sake of your own time).

For the project, you can select from three tracks, described below.

Well before you turn your project in, you will be provided with a much more detailed rubric describing how your project will be graded. For the initial proposal, however, you should just focus on selecting a project that:

fits your personal interests in the course (including your career goals)
will give you an opportunity to explore and demonstrate understanding of the key concepts from our readings and lectures.

The two heuristic questions I recommend you ask while brainstorming project ideas:

Does this project meet the unique individual incentives of all group members (e.g., a chance to work with a particular ML library, a chance to work on a task of interest, a chance to produce a high quality report or prototype to include in my portfolio).
Does this project offer an opportunity to demonstrate understanding of key concepts from the course? For instance, does it fit into any of the frameworks for human-centered ML and AI that we’ve seen, or does it relate to any of the calls for data-centric we’ve seen?

5.0.1 Track 1: Tools and interfaces for human/data-centered AI

Track 1 will be a good fit for front-end focused projects. For this track, you can propose and develop some kind of tool or interface for data-centric AI. This interface might be a web application, mobile application, or even a user-focused CLI prototype.

To fit the project criteria, this tool should help users accomplish some kind of data-related action or some kind of data exploration task. In other words, it should either be targeted at users who want to control the flow of their data, or at data scientists who want to explore data in some way.

Please note that if you’re very uncomfortable doing prototyping and frontend development, you may not want to select this track. While I’m happy to support you if you want to learn these topics on the fly, we probably won’t have much time to cover core design, frontend, or software engineering concepts in this course, so this project is best suited to students who already have some of those skills and specifically want to use their project work time to advance in this area.

Examples:

A new interface for interacting with large language models that allows user to save or export conversation data (you might consider forking and contributing to something likehttps://github.com/ollama-webui/ollama-webui)
A browser extension that helpers user collect and use data generated by their own browsing (e.g. export my YouTube watch history and train a local personalization / recommender system)
A browser extension that blocks data collection and informs the user how data that’s collected might impact AI systems
A web interface for visually exploring aspects of a dataset, aimed at ML developers

5.0.2 Track 2: ML Project with Data Exploration Component

Track 2 will be the closest to what you might do in a typical project-focused ML course. For this project, you should select a machine learning task of interest and produce a thorough report describing how you might tackle the relevant ML challenges. What will set your project apart from a pure ML focused course is that you will also be asked to conduct a data-centric exploration of the task. This might involve using data valuation techniques we learned in the course, exploring different dataset selection choices, etc.

The DataPerf reading will be particularly useful to projects on this track.

Examples:

You might select a medical imaging dataset from a research lab or research challenge and show how selecting or deselecting certain training observations impact performance on a carefully chosen held out test set
You might fine-tune an open language model with a variety of different fine-tuning sets and explore the impact on benchmark performance or quality as perceived by humans

5.0.3 Track 3: Dataset Documentation and AI Auditing

Later in the course, we will discuss some research on dataset documentation and AI auditing. To summarize, this work involves carefully scrutinizing existing datasets and/or the outputs of AI systems to check for potential biases, performance gaps, unusual behavior, etc.

As your project, you might pick a famous dataset or AI system and conduct a systematic documentation effort or “audit”.

Examples:

You might select a popular dataset that’s been used to train LLMs like ChatGPT and use a mix of manual inspection and ML-powered investigation to try and understand the demographics of dataset contributors, or biases in the underlying the content.
A fun example of this might involve a question like, “How much do various fandom communities discussing their favorite movie, book, anime, etc.” contribute to the success of ChatGPT?

If you wish to pursue this option, please consult with the instructor first to discuss properly scoping this kind of project (obviously, investigating every single piece of training data underlying ChatGPT will not be possible with the time we have).

References:

BookCorpus datasheet:https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pd
Mozilla’s Common Crawl data investigation:https://foundation.mozilla.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/

5.0.4 Mixing the tracks

If you have an idea for a project that involves mixing multiple tracks, that is totally great! Please let us know via the initial proposal draft.

In particular, mixing tracks might make sense if you have a larger group of students who want to work on multiple parts of a particular problem. For instance, if you want to build a prototype system that hooks up with a ML model and reports the results of a dataset documentation effort, you can definitely do so.