This post will capture some fresh thoughts on how Bernie Sanders's recent AI dividend proposal connects with suggestions from our 2021 data dividends report. I'll describe how my recent interest in attestation across the AI supply chain, the AI evaluation crisis, and Clear Data Rules connects to the topic of data dividends.

More specifically, I have a concrete proposal for how a data-dependence-tax-based data dividend can be updated to the 2026 context by implementing it as a "presumptive commons-rent tax that can be avoided by providing capabilities-to-data attribution." Instead of trying to come up with some proxy for data dependence to rank different companies in terms of how much data they use (e.g. by counting their users, auditing the volume of data within their organizational databases, etc.), we instead start from the assumption that more capable AI systems draw more heavily on the data commons (broadly construed) that humanity has built. AI companies can show evidence that they used licensed or governed data to achieve certain capabilities, and the ecosystem of international AI auditing organizations would work together to verify these claims and lower their tax burden.

We either have a world where companies profiting from highly capable data-dependent systems (which need not be the labs themselves) pay a large amount of aggregate funds into various national funds (or ideally a single internationally governed fund), or a world where the vast majority of upstream data flows through healthy data markets where data creators have the collective leverage necessary to get paid through a mixture of upfront and royalty payments.

Of course, there are a lot of details to be worked out. The main goal of this post is to paint a picture of this policy direction with broad strokes while starting to get specific about the research program needed to implement something like this.

To expand on the idea, let's begin with some context.

Brief history on data dividends research in 2021

In the 2021 report, we analyzed a variety of possible fundraising and disbursement mechanisms for a “data dividend” (which was being discussed by Governor Newsom of California at the time). While we were not aiming to pick a single answer, our "likely good first step" suggestion was a data dependence tax to fund public goods. To make "data dependence" operational, we suggested using user count as a proxy: firms with lots of users are probably getting lots of value from aggregated user/public data.

Some other notable works on data dividends around that time include:

Another key idea from the report was that, in the context of retroactive dividends (as opposed to forward-looking markets), it is probably best to avoid “fine-grained valuation” (e.g., just trying to write individual-specific checks). Of course, in the context of data markets, it could still make sense in some cases to price both individual data points and collective bundles.

In short: for a dividend, we should tax dependence on collective data and disburse it coarsely while we figure out better valuation and interpretability methods. I think the reasoning from that report holds up pretty well in light of AI progress. I also think it’s notable that the motivation described in the Sanders proposal matches the arguments in our original report pretty closely.

However, I also think we should be sensitive to concerns from economists about the impacts of compute taxes, automation taxes, capital taxes, etc., such as those discussed in a recent NBER working paper on AI-related taxes and sovereign wealth funds. After the Sanders op-ed went out, the idea quickly drew a cross-ideological mix of interest, skepticism, and pushback: AP covered the Sanders/Altman/Trump public-ownership conversation, while the Washington Post editorial board, Reason, Cato, and Fortune's coverage of David Sacks captured pro-market, libertarian, and tech-policy critiques of government equity stakes in AI companies. The economics concern, summarized: depending on design, an AI dividend tax could have negative effects on growth, investment, diffusion, etc.

If we can avoid it, we might want to avoid explicitly targeting “AI,” “compute,” or “automation.” Instead, the thing we actually want to target is private value extraction from the commons.

Something that’s complicated with foundation models is that they depend on several categories of data that are common-ish. This includes literal, governed digital commons like Wikipedia; public-domain works; commons-y public-web resources like Common Crawl; open-source code; and more implicit commons like click data, trace data, public posts, interaction data, and social graphs.

I think there’s actually an easy fix to target extraction from the commons: take a "rebuttable presumption of data-dependence" approach to the existence of powerful AI. Currently (and barring a major paradigm shift in AI) all of humanity’s approaches to building powerful AI are data-dependent. Even approaches that use reinforcement learning or synthetic data still have massive data dependencies in the overall training and evaluation pipeline needed to build an AI system.

Instead of using user count or something else as a primary proxy for data dependence, we might consider using capability itself. We would basically be working on a default assumption that if an AI system is very capable and monetized, a meaningful chunk of that value came from commons data. I think this assumption is currently very justified and will remain so in the near term.

There are several reasonable ways we can coarsely estimate the fraction of value attributable to data versus the value attributable to compute, non-data technical progress, interface progress, and other factors. We just need to pick some number. 50%, which happens to appear in the Sanders proposal, might be a reasonable starting placeholder. The more capable a system is, the more burden a company should face in explaining how it got so good.

Thus, we could iterate on various data dividends proposals to design what we might call a “presumptive commons-rent tax.” When a firm makes money from a powerful AI system, we presume some share of the rent came from public/user/commons data. AI operators can lower their presumptive commons-rent tax by showing that capabilities came from data that was acquired under non-commons conditions, e.g., licensed data purchased via a healthy data market.

As a toy example, suppose a highly capable AI system earns $10B in annual rents and the presumptive commons-rent share is set at 50%. If the operator can substantiate that half of its capability-relevant data contribution came from licensed, governed, or reciprocal sources, the taxable commons-rent base might fall from $5B to $2.5B.

We would need a clean accounting scheme here, with a possible unit being explained data. Explained data would estimate the share of a model’s effective, capability-relevant data contribution that the company can actually account for. To count as tax-reducing explained data, a data source would need to be documented, have provenance and proof of fair acquisition (e.g., because it was licensed, bought under contract, or similar), and plausibly relevant to the capabilities being taxed. Critically, the tax rate would depend on the capability level achieved by a model, so more capable models would require more explained data, in accordance with our scientific understanding of scaling laws and training data attribution.

Preparing such evidence would look something like this:

  • first, a company profiting from AI models (an important qualifier -- more below, though working through details here will extend beyond this "long blog post" format) prepares a datasheet
  • second, each entry in the datasheet would be labeled with an acquisition/governance status (licensed, internally generated, public-domain, governed by a data union/trust, etc.)
  • third, provenance evidence would be collected to support the acquisition/governance classifications
  • fourth, usage evidence shows how much each data component was actually used (could include details about mixture fractions, sampling rates, repetition, deduplication, training stage, post-training role, eval role, upstream sources for synthetic or RL data -- ultimately this evidence will be reviewed by AI auditor organizations, so it does not have to be 100% standardized and there can be some flexibility for different model types)
  • fifth, just as datasheet entries would be linked to provenance evidence, usage entries would be linked to ablations/data mixture experiments to show how those data components actually mattered for the relevant capabilities

For a first implementation, we need not rely on a single precise definition of “accepted explained-data points.” A more ready-to-use version could just use evidence tiers as determined by the adjudicating entities.

Model capability comes from a production process involving compute, model size, data quantity, data quality, interface and tool access, etc. Capability measurement would be used to set a default presumed tax rate. Companies could present data details to reduce their tax burden, and an auditor (or a network of auditing organizations) would convert the evidence into "accepted explained-data points" to determine a final rate.

This could create a good set of incentives:

  • if companies want lower taxes, they should build datasheets and provenance systems from the start
  • if they want larger reductions, they need to run and share data-centric scientific experiments
  • if they rely heavily on commons data, they can still do that, but they should pay something back or give something back
  • the tax is fully avoidable!

Importantly, everything described above would basically involve preparing a report that would look a lot like something required by existing or proposed data transparency laws, such as the EU AI Code of Practice. This is close to something AI companies might need to do anyway!

But wait a second -- if the whole concern around taxing compute or automation is that "we shouldn't tax stuff that we want more of," isn't this potentially even worse than those other taxes, if we interpret this proposal as a tax on intelligence itself or capability itself? Critically, this is not a tax on intelligence or capability, but rather a tax on unexplained or mysterious capability. If an AI operator trains on 100% licensed/accounted-for data, and can show that this data actually drove the relevant capabilities, their commons tax is near zero.

If you train on Wikipedia, Common Crawl, public code, user traces, etc., you would end up paying some reasonable tax back to the commons (and the tax might also be reduced if you show evidence of, e.g., contributing to something like Wikimedia Enterprise, or making in-kind contributions of data, model weights, gold standard generated code, etc.). Ideally, during any kind of transition period, there would be a way to transfer existing reciprocity programs into tax credits as well. And perhaps reciprocity programs could just be integrated into the program in the long term.

How would this be enforced? This is where the recent momentum around auditing and safety comes in. Capability measurement -- and assessment of the ablations and whether AI operators are able to provide plausible accounts of, at a high level, how data choices drive capabilities -- could be handled by an ecosystem of independent auditing institutions, along the lines of the frontier AI auditing ecosystem.

The ecosystem of auditing orgs would become part of the infrastructure for data governance: measuring capabilities, reviewing provenance, looking at ablations, etc. This would also get companies to contribute to advancing and sharing science about where model capabilities come from, in turn helping the auditing organizations.

Critically, by looping in auditing and safety organizations, this proposal could also take advantage of the fact that AI safety is one area in which there is a plausible path to international cooperation. In fact, I think this might be one of the only ways that it might be plausible to at least lay out a path toward a global wealth fund rather than various national funds and sovereign-focused economic interventions.

A single global wealth fund is morally attractive, because the data commons is transnational, but the more realistic path may be federated: national or regional AI commons funds collect revenue, while treaty or club arrangements allocate some share to global public goods and commons institutions.

Of course, we likely would not want independent AI auditors to be burdened with global taxation responsibility (nor would they likely want a bunch of extra work). Public tax authorities would still set the rules and then accredited auditors (with proportionate support to hire staff to do all this) could review evidence while a public technical board maintains standards. Courts or administrative tribunals would handle disputes.

Compared to compute and automation taxes, I believe this kind of approach would avoid some of the concerns raised by economists, and instead target a specific harm: companies turning collective human activity and public knowledge into private rents without proportionate return.

In the current world, this would mean that AI companies would pay a bunch of taxes, which then might fuel, e.g., a national wealth fund, or ideally a global wealth fund. But it also creates a path to lower the burden: license data, work with data unions/trusts, document provenance, run ablations, or give value back to the commons.

Of course, things that seem simple are often laundering complexity. To highlight a few critical questions and considerations:

  • Who classifies exactly what systems count as frontier AI, and exactly which organizations are subject to this tax?
  • The very challenging task of accurate capabilities measurement becomes pivotal -- but we need to figure this out anyway!
  • Power dynamics within the ecosystem of evaluation institutions will be critical to success. If they get captured, the whole thing fails. But if those institutions are going to get power anyway, this seems like one of the better uses of that power!
  • Anti-avoidance rules would be essential: synthetic data should inherit provenance obligations from upstream models and datasets; related-party data licenses should face transfer-pricing-style scrutiny; covered revenue should attach to deployment and monetization jurisdiction, not only training location; and open-source or research releases should not automatically exempt closed downstream monetization.
  • Attribution will always be imperfect. Capability can come from many things (data, compute, algorithmic progress, post-training tricks, scaffolding, inference-time compute, distribution, interface design, and more).
  • Eventually, the base assumption about capability as a proxy for data dependence might break.
  • The policy would need careful safe-harbor handling for small, open, nonprofit, and public-interest uses, as well as existing reciprocal commons arrangements

But the overall direction here seems very promising to me. Please let me know what you think. As time permits (and conditional on your feedback -- is this too crazy, too strong of an assumption, redundant with existing proposals, etc.), I hope to put together a more whitepaper-looking version of this blog post to socialize the idea.

Bax, Eric. 2019. “Computing a Data Dividend.” ACM Economics & Computation 2019 (EC ’19), poster presentation. arXiv:1905.01805. https://doi.org/10.48550/arXiv.1905.01805

Wadhwa, Tarun. 2020. “Economic Impact and Feasibility of Data Dividends.” Data Catalyst Institute white paper.

Vincent, Nicholas, Yichun Li, Renee Zha, and Brent Hecht. 2019. “Mapping the Potential and Pitfalls of ‘Data Dividends’ as a Means of Sharing the Profits of Artificial Intelligence.” arXiv:1912.00757 [cs.CY]. https://doi.org/10.48550/arXiv.1912.00757

Vincent, Nicholas, and Brent Hecht. 2023. “Sharing the Winnings of AI with Data Dividends: Challenges with ‘Meritocratic’ Data Valuation.” EAAMO ’23 Poster Track, Boston, MA, USA, October 30-November 1, 2023. Non-archival.