A proposal for updating data dividend ideas around capability measurement, data provenance, and AI auditing.
This post will capture some fresh thoughts on how Bernie Sanders's
recent AI
dividend proposal connects with suggestions from our 2021 data
dividends report. I'll describe how my recent interest in attestation
across the AI supply chain, the
AI evaluation crisis, and Clear
Data Rules connects to the topic of data dividends.
More specifically, I have a concrete proposal for how a
data-dependence-tax-based data dividend can be updated to the 2026
context by implementing it as a "presumptive commons-rent tax that can
be avoided by providing capabilities-to-data attribution." Instead of
trying to come up with some proxy for data dependence to rank different
companies in terms of how much data they use (e.g. by counting their
users, auditing the volume of data within their organizational
databases, etc.), we instead start from the assumption that more capable
AI systems draw more heavily on the data commons (broadly construed)
that humanity has built. AI companies can show evidence that they used
licensed or governed data to achieve certain capabilities, and the
ecosystem of international AI auditing organizations would work together
to verify these claims and lower their tax burden.
We either have a world where companies profiting from highly capable
data-dependent systems (which need not be the labs themselves) pay a
large amount of aggregate funds into various national funds (or ideally
a single internationally governed fund), or a world where the vast
majority of upstream data flows through healthy data markets where data
creators have the collective leverage necessary to get paid through a
mixture of upfront and royalty payments.
Of course, there are a lot of details to be worked out. The main goal
of this post is to paint a picture of this policy direction with broad
strokes while starting to get specific about the research program needed
to implement something like this.
To expand on the idea, let's begin with some context.
Brief history
on data dividends research in 2021
In the 2021 report, we analyzed a variety of possible fundraising and
disbursement mechanisms for a “data dividend” (which was being discussed
by Governor Newsom of California at the time). While we were not aiming
to pick a single answer, our "likely good first step" suggestion was a
data dependence tax to fund public goods. To make "data dependence"
operational, we suggested using user count as a proxy: firms with lots
of users are probably getting lots of value from aggregated user/public
data.
Some other notable works on data dividends around that time
include:
Another key idea from the report was that, in the context of
retroactive dividends (as opposed to forward-looking markets), it is
probably best to avoid “fine-grained valuation” (e.g., just trying to
write individual-specific checks). Of course, in the context of data
markets, it could still make sense in some cases to price both
individual data points and collective bundles.
In short: for a dividend, we should tax dependence on collective data
and disburse it coarsely while we figure out better valuation and
interpretability methods. I think the reasoning from that report holds
up pretty well in light of AI progress. I also think it’s notable that
the motivation described in the Sanders proposal matches the arguments
in our original report pretty closely.
However, I also think we should be sensitive to concerns from
economists about the impacts of compute taxes, automation taxes, capital
taxes, etc., such as those discussed in a recent NBER working paper on
AI-related taxes and sovereign wealth funds. After the Sanders op-ed
went out, the idea quickly drew a cross-ideological mix of interest,
skepticism, and pushback: AP
covered the Sanders/Altman/Trump public-ownership conversation,
while the Washington
Post editorial board, Reason,
Cato,
and Fortune's
coverage of David Sacks captured pro-market, libertarian, and
tech-policy critiques of government equity stakes in AI companies. The
economics concern, summarized: depending on design, an AI dividend tax
could have negative effects on growth, investment, diffusion, etc.
If we can avoid it, we might want to avoid explicitly targeting “AI,”
“compute,” or “automation.” Instead, the thing we actually want to
target is private value extraction from the commons.
Something that’s complicated with foundation models is that they
depend on several categories of data that are common-ish. This includes
literal, governed digital commons like Wikipedia; public-domain works;
commons-y public-web resources like Common Crawl; open-source code; and
more implicit commons like click data, trace data, public posts,
interaction data, and social graphs.
I think there’s actually an easy fix to target extraction from the
commons: take a "rebuttable presumption of data-dependence" approach to
the existence of powerful AI. Currently (and barring a major paradigm
shift in AI) all of humanity’s approaches to building powerful AI are
data-dependent. Even approaches that use reinforcement learning or
synthetic data still have massive data dependencies in the overall
training and evaluation pipeline needed to build an AI system.
Instead of using user count or something else as a primary proxy for
data dependence, we might consider using capability itself. We would
basically be working on a default assumption that if an AI system is
very capable and monetized, a meaningful chunk of that value came from
commons data. I think this assumption is currently very justified and
will remain so in the near term.
There are several reasonable ways we can coarsely estimate the
fraction of value attributable to data versus the value attributable to
compute, non-data technical progress, interface progress, and other
factors. We just need to pick some number. 50%, which happens to appear
in the Sanders proposal, might be a reasonable starting placeholder. The
more capable a system is, the more burden a company should face in
explaining how it got so good.
Thus, we could iterate on various data dividends proposals to design
what we might call a “presumptive commons-rent tax.” When a firm makes
money from a powerful AI system, we presume some share of the rent came
from public/user/commons data. AI operators can lower their presumptive
commons-rent tax by showing that capabilities came from data that was
acquired under non-commons conditions, e.g., licensed data purchased via
a healthy data market.
As a toy example, suppose a highly capable AI system earns $10B in
annual rents and the presumptive commons-rent share is set at 50%. If
the operator can substantiate that half of its capability-relevant data
contribution came from licensed, governed, or reciprocal sources, the
taxable commons-rent base might fall from $5B to $2.5B.
We would need a clean accounting scheme here, with a possible unit
being explained data. Explained data would estimate the
share of a model’s effective, capability-relevant data contribution that
the company can actually account for. To count as tax-reducing explained
data, a data source would need to be documented, have provenance and
proof of fair acquisition (e.g., because it was licensed, bought under
contract, or similar), and plausibly relevant to the capabilities being
taxed. Critically, the tax rate would depend on the capability level
achieved by a model, so more capable models would require more explained
data, in accordance with our scientific understanding of scaling laws and training
data attribution.
Preparing such evidence would look something like this:
- first, a company profiting from AI models (an important qualifier --
more below, though working through details here will extend beyond this
"long blog post" format) prepares a datasheet
- second, each entry in the datasheet would be labeled with an
acquisition/governance status (licensed, internally generated,
public-domain, governed by a data union/trust, etc.)
- third, provenance evidence would be collected to support the
acquisition/governance classifications
- fourth, usage evidence shows how much each data component was
actually used (could include details about mixture fractions, sampling
rates, repetition, deduplication, training stage, post-training role,
eval role, upstream sources for synthetic or RL data -- ultimately this
evidence will be reviewed by AI auditor organizations, so it does not
have to be 100% standardized and there can be some flexibility for
different model types)
- fifth, just as datasheet entries would be linked to provenance
evidence, usage entries would be linked to ablations/data mixture experiments to
show how those data components actually mattered for the relevant
capabilities
For a first implementation, we need not rely on a single precise
definition of “accepted explained-data points.” A more ready-to-use
version could just use evidence tiers as determined by the adjudicating
entities.
Model capability comes from a production process involving compute,
model size, data quantity, data quality, interface and tool access, etc.
Capability measurement would be used to set a default presumed tax rate.
Companies could present data details to reduce their tax burden, and an
auditor (or a network of auditing organizations) would convert the
evidence into "accepted explained-data points" to determine a final
rate.
This could create a good set of incentives:
- if companies want lower taxes, they should build datasheets and provenance
systems from the start
- if they want larger reductions, they need to run and share
data-centric scientific experiments
- if they rely heavily on commons data, they can still do that, but
they should pay something back or give something back
- the tax is fully avoidable!
Importantly, everything described above would basically involve
preparing a report that would look a lot like something required by
existing or proposed data transparency laws, such as the EU
AI Code of Practice. This is close to something AI companies might
need to do anyway!
But wait a second -- if the whole concern around taxing compute or
automation is that "we shouldn't tax stuff that we want more of," isn't
this potentially even worse than those other taxes, if we interpret this
proposal as a tax on intelligence itself or capability itself?
Critically, this is not a tax on intelligence or capability, but rather
a tax on unexplained or mysterious capability. If an AI operator trains
on 100% licensed/accounted-for data, and can show that this data
actually drove the relevant capabilities, their commons tax is near
zero.
If you train on Wikipedia, Common Crawl, public code, user traces,
etc., you would end up paying some reasonable tax back to the commons
(and the tax might also be reduced if you show evidence of, e.g.,
contributing to something like Wikimedia
Enterprise, or making in-kind contributions of data, model weights,
gold standard generated code, etc.). Ideally, during any kind of
transition period, there would be a way to transfer existing reciprocity
programs into tax credits as well. And perhaps reciprocity programs
could just be integrated into the program in the long term.
How would this be enforced? This is where the recent momentum around
auditing and safety comes in. Capability measurement -- and assessment
of the ablations and whether AI operators are able to provide plausible
accounts of, at a high level, how data choices drive capabilities --
could be handled by an ecosystem of independent auditing institutions,
along the lines of the frontier AI
auditing ecosystem.
The ecosystem of auditing orgs would become part of the
infrastructure for data governance: measuring capabilities, reviewing
provenance, looking at ablations, etc. This would also get companies to
contribute to advancing and sharing science about where model
capabilities come from, in turn helping the auditing organizations.
Critically, by looping in auditing and safety organizations, this
proposal could also take advantage of the fact that AI safety is one
area in which there is a plausible path to international cooperation. In
fact, I think this might be one of the only ways that it might be
plausible to at least lay out a path toward a global wealth fund rather
than various national funds and sovereign-focused economic
interventions.
A single global wealth fund is morally attractive, because the data
commons is transnational, but the more realistic path may be federated:
national or regional AI commons funds collect revenue, while treaty or
club arrangements allocate some share to global public goods and commons
institutions.
Of course, we likely would not want independent AI auditors to be
burdened with global taxation responsibility (nor would they likely want
a bunch of extra work). Public tax authorities would still set the rules
and then accredited auditors (with proportionate support to hire staff
to do all this) could review evidence while a public technical board
maintains standards. Courts or administrative tribunals would handle
disputes.
Compared to compute and automation taxes, I believe this kind of
approach would avoid some of the concerns raised by economists, and
instead target a specific harm: companies turning collective human
activity and public knowledge into private rents without proportionate
return.
In the current world, this would mean that AI companies would pay a
bunch of taxes, which then might fuel, e.g., a national wealth fund, or
ideally a global wealth fund. But it also creates a path to lower the
burden: license data, work with data unions/trusts, document provenance,
run ablations, or give value back to the commons.
Of course, things that seem simple are often laundering complexity.
To highlight a few critical questions and considerations:
- Who classifies exactly what systems count as frontier AI, and
exactly which organizations are subject to this tax?
- The very challenging task of accurate capabilities measurement
becomes pivotal -- but we need to figure this out anyway!
- Power dynamics within the ecosystem of evaluation institutions will
be critical to success. If they get captured, the whole thing fails. But
if those institutions are going to get power anyway, this seems like one
of the better uses of that power!
- Anti-avoidance rules would be essential: synthetic data should
inherit provenance obligations from upstream models and datasets;
related-party data licenses should face transfer-pricing-style scrutiny;
covered revenue should attach to deployment and monetization
jurisdiction, not only training location; and open-source or research
releases should not automatically exempt closed downstream
monetization.
- Attribution will always be imperfect. Capability can come from many
things (data, compute, algorithmic progress, post-training tricks,
scaffolding, inference-time compute, distribution, interface design, and
more).
- Eventually, the base assumption about capability as a proxy for data
dependence might break.
- The policy would need careful safe-harbor handling for small, open,
nonprofit, and public-interest uses, as well as existing reciprocal
commons arrangements
But the overall direction here seems very promising to me. Please let
me know what you think. As time permits (and conditional on your
feedback -- is this too crazy, too strong of an assumption, redundant
with existing proposals, etc.), I hope to put together a more
whitepaper-looking version of this blog post to socialize the idea.
Notes and Links
- Sanders's
recent AI dividend proposal - the motivating op-ed calling for a
public ownership stake in large AI companies.
- Our
2021 data dividends report - the earlier report this post updates,
especially its data-dependence tax and coarse disbursement framing.
- Attestation
across the AI supply chain - background on why provenance,
disclosure, and institutional verification matter for AI
governance.
- The
AI evaluation crisis is an opportunity - related argument for
treating evaluation infrastructure as a governance opportunity.
- Clear
Data Rules - related argument that clearer rules can benefit both
data creators and AI companies.
- NBER working paper on
AI-related taxes and sovereign wealth funds - economics background
on AI-related tax instruments, growth, investment, diffusion, and
wealth-fund design.
- AP
overview of Sanders, Altman, and Trump discussing public ownership in
AI - mainstream coverage showing how quickly the public-ownership
idea became a cross-ideological political conversation.
- Washington
Post editorial board critique of government stakes in AI companies -
mainstream editorial skepticism about public equity stakes and
government ownership.
- Reason
critique of the proposed AI wealth fund - libertarian/pro-market
critique of the stock-tax and voting-share design.
- Cato
critique of Sanders's proposal and Trump-era government-ownership
precedents - pro-market critique that links the Sanders proposal to
broader concerns about state ownership and corporate control.
- Fortune
coverage of David Sacks's warning about AI nationalization -
tech-policy/industry-facing coverage of criticism from David Sacks.
- EU
AI Code of Practice - example of existing or proposed
data-transparency obligations that could overlap with the reporting
burden discussed here.
- Frontier AI
auditing ecosystem - institutional model for independent auditing
capacity that could support capability measurement and provenance
review.
- Wikimedia
Enterprise - useful example of high-volume commercial reuse
supporting a commons institution.
- Data
Provenance Initiative paper - useful on data provenance tracking,
dataset licensing, attribution, and why “show receipts” is not totally
imaginary.
- Kaplan et al. scaling
laws - useful background for the idea that performance can be
modeled as a function of data, model size, and compute.
- Data mixing work -
useful background for thinking about data mixture experiments and how
data composition affects model performance.
Bax, Eric. 2019. “Computing a Data Dividend.”
ACM Economics & Computation 2019 (EC ’19), poster
presentation. arXiv:1905.01805. https://doi.org/10.48550/arXiv.1905.01805
Wadhwa, Tarun. 2020. “Economic
Impact and Feasibility of Data Dividends.” Data Catalyst Institute
white paper.
Vincent, Nicholas, Yichun Li, Renee Zha, and Brent Hecht. 2019. “Mapping the Potential and
Pitfalls of ‘Data Dividends’ as a Means of Sharing the Profits of
Artificial Intelligence.” arXiv:1912.00757 [cs.CY]. https://doi.org/10.48550/arXiv.1912.00757
Vincent, Nicholas, and Brent Hecht. 2023. “Sharing
the Winnings of AI with Data Dividends: Challenges with ‘Meritocratic’
Data Valuation.” EAAMO ’23 Poster Track, Boston, MA, USA,
October 30-November 1, 2023. Non-archival.
ATProto local JSON preview
{
"note": "Local ATProto-shaped preview. Run `make garden-refresh-atproto` to cache exact public records where available.",
"sourcePath": "01_posts/2026-06-07-presumptive-commons-rent-tax-ai-dividends.md",
"uri": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.document/3mnqonrblbhwi",
"value": {
"$type": "site.standard.document",
"title": "AI Dividends Without Taxing Compute, Automation, or Equity: A Presumptive Commons-Rent Tax Based on Capabilities and Data Dependence",
"description": "A proposal for updating data dividend ideas around capability measurement, data provenance, and AI auditing.",
"publishedAt": "2026-06-07",
"site": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.publication/3lzrsw2kvwc2m",
"content": {
"$type": "at.markpub.markdown",
"text": "This post will capture some fresh thoughts on how Bernie Sanders's recent [AI dividend proposal][sanders-ai-dividend] connects with suggestions from our 2021 [data dividends report][data-dividend-report]. I'll describe how my recent interest in [attestation across the AI supply chain][attestation], [the AI evaluation crisis][evaluation-crisis], and [Clear Data Rules][clear-data-rules] connects to the topic of data dividends.\n\nMore specifically, I have a concrete proposal for how a data-dependence-tax-based data dividend can be updated to the 2026 context by implementing it as a \"presumptive commons-rent tax that can be avoided by providing capabilities-to-data attribution.\" Instead of trying to come up with some proxy for data dependence to rank different companies in terms of how much data they use (e.g. by counting their users, auditing the volume of data within their organizational databases, etc.), we instead start from the assumption that more capable AI systems draw more heavily on the data commons (broadly construed) that humanity has built. AI companies can show evidence that they used licensed or governed data to achieve certain capabilities, and the ecosystem of international AI auditing organizations would work together to verify these claims and lower their tax burden.\n\nWe either have a world where companies profiting from highly capable data-dependent systems (which need not be the labs themselves) pay a large amount of aggregate funds into various national funds (or ideally a single internationally governed fund), or a world where the vast majority of upstream data flows through healthy data markets where data creators have the collective leverage necessary to get paid through a mixture of upfront and royalty payments.\n\nOf course, there are a lot of details to be worked out. The main goal of this post is to paint a picture of this policy direction with broad strokes while starting to get specific about the research program needed to implement something like this.\n\nTo expand on the idea, let's begin with some context.\n\n## Brief history on data dividends research in 2021\n\nIn the 2021 report, we analyzed a variety of possible fundraising and disbursement mechanisms for a “data dividend” (which was being discussed by Governor Newsom of California at the time). While we were not aiming to pick a single answer, our \"likely good first step\" suggestion was a data dependence tax to fund public goods. To make \"data dependence\" operational, we suggested using user count as a proxy: firms with lots of users are probably getting lots of value from aggregated user/public data.\n\nSome other notable works on data dividends around that time include:\n- [Bax's computational treatment](https://arxiv.org/pdf/1905.01805) of individual and grouped data dividends using Shapley and Owen values\n- [Wadhwa's Data Catalyst review](https://datacatalyst.org/wp-content/uploads/2020/06/Economic-Impact-and-Feasibility-of-Data-Dividends-1.pdf) of the economic impact and feasibility of data dividends\n- this early [mapping paper](https://arxiv.org/abs/1912.00757) I led on design choices and simulated data dividend outcomes\n- a later [EAAMO poster paper](http://www.nickmvincent.com/static/eaamo_data_dividends.pdf) on the challenges of “meritocratic” data valuation for dividends\n\nAnother key idea from the report was that, in the context of retroactive dividends (as opposed to forward-looking markets), it is probably best to avoid “fine-grained valuation” (e.g., just trying to write individual-specific checks). Of course, in the context of data markets, it could still make sense in some cases to price both individual data points and collective bundles.\n\nIn short: for a dividend, we should tax dependence on collective data and disburse it coarsely while we figure out better valuation and interpretability methods. I think the reasoning from that report holds up pretty well in light of AI progress. I also think it’s notable that the motivation described in the Sanders proposal matches the arguments in our original report pretty closely.\n\nHowever, I also think we should be sensitive to concerns from economists about the impacts of compute taxes, automation taxes, capital taxes, etc., such as those discussed in a recent [NBER working paper on AI-related taxes and sovereign wealth funds][nber-ai-taxes]. After the Sanders op-ed went out, the idea quickly drew a cross-ideological mix of interest, skepticism, and pushback: [AP covered][ap-ai-public-ownership] the Sanders/Altman/Trump public-ownership conversation, while the [Washington Post editorial board][washpost-sanders-ai-stake], [Reason][reason-sanders-ai-wealth], [Cato][cato-trump-sanders-swf], and [Fortune's coverage of David Sacks][fortune-sacks-ai-equity] captured pro-market, libertarian, and tech-policy critiques of government equity stakes in AI companies. The economics concern, summarized: depending on design, an AI dividend tax could have negative effects on growth, investment, diffusion, etc.\n\nIf we can avoid it, we might want to avoid explicitly targeting “AI,” “compute,” or “automation.” _Instead, the thing we actually want to target is private value extraction from the commons._\n\nSomething that’s complicated with foundation models is that they depend on several categories of data that are common-ish. This includes literal, governed digital commons like Wikipedia; public-domain works; commons-y public-web resources like Common Crawl; open-source code; and more implicit commons like click data, trace data, public posts, interaction data, and social graphs.\n\nI think there’s actually an easy fix to target extraction from the commons: take a \"rebuttable presumption of data-dependence\" approach to the existence of powerful AI. Currently (and barring a major paradigm shift in AI) all of humanity’s approaches to building powerful AI are data-dependent. Even approaches that use reinforcement learning or synthetic data still have massive data dependencies in the overall training and evaluation pipeline needed to build an AI system.\n\nInstead of using user count or something else as a primary proxy for data dependence, we might consider using capability itself. We would basically be working on a default assumption that if an AI system is very capable and monetized, a meaningful chunk of that value came from commons data. I think this assumption is currently very justified and will remain so in the near term.\n\nThere are several reasonable ways we can coarsely estimate the fraction of value attributable to data versus the value attributable to compute, non-data technical progress, interface progress, and other factors. We just need to pick some number. 50%, which happens to appear in the Sanders proposal, might be a reasonable starting placeholder. The more capable a system is, the more burden a company should face in explaining how it got so good.\n\nThus, we could iterate on various data dividends proposals to design what we might call a “presumptive commons-rent tax.” When a firm makes money from a powerful AI system, we presume some share of the rent came from public/user/commons data. AI operators can lower their presumptive commons-rent tax by showing that capabilities came from data that was acquired under non-commons conditions, e.g., licensed data purchased via a healthy data market.\n\nAs a toy example, suppose a highly capable AI system earns $10B in annual rents and the presumptive commons-rent share is set at 50%. If the operator can substantiate that half of its capability-relevant data contribution came from licensed, governed, or reciprocal sources, the taxable commons-rent base might fall from $5B to $2.5B.\n\nWe would need a clean accounting scheme here, with a possible unit being **explained data**. Explained data would estimate the share of a model’s effective, capability-relevant data contribution that the company can actually account for. To count as tax-reducing explained data, a data source would need to be documented, have provenance and proof of fair acquisition (e.g., because it was licensed, bought under contract, or similar), and plausibly relevant to the capabilities being taxed. Critically, the tax rate would depend on the capability level achieved by a model, so more capable models would require more explained data, in accordance with our scientific understanding of [scaling laws][kaplan-scaling-laws] and training data attribution.\n\nPreparing such evidence would look something like this:\n\n- first, a company profiting from AI models (an important qualifier -- more below, though working through details here will extend beyond this \"long blog post\" format) prepares a datasheet\n- second, each entry in the datasheet would be labeled with an acquisition/governance status (licensed, internally generated, public-domain, governed by a data union/trust, etc.)\n- third, provenance evidence would be collected to support the acquisition/governance classifications\n- fourth, usage evidence shows how much each data component was actually used (could include details about mixture fractions, sampling rates, repetition, deduplication, training stage, post-training role, eval role, upstream sources for synthetic or RL data -- ultimately this evidence will be reviewed by AI auditor organizations, so it does not have to be 100% standardized and there can be some flexibility for different model types)\n- fifth, just as datasheet entries would be linked to provenance evidence, usage entries would be linked to ablations/[data mixture experiments][data-mixing-work] to show how those data components actually mattered for the relevant capabilities\n\nFor a first implementation, we need not rely on a single precise definition of “accepted explained-data points.” A more ready-to-use version could just use evidence tiers as determined by the adjudicating entities.\n\nModel capability comes from a production process involving compute, model size, data quantity, data quality, interface and tool access, etc. Capability measurement would be used to set a default presumed tax rate. Companies could present data details to reduce their tax burden, and an auditor (or a network of auditing organizations) would convert the evidence into \"accepted explained-data points\" to determine a final rate.\n\nThis could create a good set of incentives:\n\n- if companies want lower taxes, they should build datasheets and [provenance systems][data-provenance-initiative] from the start\n- if they want larger reductions, they need to run and share data-centric scientific experiments\n- if they rely heavily on commons data, they can still do that, but they should pay something back or give something back\n- the tax is fully avoidable!\n\nImportantly, everything described above would basically involve preparing a report that would look a lot like something required by existing or proposed data transparency laws, such as the [EU AI Code of Practice][eu-ai-code]. This is close to something AI companies might need to do anyway!\n\nBut wait a second -- if the whole concern around taxing compute or automation is that \"we shouldn't tax stuff that we want more of,\" isn't this potentially even worse than those other taxes, if we interpret this proposal as a tax on intelligence itself or capability itself? Critically, this is not a tax on intelligence or capability, but rather a tax on unexplained or mysterious capability. If an AI operator trains on 100% licensed/accounted-for data, and can show that this data actually drove the relevant capabilities, their commons tax is near zero.\n\nIf you train on Wikipedia, Common Crawl, public code, user traces, etc., you would end up paying some reasonable tax back to the commons (and the tax might also be reduced if you show evidence of, e.g., contributing to something like [Wikimedia Enterprise][wikimedia-enterprise], or making in-kind contributions of data, model weights, gold standard generated code, etc.). Ideally, during any kind of transition period, there would be a way to transfer existing reciprocity programs into tax credits as well. And perhaps reciprocity programs could just be integrated into the program in the long term.\n\nHow would this be enforced? This is where the recent momentum around auditing and safety comes in. Capability measurement -- and assessment of the ablations and whether AI operators are able to provide plausible accounts of, at a high level, how data choices drive capabilities -- could be handled by an ecosystem of independent auditing institutions, along the lines of the [frontier AI auditing ecosystem][frontier-ai-auditing].\n\nThe ecosystem of auditing orgs would become part of the infrastructure for data governance: measuring capabilities, reviewing provenance, looking at ablations, etc. This would also get companies to contribute to advancing and sharing science about where model capabilities come from, in turn helping the auditing organizations.\n\nCritically, by looping in auditing and safety organizations, this proposal could also take advantage of the fact that AI safety is one area in which there is a plausible path to international cooperation. In fact, I think this might be one of the only ways that it might be plausible to at least lay out a path toward a global wealth fund rather than various national funds and sovereign-focused economic interventions.\n\nA single global wealth fund is morally attractive, because the data commons is transnational, but the more realistic path may be federated: national or regional AI commons funds collect revenue, while treaty or club arrangements allocate some share to global public goods and commons institutions.\n\nOf course, we likely would not want independent AI auditors to be burdened with global taxation responsibility (nor would they likely want a bunch of extra work). Public tax authorities would still set the rules and then accredited auditors (with proportionate support to hire staff to do all this) could review evidence while a public technical board maintains standards. Courts or administrative tribunals would handle disputes.\n\nCompared to compute and automation taxes, I believe this kind of approach would avoid some of the concerns raised by economists, and instead target a specific harm: companies turning collective human activity and public knowledge into private rents without proportionate return.\n\nIn the current world, this would mean that AI companies would pay a bunch of taxes, which then might fuel, e.g., a national wealth fund, or ideally a global wealth fund. But it also creates a path to lower the burden: license data, work with data unions/trusts, document provenance, run ablations, or give value back to the commons.\n\nOf course, things that seem simple are often laundering complexity. To highlight a few critical questions and considerations:\n\n- Who classifies exactly what systems count as frontier AI, and exactly which organizations are subject to this tax?\n- The very challenging task of accurate capabilities measurement becomes pivotal -- but we need to figure this out anyway!\n- Power dynamics within the ecosystem of evaluation institutions will be critical to success. If they get captured, the whole thing fails. But if those institutions are going to get power anyway, this seems like one of the better uses of that power!\n- Anti-avoidance rules would be essential: synthetic data should inherit provenance obligations from upstream models and datasets; related-party data licenses should face transfer-pricing-style scrutiny; covered revenue should attach to deployment and monetization jurisdiction, not only training location; and open-source or research releases should not automatically exempt closed downstream monetization.\n- Attribution will always be imperfect. Capability can come from many things (data, compute, algorithmic progress, post-training tricks, scaffolding, inference-time compute, distribution, interface design, and more).\n- Eventually, the base assumption about capability as a proxy for data dependence might break.\n- The policy would need careful safe-harbor handling for small, open, nonprofit, and public-interest uses, as well as existing reciprocal commons arrangements\n\nBut the overall direction here seems very promising to me. Please let me know what you think. As time permits (and conditional on your feedback -- is this too crazy, too strong of an assumption, redundant with existing proposals, etc.), I hope to put together a more whitepaper-looking version of this blog post to socialize the idea.\n\n## Notes and Links\n\n- [Sanders's recent AI dividend proposal][sanders-ai-dividend] - the motivating op-ed calling for a public ownership stake in large AI companies.\n- [Our 2021 data dividends report][data-dividend-report] - the earlier report this post updates, especially its data-dependence tax and coarse disbursement framing.\n- [Attestation across the AI supply chain][attestation] - background on why provenance, disclosure, and institutional verification matter for AI governance.\n- [The AI evaluation crisis is an opportunity][evaluation-crisis] - related argument for treating evaluation infrastructure as a governance opportunity.\n- [Clear Data Rules][clear-data-rules] - related argument that clearer rules can benefit both data creators and AI companies.\n- [NBER working paper on AI-related taxes and sovereign wealth funds][nber-ai-taxes] - economics background on AI-related tax instruments, growth, investment, diffusion, and wealth-fund design.\n- [AP overview of Sanders, Altman, and Trump discussing public ownership in AI][ap-ai-public-ownership] - mainstream coverage showing how quickly the public-ownership idea became a cross-ideological political conversation.\n- [Washington Post editorial board critique of government stakes in AI companies][washpost-sanders-ai-stake] - mainstream editorial skepticism about public equity stakes and government ownership.\n- [Reason critique of the proposed AI wealth fund][reason-sanders-ai-wealth] - libertarian/pro-market critique of the stock-tax and voting-share design.\n- [Cato critique of Sanders's proposal and Trump-era government-ownership precedents][cato-trump-sanders-swf] - pro-market critique that links the Sanders proposal to broader concerns about state ownership and corporate control.\n- [Fortune coverage of David Sacks's warning about AI nationalization][fortune-sacks-ai-equity] - tech-policy/industry-facing coverage of criticism from David Sacks.\n- [EU AI Code of Practice][eu-ai-code] - example of existing or proposed data-transparency obligations that could overlap with the reporting burden discussed here.\n- [Frontier AI auditing ecosystem][frontier-ai-auditing] - institutional model for independent auditing capacity that could support capability measurement and provenance review.\n- [Wikimedia Enterprise][wikimedia-enterprise] - useful example of high-volume commercial reuse supporting a commons institution.\n- [Data Provenance Initiative paper][data-provenance-initiative] - useful on data provenance tracking, dataset licensing, attribution, and why “show receipts” is not totally imaginary.\n- [Kaplan et al. scaling laws][kaplan-scaling-laws] - useful background for the idea that performance can be modeled as a function of data, model size, and compute.\n- [Data mixing work][data-mixing-work] - useful background for thinking about data mixture experiments and how data composition affects model performance.\n\n## Related Data Dividend References\n\nBax, Eric. 2019. “[Computing a Data Dividend](https://arxiv.org/pdf/1905.01805).” *ACM Economics & Computation 2019 (EC ’19)*, poster presentation. arXiv:1905.01805. https://doi.org/10.48550/arXiv.1905.01805\n\nWadhwa, Tarun. 2020. “[Economic Impact and Feasibility of Data Dividends](https://datacatalyst.org/wp-content/uploads/2020/06/Economic-Impact-and-Feasibility-of-Data-Dividends-1.pdf).” Data Catalyst Institute white paper.\n\nVincent, Nicholas, Yichun Li, Renee Zha, and Brent Hecht. 2019. “[Mapping the Potential and Pitfalls of ‘Data Dividends’ as a Means of Sharing the Profits of Artificial Intelligence](https://arxiv.org/abs/1912.00757).” arXiv:1912.00757 [cs.CY]. https://doi.org/10.48550/arXiv.1912.00757\n\nVincent, Nicholas, and Brent Hecht. 2023. “[Sharing the Winnings of AI with Data Dividends: Challenges with ‘Meritocratic’ Data Valuation](http://www.nickmvincent.com/static/eaamo_data_dividends.pdf).” *EAAMO ’23 Poster Track*, Boston, MA, USA, October 30-November 1, 2023. Non-archival.\n\n[sanders-ai-dividend]: https://www.sanders.senate.gov/op-eds/the-public-should-own-half-of-the-big-a-i-companies/\n[data-dividend-report]: https://www.nickmvincent.com/static/Data-Dividend_final.pdf\n[attestation]: https://dataleverage.substack.com/p/attestation-across-the-ai-supply\n[evaluation-crisis]: https://dataleverage.substack.com/p/the-ai-evaluation-crisis-is-an-opportunity\n[clear-data-rules]: https://dataleverage.substack.com/p/almost-everybody-including-both-data\n[eu-ai-code]: https://digital-strategy.ec.europa.eu/en/policies/ai-code-practice\n[frontier-ai-auditing]: https://www.averi.org/ourwork/frontier-ai-auditing\n[nber-ai-taxes]: https://www.nber.org/papers/w34873\n[ap-ai-public-ownership]: https://apnews.com/article/sam-altman-ai-bernie-sanders-trump-public-ownership-772224f9cd138eb79d3ef3336858a5d5\n[washpost-sanders-ai-stake]: https://www.washingtonpost.com/opinions/2026/06/03/bernie-sanders-wants-government-stake-ai-companies/\n[reason-sanders-ai-wealth]: https://reason.com/2026/06/02/bernie-sanders-ai-wealth-fund-bill-shows-that-he-doesnt-understand-ai-or-wealth/\n[cato-trump-sanders-swf]: https://www.cato.org/blog/trump-opened-door-sanderss-sovereign-wealth-fund\n[fortune-sacks-ai-equity]: https://fortune.com/2026/06/06/former-ai-czar-david-sacks-bernie-sanders-bill-government-equity-stupidity-tax-nationalization-trump-public-stakes/\n[wikimedia-enterprise]: https://wikimediafoundation.org/news/2021/10/25/wikimedia-foundation-launches-wikimedia-enterprise-the-new-opt-in-product-for-companies-and-organizations-to-easily-reuse-content-from-wikipedia-and-wikimedia-projects/\n[data-provenance-initiative]: https://www.nature.com/articles/s42256-024-00878-8\n[kaplan-scaling-laws]: https://arxiv.org/abs/2001.08361\n[data-mixing-work]: https://arxiv.org/abs/2403.16952\n"
}
}
}