Once again, we’ve had an eventful few weeks in the space of data-dependent computing!

Photo by USgGS on Unsplash. A beautiful geological photo
that looks a bit like a plot showing something trending upwards.
Once again, we’ve had an eventful few weeks in the space of
data-dependent computing! In this short post, I want to highlight a
series of three news events that suggest we might be seeing a major
shift towards data
dignity, though there’s lot of work to be done to ensure these
developments lead to benefits that are broadly shared.
Major
user-generated content platforms charging for their user-generated
content
First, we saw that, in very quick succession, Reddit
and Stack
Overflow announced plans to change the way platform data flows to
large-scale data users, especially AI operators like OpenAI. Reddit and
StackOverflow are very important data
sources for academic
research. Thus, it made a good deal of sense when AI operating firms
started using Reddit data for early
language model training.
The leadership of these firms have, nearly in sync in terms of
timing, alluded to changing this status quo. Reddit and StackExchange
want AI operators to start paying for data that has thus far followed
something of a free-for-all paradigm.
(As a side note: shortly before posting this, I came across coverage
from Christopher Mims of the Wall Street Journal on these topics. I was
thrilled to see this message in the WSJ:
“If you’ve ever published a blog, or posted something to Reddit, or
shared content anywhere else on the open web, it’s very likely you have
played a part in creating the latest generation of artificial
intelligence.”
Definitely check it out!)
Quotes
from the CEOs sound a lot like data dignity talking points
The statements from the chief executives are quite promising. I’m
going to include several quotes in full. In his interview with Mike
Isaac of the New York Times, Steve Huffman of Reddit said:
“The Reddit corpus of data is really valuable… but we don’t need to
give all of that value to some of the largest companies in the world for
free.”
He further highlighted the unique value Reddit users provide:
“There’s a lot of stuff on the site that you’d only ever say in
therapy, or A.A., or never at all… Crawling Reddit, generating value and
not returning any of that value to our users is something we have a
problem with.”
In his blog
post, Prashanth Chandrasekar, stated
“If AI models are powerful because they were trained on open source
or publicly available code, we want to craft models that reward the
users who contribute and keep the knowledge base we all rely on open and
growing, ensuring we remain the top destination for knowledge on new
technologies in the future.”
Several quotes from this post (which was controversial, primarily
because of allusions to using generative AI on the platform) really
resonated with the thinking I’ve been advocating for.
“AI systems are, at their core, built upon the vast wealth of human
knowledge and experiences… Allowing AI models to train on the data
developers have created over the years, but not sharing the data and
learnings from those models with the public in return, would lead to a
tragedy
of the commons”.
Finally, there’s a nice image in the original post with the caption,
“AI is built on our collective knowledge, and we must all participate in
building its future”.
Firms with Data
Leaning On Firms Who Want Data
These announcements are examples of data leverage by firms. These
firms have something AI operators want — data — so they have leverage to
demand payments or concessions. It’s not (yet) the hopeful vision of
grassroots data leverage I’ve laid out in some my work, in
which groups of Reddit and StackOverflow contributors threaten to
withhold, alter, or redirect data unless they get paid or stipulations
are placed on how resulting AI systems are used.
However, the above quotes really do suggest that firm leadership
wants to reward users down the line. If you’re skeptical that this is
driven purely by a sense of corporate responsibility, there’s a very
rational explanation here as well: the more money platforms make from
data, the more they need to keep their core data
contributors happy.
If Reddit and SO are relying a continuous supply of high quality data
to get money from OpenAI, Microsoft, Google, and company, this means
that overnight, users of these platforms gain a sort of second-order
leverage.
Users
Who Create Data Leaning on Firms Who Sell that Data
I think there’s a lot of value in noting the similarities between
contributions to these platforms and labor (specifically, cartographic
labor). Extending the metaphor, the value of Reddit and SO depends a
lot of this data labor (but also on the paid labor of engineers, ad
sales, etc.). The more that these platforms make off any data deals, the
more leverage users have over firm leadership, in the same way that if I
run a business that suddenly starts making a bunch of products that use
steel, the steelworkers union’s suddenly has a lot more leverage over my
business overnight.
Yes, licensing might mitigate this leverage; there’s a “Gotcha!” with
Reddit’s not-so-friendly terms:
“When Your Content is created with or submitted to the Services, you
grant us a worldwide, royalty-free, perpetual, irrevocable,
non-exclusive, transferable, and sublicensable license to use, copy,
modify, adapt, prepare derivative works of, distribute, store, perform,
and display Your Content and any name, username, voice, or likeness
provided in connection with Your Content in all media formats and
channels now known or later developed anywhere in the world.”
(TLDR: Reddit can do anything they want with your Reddit posts and
comments).
On the other hand, StackOverflow has quite open terms
(its all Creative Commons). But both platforms have already been scraped
and crystallized as weights. The main value proposition that the
platforms can offer data-hungry AI operators is this: if you want to
quickly adapt to news, trends, gossip, cultural shifts, and new
programming languages, you’re going to need the data our users have yet
to produce. This means users always have a lever here: stop or alter
their data-contributing behavior going forward.
Europe’s Proposed Legislation
On April 27th, Sam Schechner of the Wall Street Journal reported
that European Union legislators have drafted legislation requiring that
AI operators disclose any copyrighted materials used in training (see
more coverage of this concern here).
If passed, this legislation could open the door for even more
organizations — beyond Reddit and StackOverflow — to try seeking payment
or credit for data. The article notes that this legislation won’t
actually provide the final word on whether
the current mode of scraping is legal
(arguably in Europe, it’s not).
Importantly, while this legislation is still just a draft, news
coverage suggests it seems fairly likely to pass. This could mean some
AI operators simply stop offering services in the EU, but it will
probably lead to a substantial increase in generative AI dataset
documentation compared to the current GPT-4 status quo (i.e., not so
much).
Will
a Small Group of Higher-ups Reap all the Benefits of these Changes? Data
Coalitions Might Help.
On one hand, these developments clearly favor data creators over data
users. However, one serious concern is that all these developments will
primarily empower large organizations and not individual data creators.
For instance, we can imagine one future in which a small group of
higher-ups at Reddit, StackOverflow, and large media conglomerates are
the primary beneficiaries of data payments.
This is not a foregone conclusion, however. As noted above, if
Reddit, StackOverflow, or a record company suddenly lean on OpenAI and
see an influx of generative AI’s profits, there is an immediate avenue
for the true upstream source of data — individual users and creators —
to lean on the platform. The people at the top of the river will always
have the final say in this kind of arrangements: the models and
platforms truly are downstream of users.
Ideally, the bargaining positions of users and creators might be
solidified by increased prominence and formal recognition of data
cooperatives. In the case of media, this arguably already exists in the
form of organizations like the Screen Actors Guild (if production
companies get some profits from AI companies that use frames from
movies, SAG could in turn push to get some of these profits from the
production companies). Put simply, some creative industries already have
organizations that support the labor of content creators.
However, unless this kind of organizing is extended to more domains
of “creation”, in many contexts it may be the case that a small group of
higher-ups do reap the majority the benefits
of these changes. This means that researchers and activists still have
our work cut out for us: we need to keep developing tools for
coordination and working to shape policy that allows for creators
themselves to share in the winnings of AI.
ATProto raw JSON
{
"uri": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.document/3mizeds5hhtp3",
"cid": "bafyreieaa4gdqhak5vvkojr7vj4pclyr7j3eha3scx7siydgq23hpc2bde",
"value": {
"path": "/3mizeds5hhtp3",
"site": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.publication/3lzrsw2kvwc2m",
"$type": "site.standard.document",
"title": "Reddit, StackOverflow, and Europe: All Trending Towards Data Dignity",
"content": {
"$type": "pub.leaflet.content",
"pages": [
{
"$type": "pub.leaflet.pages.linearDocument",
"blocks": [
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.image",
"image": {
"$type": "blob",
"ref": {
"$link": "bafkreiby3mrpegf5jaew6cdxle6xq6aakf2wh4vdzazgrpmx6g5dnnmieq"
},
"mimeType": "image/jpeg",
"size": 487586
},
"aspectRatio": {
"$type": "pub.leaflet.blocks.image#aspectRatio",
"width": 1080,
"height": 1080
}
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 26,
"byteStart": 18
},
"features": [
{
"uri": "https://unsplash.com",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 14,
"byteStart": 9
},
"features": [
{
"uri": "https://unsplash.com/@usgs",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Photo by USgGS on Unsplash. A beautiful geological photo that looks a bit like a plot showing something trending upwards."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 224,
"byteStart": 212
},
"features": [
{
"uri": "https://www.radicalxchange.org/concepts/data-dignity/",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Once again, we’ve had an eventful few weeks in the space of data-dependent computing! In this short post, I want to highlight a series of three news events that suggest we might be seeing a major shift towards data dignity, though there’s lot of work to be done to ensure these developments lead to benefits that are broadly shared."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Major user-generated content platforms charging for their user-generated content"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 364,
"byteStart": 359
},
"features": [
{
"uri": "https://dataleverage.substack.com/p/dont-give-openai-all-the-credit-for",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 246,
"byteStart": 239
},
"features": [
{
"uri": "https://arxiv.org/abs/2010.12282",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 238,
"byteStart": 234
},
"features": [
{
"uri": "https://ojs.aaai.org/index.php/ICWSM/article/view/7347",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 71,
"byteStart": 57
},
"features": [
{
"uri": "https://meta.stackexchange.com/questions/388401/new-blog-post-from-our-ceo-prashanth-community-is-the-future-of-ai",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 52,
"byteStart": 46
},
"features": [
{
"uri": "https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "First, we saw that, in very quick succession, Reddit and Stack Overflow announced plans to change the way platform data flows to large-scale data users, especially AI operators like OpenAI. Reddit and StackOverflow are very important data sources for academic research. Thus, it made a good deal of sense when AI operating firms started using Reddit data for early language model training."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "The leadership of these firms have, nearly in sync in terms of timing, alluded to changing this status quo. Reddit and StackExchange want AI operators to start paying for data that has thus far followed something of a free-for-all paradigm."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 68,
"byteStart": 60
},
"features": [
{
"uri": "https://www.wsj.com/articles/chatgpt-ai-artificial-intelligence-openai-personal-writing-5328339a",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "(As a side note: shortly before posting this, I came across coverage from Christopher Mims of the Wall Street Journal on these topics. I was thrilled to see this message in the WSJ:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [],
"plaintext": "“If you’ve ever published a blog, or posted something to Reddit, or shared content anywhere else on the open web, it’s very likely you have played a part in creating the latest generation of artificial intelligence.”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Definitely check it out!)"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Quotes from the CEOs sound a lot like data dignity talking points"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "The statements from the chief executives are quite promising. I’m going to include several quotes in full. In his interview with Mike Isaac of the New York Times, Steve Huffman of Reddit said:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [],
"plaintext": "“The Reddit corpus of data is really valuable… but we don’t need to give all of that value to some of the largest companies in the world for free.”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "He further highlighted the unique value Reddit users provide:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [],
"plaintext": "“There’s a lot of stuff on the site that you’d only ever say in therapy, or A.A., or never at all… Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with.”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 16,
"byteStart": 7
},
"features": [
{
"uri": "https://stackoverflow.blog/2023/04/17/community-is-the-future-of-ai/",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "In his blog post, Prashanth Chandrasekar, stated"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [],
"plaintext": "“If AI models are powerful because they were trained on open source or publicly available code, we want to craft models that reward the users who contribute and keep the knowledge base we all rely on open and growing, ensuring we remain the top destination for knowledge on new technologies in the future.”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Several quotes from this post (which was controversial, primarily because of allusions to using generative AI on the platform) really resonated with the thinking I’ve been advocating for."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 301,
"byteStart": 279
},
"features": [
{
"uri": "https://en.wikipedia.org/wiki/Tragedy_of_the_commons",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "“AI systems are, at their core, built upon the vast wealth of human knowledge and experiences… Allowing AI models to train on the data developers have created over the years, but not sharing the data and learnings from those models with the public in return, would lead to a tragedy of the commons”."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Finally, there’s a nice image in the original post with the caption, “AI is built on our collective knowledge, and we must all participate in building its future”."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Firms with Data Leaning On Firms Who Want Data"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 224,
"byteStart": 214
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "These announcements are examples of data leverage by firms. These firms have something AI operators want — data — so they have leverage to demand payments or concessions. It’s not (yet) the hopeful vision of grassroots data leverage I’ve laid out in some my work, in which groups of Reddit and StackOverflow contributors threaten to withhold, alter, or redirect data unless they get paid or stipulations are placed on how resulting AI systems are used."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 302,
"byteStart": 298
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "However, the above quotes really do suggest that firm leadership wants to reward users down the line. If you’re skeptical that this is driven purely by a sense of corporate responsibility, there’s a very rational explanation here as well: the more money platforms make from data, the more they need to keep their core data contributors happy."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "If Reddit and SO are relying a continuous supply of high quality data to get money from OpenAI, Microsoft, Google, and company, this means that overnight, users of these platforms gain a sort of second-order leverage."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Users Who Create Data Leaning on Firms Who Sell that Data"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 144,
"byteStart": 126
},
"features": [
{
"uri": "https://dataleverage.substack.com/p/ai-technologies-are-system-maps-and-you-are-a-cartographer",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "I think there’s a lot of value in noting the similarities between contributions to these platforms and labor (specifically, cartographic labor). Extending the metaphor, the value of Reddit and SO depends a lot of this data labor (but also on the paid labor of engineers, ad sales, etc.). The more that these platforms make off any data deals, the more leverage users have over firm leadership, in the same way that if I run a business that suddenly starts making a bunch of products that use steel, the steelworkers union’s suddenly has a lot more leverage over my business overnight."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 108,
"byteStart": 103
},
"features": [
{
"uri": "https://www.redditinc.com/policies/user-agreement",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Yes, licensing might mitigate this leverage; there’s a “Gotcha!” with Reddit’s not-so-friendly terms:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.blockquote",
"facets": [],
"plaintext": "“When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world.”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "(TLDR: Reddit can do anything they want with your Reddit posts and comments)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 53,
"byteStart": 48
},
"features": [
{
"uri": "https://stackoverflow.com/legal/terms-of-service#licensing",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "On the other hand, StackOverflow has quite open terms (its all Creative Commons). But both platforms have already been scraped and crystallized as weights. The main value proposition that the platforms can offer data-hungry AI operators is this: if you want to quickly adapt to news, trends, gossip, cultural shifts, and new programming languages, you’re going to need the data our users have yet to produce. This means users always have a lever here: stop or alter their data-contributing behavior going forward."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Europe’s Proposed Legislation"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 241,
"byteStart": 237
},
"features": [
{
"uri": "https://www.wsj.com/articles/ai-chatgpt-dall-e-microsoft-rutkowski-github-artificial-intelligence-11675466857?mod=article_inline",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 64,
"byteStart": 56
},
"features": [
{
"uri": "https://www.wsj.com/articles/europe-to-chatgpt-disclose-your-sources-863ef330",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "On April 27th, Sam Schechner of the Wall Street Journal reported that European Union legislators have drafted legislation requiring that AI operators disclose any copyrighted materials used in training (see more coverage of this concern here)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 287,
"byteStart": 282
},
"features": [
{
"uri": "https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 249,
"byteStart": 242
},
"features": [
{
"uri": "https://www.lexology.com/library/detail.aspx?g=0adc3f5a-23f4-422e-a375-7ad5e7bf6709",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "If passed, this legislation could open the door for even more organizations — beyond Reddit and StackOverflow — to try seeking payment or credit for data. The article notes that this legislation won’t actually provide the final word on whether the current mode of scraping is legal (arguably in Europe, it’s not)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Importantly, while this legislation is still just a draft, news coverage suggests it seems fairly likely to pass. This could mean some AI operators simply stop offering services in the EU, but it will probably lead to a substantial increase in generative AI dataset documentation compared to the current GPT-4 status quo (i.e., not so much)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Will a Small Group of Higher-ups Reap all the Benefits of these Changes? Data Coalitions Might Help."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "On one hand, these developments clearly favor data creators over data users. However, one serious concern is that all these developments will primarily empower large organizations and not individual data creators. For instance, we can imagine one future in which a small group of higher-ups at Reddit, StackOverflow, and large media conglomerates are the primary beneficiaries of data payments."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This is not a foregone conclusion, however. As noted above, if Reddit, StackOverflow, or a record company suddenly lean on OpenAI and see an influx of generative AI’s profits, there is an immediate avenue for the true upstream source of data — individual users and creators — to lean on the platform. The people at the top of the river will always have the final say in this kind of arrangements: the models and platforms truly are downstream of users."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Ideally, the bargaining positions of users and creators might be solidified by increased prominence and formal recognition of data cooperatives. In the case of media, this arguably already exists in the form of organizations like the Screen Actors Guild (if production companies get some profits from AI companies that use frames from movies, SAG could in turn push to get some of these profits from the production companies). Put simply, some creative industries already have organizations that support the labor of content creators."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 158,
"byteStart": 156
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
},
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "However, unless this kind of organizing is extended to more domains of “creation”, in many contexts it may be the case that a small group of higher-ups do reap the majority the benefits of these changes. This means that researchers and activists still have our work cut out for us: we need to keep developing tools for coordination and working to shape policy that allows for creators themselves to share in the winnings of AI."
}
}
]
}
]
},
"description": "Once again, we’ve had an eventful few weeks in the space of data-dependent computing!",
"publishedAt": "2023-05-01T00:00:00.000Z"
}
}