Focusing on feedback loops -- connecting modern AI to early cybernetics-style thinking -- could help solve looming challenges and support democratic inputs to AI.

Fig 1. Left: Charon, an early steersman (Carole Raddato from
FRANKFURT, Germany, CC BY-SA 2.0,
via Wikimedia Commons). Right: A summary diagram showing the five stage
Data Pipeworks model.
A preface
This post is going to be a bit different from some of my previous
posts. Rather than commenting on data leverage-related news or recent
research (which I intend to keep doing – there’s been a lot of exciting
movement in the space of data leverage tools, and datalevers.org needs
an update), I'm going to share some in-progress writing. Specifically,
I’ll share an early draft of a post describing the “Data Pipeworks”
underlying modern AI, and how the pipeworks model can support efforts to
promote democratic inputs to AI.
This is something of an introductory “chapter”, part of a series of
posts that will introduce this pipeworks model and a variety of research
questions that I believe the pipeworks can help answer. The posts will
also argue for bringing together many disciplinary lenses to tackle
looming challenges in AI (especially thorny governance challenges and
the looming threat of major social dilemmas related to data), adding to
a chorus of calls along these lines.
Key points in this post:
We should consider all the stages involved in creating data for
AI: Our knowledge and values (Stage 1) are translated into records
(Stage 2), aggregated into datasets (Stage 3), and used to fit models
(Stage 4). These models are deployed in systems (Stage 5) that impact
the world.
A “Data Pipeworks” model is especially compatible with systems /
control/ cybernetics thinking, which emphasizes sources of
feedback. One potential source of feedback in the Data
Pipeworks is the manipulation of data flow by contributors (“data
leverage”). We should work to more precisely specify such cybernetic
systems and their feedback sources, as these determine the data with
which AI developers work.
It follows that AI is highly dependent on social dynamics and
human factors that shape the “landscape of records”. As we rely more on
large-scale data-dependent systems like generative AI, a human factors
perspective will become more important.
Tracing the flow of data to specific individuals and groups can
allow us to quantify the relative weight of an individual or group, and
foster pluralistic
governance of AI and another notion of alignment.
As data contributors gain agency to affect data flow, building AI
systems will require resolving social
dilemmas.
At the same time, new levels of data agency and a human-centered
perspective may unlock new ways to think about data scaling (e.g. in
terms of “units of knowledge”) and new ways to accelerate progress in
computing.
At the highest level, this series of posts is really an extended
argument for using a broader lens when discussing AI governance. In
part, this argument stems from my long term interest in data labor and
leverage as a governance mechanism (these concepts emerge when we zoom
out from a machine learning problem definition to consider contexts in
which data is created).
Fig 2. A visual depiction of zooming out from a specific ML
problem to consider the context that determines what data records are
created.
However, this broader lens has useful applications beyond
specifically advancing the data leverage research agenda – it can be
generally helpful to think about upstream factors that shape AI
development, as well as taking a feedback-focused approach to thinking
about downstream impacts.
In terms of specific problems that motivated this post, I especially
wanted to write down a longer argument for why I think we’re inching
towards a future in which basically all “AI” work will have a dominant
human factors component (some might argue, “always has been!”, and that
HCI will “reap
the AI harvest”).
I believe we’re going to see a coming wave of data-related social
dilemmas (arguably, it’s already started in the realm of art),
and pipeworks-thinking can help to solve them in a way that effectively
balances technological progress with inevitable ethical challenges that
arise. I often find myself a bit defensive about this claim — while the
pipeworks will highlight ways that data creators might exert leverage by
lowering AI capabilities in the short term, I see this leverage as a
part of a “pro-computing” research agenda.
I began my exploration of this concept by thinking about the kinds of
questions often posed and solved in human-computer interaction and
machine learning venues (the venues that I engage with most), and then
asking, how might we start to bring in more control engineering,
information theory, communication, economics, and sociology?
After many iterations of writing, I stumbled on a recurring theme:
that AI is downstream of human factors, including complex systems with
feedback, with the feedback mixing an ecological, economic, and
otherwise social nature. I realized that perhaps what I’ve really been
grappling with is an attempt to describe the data labor and data
leverage concepts in the language of cybernetics (i.e. Wiener's
cybernetics).
This gave me a reason to be concerned I might be retreading old ground –
after all, AI as a discipline arguably came to exist by splintering off
from the field of cybernetics! I thought to myself — “Did I just write a
lengthy argument that basically just says that the AI field needs to
care more about work from… the AI field?” (The circularity of this is
fitting, given cybernetic thinking’s emphasis on causal loops).
Indeed, Wiener’s writing from the 1950s
and 1960s
resonate with ongoing discussions about generative AI’s economic
impacts. And certainly work on control theory has continued to advance
(and work explicitly taking a cybernetics lens continues as well, see
e.g. Zargham
and Shorish 2023).
Even if the Data Pipeworks really is just an argument for “Cybernetic
AI” (which some might say is just “AI AI”), I think there are some novel
ideas here that will be constructive, and I think the old ideas are
worth dredging up again.
None of the posts in this series are (yet) structured in a way to
answer a specific research question, though they relate to many of my
ongoing research projects. If you happen to be reading this and are
excited by a particular question or aspect, please let me know!
The following “chapters” of this series are all partially complete,
but still being refined. If you’d like to see a full draft of this
document, please send me a note!
Thus ends my preface. Onwards to describing the Data Pipeworks.
An Extended Abstract and
More Context
This series of posts will present a five-stage model – the “Data
Pipeworks" – that aims to describe the process by which human knowledge
and values flow to deployed data-dependent systems (i.e., “AI”). Our
knowledge and values (Stage 1) are translated into records (Stage 2),
aggregated into datasets (Stage 3), and used to fit models (Stage 4).
These models are deployed in systems (Stage 5) that impact the world. We
hypothesize that considering the full scope of the Data Pipework will be
helpful in designing solutions that make AI more prosocial, in
particular by enabling realistic assessment of where democratic inputs
(including data bargaining) can be implemented. By describing the
pipeworks, we also hope to highlight the value of a broad lens that
considers human-computer interaction and AI as part of a shared
problem-solving framework. A full accounting of the pipeworks will
involve negotiating formalisms and problem definitions from across
machine learning, human-computer interaction, signal processing, control
theory, economics, and sociology, and more. Put another way – because
“AI” is dependent on data from a large population, returning to a
cybernetic frame that emphasizes steering, feedback, and communication
can naturally create democratic AI.
I am hopeful that the zoomed out view of the Data Pipeworks can help
to reveal (and solve) several related challenges, including: social
dilemmas that arise when people make individual decisions about data
flow, support for data coalitions, the design of data markets,
characterizing models and datasets (including “synthetic” datasets) in
terms of the set of contributors, and more.
Context for the current
draft
This current draft is meant to provide one perspective on how many
lenses might be brought (back) together to describe and answer questions
about the zoomed out Data Pipeworks. My core expertise is in
interdisciplinary work in human-computer interaction, machine learning,
and responsible AI. I especially welcome feedback on claims about a
particular formalism or disciplinary perspective you feel are missing or
could be better represented.
Another important note is that this current document has two goals.
On the descriptive side, I hope to provide a very thorough description
of what we’re doing when we feed human-generated data in generative AI,
recommender systems, search engines, classifiers, etc. In a certain
sense, simply describing the data pipeworks can be a useful conceptual
contribution that might help guide empirical work related to
data-centric AI (for instance, answering research questions about how
data-related social movements, new data markets, and new data-related
regulations might impact AI capabilities).
I argue here that the social dynamics that determine what data
records are created and how data flows from people to AI systems are
dominant in determining AI capabilities. This means that the design of
interfaces that facilitate record creation, sensors that passively
create records, and incentives for record creation should all be a top
R&D priority for the AI field. I will also discuss the role of
social dilemmas and how they could be resolved by carefully designed
incentive systems at each stage of the AI pipeline. This comprehensive
view can advance discussions about how to produce a more democratic,
fair, and robust ecosystem.
However, this document is also influenced by a specific set of views
about what kind of governance paradigm for AI we should work
towards. Specifically, I am particularly interested in
realistic paths towards large-scale democratic participation in
steering AI development. It is certainly possible some readers may agree
with descriptive claims laid out here but disagree with what I see as
takeaways for building prosocial AI.
With that in mind, please send me any thoughts you may have!
The first draft of this was written by Nick Vincent, but I shall
shift to using the term “we” from here onwards with the assumption that
this will inevitably become a collaborative piece of writing. All errors
and missing references can be attributed to the first author.
The Stages of the Data
Pipeworks
Stage 1: Knowledge and Values. First, we posit
that human activity in the physical and digital worlds generates a
"Reality Signal," consisting of facts and preferences, which acts as the
raw material for AI systems. We do not intend to make any claims about
the actual computability or measurability of reality – rather, we mean
to highlight that there exists a theoretical massive set of signals that
most “AI” models of human-generated data are sampling from.
Stage 2: Creating Records. Second, we discuss
the "Sampling Step," where sensor networks and forms (user interfaces)
collect these signals and transform them into structured data
records.
Stage 3: Datasets. Third, we describe the
"Filtering Stage," wherein records are aggregated into datasets by
various organizations subject to social, economic, and legal
constraints.
Stage 4: Models. Fourth, we describe the machine
learning modeling process as a form of "compression," turning aggregated
datasets into useful input-output mappings. We emphasize that the
choices made here are highly influenced by the preceding
stages.
Stage 5: Deployed Systems. Fifth, we explore how
these models lead to actuation in real-world scenarios, capturing
economic value and affecting human behavior and the physical
world.
Introducing the Data
Pipeworks
Overview
In this series, we will lay out a five stage model of the "Data
Pipeworks". This model is meant to describe (with considerable use of
approximation) how human knowledge and values (stage 1) emerge
from the physical world, lead to the creation of records (stage
2) that are aggregated into datasets (stage 3), which are
compressed into input-output mapping models (stage 4) embedded
into deployed systems (stage 5) that can be connected to
actuators that do things in the world. Typically, deployed systems aim
to create value for whoever built the system, but also create
externalities, and will also create feedback loops by changing the world
itself.
There are many ongoing discussions about how we might build a world
in which the benefits of deployed AI systems are shared more broadly,
perhaps by decentralizing power over monolithic AI systems (e.g., we
cast regular votes as part of a giant civic body that governs
ChatGPT) or by pushing for a world with many competing AI
models (e.g., we all select from a vibrant “pool” of models in a
commons). These discussions incorporate aspects of AI governance, AI
safety, the economics of automation
and prediction, alongside general questions about aspirational
arrangements of our political economy
Typically, “technical” AI research contributions operate within a
well-defined problem space. For instance, when describing a new modeling
technique, authors will provide formal descriptions of the data being
modeled and define a clear objective. While academic researchers can
rightfully complain that ML and AI work often does not feel as
scientific as some other “hard sciences” (see e.g. discussions of “troubling trends” in ML),
individual ML/AI problems are typically characterized quite formally
(see e.g. here for an
overview of many well-defined problems in probabilistic machine
learning).
However, questions about sharing the benefits of, and control over,
AI exist outside the scope of a single nicely defined optimization
problem or other task. They could be implemented as some kind of
political and/or economic agent-based model in which tokens representing
power and resources are allocated over agents. But this requires
contending with how AI systems impact the world and how the records
being fed into modeling pipelines came to be. Agent-based models are
often something we fall back on when a clean optimization problem cannot
be written down.
Indeed, many AI ethics concerns can be framed in terms of some actor
failing to consider either downstream model actuation or upstream data
collection factors. That is to say, we could draw out a causal graph
describing how an AI system impacts the world and take a complex systems
approach to modeling our described closed system, but this is not
typically what we do in most ML scholarship (with reasonable
justification: trying to make predictions about complex systems is
extremely challenging).
And without a doubt, the current approach of zooming in on specific ML
tasks has led to creation of many genuinely useful ML models (though in
some cases, we may fall for the Fallacy of AI
Functionality).
Building off many conversations along these lines, a key hypothesis
motivating this document is that describing the Data Pipeworks
in an end-to-end fashion can strengthen efforts to build alternate AI
paradigms. In other words, this is a continued effort of
“design space mapping”: describing how things are so we can identify
levers, knobs, and fertile space where we might insert, or “grow”, a new
lever.
Taking a control / cybernetics-inspired approach is especially
fruitful because these approaches are hyper-focused on identifying all
the levers and knobs that could be a source of feedback for a
system.
The goal is not necessarily to try to build some kind of complete
agent-based model of the world (again, a daunting
task), but rather to capture how human activities “flow” to deployed AI
systems with enough fidelity so that when considering a particular
intervention, we can speak more specifically about:
what processes are upstream or downstream relative to that
intervention
what power structures control what happens upstream of that
intervention – this can help us be realistic about what
interventions might achieve.
where feedback loops are likely to be of concern.
Eventually, writing out the specification for a formal
cybernetics-style system (perhaps drawing directly on ecological or
biological models) could be feasible. And this system could be explored
by a carefully designed (wrong but useful) agent-based model.
In describing the full flow, we can do a better job reasoning about
how distinct interventions – e.g. tools that empower data creators to
engage in collective action, new
laws that affect AI systems are deployed, new licensing norms that determine how
model weights are shared, new ML work that
explains model capabilities in terms of data – fit together and
complement, or depend on, each other.
Specifically, we’ll describe our pipeworks in terms of a signal
generated by the collective activity of all humanity that is passed
through three filters – the creation of records, the aggregation of
datasets, and algorithmic model training – and connected to some set of
actuators. Each actor who ultimately builds/deploys an AI system is
subject to their own unique set of these three filters (based on the
records they have access to, how they aggregate data, and choices they
make in implementing their learning algorithm).
In describing the Data Pipeworks, we’ll naturally have to start
defining different classes of agents, which will become useful when we
do want to use computational modeling to explore certain parts of the
pipeline (accompanied by the knowledge these models will be wrong in
some ways but useful).
Fig 3: A diagram describing all five stages of the Data
Pipeworks.
Key Claims
The Primacy of
Social Dynamics and Human Factors
In describing the AI Pipeworks, we are making an argument that social
dynamics and human factors are dominant factors in determining AI
capabilities (because things that are more upstream in the pipeworks
dominate). This is not too wildly controversial – after all, if AI aims
to model human behavior or the production of human outputs, then of
course any “AI'' related activity is circularly dependent on how humans
behave. The Data Pipeworks framing can help us describe how the
specific complex systems and emergent behaviors that underlie
record creation (and perhaps even dataset aggregation and model
training) shape the landscape upon which we can do technical work. To
the extent that “technical” AI work aims to be a sort of civil
engineering (building sound structures within our “space” of knowledge),
social processes determine our topology, which may at times block off
entire realms of exploration. In geographies with no rivers, there’s no
need to build bridges. In geographies where the soil is too loose, we
cannot guarantee certain measures of stability, etc.
We can think of record creation, dataset aggregation, and model
training as three filtering stages that heavily influence the data
distributions that many computer scientists and mathematicians doing
technical ML work explore and work with mathematically. This in turn can
drastically change training and evaluation of AI models.
Interfaces,
Sensors, and Incentives are “Core AI”
Further, the AI Pipeworks makes the case that the design of
interfaces and sensors -- and the incentives mechanisms that guide
interactions with interfaces and sensors -- should be a top tier R&D
priority for the AI community. HCI and technical ML research goals are
deeply intertwined. Any widely adopted interface designs will have major
downstream effects on choices relating to dataset management, model
training, and model deployment. Again: if a core group of “technical” ML
and AI primarily want to do math and engineering, the social factors
shape the landscape on which that technical work will be built.
Describing data flow in stages can help us design interventions that
can improve AI capabilities or give groups of people more power.
Ideally, we can do both at once by reducing inequalities in
power to healthy levels while improving capabilities of AI systems that
are broadly available.
Social Dilemmas in AI
We’ll also make a conjecture about social dilemmas and AI that I
believe the Data Pipeworks model can help resolve:
The record aggregation tasks at the heart of many AI technologies
enable social dilemmas. AI systems are (almost always) shared,
quasi-public goods. Their creation can be well-described as a classic collective
action problem: a group of individuals have (heterogenous) access to
"building block resources" they can throw into a shared pot to build
some good with shared benefits, and varying levels of interest in the
good. In other words, the process of going around house to house to
collect records to train an AI system looks a lot like going around
house to house to collect donations to renovate the park in the town
square.
In some dataset-building contexts, there is great potential to free
ride (some people skate by without contributing data because a system is
already good enough). In others, there is potential to fail to achieve
critical mass (people choose not to contribute data, even if large-scale
data contribution would lead to a broadly beneficial system). In
practice now, these social dilemmas are often solved by a "dictator
solution": tech companies leverage information asymmetry to just collect
records describing the activities of people who may not realize they're
playing a shared goods resource sharing game.
In a world with increased data-agency-per-person, a dictator solution
doesn’t work. To get models with similar predictive quality, we need
carefully designed incentive systems at each stage of the AI
pipeline.
One solution might simply involve placing more types of record
creation in the domain of formal markets (i.e., more crowdwork
marketplaces or more data escrow-type systems like DataStation)).
Better yet, we might solve the social dilemmas by supporting markets
in which the primary actors are data coalitions acting on behalf of
individuals.
In the section below describing the five stage model, we mostly avoid
formalism. However, a key value add of this framework is the ability to
translate between different lenses. In the Appendix, we will begin to
describe the Pipeworks using three different disciplinary perspectives
and their norms around formalism: machine learning, information theory,
and signal processing.
The end of the Introduction
This ends my introductory post to the Data Pipeworks. Please do let
me know if you found this helpful. Or alternatively, let me know if you
found this too indulgent or meandering, and just want to see more
discussions of recent news and recent papers. If you made it this here,
I certainly want to hear what you think!
I hope to share the following “chapters” (expanding on the individual
stages of the pipeworks, and then summarizing specific research
questions and computational modeling approaches that emerge from this
thinking) on a semi-regular cadence.
Subscribe
now
ATProto raw JSON
{
"uri": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.document/3mizeeq6ryxi3",
"cid": "bafyreigberaewugtwe6tqefqynfckiotjllpl52kdwumugdhlhspkiy77q",
"value": {
"path": "/3mizeeq6ryxi3",
"site": "at://did:plc:doxvahqvyhyqf32v7wz7p5xk/site.standard.publication/3lzrsw2kvwc2m",
"$type": "site.standard.document",
"title": "Building a Data Pipeworks for Democratic AI: From Human Knowledge to Records to AI Systems",
"content": {
"$type": "pub.leaflet.content",
"pages": [
{
"$type": "pub.leaflet.pages.linearDocument",
"blocks": [
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.image",
"image": {
"$type": "blob",
"ref": {
"$link": "bafkreiehaxha2fbj7tgpilw77kmwal5jielslbhh4ziivsuuhj57tnekca"
},
"mimeType": "image/png",
"size": 365099
},
"aspectRatio": {
"$type": "pub.leaflet.blocks.image#aspectRatio",
"width": 994,
"height": 540
}
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 188,
"byteStart": 93
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 93,
"byteStart": 81
},
"features": [
{
"uri": "https://creativecommons.org/licenses/by-sa/2.0",
"$type": "pub.leaflet.richtext.facet#link"
},
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 81,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "Fig 1. Left: Charon, an early steersman (Carole Raddato from FRANKFURT, Germany, CC BY-SA 2.0, via Wikimedia Commons). Right: A summary diagram showing the five stage Data Pipeworks model."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "A preface"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This post is going to be a bit different from some of my previous posts. Rather than commenting on data leverage-related news or recent research (which I intend to keep doing – there’s been a lot of exciting movement in the space of data leverage tools, and datalevers.org needs an update), I'm going to share some in-progress writing. Specifically, I’ll share an early draft of a post describing the “Data Pipeworks” underlying modern AI, and how the pipeworks model can support efforts to promote democratic inputs to AI."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This is something of an introductory “chapter”, part of a series of posts that will introduce this pipeworks model and a variety of research questions that I believe the pipeworks can help answer. The posts will also argue for bringing together many disciplinary lenses to tackle looming challenges in AI (especially thorny governance challenges and the looming threat of major social dilemmas related to data), adding to a chorus of calls along these lines."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Key points in this post:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "We should consider all the stages involved in creating data for AI: Our knowledge and values (Stage 1) are translated into records (Stage 2), aggregated into datasets (Stage 3), and used to fit models (Stage 4). These models are deployed in systems (Stage 5) that impact the world."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 136,
"byteStart": 128
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "A “Data Pipeworks” model is especially compatible with systems / control/ cybernetics thinking, which emphasizes sources of feedback. One potential source of feedback in the Data Pipeworks is the manipulation of data flow by contributors (“data leverage”). We should work to more precisely specify such cybernetic systems and their feedback sources, as these determine the data with which AI developers work."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 237,
"byteStart": 233
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "It follows that AI is highly dependent on social dynamics and human factors that shape the “landscape of records”. As we rely more on large-scale data-dependent systems like generative AI, a human factors perspective will become more important."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 203,
"byteStart": 194
},
"features": [
{
"uri": "https://dataleverage.substack.com/p/measuring-relative-ai-alignment-in-terms-of-data-pipelines",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 154,
"byteStart": 143
},
"features": [
{
"uri": "https://www.radicalxchange.org/media/blog/why-i-am-a-pluralist/",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Tracing the flow of data to specific individuals and groups can allow us to quantify the relative weight of an individual or group, and foster pluralistic governance of AI and another notion of alignment."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 112,
"byteStart": 97
},
"features": [
{
"uri": "https://www.annualreviews.org/doi/abs/10.1146/annurev.soc.24.1.183",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "As data contributors gain agency to affect data flow, building AI systems will require resolving social dilemmas."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "At the same time, new levels of data agency and a human-centered perspective may unlock new ways to think about data scaling (e.g. in terms of “units of knowledge”) and new ways to accelerate progress in computing."
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "At the highest level, this series of posts is really an extended argument for using a broader lens when discussing AI governance. In part, this argument stems from my long term interest in data labor and leverage as a governance mechanism (these concepts emerge when we zoom out from a machine learning problem definition to consider contexts in which data is created)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 138,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "Fig 2. A visual depiction of zooming out from a specific ML problem to consider the context that determines what data records are created."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "However, this broader lens has useful applications beyond specifically advancing the data leverage research agenda – it can be generally helpful to think about upstream factors that shape AI development, as well as taking a feedback-focused approach to thinking about downstream impacts."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 322,
"byteStart": 303
},
"features": [
{
"uri": "https://jeffreybigham.com/blog/2019/the-coming-ai-autumnn.html",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "In terms of specific problems that motivated this post, I especially wanted to write down a longer argument for why I think we’re inching towards a future in which basically all “AI” work will have a dominant human factors component (some might argue, “always has been!”, and that HCI will “reap the AI harvest”)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 130,
"byteStart": 127
},
"features": [
{
"uri": "https://www.technologyreview.com/2023/10/23/1082189/data-poisoning-artists-fight-generative-ai/",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "I believe we’re going to see a coming wave of data-related social dilemmas (arguably, it’s already started in the realm of art), and pipeworks-thinking can help to solve them in a way that effectively balances technological progress with inevitable ethical challenges that arise. I often find myself a bit defensive about this claim — while the pipeworks will highlight ways that data creators might exert leverage by lowering AI capabilities in the short term, I see this leverage as a part of a “pro-computing” research agenda."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "I began my exploration of this concept by thinking about the kinds of questions often posed and solved in human-computer interaction and machine learning venues (the venues that I engage with most), and then asking, how might we start to bring in more control engineering, information theory, communication, economics, and sociology?"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 415,
"byteStart": 404
},
"features": [
{
"uri": "https://en.wikipedia.org/wiki/Cybernetics",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 388,
"byteStart": 377
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "After many iterations of writing, I stumbled on a recurring theme: that AI is downstream of human factors, including complex systems with feedback, with the feedback mixing an ecological, economic, and otherwise social nature. I realized that perhaps what I’ve really been grappling with is an attempt to describe the data labor and data leverage concepts in the language of cybernetics (i.e. Wiener's cybernetics). This gave me a reason to be concerned I might be retreading old ground – after all, AI as a discipline arguably came to exist by splintering off from the field of cybernetics! I thought to myself — “Did I just write a lengthy argument that basically just says that the AI field needs to care more about work from… the AI field?” (The circularity of this is fitting, given cybernetic thinking’s emphasis on causal loops)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 289,
"byteStart": 265
},
"features": [
{
"uri": "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4569037",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 51,
"byteStart": 46
},
"features": [
{
"uri": "https://www.jstor.org/stable/1705998?seq=2",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 41,
"byteStart": 36
},
"features": [
{
"uri": "https://en.wikipedia.org/wiki/The_Human_Use_of_Human_Beings",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Indeed, Wiener’s writing from the 1950s and 1960s resonate with ongoing discussions about generative AI’s economic impacts. And certainly work on control theory has continued to advance (and work explicitly taking a cybernetics lens continues as well, see e.g. Zargham and Shorish 2023)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Even if the Data Pipeworks really is just an argument for “Cybernetic AI” (which some might say is just “AI AI”), I think there are some novel ideas here that will be constructive, and I think the old ideas are worth dredging up again."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "None of the posts in this series are (yet) structured in a way to answer a specific research question, though they relate to many of my ongoing research projects. If you happen to be reading this and are excited by a particular question or aspect, please let me know!"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "The following “chapters” of this series are all partially complete, but still being refined. If you’d like to see a full draft of this document, please send me a note!"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Thus ends my preface. Onwards to describing the Data Pipeworks."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "An Extended Abstract and More Context"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This series of posts will present a five-stage model – the “Data Pipeworks\" – that aims to describe the process by which human knowledge and values flow to deployed data-dependent systems (i.e., “AI”). Our knowledge and values (Stage 1) are translated into records (Stage 2), aggregated into datasets (Stage 3), and used to fit models (Stage 4). These models are deployed in systems (Stage 5) that impact the world. We hypothesize that considering the full scope of the Data Pipework will be helpful in designing solutions that make AI more prosocial, in particular by enabling realistic assessment of where democratic inputs (including data bargaining) can be implemented. By describing the pipeworks, we also hope to highlight the value of a broad lens that considers human-computer interaction and AI as part of a shared problem-solving framework. A full accounting of the pipeworks will involve negotiating formalisms and problem definitions from across machine learning, human-computer interaction, signal processing, control theory, economics, and sociology, and more. Put another way – because “AI” is dependent on data from a large population, returning to a cybernetic frame that emphasizes steering, feedback, and communication can naturally create democratic AI."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "I am hopeful that the zoomed out view of the Data Pipeworks can help to reveal (and solve) several related challenges, including: social dilemmas that arise when people make individual decisions about data flow, support for data coalitions, the design of data markets, characterizing models and datasets (including “synthetic” datasets) in terms of the set of contributors, and more."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"facets": [],
"plaintext": "Context for the current draft"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This current draft is meant to provide one perspective on how many lenses might be brought (back) together to describe and answer questions about the zoomed out Data Pipeworks. My core expertise is in interdisciplinary work in human-computer interaction, machine learning, and responsible AI. I especially welcome feedback on claims about a particular formalism or disciplinary perspective you feel are missing or could be better represented."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Another important note is that this current document has two goals. On the descriptive side, I hope to provide a very thorough description of what we’re doing when we feed human-generated data in generative AI, recommender systems, search engines, classifiers, etc. In a certain sense, simply describing the data pipeworks can be a useful conceptual contribution that might help guide empirical work related to data-centric AI (for instance, answering research questions about how data-related social movements, new data markets, and new data-related regulations might impact AI capabilities)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "I argue here that the social dynamics that determine what data records are created and how data flows from people to AI systems are dominant in determining AI capabilities. This means that the design of interfaces that facilitate record creation, sensors that passively create records, and incentives for record creation should all be a top R&D priority for the AI field. I will also discuss the role of social dilemmas and how they could be resolved by carefully designed incentive systems at each stage of the AI pipeline. This comprehensive view can advance discussions about how to produce a more democratic, fair, and robust ecosystem."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 194,
"byteStart": 185
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 124,
"byteStart": 118
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "However, this document is also influenced by a specific set of views about what kind of governance paradigm for AI we should work towards. Specifically, I am particularly interested in realistic paths towards large-scale democratic participation in steering AI development. It is certainly possible some readers may agree with descriptive claims laid out here but disagree with what I see as takeaways for building prosocial AI."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "With that in mind, please send me any thoughts you may have!"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "The first draft of this was written by Nick Vincent, but I shall shift to using the term “we” from here onwards with the assumption that this will inevitably become a collaborative piece of writing. All errors and missing references can be attributed to the first author."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"facets": [],
"plaintext": "The Stages of the Data Pipeworks"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 29,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Stage 1: Knowledge and Values. First, we posit that human activity in the physical and digital worlds generates a \"Reality Signal,\" consisting of facts and preferences, which acts as the raw material for AI systems. We do not intend to make any claims about the actual computability or measurability of reality – rather, we mean to highlight that there exists a theoretical massive set of signals that most “AI” models of human-generated data are sampling from."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 25,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Stage 2: Creating Records. Second, we discuss the \"Sampling Step,\" where sensor networks and forms (user interfaces) collect these signals and transform them into structured data records."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 18,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Stage 3: Datasets. Third, we describe the \"Filtering Stage,\" wherein records are aggregated into datasets by various organizations subject to social, economic, and legal constraints."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 16,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Stage 4: Models. Fourth, we describe the machine learning modeling process as a form of \"compression,\" turning aggregated datasets into useful input-output mappings. We emphasize that the choices made here are highly influenced by the preceding stages."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 25,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Stage 5: Deployed Systems. Fifth, we explore how these models lead to actuation in real-world scenarios, capturing economic value and affecting human behavior and the physical world."
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 1,
"facets": [],
"plaintext": "Introducing the Data Pipeworks"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"facets": [],
"plaintext": "Overview"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 403,
"byteStart": 387
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 362,
"byteStart": 356
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 297,
"byteStart": 289
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 253,
"byteStart": 246
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 179,
"byteStart": 159
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "In this series, we will lay out a five stage model of the \"Data Pipeworks\". This model is meant to describe (with considerable use of approximation) how human knowledge and values (stage 1) emerge from the physical world, lead to the creation of records (stage 2) that are aggregated into datasets (stage 3), which are compressed into input-output mapping models (stage 4) embedded into deployed systems (stage 5) that can be connected to actuators that do things in the world. Typically, deployed systems aim to create value for whoever built the system, but also create externalities, and will also create feedback loops by changing the world itself."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 500,
"byteStart": 490
},
"features": [
{
"uri": "https://www.vox.com/future-perfect/23787024/power-progress-book-ai-history-future-economy-daron-acemoglu-simon-johnson",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 333,
"byteStart": 309
},
"features": [
{
"uri": "https://github.com/manymodels/manymodels",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 268,
"byteStart": 261
},
"features": [
{
"uri": "https://openai.com/blog/democratic-inputs-to-ai",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "There are many ongoing discussions about how we might build a world in which the benefits of deployed AI systems are shared more broadly, perhaps by decentralizing power over monolithic AI systems (e.g., we cast regular votes as part of a giant civic body that governs ChatGPT) or by pushing for a world with many competing AI models (e.g., we all select from a vibrant “pool” of models in a commons). These discussions incorporate aspects of AI governance, AI safety, the economics of automation and prediction, alongside general questions about aspirational arrangements of our political economy"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 529,
"byteStart": 525
},
"features": [
{
"uri": "https://probml.github.io/pml-book/book1.html",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 434,
"byteStart": 418
},
"features": [
{
"uri": "https://arxiv.org/abs/1807.03341",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Typically, “technical” AI research contributions operate within a well-defined problem space. For instance, when describing a new modeling technique, authors will provide formal descriptions of the data being modeled and define a clear objective. While academic researchers can rightfully complain that ML and AI work often does not feel as scientific as some other “hard sciences” (see e.g. discussions of “troubling trends” in ML), individual ML/AI problems are typically characterized quite formally (see e.g. here for an overview of many well-defined problems in probabilistic machine learning)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "However, questions about sharing the benefits of, and control over, AI exist outside the scope of a single nicely defined optimization problem or other task. They could be implemented as some kind of political and/or economic agent-based model in which tokens representing power and resources are allocated over agents. But this requires contending with how AI systems impact the world and how the records being fed into modeling pipelines came to be. Agent-based models are often something we fall back on when a clean optimization problem cannot be written down."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 702,
"byteStart": 675
},
"features": [
{
"uri": "https://dl.acm.org/doi/abs/10.1145/3531146.3533158",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 497,
"byteStart": 486
},
"features": [
{
"uri": "https://journals.openedition.org/cybergeo/1035?lang=en",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Indeed, many AI ethics concerns can be framed in terms of some actor failing to consider either downstream model actuation or upstream data collection factors. That is to say, we could draw out a causal graph describing how an AI system impacts the world and take a complex systems approach to modeling our described closed system, but this is not typically what we do in most ML scholarship (with reasonable justification: trying to make predictions about complex systems is extremely challenging). And without a doubt, the current approach of zooming in on specific ML tasks has led to creation of many genuinely useful ML models (though in some cases, we may fall for the Fallacy of AI Functionality)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 211,
"byteStart": 101
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Building off many conversations along these lines, a key hypothesis motivating this document is that describing the Data Pipeworks in an end-to-end fashion can strengthen efforts to build alternate AI paradigms. In other words, this is a continued effort of “design space mapping”: describing how things are so we can identify levers, knobs, and fertile space where we might insert, or “grow”, a new lever."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Taking a control / cybernetics-inspired approach is especially fruitful because these approaches are hyper-focused on identifying all the levers and knobs that could be a source of feedback for a system."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 115,
"byteStart": 107
},
"features": [
{
"uri": "https://eckhartarnold.de/papers/2014_Social_Simulations/Whats_wrong_with_social_simulations.html",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "The goal is not necessarily to try to build some kind of complete agent-based model of the world (again, a daunting task), but rather to capture how human activities “flow” to deployed AI systems with enough fidelity so that when considering a particular intervention, we can speak more specifically about:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "what processes are upstream or downstream relative to that intervention"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 106,
"byteStart": 97
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "what power structures control what happens upstream of that intervention – this can help us be realistic about what interventions might achieve."
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "where feedback loops are likely to be of concern."
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Eventually, writing out the specification for a formal cybernetics-style system (perhaps drawing directly on ecological or biological models) could be feasible. And this system could be explored by a carefully designed (wrong but useful) agent-based model."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 291,
"byteStart": 287
},
"features": [
{
"uri": "https://www.anthropic.com/index/influence-functions",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 228,
"byteStart": 219
},
"features": [
{
"uri": "https://www.licenses.ai/",
"$type": "pub.leaflet.richtext.facet#link"
}
]
},
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 167,
"byteStart": 161
},
"features": [
{
"uri": "https://dl.acm.org/doi/10.1145/3442188.3445885",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "In describing the full flow, we can do a better job reasoning about how distinct interventions – e.g. tools that empower data creators to engage in collective action, new laws that affect AI systems are deployed, new licensing norms that determine how model weights are shared, new ML work that explains model capabilities in terms of data – fit together and complement, or depend on, each other."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Specifically, we’ll describe our pipeworks in terms of a signal generated by the collective activity of all humanity that is passed through three filters – the creation of records, the aggregation of datasets, and algorithmic model training – and connected to some set of actuators. Each actor who ultimately builds/deploys an AI system is subject to their own unique set of these three filters (based on the records they have access to, how they aggregate data, and choices they make in implementing their learning algorithm)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "In describing the Data Pipeworks, we’ll naturally have to start defining different classes of agents, which will become useful when we do want to use computational modeling to explore certain parts of the pipeline (accompanied by the knowledge these models will be wrong in some ways but useful)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 66,
"byteStart": 0
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "Fig 3: A diagram describing all five stages of the Data Pipeworks."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"facets": [],
"plaintext": "Key Claims"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 3,
"facets": [],
"plaintext": "The Primacy of Social Dynamics and Human Factors"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 493,
"byteStart": 485
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
]
}
],
"plaintext": "In describing the AI Pipeworks, we are making an argument that social dynamics and human factors are dominant factors in determining AI capabilities (because things that are more upstream in the pipeworks dominate). This is not too wildly controversial – after all, if AI aims to model human behavior or the production of human outputs, then of course any “AI'' related activity is circularly dependent on how humans behave. The Data Pipeworks framing can help us describe how the specific complex systems and emergent behaviors that underlie record creation (and perhaps even dataset aggregation and model training) shape the landscape upon which we can do technical work. To the extent that “technical” AI work aims to be a sort of civil engineering (building sound structures within our “space” of knowledge), social processes determine our topology, which may at times block off entire realms of exploration. In geographies with no rivers, there’s no need to build bridges. In geographies where the soil is too loose, we cannot guarantee certain measures of stability, etc."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "We can think of record creation, dataset aggregation, and model training as three filtering stages that heavily influence the data distributions that many computer scientists and mathematicians doing technical ML work explore and work with mathematically. This in turn can drastically change training and evaluation of AI models."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 3,
"facets": [],
"plaintext": "Interfaces, Sensors, and Incentives are “Core AI”"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Further, the AI Pipeworks makes the case that the design of interfaces and sensors -- and the incentives mechanisms that guide interactions with interfaces and sensors -- should be a top tier R&D priority for the AI community. HCI and technical ML research goals are deeply intertwined. Any widely adopted interface designs will have major downstream effects on choices relating to dataset management, model training, and model deployment. Again: if a core group of “technical” ML and AI primarily want to do math and engineering, the social factors shape the landscape on which that technical work will be built."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 287,
"byteStart": 169
},
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
]
}
],
"plaintext": "Describing data flow in stages can help us design interventions that can improve AI capabilities or give groups of people more power. Ideally, we can do both at once by reducing inequalities in power to healthy levels while improving capabilities of AI systems that are broadly available."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 3,
"facets": [],
"plaintext": "Social Dilemmas in AI"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "We’ll also make a conjecture about social dilemmas and AI that I believe the Data Pipeworks model can help resolve:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 216,
"byteStart": 199
},
"features": [
{
"uri": "https://users.ssc.wisc.edu/~oliver/PROTESTS/ArticleCopies/OliverMarwellCritMassI.pdf",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "The record aggregation tasks at the heart of many AI technologies enable social dilemmas. AI systems are (almost always) shared, quasi-public goods. Their creation can be well-described as a classic collective action problem: a group of individuals have (heterogenous) access to \"building block resources\" they can throw into a shared pot to build some good with shared benefits, and varying levels of interest in the good. In other words, the process of going around house to house to collect records to train an AI system looks a lot like going around house to house to collect donations to renovate the park in the town square."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "In some dataset-building contexts, there is great potential to free ride (some people skate by without contributing data because a system is already good enough). In others, there is potential to fail to achieve critical mass (people choose not to contribute data, even if large-scale data contribution would lead to a broadly beneficial system). In practice now, these social dilemmas are often solved by a \"dictator solution\": tech companies leverage information asymmetry to just collect records describing the activities of people who may not realize they're playing a shared goods resource sharing game."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "In a world with increased data-agency-per-person, a dictator solution doesn’t work. To get models with similar predictive quality, we need carefully designed incentive systems at each stage of the AI pipeline."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 188,
"byteStart": 177
},
"features": [
{
"uri": "https://raulcastrofernandez.com/papers/data_station_paper-11.pdf",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "One solution might simply involve placing more types of record creation in the domain of formal markets (i.e., more crowdwork marketplaces or more data escrow-type systems like DataStation))."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "Better yet, we might solve the social dilemmas by supporting markets in which the primary actors are data coalitions acting on behalf of individuals."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 3,
"facets": [],
"plaintext": "Connecting with formalisms from machine learning, information theory, signal processing, and more"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "In the section below describing the five stage model, we mostly avoid formalism. However, a key value add of this framework is the ability to translate between different lenses. In the Appendix, we will begin to describe the Pipeworks using three different disciplinary perspectives and their norms around formalism: machine learning, information theory, and signal processing."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 3,
"facets": [],
"plaintext": "The end of the Introduction"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "This ends my introductory post to the Data Pipeworks. Please do let me know if you found this helpful. Or alternatively, let me know if you found this too indulgent or meandering, and just want to see more discussions of recent news and recent papers. If you made it this here, I certainly want to hear what you think!"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [],
"plaintext": "I hope to share the following “chapters” (expanding on the individual stages of the pipeworks, and then summarizing specific research questions and computational modeling approaches that emerge from this thinking) on a semi-regular cadence."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"$type": "pub.leaflet.richtext.facet",
"index": {
"$type": "pub.leaflet.richtext.facet#byteSlice",
"byteEnd": 13,
"byteStart": 0
},
"features": [
{
"uri": "https://dataleverage.substack.com/subscribe?",
"$type": "pub.leaflet.richtext.facet#link"
}
]
}
],
"plaintext": "Subscribe now"
}
}
]
}
]
},
"description": "Focusing on feedback loops -- connecting modern AI to early cybernetics-style thinking -- could help solve looming challenges and support democratic inputs to AI.",
"publishedAt": "2023-11-13T00:00:00.000Z"
}
}