Data is the New Social Security Number

If your data is the bottleneck to automating you, think twice before giving it away.

May 28, 2025

“The way you’re taught to build a software company is, ‘your customer thinks they’re getting laid, but they’re getting f-ed’” – Alex Karp

If you live in America, you know that you shouldn’t give away your social security number.

Unlike most countries, the US doesn’t have a national ID system. Instead, we have a nine digit number issued to you at birth to track your lifetime wages so that you could receive a government sort-of-pension later in life. It’s not secure and hard to replace if it gets stolen.

Despite this, companies and government entities use social security numbers to verify who you are. It’s essentially a master key to your financial management. You use it to open credit cards, take out loans, file a tax return, or prove your right to work. That means someone else with it can do the exact same thing—just in your name instead of theirs.

You don’t want strangers spending money in your name, so Americans are all taught not to give their social security number away. Most of us remember not to do this.

In the age of automation, giving away your personal data will be even more destructive. Your social security number is a master key to your earnings; data will be the master key to your ability to earn at all.

Data privacy doesn’t buy you a lot today

For the median person, caring about data privacy today means you see worse ads.

There are real exceptions. Being careless with a password might get you hacked. Authoritarian governments sometimes crack down on dissidents using unencrypted messaging apps. Even non-authoritarian governments sometimes use privacy carelessness to suppress groups of people or movements they don’t like. In those circumstances, a focus on privacy provides real user value.

We also lose as a society from a lack of privacy. Algorithms trained on copious amounts of data are pretty good at changing your perception of the world and the values that make you who you are. Data brokers and other centralized hubs of user data are prime targets for attacks that, when successful, expose nearly everyone’s private information. But users don’t feel this when they make the choice to share, and they usually don’t incur a personal cost for doing so.

So for most people in most cases, you—at the individual level—just get worse ads. Your society may change and you might lose in the edge cases, but you aren’t going to feel that pain every day. That’s just not enough for a consumer to care when they see a cookie banner or get asked to share their email.1

If you’re really privacy conscious, you also incur costs:

You forgo social media—and miss out on seeing updates from your distant family and friends.
You only use encrypted messaging apps—making it difficult for people to contact you.
You don’t use professional networking tools—so it’s harder for you to find a job.

But in the near-future, you’re going to be asked to give away a lot more data. Doing so could have permanent consequences.

The stakes of privacy are higher in the automation age

Some companies would like to comprehensively automate all work. For tasks with correct answers that are easily checkable, this will be (relatively) trivial over the next few years.

For example, we can easily construct a training environment to produce good code because it’s easy to test whether a given program works—it either does what it’s supposed to or it doesn’t. If your job is mostly coding what other people tell you to build, automating you isn’t going to be all that hard in the limit. This extends to lots of other tasks: if your judgment and personal preferences aren’t important for your job, you’ll be on the chopping block shortly.

For more subjective tasks, however, you are a bottleneck to automation. Models still struggle with tasks that require taste, judgment, and long-horizon planning, but you don’t.

So how will companies automate these fuzzier tasks?

If a company aims to automate everything, they could start by learning how to automate you by observing how you complete the tasks they’re struggling to automate. Instead of training a model on how to be good at a task in the abstract, they could train it on how to do something exactly like you do.

Learning your specific workflow will make it easier. To do this, they’ll need lots of data, like your:

Writings and notes
Browser activity
On-device activity (by screenshotting, logging keyboard strokes, or screen recording)
Texts, emails, and meeting notes

Then, they could train a model on this. Do it once, and you’ve got a digital twin of someone that boosts their productivity. Do it a million times, and you can automate a profession.

We’re seeing early efforts to do this through personalization. These efforts will make models that help you work faster. For a while, this will strictly benefit you.

But do you own that model and the underlying data? Can the company that made it train on this data to automate you? Because if not, you might be the product.

Bottom-up augmentation, or top-down automation

If the day comes where an AI company asks you for lots of data to personalize your experience, you should ask yourself two questions:

Can this company train on the data I am giving them to automate me away?
If so, have they guaranteed they can’t train on it or use the underlying model without my explicit permission?

If the answer to the first question is no, you’re safe.

But if it’s yes, you are left with two options:

This company has taken extreme measures to assure they can’t see individual user data or access a user’s personal model. This means their solution might be more expensive, but it will preserve your ability to participate in the economy by speeding you up without building your replacement model. You will pay more today, but you will earn more tomorrow.
The company hasn’t. They are either going to train on this data, or they will have strong incentives to do so in the future. You will pay less today, but you’ll be out of a job sooner. They could have entirely benign goals, but you can’t be sure.

Under this paradigm, paying for data privacy today could preserve and expand your earning potential in the future. Failing to do so might prevent you from earning in the future because a model trained to behave like you can do your job without you.

Let’s use an example to drive this home.

Jane2 is a 30 year old novel editor in Brooklyn. People pay her because they like her taste and style. They could use ChatGPT (and some of her clients have turned to it), but it doesn’t impart her unique touch onto their work. So despite the existence of AI tools, her business is doing just fine.

Today, she’s bottlenecked by how many manuscripts she can get through. Plenty of people send her initial drafts that waste her time—she’s not going to be able to edit trash into treasure. But she’s got to read those drafts to know she’s going to pass on them, and that slows her down.

A model trained on her data could do the first read for her, letting her know whether she could help before she spends time on it. It could even do the first pass of edits while leaving the deeper work that she enjoys to her.

Jane decides she wants to integrate AI into her workflow to automate the busywork away. She has to choose between two companies. Both want her to give them everything—they need every email she’s sent and every manuscript she’s edited. They even want to record her screen for a few days to watch how she completes her tasks.

In a bottom-up augmentation scenario, Jane pays Company A $50 per month for her personalized model. The company verifiably guarantees that they can’t see the underlying data or use her personal model. Because of this, she has control over how automated her workflow becomes, and no one else can get a model that edits exactly like she does. Her productivity increases and her role in the economy expands as she turns out more work. Plus, other editors make the same choice, and people still prefer their output over generic models that don’t match their personal style.

In a top-down automation scenario, Jane pays Company B $10 per month for the model. They train on this data alongside thousands of other novel editors, and produce JaneBot-High-September-2027-New-Final. It’s a state of the art novel editor with dozens of styles based on the editors they trained on, and it costs every user $10 per month. Jane saves a bit of money today, but she gives Company B the data they need to automate her. Her role in the economy (at least as a novel editor) shrinks.3

The market for privacy preserving AI tools is going to grow dramatically

Like a social security number, people are going to learn that giving away their data could have negative consequences.

There’s already a market for privacy-preserving AI for governments and large corporations that are wary of handing over this data. I think that market will expand, but I also think the number of consumers who care about privacy will increase.

Data collection is going to feel different this time. The amount of data companies are going to need to deeply personalize models is going to be intrusive to the end user. This won’t be ambient data collection—when they ask to record your screen and read all your texts, you’re going to feel like this is a step up. Most people will notice.

Moreover, when fields begin to face credible threats of mass automation, people are going to catch on. Some are going to choose not to use AI entirely. But most are going to realize that, like it or not, they’ll need AI-powered productivity gains to remain competitive.

This will create a massive market opportunity. I think consumers will be looking for companies that credibly guarantee that only the user benefits from their personal AI. Moreover, small businesses, micro-businesses, and startups that traditionally ignore privacy concerns will think twice before handing over their proprietary knowledge to a system designed to automate away their role in the economy.

As automation knocks most people out of their existing jobs and forces them to rely on their taste-based or long-horizon-planning-based skills in new roles they devise, they’ll be forced to make a choice:

Save money today, and train a model that automates you away tomorrow.
Pay a bigger premium and try to keep their role in the economy.

I’d bet that lots of people are going to choose the latter.

Thank you to Rudolf Laine, Alex Komoroske, Oscar Moxon, Xavi Costafreda-Fu, Soren Larson, Jonathan Mortensen, and Chad Fowler for reviewing drafts of this post.

This doesn’t mean that you shouldn’t care about privacy. Rather, it means that, when faced with the existing tradeoff, most people do not see a benefit that justifies the inconvenience or cost.

Jane isn’t real, but she’s loosely based on a user interview I conducted.

A superintelligent system will still need local information, which makes owning that local data even more critical. See my previous writing on this subject here.

Somil Aggarwal

May 28

Strongly believe in the idea that value will shift to our agency over data exposure. Another hypothetical is imaging the decreasing friction (read, time it takes to reach ideal potential customer) it takes for a business to reach it's desired audience and thus where the value shifts. Loved reading this!

Expand full comment

Herbie Bradley

Interesting piece, but two flaws I see:

1. Literal privacy preserving tech is likely unnecessary, relative to the existing guarantees companies give for not training on your data. For example, I don't think opting out of ChatGPT training on your conversations does anything except give you the company's promise that they will exclude that data from training, and most people are probably satisfied with this.

For enterprise use there are guarantees of containerization/sandboxing they give to companies data. There appears to be a model where in some enterprise/corporate use cases, OpenAI try to incentivize customers to give data by giving free usage credits when they do so. Potentially this is how more personalized data might be collected, and potentially it's sufficient to make models good enough to just do this via enterprise deals.

2. We don't yet know the actual amount of data that will be sufficient to train models. Clearly, a diversity of data is good, which is why labs want to gather data from many people, especially for personalization. But it's plausible that the amount needed is in the low millions of data points, which makes pure contracting/enterprise deals possible.

On your final paragraphs, I think most people are not long-term thinking enough to value their data so highly, even if it is used to automate their job. This is basically a "humans are irrational" position, though one could argue that it's rational based on comparative advantage to accept the automation and be on the frontier of AI automation, then hope that you are still employable by being able to move to a higher level of abstraction (staying at the moving frontier up the pyramid). The history of data privacy shows most people care much less about it than researchers concerned about it think, and that in most cases this turned out to be fine.

You also don't mention the value of having one's work be trained on, to make future AIs more sensitive to you. This is part of the reason that I write.

3 more comments...

Luke Drago

Discussion about this post