Pennies for Petabytes

Data royalties would be small, but greatly slow down AI progress

and

Jun 06, 2025

Modern AI models need access to internet-scale public data, which requires a liberal copyright regime to train on. Current debates around copyright reform often rely on the perceived costs to artists and writers—they provide the key inputs to making models, but receive no compensation in the process.

We model how much they could get paid. Even if AI companies were forced to spend as much on data as all other expenses combined, artists would receive only $0.09/10 000-word essay and $0.06/image—while raising AI company costs by 65%. Considering the large amounts of anonymous data available, the fact that data varies in quality, and cheap synthetic enhancement that can be done, the median artist’s payment would be even lower.

How did we calculate this number?

There are many ways you could assign property rights for data. How much, realistically, could they pay? Training a frontier model already requires billions of dollars of compute and engineering. Whatever legal regime emerges, labs cannot afford raw data, so we therefore model outcomes under the conservative assumption that this entire current spend forms the absolute ceiling — while staying agnostic about the mechanism enforcing it.1 Competition with geopolitical rivals that care less about respecting copyright puts a hard upper limit to the maximum possible tax that AI companies will in fact pay to copyright-holders.

So, let’s explore what creators might earn if they were paid for AI companies using their work. The best publicly available data is from 2024, and with Meta’s Llama figures.

When Llama 4 Behemoth was trained, it used about 30 trillion tokens of text from roughly 1.4 petabytes of data. The training cost approximately $800 million. If creators were compensated based on this total cost, they’d receive ~$0.00003/token.

But this initial calculation doesn't tell the whole story. This 30 trillion tokens was from 120 trillion tokens of raw text, dropping the value dramatically to about $0.000007 per token. To put this in perspective, a 10,000-word essay (about 13,000 tokens) would earn only about 9 cents - or Shakespeare’s complete works $4.74.

Something wicked this way comes

We need to make several key adjustments to this estimate. Multiple AI companies run multiple training iterations, which increases the value somewhat, but not dramatically. At least 16 training runs larger than 1e25 flops took part in 2025, and 100 larger than 1e23; assuming an exponential distribution over training run size, this implies that the sum of all training runs will be 3.59x larger than the largest training run in a given year, as the largest training runs take up most compute. Additionally, compute and data trade off against each other, so companies can shift to using less data and more compute power. Under Chinchilla-optimal scaling laws, companies would respond to laws requiring a fixed cost per token by giving the model more parameters - but if they want models as effective as previous, they’ll need more compute - as their previous choices represented the optimal combination of parameters and data. This causes a 43% reduction in data usage and an 8% rise in compute used, and thus tax revenue at 57% of expected levels; with the 8% represents a pure loss to society, as AI developers' costs rise without firms compensating creators.2

Additionally, much data has an untraceable author - being an anonymous form comment, Wikipedia article or random code. This means that if firms had to pay, they could simply push towards using such anonymous data. This will be magnified by synthetic data - model-generated stand-ins - as current techniques cost relatively little extra GPU time. Once a sufficiently comprehensive seed of human works is secured, labs can pad the rest with these cheap tokens, driving royalties ever closer to bare compute cost. We won’t model either of these effects here, but this will make all discussions regarding creator compensation increasingly irrelevant over time.

Quality and variety considerations don’t substantially increase these values either. Most creative works aren’t uniquely valuable to AI systems because similar information exists in many sources.3 There may only be one ground truth for Romeo and Juliet, but the AIs can probably guess the contents of your 7000-word Harry Potter fanfic.4 Insofar as content differs in value, the payment to the median content-producer - holding the mean constant - necessarily falls. AI companies would also likely develop more synthetic data rather than pay high prices for human-created content.

After all these adjustments, a 10,000-word essay might earn about 9 cents per year — hardly Substack.

The situation for images is similar. Models like DALLE-3 trained on about 1 billion images with approximately $30 million in compute costs. This translates to about 6 cents per image. Picasso might earn $831 for his life's work, but the less accomplished, such as the little known Leonardo da Vinci, might receive less than 50 cents.

For sale: human art, never automated.

Implementing such a system would increase AI development costs by about 65% while providing minimal benefits to creators, measured at best in the single digit dollars per year. Without worldwide coordination, companies would simply relocate to countries with fewer restrictions.

In the end, these calculations suggest that data payment schemes would significantly hinder AI development while offering little meaningful compensation to the creators whose work is used in training.

Conclusion

At present, the UK copyright’s text-and-data-mining carve-out (s. 29A CDPA) is restricted to non-commercial research, meaning a frontier lab must obtain licences from every right-holder before scrapping even a single copyrighted sentence. At the trillion-token scale the transaction burden becomes unworkable, effectively shutting commercial model training out of UK jurisdiction.

We’d like AGI to be safer than current trajectories, yet if advanced models are coming anyway the a country’s tight copyright rules won’t protect its artists—they’ll just vanish from the corpus. That vacuum hands influence to looser jurisdictions, so ensuring local data is trainable is strategically vital: it imprints future systems with local norms, economic aims, and security interests.

Acknowledgements: Rudolf Laine, Jack Wiseman, Julia Willemyns, GPT-4o, millions of uncompensated artists

Importantly, we also assume zero transaction costs (i.e. Coase theorem). In practice, given non-zero transaction costs the allocation of property rights does matter for outcomes - and so in some property rights regimes where the cost of individually asking for permission from each copyholder is too high this could prevent training runs from happening even if the value the training run generated greatly outweighed the aggregate disutility to data producers.

See here for the exact model. Alternatively, firms could fall off the isoquant, slowing overall development and thus the realisation of the benefits of AI to society. Given the extremely large externalities from innovation generally, this would (ignoring potential existential risk considerations) greatly raise the potential cost of such taxes. To keep our estimates conservative (and avoid consideration of net externalities) we will not model this here.

Regarding the practicality of per-token attribution: Choe et al. (2024) record 4 696 tokens s⁻¹ when LoGra logs rank-16 projected gradients for Llama-3-8B on a single A100-80 GB GPU (Table 1). NVIDIA’s vanilla 8 B-class pre-training on the same GPU runs at 40 345 tokens s⁻¹ on 8 GPUs ≈ 5 043 tokens s⁻¹ per-GPU (NeMo GPT performance table). 5043/4696 -1 = 7.4%.

Apply that to Meta’s $800 M, 30 T-token Llama-3 run: 0.074×$800 M = $59M. LoGra also writes 3.5 TB / B-token; for 30 T tokens that is 105 TB, rented at $0.03 / GB-mo ⇒ $3 k / month, negligible next to compute. Thus per-document attribution adds ~$60 M (≈ 7 %) to a modern frontier-scale training budget - enough to make doing so a substantial hindrance with present techniques.

HPMOR aside (albeit >>7k words).

Model Thinking

Discussion about this post

Ready for more?