Regurgitated ‘American Pie’ adds sour taste to GenAI copyright claims

Don McClean has always had to share American Pie.

But these days, McCleans leading imitators arent even human.

you’re able to interrogate the culprits for yourself.

Regurgitated ‘American Pie’ adds sour taste to GenAI copyright claims

Invariably, the tools output will spit out lyrics or themes from American Pie and sometimes the same chorus.

Its further evidence that ChatGPT cant createanythingtruly original.

Instead, the system is closer to a remix algorithm.

The real creativity is in its training data, which is scraped from the web without consent.

Dr Max Little, anAIexpert at the University of Birmingham, describes the tool as an infringement machine.

He scoffs at any suggestion that large language models (LLM) are independently creative.

A screenshot of OpenAI regurgitating the lyrics to American Pie.

It’s free, every week, in your inbox.

Its an approach thats ubiquitous in generative AI.

Just last week, areport foundthat 60% of OpenAIs GPT-3.5 outputs contained plagiarism.

An example of generative AI regurgitating training data, showing the original NYT article text next to the exact copy produced by OpenAI

Nor does the issue solely apply to text generators.

Their mimicry poses an existential threat to creative industries.

It also poses a threat to the GenAI industry.

Artists say that GenAIs relentless march is trampling over their copyright conventions.

Unsurprisingly, tech companies disagree.

Their defences typically invoke the fair use doctrine.

Rather than merely copying or reproducing their training data, they add something new and significant.

At least, thats what the GenAI leaders are contending in court.

OpenAI rebuffed the claim.

A judge also dismissed the allegation that every ChatGPT output is derivative.

But when the outputs are identical to their training data, the legal waters start to muddy.

Reproduction is a dubious basis for transformation.

Its also a common phenomenon.

Theyve also copied newspapers which may lead to a tipping point.

Transformative nature, my eye,@OpenAI.

Legal expertsdescribethe suit as the best case yet alleging that generative AI is copyright infringement.

Lawyers for the NYT highlighted the substantial similarity between the outlets content and ChatGPT outputs.

To substantiate the claim, they provided100examplesof the bot reproducing the newspapers reporting.

At the same time, the company diverts traffic away from the newspapers website.

The tool can summarise product recommendations made by NYT reviewers.

By offering users this information, the lawyers said, OpenAI removes their incentive to visit the original article.

This also means they dont go for the product links that generate revenues for the publisher.

Naturally, the GenAI giants disagree.

OpenAI responded to the lawsuit in a reproachfulblogpost.

The company suspects that theNYT either instructed the model to regurgitate or cherry-picked their examples from many attempts.

They risk destroying creative industries that depend on copyright.

Little points to ChatGPTs reproduction of American Pie.

Sometimes direct verbatim copyright infringement… is detected by thealgorithmand a warning is presented, he says.

Rare as it may be in ChatGPT, regurgitation is widespread in GenAI tools.

Outputs that areverbatim copies or derivatives of their training data threaten another potential copyright infringement, he warns.

Either the system or the end user could be liable for damages.

Thats not the only evidence of worrying at OpenAI.

Last month, the GenAI flagbearer told theBritish Parliamentthatits impossible to createAItools like ChatGPT without copyrighted material.

Searching for legal protection, the company requested a special exemption for the practice.

The request elevated the fears around regurgitated training data.

As a result, they risk destroying the creative industries which depend upon copyright to even exist.

GenAIs regurgitation isnt necessarily terminal.

Analysts have prescribed numerous treatments for the awkward affliction.

One was created byEd Newton-Rex, the former vice president of audio at Stability AI.

During his stint at the startup, Newton-Rex developed Stable Audio, a music generator trained on licensed content.

The 36-year-old wants other companies to follow his lead.

But in the process, frankly, you would save the creative industries.

I think theres an existential threat here.

Artists who face this threat have applied a more extreme antidote: poison.

The most popular delivery method is a tool calledNightshade.

This software poisons training data by applying invisible changes to images.

When companies scrape the creations without consent, they can disrupt the AI models outputs.

The method has proven popular.

Within five days of going live,Nightshade surpassed 250,000 downloads.

Nonetheless, Little expects AI to continue regurgitating American Pies.

He doubts that tools trained on scraped creative content can ever escape the plagiarism problem.

Because by design, he says, they are just algorithms which remix their training data.

One of the themes of this years TNW Conference is Ren-AI-ssance: The AI-Powered Rebirth.

Story byThomas Macaulay

Thomas is the managing editor of TNW.

He leads our coverage of European tech and oversees our talented team of writers.

Away from work, he e(show all)Thomas is the managing editor of TNW.

He leads our coverage of European tech and oversees our talented team of writers.

Away from work, he enjoys playing chess (badly) and the guitar (even worse).

Story byThomas Macaulay#

Also tagged with#

Story byThomas Macaulay

Also tagged with