ChatGPT and the Non-Endless Supply of Training Data

ChatGPT and the Non-Endless Supply of Training Data

In August 2024, OpenAI announced yet another coup – a massive deal with the publisher Condé Nast. If you aren’t aware of Condé Nast, it’s a huge publishing conglomerate that, among others, owns the New Yorker, Variety, Wired, Vogue, Vanity Fair, and many others. The deal is one of several OpenAI has struck, following others like TIME, Financial Times, and News Corp.

The deal is somewhat positive for both sides. OpenAI gets a vast amount of training data stretching back decades. Of course, it should be noted that data leans toward cultural expertise. Vogue and The New Yorker might not be everyone’s cup of tea, but they are custodians of the arts. In return, Condé Nast will get a pile of cash and the chance to have ChatGPT – and soon, SearchGPT – as a discovery point for its publications.

The race is on to secure data

It’s no secret that AI chatbots require quality training data. That data was not initially held by OpenAI, Anthropic, Google (to an extent), et al.. It’s usually public – a social media post, for example – or semi-public/proprietary, e.g., something behind a New York Times paywall. The latter is considered intellectual property, and the fact that The New York Times believes AI bots were trained on its data without permission is the reason that it is suing OpenAI and Microsoft for “Billions.”.

Effectively, you’ve got a situation where some publications are embracing AI and getting paid for it, but some are highly skeptical, including The NYT, BBC, Reuters, and The Guardian. The data (reporting stretching back centuries) held by those entities is vast, and any AI model not trained on it will likely lack important chunks of human knowledge. Of course, that’s not to say that these institutions won’t change their tune if the money is right.

Wide range of human knowledge on open web

Yet, we are in a situation where there could be significant gaps in the knowledge of competing AI bots. It runs contrary to the ideas of the open web. With a search engine, I can access diverse knowledge, ranging from the imagery of the Bolshoi ballet to fun facts about slot machines, and the history of aviation to live sports scores. If one AI knows some of those things and others don’t, then it is no longer open discovery. If there are limits to the knowledge of all AI bots, then we have an issue.

Wide range of human knowledge on open web

This argument is not unknown. There have been plenty of articles about the increased blocking of web crawlers by websites. Some are blocking AI access on ethical grounds. For instance, there is a significant movement among artists to block access to AI models training on their work, seeing the technology as a clear and present danger to their livelihoods. Others, you expect, are hoping for financial reward. And some are simply angry that they have not been asked for permission.

Antitrust issues may come to the fore

Yet, there is another issue and one that is not arguably being discussed in tech circles – antitrust. By that, we mean the following. Big AI companies – OpenAI, Google Gemini, xAI – have access to vast amounts of good data already. And they have the financial clout to keep signing deals with data providers, including the real-time pulses of humanity like YouTube, X, Reddit, Twitch, and so on. But what about AI startups without those resources?

Arguably, these issues simply aren’t being talked about enough despite the actions coming from AI companies themselves. Some have espoused the theory that we will approach “peak data” within a few years, meaning AI will halt its evolutionary progress as it runs out of the trillions of lines of human-written text to train on. It means they might need to tap into private data (emails, text messages), or rely on synthetic data, which has been red-flagged. Regardless, you can expect these issues to start coming to a head as lawsuits, deals, and protests against AI training come to the fore.

🔙 Back to Articles list.