
A recent investigation by Proof News, along with Wired, found that over 170,000 YouTube videos were used to train AI systems for major tech companies. The companies involved include Apple, Nvidia, Salesforce, and Anthropic.
They used data called “YouTube Subtitles,” which was taken from YouTube without permission. This data set includes subtitles from videos on more than 48,000 channels. It does not contain any images from the videos.
The dataset includes videos from well-known creators such as MrBeast and Marques Brownlee. It also has clips from major news outlets like ABC News, the BBC, and The New York Times. Over 100 videos from The Verge are in the dataset, along with many others from Vox, according to The Verge.
Marques Brownlee, also known as MKBHD, revealed in a post on X that Apple used data from several companies to train their AI. One company scraped a large amount of data and transcripts from YouTube videos, including his own. He stated, “This is going to be an evolving problem for a long time.”
New lookup tool to check data used in AI training
As part of its investigation, Proof News has introduced a tool that lets users check if their or their favorite YouTuber’s content is included in the dataset.
This subtitles dataset is just one part of a larger collection called The Pile, created by the nonprofit EleutherAI. The Pile includes various materials such as books and Wikipedia articles.
Apple, Nvidia, Anthropic relied on thousands of YouTube videos for AI training, sparking a debate over consent and copyright. How ethical is this data-sourcing practice? Let’s dive in! pic.twitter.com/0NGyl8wNAO
— Talk AI Today (@TalkAIToday) July 16, 2024
Last year, an analysis of a specific dataset called Books3 showed which authors’ works were used for AI training. This has led to lawsuits from authors against the companies that used their work, as reported by The Verge.
AI companies are often not open about the data they use for their AI systems. Recently, there have been many questions about how YouTube content is being used.
OpenAI’s Sora might be trained on YouTube Videos
In March, OpenAI introduced a powerful video generation tool called Sora. When asked if Sora was trained on YouTube videos, OpenAI’s CTO, Mira Murati, avoided answering directly.
Mira Murati told The Wall Street Journal, “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data.” When asked specifically about using YouTube content, Murati responded that she “wasn’t sure about that.”
In past interviews, YouTube CEO Neal Mohan stated that using video content, including transcripts to train AI would violate YouTube’s terms.
In May, during an episode of Decoder, Google CEO Sundar Pichai agreed with Mohan. Pichai said that if OpenAI had used YouTube content to train Sora, it would have breached YouTube’s terms.