Over 21 Million Music Tracks Used Without Consent to Train AI Models, Investigation Finds

A recent investigation by The Atlantic uncovered that over 21 million music tracks have been utilized without authorization to train generative AI audio models. This massive volume includes songs from prominent artists such as Flume, Tame Impala, and Sia. The research casts light on a significant gap in current AI development practices, where copyrighted material is ingested without compensating or obtaining consent from rights holders.

The project, driven by investigative journalist Alex Reisner and known as AI Watchdog, originally focused on identifying unauthorized data use in books, academic literature, and video content. Its expansion into music unveiled previously hidden training datasets that circulate widely across the AI development ecosystem. To increase transparency, AI Watchdog converted these often opaque collections into a publicly searchable tool that anyone can access via The Atlantic’s website. This enables artists, labels, and legal experts to pinpoint specific works used in AI training datasets.

The investigation centered on four main datasets distributed among AI developers. The largest, LAION-DISCO-12M, contains about 12.6 million tracks and was created by the German nonprofit LAION. This dataset was compiled through automated searches linking hundreds of thousands of seed artists to YouTube Music URLs and was released under an Apache 2.0 license intended for academic research. Despite this, it has been widely adopted by commercial AI developers.

Following that is the Sleeping-DISCO-9M dataset, comprising roughly 9 million commercially popular tracks scraped from the web by the Sleeping AI Research Collective. Hosted on the Hugging Face platform, this dataset has attracted significant use from generative modeling companies. The collective also maintains a restricted subset—Sleeping-DISCO-Private—that includes full song lyrics and annotations from Genius, accessible only to verified research institutions.

Two smaller but notable datasets also contribute to this ecosystem. One is the Free Music Archive, a collection of around 100,000 Creative Commons-licensed tracks originating from the WFMU radio station, which has been employed by major organizations like Google and Stability AI. Another dataset, similar in size, operates as a pointer system within private developer forums, linking active Spotify and YouTube files.

A critical technical aspect uncovered is that three of these four datasets function less as direct audio repositories and more as structured systems of pointers or metadata. They store URLs leading to YouTube or Spotify tracks rather than hosting the audio themselves. AI developers then use automated tools to download these tracks, bypassing platform restrictions such as login requirements, ads, and monetization mechanisms designed to compensate creators. This method undermines arguments that AI systems train only on freely available content, as it involves intentional circumvention of licensing and platform protections.

The scale of usage is staggering. For example, auditing the LAION-DISCO-12M collection alone, with average track lengths of four minutes, reveals a vast trove of copyrighted material processed without authorization. The findings raise pressing questions about copyright enforcement in AI development and highlight the expanding tension between technological innovation and intellectual property rights.