Atlantic’s Alex Reisner makes 4 music AI datasets searchable, including 12M and 9M tracks
A public index of the songs used for AI training turns “opaque data sourcing” into something boards and regulators can audit.

Atlantic reporter Alex Reisner uncovered four datasets of music used to train AI models and made them searchable for the public. The consequence for decision-makers is that training-data sourcing is becoming measurable, not merely claimed.
Atlantic reporter Alex Reisner uncovered four datasets of music being used to train AI models and made them fully searchable for the public. Two of the sets are enormous, with 12 million and 9 million tracks. The other two are smaller, but still huge in absolute terms, at over 100,000 songs each.
This matters because the datasets are no longer just a vague “trust us” in research circles. According to Reisner, the sets have been downloaded thousands of times. And while it is impossible to know exactly who has used them, Google and Stability have both confirmed they have used the datasets in research papers. In other words, major AI players are not only in the conversation. They are in the dataset lineage.
So what exactly did the Atlantic do that changes the game? Reisner took four music datasets and built a searchable public view of what is in them. That is not just a journalist flex. It shifts the question from “Did your model train on copyrighted material?” to “Which specific catalog entries are included, and how can they be located?” When a dataset is searchable, due diligence becomes faster. Boards can ask better questions. Legal teams can compare what was claimed against what is actually indexed.
The dataset sizes also tell you something about the stakes for product and compliance. When two datasets hit 12 million and 9 million tracks, you are not talking about a pilot training set. You are talking about massive coverage that can influence what an AI learns about melody, style, structure, and even performance patterns. Even without knowing who has downloaded and used the files, the sheer scale creates risk that spreads across multiple downstream systems, from research prototypes to commercial offerings. Larger training sets can be more valuable for model performance, but they also make provenance problems harder to ignore.
Reisner’s reporting also highlights a key nuance: not all sources are the same kind of “free” in the public imagination. Some of the sources, like the Free Music Archive dataset, are free to stream for personal use. That detail matters because training is not the same thing as streaming. But it does affect how stakeholders interpret the dataset’s characteristics, and why regulators and courts often focus on the specific rights and terms attached to particular works. The public indexing makes those distinctions easier to investigate because the dataset contents can be examined rather than just referenced.
Another second-order implication: dataset transparency can pressure the AI ecosystem into stricter attribution and documentation norms. Google and Stability confirmed they have used these datasets in research papers. That is an important anchor fact, because research papers are typically where methodology and training corpora get described, at least at a high level. When the underlying corpora become searchable, the gap between “we trained on music datasets” and “here is what those datasets include” becomes harder to sustain. It also raises the bar for how training-data disclosures are written, reviewed, and audited.
Regulators are already grappling with questions that sound abstract until you can point to an actual dataset. Music training data sits at the intersection of copyright, platform policy, and model governance. A public searchable database turns that intersection into something that can be audited by more than the people who already had access. Even if it is impossible to know exactly who used the datasets, the fact that downloads number in the thousands and that major labs acknowledge usage in papers suggests these are not fringe curiosities. They are tooling.
For executives and boards, the strategic stake is simple: training-data choices are becoming operational risk. If datasets can be indexed and checked publicly, then “we did what the industry does” is a weaker shield. The safest posture becomes documentation you can stand behind, governance that can answer specific dataset questions quickly, and contracts or policies that reflect the real sources used. Today it is four music datasets made searchable. Tomorrow it can be the same scrutiny applied to other training corpora, at a scale that turns compliance from a department issue into a board issue.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Meredith Whittaker urges AI users to remember chatbots are not friends
Signal’s leader says AI chatbots are neither conscious nor sentient, and that framing should change how companies ship.

NASA tests Ernest rover that drives faster and lifts wheels to climb obstacles
Footage from NASA shows its Ernest prototype doing speed plus obstacle handling in the same run.

Founders Fund backs Shinkei’s Poseidon robot for humane fish killing
A refrigerator-sized killer robot is the “humanely killed” wager that could reshape how investors think about food tech.
