We can’t have nice things… because of AI scrapers

In the past few months the MetaBrainz team has been fighting a battle against unscrupulous AI companies ignoring common courtesies (such as robots.txt) and scraping the Internet in order to build up their AI models. Rather than downloading our dataset in one complete download, they insist on loading all of MusicBrainz one page at a time. This of course would take hundreds of years to complete and is utterly pointless. In doing so, they are overloading our servers and preventing legitimate users from accessing our site.

Now the AI scrapers have found ListenBrainz and are hitting a number of our API endpoints for their nefarious data gathering purposes. In order to protect our services from becoming overloaded, we’ve made the following changes:

  • The /metadata/lookup API endpoints (GET and POST versions) now require the caller to send an Authorization token in order for this endpoint to work.
  • The ListenBrainz Labs API endpoints for mbid-mapping, mbid-mapping-release and mbid-mapping-explain have been removed. Those were always intended for debugging purposes and will also soon be replaced with a new endpoints for our upcoming improved mapper.
  • LB Radio will now require users to be logged in to use it (and API endpoint users will need to send the Authorization header). The error message for logged in users is a bit clunky at the moment; we’ll fix this once we’ve finished the work for this year’s Year in Music.

Sorry for these hassles and no-notice changes, but they were required in order to keep our services functioning at an acceptable level.

19 thoughts on “We can’t have nice things… because of AI scrapers”

    1. Incompetence of AI bros and their workers cannot be underestimated. Also, if you were to steal some apples from an orchard, would you do it gently, or like a caveman? I think they’re going with the latter approach.

    2. They do it that way because they are done as automated web spiders. They follow links and download whatever they find at the URL because it’s just an automated effort to get as many web pages as possible. They don’t have oversight so there’s nobody to notice what it’s scraping and go “Oh, this service has a free database dump we can use!”

  1. Have you considered adding some poison links the bots will hit early-on and putting their IPs on a blacklist? I did that on my own server and it’s worked pretty well.

  2. No. People for some reason are bothered by their stupid websites being “scanned.” WHO CARES? We’ve always done that on the internet. How do you think the WAYBACK MACHINE EXISTS? Tell these website assholes to stop caring about whether it’s a human visiting the site or not, the controlling fucks.

    1. When a bot opens one page at time they aren’t doing it slowly and waiting between requests like a person they are doing it as fast as the can request the next one. This would be called a denial of service attack and if it was from a few servers a distributed denial of service attack (DDoS). This is antisocial behaviour and brings down websites by overwhelming them. The Wayback machine respects robots.txt files for what it can scrape and does so far less invasively than the ai crawlers do. This isn’t any AI it is anti antisocial behaviour

  3. I run a TTRPG design project, and our forum has been knocked down several times by AI scrapers that not only want to download the entire site page by page, but also do it repeatedly, over and over. I had to implement CloudFlare protection, and still the bots get through every now and then and overload the (admittedly puny) server. It’s maddening.

  4. I can bet the reason is that they use a pattern or workflow that will work on any website. Having a bespoke step requires modifying the code just for that website. Then that bespoke functionality will need to be maintained and documented, so it’s easier and cheaper to just enumerate and crawl each page.

  5. We’re experiencing similar challenges. Our website has been overwhelmed by persistent bot traffic exhausting server resources. These bots appear to be targeting URLs they perceive as valuable—in our case, any path containing /podcast/ – no idea why!
    The crawling is relentless and resource-intensive. We’re currently evaluating our mitigation options.

Leave a Reply

Your email address will not be published. Required fields are marked *