GSoC 2020: Spam detection with online learning

Introduction

Hello Everyone!!

I am Rohit Dandamudi, more commonly known as diru1100 in IRC and all other sites. I am currently doing my final year in Computer Science and Engineering at Chaitanya Bharathi Institute of Technology, Hyderabad. This summer, I had the wonderful opportunity to work with MetaBrainz Foundation and it’s my first time participating in GSoC. I worked on the SpamBrainz project under the guidance of yvanzo to make a step forward on eliminating spam in MusicBrainz.

How it started

I started looking for some cool projects to apply for GSoC, eventually, after going through some which were involved in the web development side, I finally got to know about the MetaBrainz Foundation, and it was already pretty late (around 2½ weeks before the proposal deadline), most of my fellow GSoCers were already in good rapport with the community by then. After looking through the project ideas, I wanted to do my project on CritiqueBrainz, but later I found out that it’s not considered for this year. In the end, I liked the concept of SpamBrainz and how it involves a good combination (Deep Learning and Web Development) of technologies. After browsing through the project I understood what I could and tried to make some changes to the codebase and was successfully able to run the model and add some documentation. Finally, I submitted the proposal, which got accepted.

The proposal

My proposal was focused on extending the work done by Leo as part of GSoC 2018. It mainly involved the following:

  • Do the research and implement online learning to:
    • Update the model dynamically as new variations of editor spam accounts appear.
    • Make the model self-sufficient without depending on a particular file or a batch of data.
    • Explore different types of learnings that are applicable to enhance LodBrok and for better performance in production.
  • Complete SpamBrainz API to:
    • Use and update the model with API calls.
    • Connect LodBrok with MusicBrainz Server.
  • Do detailed documentation to make the project more public and involve more contributors

Achievements

LodBrok model improvements

Research for model live update

SpamBrainz API

  • Incorporated the above research in SpamBrainz API, which consists of 2 endpoints, namely:
    • /predict to return classification results by LodBrok for the editor accounts
    • /train to retrain the model with incorrect results sent to SpamNinja respectively
  • After discussing with Leo, I decided to implement the API using Flask and Redis combination. Going with Redis over RabbitMQ for this API is feasible as the API is pretty lightweight and has at most 2 events.
  • Documented the entire API, with internal working, steps to replicate, and images to understand the results obtained.
  • Completed dockerization of SpamBrainz_API for easier integration and testing with MusicBrainz docker.
  • This diagram explains the current workflow of the implemented API:diagram explaining the current workflow of the implemented API

Challenges ahead and future of SpamBrainz

  • The API has to be integrated with MusicBrainz and should undergo more testing with real live data, currently, my focus is on this part.
    • Note: All the work done till now on the model was on dummy data generated by scripts that tend to replicate the real accounts as much as they can be, by taking into account the inputs from Freso, yvanzo, and the analysis done by Leo, without affecting the data privacy policy.
  • To extend online learning to other use cases in MetaBrainz through Transfer Learning and Online Transfer Learning.
  • Also looking forward to writing a research paper about the work done, and eventually publish it in IEEE transactions, as I plan on using SpamBrainz as my final year major project.

Special thanks to…

  • My mentor, YvanZo for being incredibly patient with me, helping me create quality commits, and overall making me a better programmer. Have always learned something new in every interaction with him.
  • LeoVerto, for helping me out whenever stuck and getting me up to date with the project.
  • MetaBrainz Foundation, for creating an open, inclusive, and productive environment to build some amazing stuff.

GSoC 2018: SpamBrainz – Fighting spam in MusicBrainz using machine learning

Hi, I’m Leo and I spent my summer building and training SpamBrainz, our new solution to fighting spam in MusicBrainz. If you haven’t heard of SpamBrainz before it’s probably because it did not exist before this year’s Summer of Code.

For quite a while now the amount of spam in MusicBrainz has started to become a serious problem. Often this means editors are automatically created with descriptions that look not unlike the spam emails most of us get every day, promoting other websites and services.

During last year’s MetaBrainz Summit we discussed possible solutions to this and came up with the Spam Ninja system. Essentially this means that Soon™ there will be a group of editors that receive spam reports and have the ability to delete editors and entities that are nothing but spam.

Now with MusicBrainz having almost two million registered editors, could we really expect the Spam Ninjas to manually check every single one of them in addition to all the new registrations? Obviously not, and this is where SpamBrainz comes in.

SpamBrainz is a machine learning system that looks at all editors and decides whether or not it thinks they are spammers. If it thinks they are, it automatically notifies the spam ninjas who then decide whether or not SpamBrainz was correct.

What’s great about this system is that a human is guaranteed to look at any report and at no point does a computer decide that you’re a spammer and should be banned, because no one wants machines to run the world, right?

Building SpamBrainz

While most GSoC projects involve adding features to existing systems, SpamBrainz is something entirely new and I had not built anything on this scale before so I started out by doing tons of research.

When building a machine learning project you should always start by doing some good
old statistics first
and trying to figure out what matters about your data and how the
system could use it. I wrote a couple Jupyter notebooks (which are great for working with data) to do this.

As I was not working for MetaBrainz at the time and had to respect our privacy policy, I wrote a script to collect the most common values of a couple different editors, anonymize them and save them to a report. Using that data I could compare all spam and non-spam editors and decide upon a set of datapoints that would be useful for my machine learning model. Yvanzo then ran these on the live database and I could happily do my data analysis without compromising user privacy.

Next I built a pretty boring Flask-based API that would allow MusicBrainz to queue up editor analysis and training. Quite a few different MetaBrainz projects use Python and need to access the MusicBrainz database so a long time ago someone wise decided to move commonly used code into a repository called brainzutils-python. All I had to do was to add some code for accessing editor data through it.

In a surprise move by ruaok I was then hired by MetaBrainz as a contractor with a yearly salary of 100g of chocolate. I probably should have negotiated what kind of chocolate but what mattered most was that I could now work with user data without breaching our privacy policy.

But before I could build my Keras model I had to decide on a final set of input features and do write code for preprocessing the data. Only then could I finally get started building and testing models.

The current SpamBrainz state of the art model is Lodbrok which actually turned out to work really well, reaching a 99% accuracy in detecting spam while only mis‐classifying 0.2% of real users as spammers. Obviously the latter won’t be a problem because after all a Spam Ninja will still check these reports.

Future outlook

Now that GSoC is over I could just disappear with all the money and leave SpamBrainz in its current state but obviously that’s not what I am planning to do.

I would like to work with zas on getting it deployed along with the Spam Ninja system, improve the code documentation and try to tackle the remaining problem that is online learning (which as it turns out, isn’t as easy as I had thought).

With spam always evolving and spammers already moving to more sophisticated methods than just using editor biographies, I’d also look into building separate models for other entities.

After all SpamBrainz is just getting started and I’m very much looking forward to continuing our journey towards reducing the spam we all have to endure on MusicBrainz and other MetaBrainz projects.