GSoC 2023: Reviving the BookBrainz importer project

Hi, I am David Kellner (aka kellnerd), an electrical engineering student from Germany, who has finally found the time to participate in Summer of Code after four years of contributing data, bug reports and a bit of code to MetaBrainz projects (mostly to MusicBrainz and related tools such as userscripts).

Although I had mostly worked with MusicBrainz so far, I decided to apply for the BookBrainz importer project as I was already familiar with the underlying JavaScript technology and saw the huge potential of the idea to transform and import external datasets into the cleverly designed BookBrainz database schema. My proposed project was accepted by the MetaBrainz team and I have been working on it for the last six months under the mentorship of monkey.

This post gives an overview about my GSoC project and the challenges which I encountered during this summer.

What is the purpose of this project?

BookBrainz still has a relatively small community and contains less entities than other comparable databases. Therefore we want to provide a way to import available collections of library records into the database while still ensuring that they meet BookBrainz’ high data quality standards.

From a previous GSoC project in 2018, the database schema already contains additional tables set up for that purpose, where the pending imports will await a user’s approval before becoming a fully accepted entity in the database.

The project will require processing very large data dumps (e.g. MARC records or JSON files) in a robust way and transforming entities from one database schema to the BookBrainz schema. Additionally the whole process should be repeatable without creating duplicate entries.

Overview of the existing infrastructure

Before I will tell you more about the import process, let us agree on common terms for the different states of the imported data:

External Entity: An entity which has been extracted from a certain external data source (that is a database dump or an API response) in its original format.

Parsed Entity: A JSON representation of the external entity which is compatible with the BookBrainz ORM. The JSON may also contain additional data which can not be represented within the current BookBrainz schema.

Pending Entity: The entity has been imported into the BookBrainz database schema, but has not been approved (or discarded) by a user so far. The additional data of the parsed entity will be kept in a freeform JSON column.

Accepted Entity: The imported entity has been accepted by a user and now has the same status as a regularly added entity.

The software architecture consists of two separate types of services which are connected by a RabbitMQ messaging queue which acts as a processing queue:

  1. Producer: Extracts external entities from a database dump (or an API) and emits parsed entities which are inserted into the queue. Each external data source has its own specific producer implementation although parts of it can be shared. Since the parsed entities are queued for insertion into the database, there can be multiple producer instances running in parallel.

  2. Consumer: Takes parsed entities from the queue and validates them before they are passed to the ORM which inserts them into the database as a pending entity.

The processing queue acts as a buffer to make the fast parsing logic independent from the slow database insertion process. It also allows the consumer to be stopped or interrupted without losing data and the import process can be continued at any time.

Import process of an entity

Project goals

Almost five years have passed since the initial implementation of the importer project for GSoC 2018. Since BookBrainz has evolved in the meantime, the importer is no longer compatible with the latest database schema. For example, entity type names have been changed in 2019 from Creator to Author and from Publication to Edition Group, so the remaining occurrences in the importer code have to be renamed too.

In my proposal I had identified the shortcomings of the implementation from 2018 and outlined the following steps towards a fully featured importer:

  1. Update existing importer infrastructure from 2018 to be compatible with the latest BookBrainz database schema
  2. Test the infrastructure by finishing the OpenLibrary producer (also from GSoC 2018)
  3. Update database schema to support relationships between pending entities (and author credits!)
  4. Resolve an entity’s external identifiers (for example an OpenLibrary ID) to a BBID in order to create relationships between pending entities
  5. Add a queue consumer option to update already imported entities when an import process is repeated (with updated source data or an improved producer)

Back at the beginning of this summer, I had imagined the remainder of this blog post to be showcasing and praising the new features of the importer project with lots of graphics. But the reality of working on a project which was discontinued five years ago proved to be different, the more weeks of coding passed, the less confident I was about reaching all these goals. Here comes the story of a project which turned out completely different from what I expected.

Undusting the old project: No sign of life

Since the importer project was developed with the Node.js runtime, my Summer of Code project started with the uninspiring task to update npm package dependencies which were last touched back in 2018. Unfortunately that was not enough to get the existing OpenLibrary producer script to run and fill the RabbitMQ import queue which I had set up.

So I have jumped through various other hoops to get the old code to run. Among them a bit of conversion from Flow type definitions to TypeScript (to which BookBrainz had migrated in the meantime), changing the babel transpilation setup to output modern ESM instead of Node.js old CJS modules (because that was required for the newer versions of a dependency) and migrating to the latest, error-free version of a logging library. All of that only to realize that the producer script just logs a few lines before it hangs indefinitely (instead of throwing an error at me at least).

To ease debugging of that mess, I wanted to save the separate transpilation step by using a command like ts-node which directly executes TypeScript, only to find out that it’s no longer working with the current version (v20) of Node.js. (Neither did the alternative babel-node.) After that disappointing experience I have finally given up on that approach and with a few minor changes to the code I managed to get the project running in Deno, which supports TypeScript out of the box. The producer still did not work as expected, but development iterations became a lot smoother.

After having wasted multiple days with debugging this (by commenting out various combinations of lines where I was suspecting the bug to hide), I found the real culprit: It was the import statement for the logging library, which caused side effects that were severe enough to get the RabbitMQ library hanging. As a temporary solution I replaced the logger module’s export with a simple alias for the standard console object (whose methods have the same names info, error etc.) as a drop-in replacement for now. But I had also kept the module with the logger setup to make it easier to replace it with a proper logging library again. More about that later in this post. Now the producer script was no longer hanging, but there were still some bugs to fix.

BBIQ: Management CLI for the BookBrainz import queue

Following the prior experiences, I was no longer trying to fix things by doing only minimal changes to the existing code base. Since the old import queue implementation still was not working and contained a lot of boilerplate which made the code hard to understand, I decided to rewrite it from scratch. As a result, I finally got the OpenLibrary producer running and, confirmed by this, I also rewrote the top-level module of the consumer application.

The new consumer command line interface (CLI) is called BBIQ (BookBrainz Import Queue) and also has an option to put invalid/problematic messages into a separate failure queue. Making the standard names of these two queues configurable allows us to easily run a new consumer instance which deals with the remaining messages, without having to drop these messages as in the initial implementation (negative acknowledgment of messages would cause an infinite loop). Combined with a CLI option to purge the import queue, we now have a flexible tool for the most important queue management tasks, which is no longer properly described with the term “consumer”.

Integrated usage instructions for the BBIQ CLI which are shown for bbiq --help

After having completed that task, I documented how to use the refreshed OpenLibrary producer and the new BBIQ CLI. Although some queued entities still caused the consumer to skip them with SQL transaction errors, it finally felt like the right time to create my first “feature-complete” pull request: bookbrainz-utils#40

TypeScript-ification of the project (and the ORM, while we are at it)

Since the code still passes lots of data objects through deep function hierarchies and it is hard to know (or even remember) which properties the passed parameters are expected to have at each stage, I gradually started to introduce more TypeScript type annotations. Having created type definitions for queued entities has already revealed quite a few small bugs in the OpenLibrary producer.

As a lot of the used functions which deal with the database are located in the separate bookbrainz-data ORM package, I also had to improve the type definitions over there. Once I had started with the conversion to TS, I did a full first pass of the whole ORM, trying to achieve more consistency. (Since BookBrainz originally was not a TS project, most types were introduced later on and had been defined as needed and close to where they were needed.) The next pass would involve heavier changes of the way our Bookshelf ORM models are defined, so I was not going to do that during GSoC.

TypeScript migration pull requests: bookbrainz-data#309 (ORM), bookbrainz-site#1027 (website), bookbrainz-utils#41 (importer, also deletes lots of old boilerplate code which is now unused)

Eureka: The first imports made their way into the database

After having added all of these type definitions, I got a more thorough understanding of which data properties are passed into which function and which are expected. In the end, one of the major bugs of the entity consumer implementation boiled down to a wrongly expected type: The code expected a numeric ID, but received an object which contained the ID (for example 42 versus {id: 42}). Of course it was hiding close to the low-level Knex.js SQL query builder code and not in the higher-level ORM functions where I had been adding all of these type annotations previously.

Now the BBIQ import queue setup and the basics of the import process are finally working and the first few test authors from OpenLibrary were successfully imported. The new errors which popped up about missing languages in the DB looked much more reasonable and pointed out a problem in how the OpenLibrary producer assigns guessed languages to alias names. Additionally, there were still a few undiscovered errors hiding in the OpenLibrary producer and the entity validation code which I had to fix.

At this point I had also decided that it is crucial to have better logging output going forward, because it was hard to find the cause of logged errors as logs were uncomfortable to read (too much irrelevant or redundant information, but important details like file and line number where the error occurred missing). So I revisited the issue with the logging library which was causing the entire script to hang a few months ago. It turned out that I only had to remove a single line of config (which was incompatible with the new version of the library) to be able to use it again instead of my temporary console drop-in replacement…

Now it was easy to customize the console output with timestamps and highlight colors for the different log levels and to filter the output by log level if necessary. After making the formatting of the messages and the usage of the different log levels more consistent, I have also introduced stack traces for errors which make it much easier to trace bugs as the one mentioned above.

Pull request: bookbrainz-utils#42

Making imports repeatable

Now that the BBIQ import queue setup and the basics of the import process are finally working, I wanted to dive into improving the quality of the processed data. But wait, the time is almost up… and nearly all of the originally proposed new features of the importer are still missing. While the OpenLibrary producer and the import queue consumer are working, the data quality still sucks and will not greatly improve without support for relationships. What is the point of importing lots of Work entities which can not be associated to their Author?

So the one crucial feature which at least has to be finished before the GSoC submission deadline is the ability to repeat imports without creating tons of duplicates. This allows us to run a full test import regardless of the current shortcomings and simply repeat it once the producer or the import schema have received significant updates, for example relationship support.

In order to do this, we have to find existing imports by their source (for example OpenLibrary) and the external ID of the entity (OpenLibrary ID). Then we can assign the freshly created entity data to this import instead of creating a new one (or causing an SQL conflict). So having added support for this use case wraps up the reduced goals of my GSoC project.

Pull request: bookbrainz-data#314

Achievements and outlook

After six months of work, I have managed to migrate the importer project to TypeScript, adapted it to work with the latest BookBrainz database schema, fixed various bugs and brought the code into a better shape in general. The scope of the TypeScript migration and related refactoring has not been limited to the importer project, the BookBrainz ORM library has also benefited.

With the BBIQ CLI we now have a robust import queue management tool which gracefully handles errors and puts failed imports into a separate queue. Together with the possibility to specify the name of both the input and the failure queue, this makes the infrastructure very flexible and allows to select a pair of queues for each import process.

Combined with the improved OpenLibrary producer, it is possible to fill the import queue with a full dumps of all OpenLibrary authors and have them imported as pending entities. (Currently only the parsing of authors is sufficiently complete and tested, while works are untested and editions are not yet supported.)

The import process of a dump can also be repeated without creating duplicates for all entities which have already been imported previously. This makes sense when updated data dumps are published or when the data mapping of the producer implementation has been improved. Existing imports are detected by the consumer and pending entities can be updated with the new data (while already accepted entities will be skipped – at least for now).

Importing an OpenLibrary sample dump: The first author already existed and gets updated, the second is new

Speaking about accepting entities, the user interface for this feature is one of the missing things which still have to be migrated to work together with the 2023 BookBrainz website. Most of the previously created interface for that is probably obsolete once support for relationships between (pending and accepted!) entities will be implemented. These will make the pages for pending entities and regular (accepted) entities much more similar as both types of entities have to deal with relationships to entities of the other type then.

Since relationships are the core feature of BookBrainz, this will open the door for many other possibilities to improve the quality of the OpenLibrary to BookBrainz data mapping by supporting author credits and publishers for editions, for example. This is also the time to implement more producers for other data sources such as Bookogs or MARC records from national libraries…

Later versions of the consumer should also leave less orphaned database rows behind when they update pending entities. If we implement an algorithm to determine the differences between parsed entities and pending or accepted entities for this, we can also create pending updates for accepted entities (which are already revisioned and might have been changed since their import).

Final words

Summer of Code and this project in particular have been a fantastic opportunity. Although I already had advanced coding knowledge, I got more confident in dealing with larger code bases and the necessary tools to manage these (for example Docker images). The GSoC stipend also gave me an excuse to invest a lot of time into an open source software project, which would have been hard to explain to my family and friends without. So thank you MetaBrainz and Google for making this possible!

One of the most challenging parts of this project was to deal with unforeseen problems with existing code, trying to stay on track with my schedule and managing my own expectations once that has proven to be impossible. Although I could not complete all the tasks from my initial timeline in time due to problems with the old code from 2018, there has been no pressure. Most of the time I was able to work autonomously and there were no questions regarding my progress except for the weekly meetings, but once I had doubts or needed help, my mentor was there to answer my questions and assure me that I was still on a good path. Therefore a huge shout-out goes to monkey, thank you also for reviewing and testing my code. I am definitely looking forward to continue working on the project, the best parts are yet to come!

All my pull requests for MetaBrainz repositories in 2023: Before GSoC and during the GSoC coding period

2 thoughts on “GSoC 2023: Reviving the BookBrainz importer project”

  1. “Here comes the story of a project which turned out completely different from what I expected.” I was on the edge of my seat. 😁 Great work and blog post, David!

  2. Amazing work kellnerd! I’ll admit I didn’t understand any of the technical aspects, but this project sounds like a great example of “do it quick or do it right”. Also cool is that you were supported in taking the time to do it right, by monkey. Go team!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.