Pipeline Dreams: SMILES All Around as MLOps Boosts Drug Discovery

January 2024, Melbourne

Introduction

Six out of the top ten causes of death in low income countries in the Global South relate to infectious diseases, yet only 10% of drugs in development target these diseases (Ersilia, 2023). That’s because these drugs aren’t as profitable to develop for big pharma companies.

Who is Ersilia

But there is a light in the dark. Ersilia is on a mission to make science accessible to all. Their aim is to accelerate novel drug discovery in low-resourced countries by providing scientists in these areas with cutting-edge data science tools.

Avid readers might feel a stirring sense of recognition of their moniker. It’s one of the many conceptual towns described in Italo Calvino’s Invisible Cities. In the narrative, Ersilia’s inhabitants establish relationships by weaving a web of strings between the city's buildings. When the web becomes too dense, the city is abandoned to be rebuilt elsewhere with new relationships in another place. The Ersilia organisation embraces a similar philosophy, but with a twist. Their goal is to weave a complex network of information, connecting data points to unravel the complexities of a disease. By doing so, they aim to leave the disease behind in a bind, moving forward to explore new horizons in scientific discovery.

How do they do this? It all starts with their extensive collection of machine learning models. They presently have over100 models, each tailored for a unique aspect of drug discovery. Let's focus on one specific model, eos92sw, to better understand Ersilia's impact and our contributions to their mission. eos92sw operates as a sophisticated evaluator of SMILES, the shorthand for chemical structures, which could represent potential drug ingredients. Its role is crucial: calculating the likelihood of these chemicals being toxic to humans. This early-stage analysis is invaluable, providing scientists with a preliminary gauge of a compound's suitability for further development.

With such a diverse array of models, you might wonder how scientists navigate this complexity. Indeed, the variety presents a challenge. To streamline this process, Ersilia has developed their eponymous command-line interface. This tool simplifies the interaction with their vast library of models, making it more user-friendly for scientists. The interface acts as a bridge between complex machine learning algorithms and the researchers who use them, ensuring that the focus remains on innovation and discovery, rather than on navigating technical hurdles. You can see an example of how this works in the video below.

GDI Ersilia: CLI

The Problem

Ersilia's system is not just functional; it's award-winning. Their innovative approach even caught the attention of GitHub, earning them the prestigious GitHub for Good Award in 2023. But behind this success lies a significant challenge.

Consider this: What happens when a scientist lacks the necessary computational power to run these complex models? This issue becomes even more daunting when dealing with a vast number of molecules. At the time of writing, Ersilia’s chemical reference library has over 2 million potential model inputs—a colossal data trove.

Ersilia envisioned a solution to circumvent this bottleneck. Their idea was to establish a comprehensive database, a reservoir of precalculated inferences from all their models, applied across their entire reference library. Such a resource would allow their command-line tool to first search this database for existing predictions. This approach would save scientists invaluable time and resources by providing them with ready-made calculations, only turning to live computation when necessary.

Engagement With GDI

This pivotal challenge is where our journey with Ersilia began. Recognizing the need for expertise in bringing their vision to life, Ersilia reached out to us at GDI. Leading the charge was Raul Bermejo, an experienced Data Engineer who took on the role of project manager. Raul gathered a team of data specialists from GDI, including the astute mind of Kartikey Vyas, an ML Engineer, and David Cole—myself, the author of this article.

We embarked on a mission to design and implement the system Ersilia envisioned. The task was not just about technical prowess; it was about crafting a tool that could change the landscape of drug discovery in resource-limited settings.

Our Methodology

From the outset, we were certain that a cloud-based solution was the way forward. This system needed to be robust and agile. Moreover, it had to efficiently deliver these insights to Ersilia’s global network of scientists.

However, before diving into development, we recognized the importance of thoroughly understanding Ersilia’s objectives, their existing codebase, and the nuances of drug discovery science—a field where our data expertise needed to be complemented with specific domain knowledge.

Through a series of interactive workshops, we delved deep into Ersilia’s aspirations. They needed a system that was not just fast and reliable but one that could scale gracefully with the ever-expanding pool of models, inputs, and users. We also uncovered their constraints: the system had to be low-maintenance, intuitive for future contributors, and, critically, cost-effective to operate.

Solution

It was the cost constraint that drove us to design something a little… funky. A batch inference system straddling GitHub Actions and AWS.

Ersilia benefits from unlimited complimentary GitHub Action compute (‘standard runner’) minutes as part of their arrangement. That’s very handy, because computing inference was taking 16+ hours to compute the full 2 million inputs for even the most ‘basic’ of models—it was going to be cost-prohibitive to run this system the way we’d otherwise propose on a cloud platform. As an aside, we did find some breadcrumbs as to how the performance could be improved for the batch-inference use case of the Ersilia CLI (which was originally designed for only a few inputs at a time), but it was agreed to leave this for another project. So, how else to scale? Well, after some investigation—including a check with the GitHub support team that this wouldn’t contravene their terms of service—we wrangled a way to partition the input library and have 50 GitHub standard runners compute the model inference and land it to S3… For free. The results are then written to a key-value database, and this is not free, but it was still the best option, and the estimated costs were tolerable.

This approach leveraged existing Ersilia AWS resources, minimised cost, allowed for scaling, and also allowed for easy execution and maintenance.

GDI’s Model Inference Calculation & Serving Solution Design for Ersilia

Reading from top left of the architecture diagram above

1)The pipeline can be executed either through the GitHub online portal, or via API call. We also wrote a cron job script so that Ersilia can run a list of models over a set duration.

2) The predict-parallel.yml GitHub workflow is the top level orchestrator for the inference pipeline. It sets up 50 machines via a nested workflow (predict.yml) to receive the input library, calculate the inference, and land it to AWS S3. The orchestrating pipeline’s final job is to hand over to the ‘serving’ subcomponent.

GDI Ersilia: Model Inference Pipeline

3) The serve-parallel.yml workflow sets up another 50 machines, takes the inputs from landing and writes them to DynamoDB in AWS. Why not do it in one workflow? Because we were restricted to certain other limitations in GitHub, like a 6-hour workflow limit, and maximum of one level of workflow nesting.

4) Finally, an API Gateway and Lambda function were set up in AWS to facilitate connection to Dynamo for querying scientists.

Querying the Inference Database

And voilà, a cost-light solution to calculate, store, and serve model predictions to scientists for novel drug discovery!

One aspect we didn’t tackle in this project was modifying the Ersilia CLI to query the database directly instead of computing locally. This decision was made to avoid altering a larger, more complex codebase. However, this leaves room for future collaboration where we can assist Ersilia in integrating this functionality and other enhancements we’ve proposed.

Outcome

The culmination of our journey with Ersilia was the handover of our meticulously crafted solution. We delivered our code via a dedicated GitHub repository, where it now resides among Ersilia’s impressive collection of over 216 other repositories, a testament to their extensive open-source endeavours.

But our contribution extended beyond just code. We also provided a comprehensive package including swagger documentation, detailed requirements analysis, solution design blueprints, and a record of all our meetings. Furthermore, we offered insights and code recommendations not just for the model inference pipeline, but for other vital components of Ersilia’s software ecosystem.

Working alongside Ersilia has been an enriching experience. We embarked on this project with a sense of admiration for their mission and concluded it with a deeper respect, humbled by the opportunity to contribute to such a noble cause. Our hope is that the system we developed will significantly aid scientists in their pursuit of groundbreaking drug research. Looking ahead, we eagerly anticipate future collaborations with Ersilia, continuing to support their mission of making scientific advancements accessible to all.

Author:

Dave Cole (GDI Fellow)

About GDI:

The Good Data Institute (established 2019) is a registered not-for-profit organisation (ABN: 6664087941) that aims to give not-for-profits access to data analytics (D&A) support & tools. Our mission is to be the bridge between the not-for-profit world and the world of data analytics practitioners wishing to do social good. Using D&A, we identify, share, and help implement the most effective means for growing NFP people, organisations, and their impact.