Predict New Medicine

Rank: 189th out of 1952

Predicting small molecule-protein interactions for drug development

About the competition

Small molecule drugs (ligands) can interact with proteins in the human body. As a result of these interactions, the protein can structurally change. This change of information can initiate a reaction cascade. A classic approach to test the binding affinity (how well a molecule will bind) of a ligand with a protein is to physically make it react. To revolutionize this process, Leash Biosciences created a dataset, Big Encoded Library for Chemical Assessment (BELKA), and now hosts a competition that asks engineers to use machine learning to predict Protein-Ligand binding.

Leash Biosciences

Leash Biosciences’ catchphrase is “Unleashing machine learning to solve medicinal chemistry”. It is a biotechnology company that is building a large dataset of protein-molecule interactions for machine learning purposes.   

Relevance

Cancer is one of the most common diseases globally. Chemo-therapy is often applied as treatment. Sadly, it lacks target selectivity and comes with several severe side effects. What’s necessary is a drug delivery system where drugs are delivered and taken up at specific target sites, improving the quality of life of a patient and decreasing toxicity exposure. An effective approach for tumor-selective drug delivery is the use of functional ligands, small molecules that interact with specific proteins, e.g. overexpressed receptor proteins in malignant cancer cells. Ligands have to be tested on binding affinity with the protein. Classically this is done in the lab, which is labor-intensive and time-consuming.

To revolutionize this process, Leash Biosciences has tested 133 million ligands on binding affinity with three target protein molecules. This dataset, BELKA, is intended to encourage AI engineers on the competition platform Kaggle, to build predictive models to estimate the binding affinity of unknown chemical compounds to protein targets, using BELKA.

Technical Details

The competition’s main challenge is predicting whether a ligand (described by a string of chemical structures) binds with three specific proteins (seH, Hsa, brd4). Although LEASH provides a vast dataset, it includes only ligands with a certain structure, complicating predictions for ligands with different structures.

The two key challenges are: developing a model that identifies essential ligand characteristics for protein binding and removing training set bias to generalize beyond the set. While model choice (CNN, GNN, transformer) impacts generalizability, experts emphasize that molecule representation is crucial. The strings can be encoded into fingerprints, embeddings, atom graphs, or pharmacophore graphs, each presenting unique challenges and opportunities for model learning and generalisation.

UN Sustainable Development Goals

The competition was chosen based on the UN sustainable development goals. This competition links to goal #3: Good health and well-being. Using machine learning to search for the perfect ligand in the 1060 chemicals in the drug-like space will revolutionize the way medicine is developed and could speed up the process of curing diseases.

Link

Competition Page

Code

Next
Next

Identifying Bird Calls