About The Project
Project Overview
This project presents a machine learning tool designed to predict the solubility of a protein based on its amino acid sequence or UniProt ID. By inputting a sequence or a UniProt ID, the model analyzes key features like its amino acid composition and sequence length to determine if the protein will be soluble or insoluble when expressed in an E. coli host system. The output provides a clear prediction, a probability score, detailed visualizations of amino acid composition, and an interactive 3D model of the protein's structure.
Our Motivation & Guidance
The inspiration for this project originated from an academic challenge to explore the implications of DeepMind's revolutionary AlphaFold. Guided by a professor, we were encouraged to apply similar predictive principles to another critical problem in biotechnology: protein solubility. Accurately predicting whether a protein will be soluble can save countless hours and resources in the lab, accelerating research and the development of new protein-based therapeutics. This tool represents our step towards solving that challenge.
Frequently Asked Questions
Why was Random Forest chosen as the best model ?
We trained and evaluated several models, including Logistic Regression, Support Vector Machines (SVM), and Random Forest. While all models performed well, Random Forest was selected as the final model because it consistently achieved the best balance of performance metrics, particularly the F1-score. The F1-score is crucial for this type of biological problem because it considers both precision and recall, ensuring the model is good at not only correctly identifying soluble proteins but also at not missing them. Random Forest is also an "ensemble" model, meaning it combines the predictions of many individual decision trees, which makes it very robust and less prone to overfitting the training data.
What does the solubility percentage (e.g., 86% soluble) actually mean ?
The percentage represents the model's confidence score, not a literal physical measurement. It's the probability calculated by the Random Forest model that the given protein belongs to the "soluble" class.
- An 86% soluble prediction means the model is highly confident that if you were to produce this protein in an E. coli system, the resulting population of protein molecules would be predominantly soluble and functional.
- A 40% soluble prediction (which would be classified as "insoluble") means the model is confident (60% confident in insolubility) that the protein is likely to misfold and form non-functional inclusion bodies. It suggests a lower yield of active protein. This percentage is a powerful guide for scientists to prioritize which proteins to take forward for experimental validation.
How is the composition graph created from a UniProt ID ?
This is an excellent question about the data workflow. The application does not go to an external database like AlphaFold to get the sequence for the graph. Instead, the process is much faster:
1. The Python server (`app.py`) starts by loading our local datasets (`Ecoli.csv` and `Human.csv`) into its memory.
2. It creates an internal, high-speed lookup table (a dictionary) that maps every UniProt ID to its corresponding amino acid sequence.
3. When you enter a UniProt ID, the server instantly finds the sequence in its memory, calculates the amino acid composition from that sequence, and then sends the data to the webpage to draw the bar graph.
What is the significance of protein solubility within E. coli ?
This is a crucial question. The solubility of a protein inside *E. coli* determines its functionality. It's not about the protein affecting the bacteria's health, but about whether the protein itself can be useful.
- A soluble protein is one that has folded correctly into its proper 3D shape. In this state, it is functional and can perform its intended job (e.g., act as an enzyme). In biotechnology, producing soluble proteins is the primary goal.
- An insoluble protein is typically misfolded and non-functional. These proteins clump together inside the bacterium to form dense aggregates called "inclusion bodies." While the *E. coli* cell survives, the protein produced is useless for research or therapeutic purposes.
Why focus specifically on E.coli ?
E. coli is the most widely used host organism in biotechnology and molecular biology for producing recombinant proteins. It is well-understood, grows rapidly, and is very cost-effective. Because so many new proteins are first produced in *E. coli*, a tool that predicts solubility in this specific environment is incredibly valuable to the scientific community.
What are the practical applications of this tool ?
This tool can significantly streamline the protein production pipeline. Researchers can screen hundreds of potential protein candidates virtually before starting expensive and time-consuming lab experiments. This helps in:
- Drug Development: Quickly identifying therapeutic protein candidates that are more likely to be stable and manufacturable.
- Academic Research: Reducing trial and error when expressing new proteins for study.
- Industrial Biotechnology: Optimizing the production of enzymes and other proteins used in various industries.
Acknowledgements
Prof. Krishna Kumar Balaraman
We would like to express our sincere gratitude to Prof. Krishna Kumar Balaraman, from IIT Jodhpur, for giving us this project. The main purpose behind this assignment was to help us learn and explore new concepts, and it truly enhanced our understanding and practical knowledge. We are thankful for his guidance and for the opportunity to work on this meaningful learning experience.