Data & Methods - Protein Production

Dataset and Preprocessing

The foundation of this project is a robust dataset of proteins from E. coli. Our primary data for model training was sourced from two files, ecoli_train and ecoli_test, containing sequences and their solubility labels. Separately, we utilized two reference libraries, Human.csv and Ecoli.csv, which contain proteins found in humans and E. coli, respectively. It is important to note that the 3D structure viewer will only display a model if the input UniProt ID exists within these two reference files, as they are used to look up the corresponding amino acid sequence. The training data underwent a rigorous preprocessing pipeline to clean, validate, and prepare it for model training, including removing duplicates and ensuring sequence integrity.

Feature Engineering

A raw amino acid sequence cannot be directly fed into a machine learning model. Therefore, we converted each sequence into a numerical format that captures its most essential properties. Our successful approach focused on two fundamental and powerful features:

Amino Acid Composition (AAC): This is the core feature of our model. For each protein, we calculated the frequency of each of the 20 standard amino acids. This creates a 20-dimensional vector that represents the protein's overall makeup.
Sequence Length: The total number of amino acids in the protein was included as an additional, crucial feature, as longer chains can have a higher tendency to misfold.

By combining these 21 features, we created a simple yet highly effective numerical "fingerprint" for each protein, allowing the model to learn the patterns that distinguish soluble proteins from insoluble ones with high accuracy.

Model Training and Selection

We explored several machine learning algorithms, including Logistic Regression, Support Vector Machine (SVM), and Random Forest. Each model was trained and evaluated on the independent test data. The Random Forest model emerged as the clear winner, consistently outperforming the others across key evaluation metrics like F1-score. Its strength in handling complex relationships and robustness against overfitting made it the ideal choice for our final predictive engine.

3D Structure Retrieval

To provide an interactive visualization, the application retrieves a protein's 3D structure when a valid UniProt ID is provided. Our Python server uses a smart, two-step approach:

Primary Search (RCSB PDB): The server first queries the official Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank. This is the primary repository for experimentally determined structures.
Fallback Search (AlphaFold): If no structure is found in the RCSB PDB, the server automatically falls back to querying the AlphaFold Protein Structure Database, which contains highly accurate predicted structures.

The server fetches the raw PDB file content from the first successful source and sends it directly to the webpage, where the 3Dmol.js library renders it for viewing.

Tools and Technologies

This project was made possible by leveraging a suite of powerful open-source libraries and databases:

Scikit-learn: For implementing, training, and evaluating machine learning models.
Pandas & NumPy: For data manipulation, cleaning, and numerical operations.
Flask: For building the Python backend server that connects our model to the website.
RCSB PDB & AlphaFold Database: For retrieving the 3D structures of proteins.
3Dmol.js & Chart.js: For rendering the interactive 3D viewer and data charts on the webpage.

Our Approach to Predicting Protein Solubility

Dataset and Preprocessing

Feature Engineering

Model Training and Selection

3D Structure Retrieval

Tools and Technologies