Skip to content Skip to footer

Novozymes Enzyme Stability Prediction

Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another.

https://byjus.com/chemistry/protein-structure-and-levels-of-protein/

Enzymes are proteins that act as catalysts in the chemical reactions of living organisms. The goal of this case study is to predict the thermostability of enzyme variants. The experimentally measured thermostability (melting temperature) data includes natural sequences, as well as engineered sequences with single or multiple mutations upon the natural sequences.
Understanding and accurately predict protein stability is a fundamental problem in biotechnology. Its applications include enzyme engineering for addressing the world’s challenges in sustainability, carbon neutrality and more. Improvements to enzyme stability could lower costs and increase the speed scientists can iterate on concepts.

Contents

  1. Problem statement
  2. Source of data
  3. ML Problem
  4. EDA
  5. Featurization
  6. Model explanation
  7. Result evaluation
  8. Conclusion
  9. Future work
  10. References

1.Problem statement

The goal of this case study is to predict the thermostability of enzyme variants.
The attributes are protein sequence and pH values of a protein. The output to be predicted is a melting temperature ™.

2. Source of data

The data is available on Kaggle-Novozymes Enzyme Stability Prediction challenge.
The dataset consists of following features

  • seq_id — id of the protein sequence
  • protein_sequence — sequence of the protein
  • pH — pH value of the protein

3. ML Problem

As the target variable is a real-valued number, the given problem is a regression problem.
The metric used for evaluating a model is Spearman correlation coefficient between the ground truth and the predictions.

4. EDA

EDA is the most important part of doing any type of statistical or mathematical modelling. As melting temperature is the only numerical variable in this case, its distrbution is observed.

Handling missing values

df_train.isna().sum()
df_train.dropna(inplace=True)
df_test.dropna(inplace=True)

The rows containing null values are simply dropped.

Melting temperaturesns.distplot(df_train[‘tm’])

It is clear from the above PDF plot that, the melting temperature follows approximately normal distribution with mean melting temperature of about 50.

5. Featurization

Protein sequence lengthdf_train[“protein_sequence_len”] = df_train[“protein_sequence”].apply(lambda x: len(x))
df_test[“protein_sequence_len”] = df_test[“protein_sequence”].apply(lambda x: len(x))

A protein sequence length feature is computed as the length of the protein sequence as the length can be the useful information for predicting the melting temperature.

sns.distplot(df_train['protein_sequence_len'])

Most of the protein sequences has a length of ≤ 221. Hence, filtering a dataset with sequence length of ≤ 221



.df_train = df_train[df_train["protein_sequence_len"]<=221]

Protein sequences

As protein sequence is a sequence, it was a best way to encode them using label encoder.

sequences = [list(string) for string in df_train["protein_sequence"].values.tolist()]
train_sequences = pd.DataFrame(sequences)

aminos = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', None]

le = LabelEncoder()
le.fit(aminos)

train_sequences_tr = train_sequences.apply(le.transform)
train_sequences_tr["tm"] = df_train["tm"]

Label encoder will convert each protein sequence into an array of numbers. Each element of a sequence will have a unique number for it, and in this way the sequences are featurized.

6. Model explanation

I had decided to use an extreme gradient boosting regressor for this problem with hyperparameter tuning for max_depth and n_estimators.X = train_sequences_tr.drop([‘tm’], axis=1)
y = train_sequences_tr[‘tm’].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)xgbreg = XGBRegressor()
parameters = {'max_depth':[1, 5, 10, 50], 'n_estimators':[5, 10, 100, 500]}
clf = GridSearchCV(xgbreg, parameters, cv=3, return_train_score=True)
clf.fit(X_train, y_train)

After hyperparameter tuning, the best parameters got are max_depth=5 and n_estimators=100.

Fitting the model with these hyperparameters.model = XGBRegressor(n_estimators=100, max_depth=5)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

7. Result evaluation

As per the evaluation metric, the comparison to be made between the ground truth and predicted values using Spearman correlation coefficient.



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)xgbreg = XGBRegressor()
parameters = {'max_depth':[1, 5, 10, 50], 'n_estimators':[5, 10, 100, 500]}
clf = GridSearchCV(xgbreg, parameters, cv=3, return_train_score=True)
clf.fit(X_train, y_train)

The Spearman correlation coefficient obtained between the ground truth and predicted values is 0.436

8. Conclusion

The Spearman correlation coefficient obtained between the ground truth and predicted values is 0.436 which can be considered a good value for the initial stage of the model.
For getting a high value of accuracy of model, the model needs to be trained with PDB (Protein Data Bank) files which increases the data size to a great extent and it is beyond the scope of the Google colab notebook.

9. Future work

For getting a high value of accuracy of model, the model can be trained with PDB (Protein Data Bank) data. As this increases the size of the data, this is to be run on cloud platform like GCP or AWS.