Unfolding the universe of possibilities..

Whispers from the digital wind, hang tight..

Clinical Trial Outcome Prediction

Part 2: Predicting clinical trial outcomes using XGBoost

In the first part of this series I focused on embedding multi-modal real-world data derived from ClinicalTrials.gov. In this article I will implement a basic XGBoost model, train it on the embeddings we created in Part 1 and compare its performance to that of the HINT model (a hierarchical graph neural net) by which this project was inspired.

Workflow schematic (image by author)

These are the steps I will follow in this article:

Load training, validation, and test datasetsEmbed drug molecule(s), inclusion/exclusion criteria, disease indication(s), trial sponsor, and number of participantsDefine evaluation metricsTrain XGBoost model and briefly compare with HINT model performanceFocus for Part 2 of this series: Predicting clinical trial outcomes based on feature embeddings created in Part 1 (image by author)

You can follow all the steps in this Jupyter notebook: Clinical trial embedding tutorial.

Load training, validation, and test datasets

import os
import pandas as pd
import numpy as np
import pickle

# Import toy dataset
toy_df = pd.read_pickle(‘data/toy_df_full.pkl’)

train_df = toy_df[toy_df[‘split’] == ‘train’]
val_df = toy_df[toy_df[‘split’] == ‘valid’]
test_df = toy_df[toy_df[‘split’] == ‘test’]

y_train = train_df[‘label’]
y_val = val_df[‘label’]
y_test = test_df[‘label’]

print(train_df.shape, val_df.shape, test_df.shape)
print(y_train.shape, y_val.shape, y_test.shape)

### Output:
# (1028, 14) (146, 14) (295, 14)
# (1028,) (146,) (295,)

Embed drug molecule, protocol, indication, and trial sponsor

In this section we load the dictionaries we created in Part 1 of this series and use them to map the values in the train, validation, and test sets into their respective embeddings.

def embed_all(df):
print(‘input shape: ‘, df.shape)
print(’embedding drug molecules..’)
nctid2molecule_embedding_dict = load_nctid2molecule_embedding_dict()
h_m = np.stack(df[‘nctid’].map(nctid2molecule_embedding_dict))
print(f”drug molecules successfully embedded into {h_m.shape} dimensions”)
print(’embedding protocols..’)
nctid2protocol_embedding_dict = load_nctid2protocol_embedding_dict()
h_p = np.stack(df[‘nctid’].map(nctid2protocol_embedding_dict))
print(f”protocols successfully embedded into {h_p.shape} dimensions”)
print(’embedding disease indications..’)
nctid2disease_embedding_dict = load_nctid2disease_embedding_dict()
h_d = np.stack(df[‘nctid’].map(nctid2disease_embedding_dict))
print(f”disease indications successfully embedded into {h_d.shape} dimensions”)
print(’embedding sponsors..’)
sponsor2embedding_dict = load_sponsor2embedding_dict()
h_s = np.stack(df[‘lead_sponsor’].map(sponsor2embedding_dict))
print(f”sponsors successfully embedded into {h_s.shape} dimensions”)
print(‘normalizing enrollment numbers..’)
enrollment = pd.to_numeric(df[‘enrollment’] , errors=’coerce’)
if enrollment.isna().sum() != 0:
print(f”filling {enrollment.isna().sum()} NaNs with median value”)
enrollment.fillna(int(enrollment.median()), inplace=True)
print(f”succesfully filled NaNs with median value: {enrollment.isna().sum()} NaNs left”)
enrollment = enrollment.astype(int)
h_e = np.array((enrollment – enrollment.mean())/enrollment.std()).reshape(len(df),-1)
print(f”enrollment successfully embedded into {h_e.shape} dimensions”)
embedded_df = pd.DataFrame(data=np.column_stack((h_m, h_p, h_d, h_s, h_e)))
print(‘output shape: ‘, embedded_df.shape)
return embedded_df

# Embed data
X_train = embed_all(train_df)
X_val = embed_all(val_df)
X_test = embed_all(test_df)

Define evaluation metrics

We will use the same evaluation metrics as the ones proposed in the HINT article: ROC AUC, F1, PR-AUC, Precision, Recall, and Accuracy.

Train XGBoost model and predict train, validation, and test labels

import xgboost as xgb
# Create an XGBoost classifier with specified hyperparameters
xgb_classifier = xgb.XGBClassifier(
objective=’binary:logistic’, # for binary classification

# Train the XGBoost model
xgb_classifier.fit(X_train, y_train)
# Make predictions
y_train_pred = xgb_classifier.predict(X_train)
y_val_pred = xgb_classifier.predict(X_val)
y_test_pred = xgb_classifier.predict(X_test)
print(‘———–Results on training data:———–‘)
print_results(y_train_pred, y_train)
print(‘———–Results on validation data:———–‘)
print_results(y_val_pred, y_val)
print(‘———–Results on test data:———–‘)
print_results(y_test_pred, y_test)

### Output:
#———–Results on training data:———–
# ROC AUC: 1.0
# F1: 1.0
# PR-AUC: 1.0
# Precision: 1.0
# recall: 1.0
# accuracy: 1.0
# predict 1 ratio: 0.661
# label 1 ratio: 0.661
# ———–Results on validation data:———–
# ROC AUC: 0.765
# F1: 0.817
# PR-AUC: 0.799
# Precision: 0.840
# recall: 0.795
# accuracy: 0.773
# predict 1 ratio: 0.602
# label 1 ratio: 0.636
# ———–Results on test data:———–
# ROC AUC: 0.742
# F1: 0.805
# PR-AUC: 0.757
# Precision: 0.790
# recall: 0.821
# accuracy: 0.759
# predict 1 ratio: 0.630
# label 1 ratio: 0.606

Compare performance with HINT model

This simple XGBoost model was trained on feature embeddings for drug molecule(s), inclusion/exclusion criteria, disease indication(s), trial sponsor, and number of participants while the HINT authors did not to use the last two features: trial sponsor and number of participants. We used several large language model embedding tools such as BioBERT and SBERT, and used Morgan encoding for drug representations while the HINT authors used a variety of Neural Nets for all their embeddings.

We can see from the figure below that our feature embeddings, trained on by a simple XGBoost model, perform pretty well compared to the more sophisticated HINT model. Our project has better precision and accuracy on this dataset, but worse recall.

Comparison of this project’s performance to that of the HINT project (image by author)


Next steps could include doing an analysis to figure out to which extent the addition of the features trial sponsor and number of participants are contributing to improved performance (on some metrics) compared to other factors such as model choice and embedding techniques. Intuitively, it does seem like these features could improve predictive performance, as certain sponsors have historically performed better than others, and one might expect a relationship between trial size and outcome as well.

Now you may wonder: “What is the usefulness of such a predictive model? We can’t possibly rely on such a model and forgo running the trial?” And you are correct (although some companies are creating digital twins of patients with the goal of running trials virtually). A model like the one presented in this series could, for example, be used to improve clinical trial power analysis, a related statistical practice. A power analysis is used to determine the optimal number of participants to enroll in a specific trial, and a strong assumption about the treatment effect has to be made to perform such analysis. A predictive model that utilizes trial information such as drug molecule structure, disease indication, and trial eligibility criteria, like the model we implemented here, can potentially help with creating a more accurate power analysis.


Fu, Tianfan, et al. “Hint: Hierarchical interaction network for clinical-trial-outcome predictions.” Patterns 3.4 (2022).

Clinical Trial Outcome Prediction was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment