Part 1: Multi-modal health data embedding
I recently came across this article: HINT: Hierarchical interaction network for clinical-trial-outcome predictions from Fu et al. It’s an interesting application of real-world data science, and it inspired me to create my own project in which I attempt to predict clinical trial outcomes based on publicly available information from ClinicalTrials.gov.
The aim of the project is to predict the outcome of clinical trials (a binary outcome: fail vs. success), without actually having to run them. We will use publicly available clinical trial information from ClinicalTrials.gov such as Drug Molecule, Disease Indication, Trial Protocol, Sponsor, and Number of Participants and embed it (transform it into vector representations) using different tools such as BioBERT, SBERT, and DeepPurpose.
Workflow schematic (image by author)
In the first part of this series I focus on embedding the multi-modal clinical trial data. In the second part I use an XGBoost model to predict trial outcomes (a binary prediction: fail vs. success) and briefly compare my simple XGBoost model’s performance to the HINT model’s performance from the article that inspired this project.
Focus for Part 1 of this series: Embedding multi-modal clinical trial data into vectors (image by author)
Here are the steps I will follow in this article:
Collect all clinical trial records from ClinicalTrials.govRead and parse the obtained XML filesEmbed disease indications using tiny-biobert by Rohanian et al., a compact version of BioBERTEmbed clinical trial inclusion-/exclusion criteria using tiny-biobertEmbed sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERTConvert drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose by Huang et al.
You can follow all the steps in this Jupyter notebook: clinical trial embedding tutorial.
Collect clinical trial records from ClinicalTrials.gov
I suggest running the whole process in the command line since it is time- and space-consuming. In case you don’t have wget installed on your system, have a look here for how to install wget. Open up a command line/terminal and type in the following commands:
# 0. Clone repository
# Navigate to the directory where you want to clone the repository and type:
git clone https://github.com/lenlan/clinical-trial-prediction.git
cd clinical-trial-prediction
# 1. Download data
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip # This will take 10-20 minutes to download
# 2. Unzip the ZIP file.
# The unzipped file occupies approximately 11 GB. Please make sure you have enough space.
unzip AllPublicXML.zip # This might take over an hour to run, depending on your system
cd ../
# 3. Collect and sort all the XML files and put output in all_xml
find raw_data/ -name NCT*.xml | sort > data/all_xml
head -3 data/all_xml
### Output:
# raw_data/NCT0000xxxx/NCT00000102.xml
# raw_data/NCT0000xxxx/NCT00000104.xml
# raw_data/NCT0000xxxx/NCT00000105.xml
# NCTID is the identifier of a clinical trial. `NCT00000102`, `NCT00000104`, `NCT00000105` are all NCTIDs.
# 4. Remove ZIP file to recover some disk space
rm raw_data/AllPublicXML.zip
Read and parse the obtained XML files
Now that you have the clinical trials as individual files on your hard drive, we’re going to extract the information we need from them by parsing the XML files.
from xml.etree import ElementTree as ET
# function adapted from https://github.com/futianfan/clinical-trial-outcome-prediction
def xmlfile2results(xml_file):
tree = ET.parse(xml_file)
root = tree.getroot()
nctid = root.find(‘id_info’).find(‘nct_id’).text ### nctid: ‘NCT00000102’
print(“nctid is”, nctid)
study_type = root.find(‘study_type’).text
print(“study type is”, study_type)
interventions = [i for i in root.findall(‘intervention’)]
drug_interventions = [i.find(‘intervention_name’).text for i in interventions
if i.find(‘intervention_type’).text==’Drug’]
print(“drug intervention:”, drug_interventions)
### remove ‘biologics’,
### non-interventions
if len(drug_interventions)==0:
return (None,)
try:
status = root.find(‘overall_status’).text
print(“status:”, status)
except:
status = ”
try:
why_stop = root.find(‘why_stopped’).text
print(“why stop:”, why_stop)
except:
why_stop = ”
try:
phase = root.find(‘phase’).text
print(“phase:”, phase)
except:
phase = ”
conditions = [i.text for i in root.findall(‘condition’)] ### disease
print(“disease”, conditions)
try:
criteria = root.find(‘eligibility’).find(‘criteria’).find(‘textblock’).text
print(‘found criteria’)
except:
criteria = ”
try:
enrollment = root.find(‘enrollment’).text
print(“enrollment:”, enrollment)
except:
enrollment = ”
try:
lead_sponsor = root.find(‘sponsors’).find(‘lead_sponsor’).find(‘agency’).text
print(“lead_sponsor:”, lead_sponsor)
except:
lead_sponsor = ”
data = {‘nctid’:nctid,
‘study_type’:study_type,
‘drug_interventions’:[drug_interventions],
‘overall_status’:status,
‘why_stopped’:why_stop,
‘phase’:phase,
‘indications’:[conditions],
‘criteria’:criteria,
‘enrollment’:enrollment,
‘lead_sponsor’:lead_sponsor}
return pd.DataFrame(data)
### Output:
# nctid is NCT00040014
# study type is Interventional
# drug intervention: [‘exemestane’]
# status: Terminated
# phase: Phase 2
# disease [‘Breast Neoplasms’]
# found criteria
# enrollment: 100
# lead_sponsor: Pfizer
Using sentence-transformers to embed information — Example
First we need to install the sentence-transformers library.
pip install -U sentence-transformersfrom sentence_transformers import SentenceTransformer
sentences = [“This is an example sentence”, “Each sentence is converted”]
#all-MiniLM-L6-v2 encodes each sentence into a 312-dimensional vector
model = SentenceTransformer(‘all-MiniLM-L6-v2’)
embeddings = model.encode(sentences)
print(embeddings.shape)
### Output:
# (2, 312)
We have successfully transformed two sentences into a 312-dimensional vector representation.
Embed disease indications using tiny-biobert
We first create a dictionary, where we map each indication to its 312- dimensional embedding using tiny-biobert. Then, we create a dictionary that directly maps each trial identifier to its disease embedding. When trials include multiple indications, we take their mean as vector representation.
def create_indication2embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle(‘data/toy_df.pkl’)
# Create list with all indications and encode each one into a 312-dimensional vector
all_indications = sorted(set(reduce(lambda x, y: x + y, toy_df[‘indications’].tolist())))
# Using ‘nlpie/tiny-biobert’, a smaller version of BioBERT
model = SentenceTransformer(‘nlpie/tiny-biobert’)
embeddings = model.encode(all_indications, show_progress_bar=True)
# Create dictionary mapping indications to embeddings
indication2embedding_dict = {}
for key, row in zip(all_indications, embeddings):
indication2embedding_dict[key] = row
pickle.dump(indication2embedding_dict, open(‘data/indication2embedding_dict.pkl’, ‘wb’))
embedding = []
for indication_lst in tqdm(toy_df[‘indications’].tolist()):
vec = []
for indication in indication_lst:
vec.append(indication2embedding_dict[indication])
print(np.array(vec).shape) # DEBUG
vec = np.mean(np.array(vec), axis=0)
print(vec.shape) # DEBUG
embedding.append(vec)
print(np.array(embedding).shape)
dict = zip(toy_df[‘nctid’], np.array(embedding))
nctid2disease_embedding_dict = {}
for key, row in zip(toy_df[‘nctid’], np.array(embedding)):
nctid2disease_embedding_dict[key] = row
pickle.dump(nctid2disease_embedding_dict, open(‘data/nctid2disease_embedding_dict.pkl’, ‘wb’))
create_indication2embedding_dict()
Embed clinical trial inclusion/exclusion criteria using tiny-biobert
In a very similar way, we encode clinical trial inclusion/exclusion criteria. There is some additional data cleaning involved to get the text into the right format. We encode inclusion and exclusion criteria separately, and each criteria embedding is the mean vector representation of the sentences it consists of.
def create_nctid2protocol_embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle(‘data/toy_df.pkl’)
# Using ‘nlpie/tiny-biobert’, a smaller version of BioBERT
model = SentenceTransformer(‘nlpie/tiny-biobert’)
def criteria2vec(criteria):
embeddings = model.encode(criteria)
# print(embeddings.shape) # DEBUG
embeddings_avg = np.mean(embeddings, axis=0)
# print(embeddings_avg.shape) # DEBUG
return embeddings_avg
nctid_2_protocol_embedding = dict()
print(f”Embedding {len(toy_df)*2} inclusion/exclusion criteria..”)
for nctid, protocol in tqdm(zip(toy_df[‘nctid’].tolist(), toy_df[‘criteria’].tolist())):
# if(nctid == ‘NCT00003567’): break #DEBUG
split = split_protocol(protocol)
if len(split)==2:
embedding = np.concatenate((criteria2vec(split[0]), criteria2vec(split[1])))
else:
embedding = np.concatenate((criteria2vec(split[0]), np.zeros(312)))
nctid_2_protocol_embedding[nctid] = embedding
# for key in nctid_2_protocol_embedding: #DEBUG
# print(f”{key}:{nctid_2_protocol_embedding[key].shape}”) #DEBUG
pickle.dump(nctid_2_protocol_embedding, open(‘data/nctid_2_protocol_embedding_dict.pkl’, ‘wb’))
return
create_nctid2protocol_embedding_dict()
Embed sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERT
I chose to encode trial sponsors using sentence embedding (SBERT) as well. Using simpler methods such as Label or One-Hot Encoding could work too, but I wanted to be able to catch similarities between sponsor names, in case there were typos or multiple different spellings for one sponsor. I use the pre-trained all-MiniLM-L6-v2 model which achieves high speed and performance on benchmark datasets. It converts each sponsor institution into a 384-dimensional vector.
def create_sponsor2embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle(‘data/toy_df.pkl’)
# Create list with all indications and encode each one into a 384-dimensional vector
all_sponsors = sorted(set(toy_df[‘lead_sponsor’].tolist()))
# Using ‘all-MiniLM-L6-v2’, a pre-trained model with excellent performance and speed
model = SentenceTransformer(‘all-MiniLM-L6-v2’)
embeddings = model.encode(all_sponsors, show_progress_bar=True)
print(embeddings.shape)
# Create dictionary mapping indications to embeddings
sponsor2embedding_dict = {}
for key, row in zip(all_sponsors, embeddings):
sponsor2embedding_dict[key] = row
pickle.dump(sponsor2embedding_dict, open(‘data/sponsor2embedding_dict.pkl’, ‘wb’))
create_sponsor2embedding_dict()
Convert drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose
Molecules can be represented in SMILES strings. SMILES is a line notation for encoding molecular structure. Drug molecule data is extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings) using CACTUS.
import requests
def get_smiles(drug_name):
# URL for the CIR API
base_url = “https://cactus.nci.nih.gov/chemical/structure”
url = f”{base_url}/{drug_name}/smiles”
try:
# Send a GET request to retrieve the SMILES representation
response = requests.get(url)
if response.status_code == 200:
smiles = response.text.strip() # Get the SMILES string
print(f”Drug Name: {drug_name}”)
print(f”SMILES: {smiles}”)
else:
print(f”Failed to retrieve SMILES for {drug_name}. Status code: {response.status_code}”)
smiles = ”
except requests.exceptions.RequestException as e:
print(f”An error occurred: {e}”)
return smiles
# Define the drug name you want to convert
drug_name = “aspirin” # Replace with the drug name of your choice
get_smiles(drug_name)
### Output:
# Drug Name: aspirin
# SMILES: CC(=O)Oc1ccccc1C(O)=O
DeepPurpose can be used to encode molecular compounds. It currently supports 15 different encodings. We will use Morgan encoding, which encodes the atom groups of a chemical into a binary vector with length and radius as its two parameters. First we need to install the DeepPurpose library.
pip install DeepPurposeOverview of DeepPurpose Encoders (Image by Huang et al., CC license)
We create a dictionary that maps SMILES to Morgan representation and a dictionary that maps clinical trial identifiers (NCTIDs) directly to their Morgan representation.
def create_smiles2morgan_dict():
from DeepPurpose.utils import smiles2morgan
# Import toy dataset
toy_df = pd.read_csv(‘data/toy_df.csv’)
smiles_lst = list(map(txt_to_lst, toy_df[‘smiless’].tolist()))
unique_smiles = set(reduce(lambda x, y: x + y, smiles_lst))
morgan = pd.Series(list(unique_smiles)).apply(smiles2morgan)
smiles2morgan_dict = dict(zip(unique_smiles, morgan))
pickle.dump(smiles2morgan_dict, open(‘data/smiles2morgan_dict.pkl’, ‘wb’))
def create_nctid2molecule_embedding_dict():
# Import toy dataset
toy_df = pd.read_csv(‘data/toy_df.csv’)
smiles_lst = list(map(txt_to_lst, toy_df[‘smiless’].tolist()))
smiles2morgan_dict = load_smiles2morgan_dict()
embedding = []
for drugs in tqdm(smiles_lst):
vec = []
for drug in drugs:
vec.append(smiles2morgan_dict[drug])
# print(np.array(vec).shape) # DEBUG
vec = np.mean(np.array(vec), axis=0)
# print(vec.shape) # DEBUG
embedding.append(vec)
print(np.array(embedding).shape)
dict = zip(toy_df[‘nctid’], np.array(embedding))
nctid2molecule_embedding_dict = {}
for key, row in zip(toy_df[‘nctid’], np.array(embedding)):
nctid2molecule_embedding_dict[key] = row
pickle.dump(nctid2molecule_embedding_dict, open(‘data/nctid2molecule_embedding_dict.pkl’, ‘wb’))
create_nctid2molecule_embedding_dict()
Conclusion
Using publicly available clinical trial information we can create useful inputs for machine learning models by using feature embedding. To summarize, we:
Embedded disease indications using tiny-biobert by Rohanian et al., a compact version of BioBERTEmbedded clinical trial inclusion-/exclusion criteria using tiny-biobertEmbedded sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERTConverted drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose by Huang et al.
In the second part of this series I will run a simple XGBoost model to predict the clinical trial outcome based on the embedded vector representations we created here. I will compare its performance to that of the HINT model.
References
Fu, Tianfan, et al. “Hint: Hierarchical interaction network for clinical-trial-outcome predictions.” Patterns 3.4 (2022).Huang, Kexin, et al. “DeepPurpose: a deep learning library for drug–target interaction prediction.” Bioinformatics 36.22–23 (2020): 5545–5547.
Clinical Trial Outcome Prediction was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.