Biomedical Relation Extraction from PubMed abstracts using Supervised Learning

4 min readMay 26, 2021

Recent years have seen a rapid increase in the number of publications, scientific articles, reports, etc. that are available and readily accessible in electronic format, especially in the biomedical field.

The PubMed database contains more than 32 million citations and abstracts of biomedical literature. Due to the large number of articles submitted, it is difficult to read the abstract of the article and obtain a complete picture. This has led to an increased need for text mining in the biomedical field.

The amount of biomedical information that can be extracted from such a database is enormous. Here, we intend to extract ‘treated by’ and ‘caused by’ relations from biomedical abstracts using a Supervised learning. This would allow us to query for a particular disease or entity and get the appropriate cause and treatment.

Above working example can be found here:

Biomedical Relations (transformernlp.github.io)

We will take you through the end-end pipeline, starting from extracting sentences from PubMed abstracts to creating a (entity 1 — relation — entity 2) triplet.

End-to-End Pipeline

1. Downloading PubMed Abstracts and Preprocessing

The PubMed abstracts can be downloaded from here. The link contains over 1000 XML files, each containing multiple article abstracts. We extract the abstracts, preprocess, and store all the abstracts in a CSV.

2. Extracting Biomedical Entities

We rely on Scispacy to extract the Biomedical entities from the extracted sentences. The scispacy models en_ner_bionlp13cg_md and en_ner_bc5cdr_md are used for extracting entities from sentences. The extracted entities are then converted to a triplet form (entity 1 — sentence — entity 2). When a sentence contains more than two entities, every possible triplet combination is considered.

Entity 1: corticosteroids | Sentence: Corticosteroids are used in the treatment of arthritis | Entity 2: arthritis

3. Parsing the Snorkel CrowdTruth Dataset to NLI form to create our training dataset

We now parse the Snorkel CrowdTruth dataset into an NLI form by considering both ‘caused by’ and ‘treated by’ as the possible hypothesis. We ensure that the samples with expert=1.0 are only considered while dataset creation, as these samples are discrete labels based on an expert’s judgment as to whether the baseline label is correct. Now, that we have the True label samples, we inverse the relations ( caused → treated ; treated → caused) between the entities so that our training data has both positive and negative labels.

4. Parsing the PubMed triplets to NLI form to create our inference dataset

We now parse the PubMed triplets into an NLI form by considering both ‘caused by’ and ‘treated by’ as the possible hypothesis. The label (True or False) can then be predicted by feeding this data to the model trained above.

Premise: Sentence | Hypothesis: Entity1/Entity2 caused by/treated by Entity2/Entity1.

For example, the previous example can be converted into four possible forms.

Premise: Corticosteroids are used in the treatment of arthritis | Hypothesis: corticosteroids caused by arthritis
Premise: Corticosteroids are used in the treatment of arthritis | Hypothesis: arthritis caused by corticosteroids
Premise: Corticosteroids are used in the treatment of arthritis | Hypothesis: corticosteroids treated by arthritis
Premise: Corticosteroids are used in the treatment of arthritis | Hypothesis: arthritis treated by corticosteroids

5. Training RoBERTa model on modified CrowdTruth dataset from Step 3

We fine-tune a RoBERTa-Large model using fast.ai on the training dataset. The model converges in less than 10 epochs and has a validation accuracy of 97%. We now save the trained model for predicting the label for our inference dataset (modified PubMed dataset from Step 4).

6. Relation Prediction using the fine-tuned model

The RoBERTa-large model fine-tuned on the NLI task is now used to predict the entailment or not _entailment — relation between the biomedical premises and hypothesis. Now our outputs contain one of entailment or not_entailment. From these results, we consider only entailment and filtered final relations based on probability scores from the model. This allows us to create triplets (entity 1 — relation— entity 2) from the premise-hypothesis pairs of entailment predictions. This makes our approach a Supervised way of extracting relations. If you want to explore Supervised and Few-Shot learning, this might be a great place to start.

The obtained database is available at Biomedical Relations (transformernlp.github.io)

In case you have any queries, please reach out to us here!

This article was authored by Saichandra Pandraju and Sakthi Ganesh