Sentence classification of online drug reviews using machine learning techniques
Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical ca...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2016
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/69363 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports.
Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it.
Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus.
Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models. |
---|