Sentence classification of online drug reviews using machine learning techniques

Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical ca...

Full description

Saved in:
Bibliographic Details
Main Author: Sukumar Warrier, Vinay
Other Authors: Khoo Soo Guan, Christopher
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/10356/69363
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69363
record_format dspace
spelling sg-ntu-dr.10356-693632019-12-10T13:06:54Z Sentence classification of online drug reviews using machine learning techniques Sukumar Warrier, Vinay Khoo Soo Guan, Christopher Wee Kim Wee School of Communication and Information DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models. Master of Science (Information Studies) 2016-12-20T01:48:15Z 2016-12-20T01:48:15Z 2016 Thesis http://hdl.handle.net/10356/69363 en Nanyang Technological University 50 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Library and information science
DRNTU::Engineering::Computer science and engineering::Information systems
DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics
spellingShingle DRNTU::Library and information science
DRNTU::Engineering::Computer science and engineering::Information systems
DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics
Sukumar Warrier, Vinay
Sentence classification of online drug reviews using machine learning techniques
description Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models.
author2 Khoo Soo Guan, Christopher
author_facet Khoo Soo Guan, Christopher
Sukumar Warrier, Vinay
format Theses and Dissertations
author Sukumar Warrier, Vinay
author_sort Sukumar Warrier, Vinay
title Sentence classification of online drug reviews using machine learning techniques
title_short Sentence classification of online drug reviews using machine learning techniques
title_full Sentence classification of online drug reviews using machine learning techniques
title_fullStr Sentence classification of online drug reviews using machine learning techniques
title_full_unstemmed Sentence classification of online drug reviews using machine learning techniques
title_sort sentence classification of online drug reviews using machine learning techniques
publishDate 2016
url http://hdl.handle.net/10356/69363
_version_ 1681036400399482880