Sentence classification of online drug reviews using machine learning techniques
Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical ca...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2016
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/69363 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-69363 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-693632019-12-10T13:06:54Z Sentence classification of online drug reviews using machine learning techniques Sukumar Warrier, Vinay Khoo Soo Guan, Christopher Wee Kim Wee School of Communication and Information DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models. Master of Science (Information Studies) 2016-12-20T01:48:15Z 2016-12-20T01:48:15Z 2016 Thesis http://hdl.handle.net/10356/69363 en Nanyang Technological University 50 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics |
spellingShingle |
DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics Sukumar Warrier, Vinay Sentence classification of online drug reviews using machine learning techniques |
description |
Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports.
Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it.
Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus.
Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models. |
author2 |
Khoo Soo Guan, Christopher |
author_facet |
Khoo Soo Guan, Christopher Sukumar Warrier, Vinay |
format |
Theses and Dissertations |
author |
Sukumar Warrier, Vinay |
author_sort |
Sukumar Warrier, Vinay |
title |
Sentence classification of online drug reviews using machine learning techniques |
title_short |
Sentence classification of online drug reviews using machine learning techniques |
title_full |
Sentence classification of online drug reviews using machine learning techniques |
title_fullStr |
Sentence classification of online drug reviews using machine learning techniques |
title_full_unstemmed |
Sentence classification of online drug reviews using machine learning techniques |
title_sort |
sentence classification of online drug reviews using machine learning techniques |
publishDate |
2016 |
url |
http://hdl.handle.net/10356/69363 |
_version_ |
1681036400399482880 |