Sentence classification of online drug reviews using machine learning techniques

Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical ca...

全面介紹

Saved in:

書目詳細資料
主要作者:	Sukumar Warrier, Vinay
其他作者:	Khoo Soo Guan, Christopher
格式:	Theses and Dissertations
語言:	English
出版:	2016
主題:	DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics
在線閱讀:	http://hdl.handle.net/10356/69363
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-69363
record_format	dspace
spelling	sg-ntu-dr.10356-693632019-12-10T13:06:54Z Sentence classification of online drug reviews using machine learning techniques Sukumar Warrier, Vinay Khoo Soo Guan, Christopher Wee Kim Wee School of Communication and Information DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models. Master of Science (Information Studies) 2016-12-20T01:48:15Z 2016-12-20T01:48:15Z 2016 Thesis http://hdl.handle.net/10356/69363 en Nanyang Technological University 50 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics
spellingShingle	DRNTU::Library and information science DRNTU::Engineering::Computer science and engineering::Information systems DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics Sukumar Warrier, Vinay Sentence classification of online drug reviews using machine learning techniques
description	Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models.
author2	Khoo Soo Guan, Christopher
author_facet	Khoo Soo Guan, Christopher Sukumar Warrier, Vinay
format	Theses and Dissertations
author	Sukumar Warrier, Vinay
author_sort	Sukumar Warrier, Vinay
title	Sentence classification of online drug reviews using machine learning techniques
title_short	Sentence classification of online drug reviews using machine learning techniques
title_full	Sentence classification of online drug reviews using machine learning techniques
title_fullStr	Sentence classification of online drug reviews using machine learning techniques
title_full_unstemmed	Sentence classification of online drug reviews using machine learning techniques
title_sort	sentence classification of online drug reviews using machine learning techniques
publishDate	2016
url	http://hdl.handle.net/10356/69363
_version_	1681036400399482880

Sentence classification of online drug reviews using machine learning techniques

相似書籍