Term recognition from electronic medical records of Singaporean hospital (1/2)

Doctors at Hospitals daily write reports on patients’ statuses and kept them for future usages. However, with the increase use in computers to store information and analysed data, there have been problems allowing doctors to freely express their diagnosis and get the system to analyse the content of...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Hafiz Mohamed Hassan
Other Authors: School of Computer Engineering
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/59051
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Doctors at Hospitals daily write reports on patients’ statuses and kept them for future usages. However, with the increase use in computers to store information and analysed data, there have been problems allowing doctors to freely express their diagnosis and get the system to analyse the content of what is written. The challenge also lies with trying to tie down the medical terms used by the experience doctors in Singapore especially, with the standard terms that are used by other doctors, in different parts of the world. Thus a need was identified to help bridge manually written records by local doctors and ready available medical systems that conforms to other country standard; this project aims to develop a system that helps to be this bridge. The main objective of the project is to develop a term recognition system that can identify the content of what is written by the doctor. The author overall scope was to develop text categorization system to classify the words used in the EMR records by the machine learning approach. The author used Hierarchical TRIE structure to analyse the records, Manual segmentation was done on the records by Sentence disambiguation technique. Feature extraction by the n-grams approach was chosen and feature generation was developed. This was done by thorough analysis of the records and rigorous testing across different designs; each design was done to test different feature/features. A program was created and algorithms generated to classify the words according to the design specification in ARFF, LIBSVM and SVM light format. The result of the classifier was compiled and feature weighting was done. 3 main machine learning tools were used: WEKA, LIBSVM and SVM light. 3 different classifiers were used in WEKA: SMO, Naïve Bayes and J48 decision tree. From the Feature Performance result, it was found that increasing from a tri-gram output to a hex-gram output, introducing other features especially in shape feature form does help to improve its performance. Reducing the shape feature attribute by combining attribute features or omitting a shape feature attribute from the design does not have a big impact on the result. Having shape feature attributes for tri gram previous and next word does help to improve the performance but performance level does not increase when this is done to the hex–gram. Hence conclude that having more attributes for increase in n-value does not improve the performance level. Version 5b was identified as the best feature design. The design was used to re-evaluate test data, and automatic segmentation was done on the original data using the output results of the different classifiers. Comparison was done between the segmentation results of the different classifiers and a segmentation done by the author. Segmentation shows that the classifier with the best performance does not translate to most accurate segmentation result. Naïve B ayes give the most similar result but it has the lowest performance value. Results also show that there’s a lot to be improved in terms of feature design, use of rule base method can be considered in the future to compare with the machine learning approach that is done. More machine learning classifiers could be used and in depth analysis of the segmentation can be done