Named entity recognition in the medical domain

This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best...

Full description

Saved in:
Bibliographic Details
Main Author: Kusalavan, Kirubhaharini
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165225
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-165225
record_format dspace
spelling sg-ntu-dr.10356-1652252023-03-24T15:40:57Z Named entity recognition in the medical domain Kusalavan, Kirubhaharini Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Data This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best model and architecture, and creating a pipeline to train and deploy NER models for different domains. The NER models will be used to auto-fill forms during emergencies by detecting entities in call transcripts and to identify essential entities from clinical notes. The paper also discusses the sampling method used to derive subsets from the datasets, the backend and frontend of the NER Flask application, and presents the results and discussions. For the GMB dataset, RoBERTa outperformed BERT by 0.19% and DistilBERT by 1.58%. RoBERTa and BERT showed similar results for the CoNLL-2003 dataset, with the latter scoring 0.02% higher and 0.85% better than DistilBERT. MedBERT was the best model for the N2C2 dataset, performing 1.71% better than BERT. However, the implementation of augmentation techniques for the GMB and N2C2 datasets did not yield significant improvements in the results of the NER models. Lastly, The emergency hotline dataset models showed similar results, with BioClinical BERT scoring the highest. These models can be deployed using the Flask application introduced in this report to receive useful outputs. Bachelor of Science in Data Science and Artificial Intelligence 2023-03-20T23:53:48Z 2023-03-20T23:53:48Z 2023 Final Year Project (FYP) Kusalavan, K. (2023). Named entity recognition in the medical domain. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165225 https://hdl.handle.net/10356/165225 en SCSE22-0086 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Data
spellingShingle Engineering::Computer science and engineering::Data
Kusalavan, Kirubhaharini
Named entity recognition in the medical domain
description This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best model and architecture, and creating a pipeline to train and deploy NER models for different domains. The NER models will be used to auto-fill forms during emergencies by detecting entities in call transcripts and to identify essential entities from clinical notes. The paper also discusses the sampling method used to derive subsets from the datasets, the backend and frontend of the NER Flask application, and presents the results and discussions. For the GMB dataset, RoBERTa outperformed BERT by 0.19% and DistilBERT by 1.58%. RoBERTa and BERT showed similar results for the CoNLL-2003 dataset, with the latter scoring 0.02% higher and 0.85% better than DistilBERT. MedBERT was the best model for the N2C2 dataset, performing 1.71% better than BERT. However, the implementation of augmentation techniques for the GMB and N2C2 datasets did not yield significant improvements in the results of the NER models. Lastly, The emergency hotline dataset models showed similar results, with BioClinical BERT scoring the highest. These models can be deployed using the Flask application introduced in this report to receive useful outputs.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Kusalavan, Kirubhaharini
format Final Year Project
author Kusalavan, Kirubhaharini
author_sort Kusalavan, Kirubhaharini
title Named entity recognition in the medical domain
title_short Named entity recognition in the medical domain
title_full Named entity recognition in the medical domain
title_fullStr Named entity recognition in the medical domain
title_full_unstemmed Named entity recognition in the medical domain
title_sort named entity recognition in the medical domain
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/165225
_version_ 1761781182056366080