Named entity recognition in the medical domain
This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/165225 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-165225 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1652252023-03-24T15:40:57Z Named entity recognition in the medical domain Kusalavan, Kirubhaharini Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Data This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best model and architecture, and creating a pipeline to train and deploy NER models for different domains. The NER models will be used to auto-fill forms during emergencies by detecting entities in call transcripts and to identify essential entities from clinical notes. The paper also discusses the sampling method used to derive subsets from the datasets, the backend and frontend of the NER Flask application, and presents the results and discussions. For the GMB dataset, RoBERTa outperformed BERT by 0.19% and DistilBERT by 1.58%. RoBERTa and BERT showed similar results for the CoNLL-2003 dataset, with the latter scoring 0.02% higher and 0.85% better than DistilBERT. MedBERT was the best model for the N2C2 dataset, performing 1.71% better than BERT. However, the implementation of augmentation techniques for the GMB and N2C2 datasets did not yield significant improvements in the results of the NER models. Lastly, The emergency hotline dataset models showed similar results, with BioClinical BERT scoring the highest. These models can be deployed using the Flask application introduced in this report to receive useful outputs. Bachelor of Science in Data Science and Artificial Intelligence 2023-03-20T23:53:48Z 2023-03-20T23:53:48Z 2023 Final Year Project (FYP) Kusalavan, K. (2023). Named entity recognition in the medical domain. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165225 https://hdl.handle.net/10356/165225 en SCSE22-0086 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Data |
spellingShingle |
Engineering::Computer science and engineering::Data Kusalavan, Kirubhaharini Named entity recognition in the medical domain |
description |
This report presents a project that aims to develop Named Entity Recognition (NER) models for two datasets in the medical domain: emergency hotline data and N2C2 clinical notes. The project objectives include reviewing existing models and architectures, training on general datasets to find the best model and architecture, and creating a pipeline to train and deploy NER models for different domains. The NER models will be used to auto-fill forms during emergencies by detecting entities in call transcripts and to identify essential entities from clinical notes. The paper also discusses the sampling method used to derive subsets from the datasets, the backend and frontend of the NER Flask application, and presents the results and discussions. For the GMB dataset, RoBERTa outperformed BERT by 0.19% and DistilBERT by 1.58%. RoBERTa and BERT showed similar results for the CoNLL-2003 dataset, with the latter scoring 0.02% higher and 0.85% better than DistilBERT. MedBERT was the best model for the N2C2 dataset, performing 1.71% better than BERT. However, the implementation of augmentation techniques for the GMB and N2C2 datasets did not yield significant improvements in the results of the NER models. Lastly, The emergency hotline dataset models showed similar results, with BioClinical BERT scoring the highest. These models can be deployed using the Flask application introduced in this report to receive useful outputs. |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Kusalavan, Kirubhaharini |
format |
Final Year Project |
author |
Kusalavan, Kirubhaharini |
author_sort |
Kusalavan, Kirubhaharini |
title |
Named entity recognition in the medical domain |
title_short |
Named entity recognition in the medical domain |
title_full |
Named entity recognition in the medical domain |
title_fullStr |
Named entity recognition in the medical domain |
title_full_unstemmed |
Named entity recognition in the medical domain |
title_sort |
named entity recognition in the medical domain |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/165225 |
_version_ |
1761781182056366080 |