An evaluation of tokenizers on domain specific text

The healthcare industry is fast realizing the value of data, collecting information from electronic health record systems (EHRs), sensors, and other sources. However, the problem of understanding the data collected in the process has been existed for years. According to big data analytics in heal...

Full description

Saved in:

Bibliographic Details
Main Author:	Tao, Yuan
Other Authors:	Sun Aixin
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/156461
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-156461
record_format	dspace
spelling	sg-ntu-dr.10356-1564612022-04-17T09:21:57Z An evaluation of tokenizers on domain specific text Tao, Yuan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg Engineering::Computer science and engineering The healthcare industry is fast realizing the value of data, collecting information from electronic health record systems (EHRs), sensors, and other sources. However, the problem of understanding the data collected in the process has been existed for years. According to big data analytics in healthcare, up to 80% of healthcare documentation is unstructured and hence generally unutilized, because mining and extracting this data is challenging and resource intensive. This is where Natural Language Processing can come in. NLP technology services have the potential to extract meaningful insights and concepts from data that was previously considered buried in text form. In NLP studies, text preprocessing is traditionally the first step in building a Machine Learning model, and in the process of text preprocessing, the very first and usually the most important step is tokenization. Currently, many open-source tools for tokenization are available for tokenizing text based on different rules, but few studies have been done on the performance of tokenizers on domain specific text—e.g., healthcare domain. Therefore, this project aims to, first, evaluate different open-source tokenizers’ performance on medical text data and select the best-performing tokenizer; after that, build a wrapper based on the best-performing tokenizer, to further improve its performance on medical text data. In this way, more accurate tokenization results of medical text data can be achieved, and these results can be used in the following NLP process to generate more meaningful insights. With NLP technology, physicians can enhance patient care, research efforts, and disease diagnosis methods. Bachelor of Engineering (Computer Science) 2022-04-17T09:21:57Z 2022-04-17T09:21:57Z 2022 Final Year Project (FYP) Tao, Y. (2022). An evaluation of tokenizers on domain specific text. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156461 https://hdl.handle.net/10356/156461 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Tao, Yuan An evaluation of tokenizers on domain specific text
description	The healthcare industry is fast realizing the value of data, collecting information from electronic health record systems (EHRs), sensors, and other sources. However, the problem of understanding the data collected in the process has been existed for years. According to big data analytics in healthcare, up to 80% of healthcare documentation is unstructured and hence generally unutilized, because mining and extracting this data is challenging and resource intensive. This is where Natural Language Processing can come in. NLP technology services have the potential to extract meaningful insights and concepts from data that was previously considered buried in text form. In NLP studies, text preprocessing is traditionally the first step in building a Machine Learning model, and in the process of text preprocessing, the very first and usually the most important step is tokenization. Currently, many open-source tools for tokenization are available for tokenizing text based on different rules, but few studies have been done on the performance of tokenizers on domain specific text—e.g., healthcare domain. Therefore, this project aims to, first, evaluate different open-source tokenizers’ performance on medical text data and select the best-performing tokenizer; after that, build a wrapper based on the best-performing tokenizer, to further improve its performance on medical text data. In this way, more accurate tokenization results of medical text data can be achieved, and these results can be used in the following NLP process to generate more meaningful insights. With NLP technology, physicians can enhance patient care, research efforts, and disease diagnosis methods.
author2	Sun Aixin
author_facet	Sun Aixin Tao, Yuan
format	Final Year Project
author	Tao, Yuan
author_sort	Tao, Yuan
title	An evaluation of tokenizers on domain specific text
title_short	An evaluation of tokenizers on domain specific text
title_full	An evaluation of tokenizers on domain specific text
title_fullStr	An evaluation of tokenizers on domain specific text
title_full_unstemmed	An evaluation of tokenizers on domain specific text
title_sort	evaluation of tokenizers on domain specific text
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/156461
_version_	1731235743964069888

An evaluation of tokenizers on domain specific text

Similar Items