An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this w...

Full description

Saved in:

Bibliographic Details
Main Authors:	JIANG, Jing, ZHAI, ChengXiang
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2007
Subjects:	Databases and Information Systems Numerical Analysis and Scientific Computing
Online Access:	https://ink.library.smu.edu.sg/sis_research/23 http://dx.doi.org/10.1007/s10791-007-9027-7
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-1022
record_format	dspace
spelling	sg-smu-ink.sis_research-10222010-09-22T14:00:36Z An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval JIANG, Jing ZHAI, ChengXiang Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text. 2007-10-01T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/23 info:doi/10.1007/s10791-007-9027-7 http://dx.doi.org/10.1007/s10791-007-9027-7 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Numerical Analysis and Scientific Computing
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Databases and Information Systems Numerical Analysis and Scientific Computing
spellingShingle	Databases and Information Systems Numerical Analysis and Scientific Computing JIANG, Jing ZHAI, ChengXiang An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
description	Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.
format	text
author	JIANG, Jing ZHAI, ChengXiang
author_facet	JIANG, Jing ZHAI, ChengXiang
author_sort	JIANG, Jing
title	An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_short	An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_full	An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_fullStr	An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_full_unstemmed	An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_sort	empirical study of tokenization strategies for biomedical information retrieval
publisher	Institutional Knowledge at Singapore Management University
publishDate	2007
url	https://ink.library.smu.edu.sg/sis_research/23 http://dx.doi.org/10.1007/s10791-007-9027-7
_version_	1770568801397506048

An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Similar Items