An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this w...

Full description

Saved in:
Bibliographic Details
Main Authors: JIANG, Jing, ZHAI, ChengXiang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2007
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/23
http://dx.doi.org/10.1007/s10791-007-9027-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-1022
record_format dspace
spelling sg-smu-ink.sis_research-10222010-09-22T14:00:36Z An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval JIANG, Jing ZHAI, ChengXiang Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text. 2007-10-01T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/23 info:doi/10.1007/s10791-007-9027-7 http://dx.doi.org/10.1007/s10791-007-9027-7 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Numerical Analysis and Scientific Computing
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Numerical Analysis and Scientific Computing
spellingShingle Databases and Information Systems
Numerical Analysis and Scientific Computing
JIANG, Jing
ZHAI, ChengXiang
An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
description Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.
format text
author JIANG, Jing
ZHAI, ChengXiang
author_facet JIANG, Jing
ZHAI, ChengXiang
author_sort JIANG, Jing
title An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_short An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_full An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_fullStr An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_full_unstemmed An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
title_sort empirical study of tokenization strategies for biomedical information retrieval
publisher Institutional Knowledge at Singapore Management University
publishDate 2007
url https://ink.library.smu.edu.sg/sis_research/23
http://dx.doi.org/10.1007/s10791-007-9027-7
_version_ 1770568801397506048