Accelerating Sequence Searching: Dimensionality Reduction Method

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching...

Full description

Saved in:

Bibliographic Details
Main Authors:	SONG, Guojie, CUI, Bin, ZHENG, Baihua, XIE, Kunqing, YANG, Dongqing
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2009
Subjects:	Sequence similarity search Sequence embedding Index Dimension reduction Computer Sciences Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/750 https://ink.library.smu.edu.sg/context/sis_research/article/1749/viewcontent/sem.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-1749
record_format	dspace
spelling	sg-smu-ink.sis_research-17492015-12-25T12:57:53Z Accelerating Sequence Searching: Dimensionality Reduction Method SONG, Guojie CUI, Bin ZHENG, Baihua XIE, Kunqing YANG, Dongqing Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods. 2009-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/750 info:doi/10.1007/s10115-008-0180-0 https://ink.library.smu.edu.sg/context/sis_research/article/1749/viewcontent/sem.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Sequence similarity search Sequence embedding Index Dimension reduction Computer Sciences Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Sequence similarity search Sequence embedding Index Dimension reduction Computer Sciences Databases and Information Systems
spellingShingle	Sequence similarity search Sequence embedding Index Dimension reduction Computer Sciences Databases and Information Systems SONG, Guojie CUI, Bin ZHENG, Baihua XIE, Kunqing YANG, Dongqing Accelerating Sequence Searching: Dimensionality Reduction Method
description	Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.
format	text
author	SONG, Guojie CUI, Bin ZHENG, Baihua XIE, Kunqing YANG, Dongqing
author_facet	SONG, Guojie CUI, Bin ZHENG, Baihua XIE, Kunqing YANG, Dongqing
author_sort	SONG, Guojie
title	Accelerating Sequence Searching: Dimensionality Reduction Method
title_short	Accelerating Sequence Searching: Dimensionality Reduction Method
title_full	Accelerating Sequence Searching: Dimensionality Reduction Method
title_fullStr	Accelerating Sequence Searching: Dimensionality Reduction Method
title_full_unstemmed	Accelerating Sequence Searching: Dimensionality Reduction Method
title_sort	accelerating sequence searching: dimensionality reduction method
publisher	Institutional Knowledge at Singapore Management University
publishDate	2009
url	https://ink.library.smu.edu.sg/sis_research/750 https://ink.library.smu.edu.sg/context/sis_research/article/1749/viewcontent/sem.pdf
_version_	1770570699686019072

Accelerating Sequence Searching: Dimensionality Reduction Method

Similar Items