Squeezing Long Sequence Data for Efficient Similarity Search

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level s...

Full description

Saved in:
Bibliographic Details
Main Authors: SONG, Guojie, Cui, Bin, ZHENG, Baihua, Xie, Kunqing, YANG, Dongqing
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2008
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/405
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-1404
record_format dspace
spelling sg-smu-ink.sis_research-14042010-09-24T06:36:22Z Squeezing Long Sequence Data for Efficient Similarity Search SONG, Guojie Cui, Bin ZHENG, Baihua Xie, Kunqing YANG, Dongqing Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods. 2008-03-01T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/405 info:doi/10.1007/978-3-540-78849-2_44 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer Sciences
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Computer Sciences
spellingShingle Computer Sciences
SONG, Guojie
Cui, Bin
ZHENG, Baihua
Xie, Kunqing
YANG, Dongqing
Squeezing Long Sequence Data for Efficient Similarity Search
description Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.
format text
author SONG, Guojie
Cui, Bin
ZHENG, Baihua
Xie, Kunqing
YANG, Dongqing
author_facet SONG, Guojie
Cui, Bin
ZHENG, Baihua
Xie, Kunqing
YANG, Dongqing
author_sort SONG, Guojie
title Squeezing Long Sequence Data for Efficient Similarity Search
title_short Squeezing Long Sequence Data for Efficient Similarity Search
title_full Squeezing Long Sequence Data for Efficient Similarity Search
title_fullStr Squeezing Long Sequence Data for Efficient Similarity Search
title_full_unstemmed Squeezing Long Sequence Data for Efficient Similarity Search
title_sort squeezing long sequence data for efficient similarity search
publisher Institutional Knowledge at Singapore Management University
publishDate 2008
url https://ink.library.smu.edu.sg/sis_research/405
_version_ 1770570413040992256