Fast Filter-and-Refine Algorithms for Subsequence Selection

Large sequence databases, such as protein, DNA and gene sequences in biology, are becoming increasingly common. An important operation on a sequence database is approximate subsequence matching, where all subsequences that are within some distance from a given query string are retrieved. This paper...

Full description

Saved in:
Bibliographic Details
Main Authors: OOI, Beng-Chin, PANG, Hwee Hwa, WANG, Hao, WONG, Limsoon, YU, Cui
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2002
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/1144
https://ink.library.smu.edu.sg/context/sis_research/article/2143/viewcontent/FastFilterandRefine_edited_.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Large sequence databases, such as protein, DNA and gene sequences in biology, are becoming increasingly common. An important operation on a sequence database is approximate subsequence matching, where all subsequences that are within some distance from a given query string are retrieved. This paper proposes a filter-and-refine algorithm that enables efficient approximate subsequence matching in large DNA sequence databases. It employs a bitmap indexing structure to condense and encode each data sequence into a shorter index sequence. During query processing, the bitmap index is used to filter out most of the irrelevant subsequences, and false positives are removed in the final refinement step. Analytical and experimental studies show that the proposed strategy is capable of reducing response time substantially while incurring only a small space overhead.