Negative Factor: Improving Regular-Expression Matching in Strings

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques be...

Full description

Saved in:
Bibliographic Details
Main Authors: YANG, Xiaochun, QIU, Tao, WANG, Bin, ZHENG, Baihua, WANG, Yaoshu, LI, Chen
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2016
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/3157
https://ink.library.smu.edu.sg/context/sis_research/article/4157/viewcontent/P_ID_51559_NegativeFactor_2015_afv.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-4157
record_format dspace
spelling sg-smu-ink.sis_research-41572020-01-15T15:06:42Z Negative Factor: Improving Regular-Expression Matching in Strings YANG, Xiaochun QIU, Tao WANG, Bin ZHENG, Baihua WANG, Yaoshu LI, Chen The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents. 2016-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3157 info:doi/10.1145/2847525 https://ink.library.smu.edu.sg/context/sis_research/article/4157/viewcontent/P_ID_51559_NegativeFactor_2015_afv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University long sequence performance regular expression algorithms automata patterns search Databases and Information Systems Theory and Algorithms
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic long sequence
performance
regular expression
algorithms
automata
patterns
search
Databases and Information Systems
Theory and Algorithms
spellingShingle long sequence
performance
regular expression
algorithms
automata
patterns
search
Databases and Information Systems
Theory and Algorithms
YANG, Xiaochun
QIU, Tao
WANG, Bin
ZHENG, Baihua
WANG, Yaoshu
LI, Chen
Negative Factor: Improving Regular-Expression Matching in Strings
description The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.
format text
author YANG, Xiaochun
QIU, Tao
WANG, Bin
ZHENG, Baihua
WANG, Yaoshu
LI, Chen
author_facet YANG, Xiaochun
QIU, Tao
WANG, Bin
ZHENG, Baihua
WANG, Yaoshu
LI, Chen
author_sort YANG, Xiaochun
title Negative Factor: Improving Regular-Expression Matching in Strings
title_short Negative Factor: Improving Regular-Expression Matching in Strings
title_full Negative Factor: Improving Regular-Expression Matching in Strings
title_fullStr Negative Factor: Improving Regular-Expression Matching in Strings
title_full_unstemmed Negative Factor: Improving Regular-Expression Matching in Strings
title_sort negative factor: improving regular-expression matching in strings
publisher Institutional Knowledge at Singapore Management University
publishDate 2016
url https://ink.library.smu.edu.sg/sis_research/3157
https://ink.library.smu.edu.sg/context/sis_research/article/4157/viewcontent/P_ID_51559_NegativeFactor_2015_afv.pdf
_version_ 1770572869267357696