Robust spoken term detection using partial search and re-scoring hypothesized detections techniques

This research focusses on Spoken Term Detection (STD) which aims to detect a textual keyword in a speech corpus. A typical STD system relies on an Automatic Speech Recognition (ASR) system to transform the speech corpus to intermediate textual representations such as 1-best transcriptions or lattice...

Full description

Saved in:
Bibliographic Details
Main Author: Pham, Van Tung
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/82987
http://hdl.handle.net/10220/47558
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This research focusses on Spoken Term Detection (STD) which aims to detect a textual keyword in a speech corpus. A typical STD system relies on an Automatic Speech Recognition (ASR) system to transform the speech corpus to intermediate textual representations such as 1-best transcriptions or lattices of word and subword for indexing/retrieval. However, the imperfect modelling of ASR results in two types of error, i.e. missing and false alarm. This thesis aims to address both type of errors. In STD, the subword approach has been attractive because it is able to address the Out-of-Vocabulary (OOV) problems. A standard subword-based STD system, referred to as the full search technique, first converts the keyword into a subword sequence, then searches the subword sequence in the subword lattices. However, due to the high error rate of subword ASR, detecting entire subword sequence in lattices is difficult and results in a high miss rate. This thesis proposes a partial search approach to address this problem. The proposed approach transforms the keyword’s subword sequence into overlapping sub-sequences and then searches these sub-sequences in the index. It reduces the miss rate by accepting hypothesized detections that contain some of the keyword’s sub-sequences. STD systems rank and make “accept/reject” decisions on hypothesized detections by the confidence scores estimated from the decoding lattices generated by the ASR. Such scores may be inaccurate due to the imperfect modelling of speech and noise. Using the lattice-based posterior probabilities as the detection scores might result in degraded STD performance. Firstly, it is observed that the posterior probabilities are not comparable across keywords. As a result, it is difficult to make “accept/reject” decisions using a single threshold for all keywords. Secondly, a correct detection might have a smaller posterior probability than false alarm detections. Two techniques to re-score and re-rank hypothesized detections are proposed. These techniques utilize additional information that is not captured by the detection scores, hence improve the STD performance. The first technique re-scores hypothesized detections using keyword exemplars. A keyword exemplar is a true instance of the keyword obtained from a labelled speech corpus. The main idea is that if a hypothesized detection is acoustically more similar to the keyword exemplars, it is more likely to be a true detection and hence its score should be boosted. Experimental results show that the proposed technique consistently outperforms previous re-ranking methods that do not make use of keyword exemplars. The second technique re-scores hypothesized detections by exploiting features derived from competing hypotheses. Competing hypotheses of a detection are its alternative hypotheses which have similar time information as the detection in the corresponding lattice. From the competing hypotheses, several novel features are derived. These features reflect the relative confidence of the hypothesized detection to its competing hypotheses. These features are informative and can be used to re-score detections. Experimental results show that using these features result in improved STD performance.