AUTOMATIC SPEAKER RECOGNITION FOR FORENSIC APPLICATIONS IN INDONESIA BASED ON I-VECTOR MODELING

Speaker recognition is a process of technology to identify a speaker’s identity based on their speech recording. This system can be used to help in forensic application. In Indonesia, speaker recognition is used to help to verify the legal evidence in the court by Komisi Pemberantasan Korupsi (KP...

Full description

Saved in:
Bibliographic Details
Main Author: Hartanto, Jocelyn
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/50299
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Speaker recognition is a process of technology to identify a speaker’s identity based on their speech recording. This system can be used to help in forensic application. In Indonesia, speaker recognition is used to help to verify the legal evidence in the court by Komisi Pemberantasan Korupsi (KPK), police, and judiciary. Currently, the system used is based on text-dependent system that needs more time and human intervention. Therefore, a system that can reduce the time needed for analysis while also have small error is desirable in verification process. The constructed system is an automatic speaker recognition system based on Identity Vector (I-Vector model). This system is trained and tested using speech database in Bahasa Indonesia. Speech recording are taken at semi-anechoic chamber in Adhiwijogo Acoustic Laboratory, Institut Teknologi Bandung. The data features will be extracted using 19+1 dimensions Mel Frequency Cepstral Coefficient (MFCC). In addition to MFCC coefficient, 20 dimensions of delta MFCC and delta-delta MFCC will be used to obtain more detailed feature in speech dynamics and to achieve higher accuracy. The extracted data is modeled using IVector using 32 components of Gaussian and 100 dimensions of I-Vector. The system will be scored using cosine distance scoring to obtain the target and nontarget score. Normalization is applied using Zero Normalization (Z-norm), Test Normalization (T-norm), or Zero-Test Normalization (ZT-norm) to further reduce the system’s error. The system is tested using 46 male speech data and 52 female speech data and trained using the first 20 data for both genders. The lowest Equal Error Rate (EER) achieved by this system is 3,50% which is obtained using T-normed and ZT-normed score in female interview scenario, while the lowest EER by male speaker is 3,56% achieved using T-normed conversation scenario. The low EER number means this system is better than the previous speaker recognition system based on GMM-UBM model.