INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...

Full description

Saved in:
Bibliographic Details
Main Author: Naufal Abdjul, Rifqi
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82407
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data.