INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...

Full description

Saved in:
Bibliographic Details
Main Author: Naufal Abdjul, Rifqi
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82407
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:82407
spelling id-itb.:824072024-07-08T10:51:36ZINDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT Naufal Abdjul, Rifqi Indonesia Final Project privacy, de-identification, low-resource INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/82407 The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data.
format Final Project
author Naufal Abdjul, Rifqi
spellingShingle Naufal Abdjul, Rifqi
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
author_facet Naufal Abdjul, Rifqi
author_sort Naufal Abdjul, Rifqi
title INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_short INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_full INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_fullStr INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_full_unstemmed INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_sort indonesian speech content de-identification on low resource transcript
url https://digilib.itb.ac.id/gdl/view/82407
_version_ 1822997686706503680