INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...

Full description

Saved in:

Bibliographic Details
Main Author:	Naufal Abdjul, Rifqi
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/82407
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:82407
spelling	id-itb.:824072024-07-08T10:51:36ZINDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT Naufal Abdjul, Rifqi Indonesia Final Project privacy, de-identification, low-resource INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/82407 The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data.
format	Final Project
author	Naufal Abdjul, Rifqi
spellingShingle	Naufal Abdjul, Rifqi INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
author_facet	Naufal Abdjul, Rifqi
author_sort	Naufal Abdjul, Rifqi
title	INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_short	INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_full	INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_fullStr	INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_full_unstemmed	INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
title_sort	indonesian speech content de-identification on low resource transcript
url	https://digilib.itb.ac.id/gdl/view/82407
_version_	1822997686706503680

INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

Similar Items