INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/82407 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:82407 |
---|---|
spelling |
id-itb.:824072024-07-08T10:51:36ZINDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT Naufal Abdjul, Rifqi Indonesia Final Project privacy, de-identification, low-resource INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/82407 The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
The advancement of technology and the use of digital data threaten individual
privacy, particularly in speech content containing Personally Identifiable
Information (PII). Therefore, a system capable of de-identifying data in speech
content is needed, especially in low-resource transcripts that are difficult to process.
This research focuses on the development and evaluation of an efficient speech
content de-identification system for low-resource languages, such as Indonesian,
which has not been extensively explored before.
The methods used in this study involve the construction of a speech dataset in
Indonesian containing privacy-sensitive information and the development of three
main components: the speech processing component, the information extraction
component, and the masking component. Training methods include using
transcription data, data augmentation, and weakly-supervised learning to improve
system performance.
From the experimental results, the de-identification method using existing
approaches provides results based on the percentage of labels in the dataset. The
use of combined methods, including audio transcription domain data, dataset
augmentation, and semi-supervised learning, significantly improves performance,
achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect
data. |
format |
Final Project |
author |
Naufal Abdjul, Rifqi |
spellingShingle |
Naufal Abdjul, Rifqi INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
author_facet |
Naufal Abdjul, Rifqi |
author_sort |
Naufal Abdjul, Rifqi |
title |
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
title_short |
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
title_full |
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
title_fullStr |
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
title_full_unstemmed |
INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT |
title_sort |
indonesian speech content de-identification on low resource transcript |
url |
https://digilib.itb.ac.id/gdl/view/82407 |
_version_ |
1822997686706503680 |