INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/82407 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The advancement of technology and the use of digital data threaten individual
privacy, particularly in speech content containing Personally Identifiable
Information (PII). Therefore, a system capable of de-identifying data in speech
content is needed, especially in low-resource transcripts that are difficult to process.
This research focuses on the development and evaluation of an efficient speech
content de-identification system for low-resource languages, such as Indonesian,
which has not been extensively explored before.
The methods used in this study involve the construction of a speech dataset in
Indonesian containing privacy-sensitive information and the development of three
main components: the speech processing component, the information extraction
component, and the masking component. Training methods include using
transcription data, data augmentation, and weakly-supervised learning to improve
system performance.
From the experimental results, the de-identification method using existing
approaches provides results based on the percentage of labels in the dataset. The
use of combined methods, including audio transcription domain data, dataset
augmentation, and semi-supervised learning, significantly improves performance,
achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect
data. |
---|