INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT
The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...
محفوظ في:
المؤلف الرئيسي: | |
---|---|
التنسيق: | Final Project |
اللغة: | Indonesia |
الوصول للمادة أونلاين: | https://digilib.itb.ac.id/gdl/view/82407 |
الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
المؤسسة: | Institut Teknologi Bandung |
اللغة: | Indonesia |
الملخص: | The advancement of technology and the use of digital data threaten individual
privacy, particularly in speech content containing Personally Identifiable
Information (PII). Therefore, a system capable of de-identifying data in speech
content is needed, especially in low-resource transcripts that are difficult to process.
This research focuses on the development and evaluation of an efficient speech
content de-identification system for low-resource languages, such as Indonesian,
which has not been extensively explored before.
The methods used in this study involve the construction of a speech dataset in
Indonesian containing privacy-sensitive information and the development of three
main components: the speech processing component, the information extraction
component, and the masking component. Training methods include using
transcription data, data augmentation, and weakly-supervised learning to improve
system performance.
From the experimental results, the de-identification method using existing
approaches provides results based on the percentage of labels in the dataset. The
use of combined methods, including audio transcription domain data, dataset
augmentation, and semi-supervised learning, significantly improves performance,
achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect
data. |
---|