INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that a...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Naufal Abdjul, Rifqi
التنسيق:	Final Project
اللغة:	Indonesia
الوصول للمادة أونلاين:	https://digilib.itb.ac.id/gdl/view/82407
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Institut Teknologi Bandung
اللغة:	Indonesia

الوصف
الملخص:	The advancement of technology and the use of digital data threaten individual privacy, particularly in speech content containing Personally Identifiable Information (PII). Therefore, a system capable of de-identifying data in speech content is needed, especially in low-resource transcripts that are difficult to process. This research focuses on the development and evaluation of an efficient speech content de-identification system for low-resource languages, such as Indonesian, which has not been extensively explored before. The methods used in this study involve the construction of a speech dataset in Indonesian containing privacy-sensitive information and the development of three main components: the speech processing component, the information extraction component, and the masking component. Training methods include using transcription data, data augmentation, and weakly-supervised learning to improve system performance. From the experimental results, the de-identification method using existing approaches provides results based on the percentage of labels in the dataset. The use of combined methods, including audio transcription domain data, dataset augmentation, and semi-supervised learning, significantly improves performance, achieving recall of 75.2%, precision of 75.6%, and F1 score of 75.3% on perfect data.

INDONESIAN SPEECH CONTENT DE-IDENTIFICATION ON LOW RESOURCE TRANSCRIPT

مواد مشابهة