QUANTIZATION IMPLEMENTATION OF INDONESIAN BERT LANGUAGE MODEL

In recent years, the use of pre-trained models has dominated computational research in various fields, including natural language processing. One prominent pre-training model is the Bidirectional Encoder Representations from Transformers (BERT). BERT has succeeded in becoming a state-of-the-art a...

Full description

Saved in:
Bibliographic Details
Main Author: Ayyub Abdurrahman, Muhammad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/69111
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:In recent years, the use of pre-trained models has dominated computational research in various fields, including natural language processing. One prominent pre-training model is the Bidirectional Encoder Representations from Transformers (BERT). BERT has succeeded in becoming a state-of-the-art among other models and has been adapted into various languages, including Indonesian Language BERT, IndoBERT. Like the BERT model, IndoBERT has a large size, which raises issues related to latency and efficiency of the model. In order to alleviate the efficiency issue in IndoBERT, in this study we explore the possibility of using quantization to compress IndoBERT. Quantization is a technique for computing and storing tensors at a smaller bit precision. Quantization has the advantage that quantization only changes the bit size of the model weight, so model architecture does not need to be changed and effort to create a smaller model design is not necessary. Furthermore, quantization also has a very tiny performance drop to no reduction at all. Popular quantization methods are post training quantization and quantization aware training. Post training quantization is a quantization method in which the bit precision of the weight is reduced after fine-tuned. Quantization aware training is a method where quantization operations in the model are implemented during training/fine tuning with the aim of making the model being adaptive to the quantized weights and activations. Experiments were carried out using 7 downstream tasks and the results showed that the model had a good performance when compared to the full precision model. There is a decrease in performance in extreme cases, such as 4-bit quantization. Experiments also show that sequence labeling downstream task have higher sensitivity. The experimental results also show that the decrease in performance can be minimized by using the Quantization Aware Training method.