QUANTIZATION IMPLEMENTATION OF INDONESIAN BERT LANGUAGE MODEL
In recent years, the use of pre-trained models has dominated computational research in various fields, including natural language processing. One prominent pre-training model is the Bidirectional Encoder Representations from Transformers (BERT). BERT has succeeded in becoming a state-of-the-art a...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/69111 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | In recent years, the use of pre-trained models has dominated computational research in
various fields, including natural language processing. One prominent pre-training model is the
Bidirectional Encoder Representations from Transformers (BERT). BERT has succeeded in
becoming a state-of-the-art among other models and has been adapted into various languages,
including Indonesian Language BERT, IndoBERT. Like the BERT model, IndoBERT has a large
size, which raises issues related to latency and efficiency of the model. In order to alleviate the
efficiency issue in IndoBERT, in this study we explore the possibility of using quantization to
compress IndoBERT.
Quantization is a technique for computing and storing tensors at a smaller bit precision.
Quantization has the advantage that quantization only changes the bit size of the model weight, so
model architecture does not need to be changed and effort to create a smaller model design is not
necessary. Furthermore, quantization also has a very tiny performance drop to no reduction at all.
Popular quantization methods are post training quantization and quantization aware training. Post
training quantization is a quantization method in which the bit precision of the weight is reduced
after fine-tuned. Quantization aware training is a method where quantization operations in the
model are implemented during training/fine tuning with the aim of making the model being
adaptive to the quantized weights and activations.
Experiments were carried out using 7 downstream tasks and the results showed that the
model had a good performance when compared to the full precision model. There is a decrease in
performance in extreme cases, such as 4-bit quantization. Experiments also show that sequence
labeling downstream task have higher sensitivity. The experimental results also show that the
decrease in performance can be minimized by using the Quantization Aware Training method. |
---|