Advancements in green AI: a pathway to sustainability
In this paper, a survey of model compression and optimization techniques are evaluated on benchmarks of energy efficiency, memory footprint and accuracy on a task key to online safety, phishing. The three primary categories of compression explored are (1) Quantization, (2) Distillation and (3) Pruni...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181771 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | In this paper, a survey of model compression and optimization techniques are evaluated on benchmarks of energy efficiency, memory footprint and accuracy on a task key to online safety, phishing. The three primary categories of compression explored are (1) Quantization, (2) Distillation and (3) Pruning. The quantization techniques explored are QLoRA and LLM.Int8, techniques designed for compressing LLMs as well as Quantization Aware Training with asymmetric quantization on inference. The Distillation techniques explored are (1) Knowledge Distillation, (2) Hint Distillation for FitNets and (3) Relational Knowledge Distillation, all of which are used to train smaller transformer architectures compared to the base Bert transformer. For Pruning, L1 and L2 Magnitude Pruning and Head Pruning are evaluated. The results showed major gains in both carbon footprint and memory footprint are made with the application of QLoRA with FP4 and a compute type of FP16, with near zero accuracy degradation. The model showed great promise with an accuracy of 98.60%, a carbon footprint of 0.0016kg of CO2 for 20,000 samples, and time per inference of 0.0059 seconds, making it fast, efficient and of high quality, especially when compared to a baseline performance of 98.58%, 0.0095kg of CO2 for 20,000 samples, and a time per inference of 0.016, making the most optimal model 10 times faster and has nearly 6 times less carbon emissions over 20,000 samples. |
---|