MAE-VQA: an efficient and accurate end-to-end video quality assessment method for user generated content videos

In the digital age, the proliferation of user-generated content (UGC) videos presents unique challenges in maintaining video quality across diverse platforms. In this project, we propose Masked Auto-Encoder model for no-reference video quality assessment (NR-VQA) problem. To our best knowledge, we a...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Chuhan
Other Authors: Lin Weisi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178566
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In the digital age, the proliferation of user-generated content (UGC) videos presents unique challenges in maintaining video quality across diverse platforms. In this project, we propose Masked Auto-Encoder model for no-reference video quality assessment (NR-VQA) problem. To our best knowledge, we are the first to apply the MAE to NR- VQA, and propose the MAE-VQA model. Specifically, MAE-VQA model is designed to evaluate the quality of UGC videos without the need for reference footage, which is often unavailable in real-world scenarios. It is composed of three modules: patch masking module, auto-encoder module, and quality regression module, respectively for handling sampling strategy, capturing spatiotemporal representations, and mapping to video quality score. This approach is specifically designed to capture and analyze the complex spatiotemporal features and diverse distortions typical of UGC. Vision Transformer’s (ViT) self-attention mechanism allows for detailed observation of different parts in a video, facilitating the understanding of their correlation. Transformer is able to extract the features and texture information from the distorted video. Given that video content is highly redundant, appropriately extracted features can speed up the model without decreasing accuracy. By masking the majority of the input video, MAE-VQA can use ViTto learn robust spatiotemporal representations from videos. We conduct thorough assessments on benchmark datasets to contrast our methodology with cutting-edge techniques. The achievement of this project is that our approach achieves state-of-the-art performance across the majority of VQA datasets and secures a close second in the remainder, while resulting in a significant reduction in computational overhead.