Visual food recognition using artificial intelligence

Food-related research gains increasing attention for its importance in people's daily life. Proper understanding of daily food intake benefits not only individual's personal health but also the collective good of the society. Visual food classification is to recognize different food dishes...

Full description

Saved in:
Bibliographic Details
Main Author: Zhao, Heng
Other Authors: Yap Kim Hui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/161182
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Food-related research gains increasing attention for its importance in people's daily life. Proper understanding of daily food intake benefits not only individual's personal health but also the collective good of the society. Visual food classification is to recognize different food dishes from pictures. Early approaches of visual food classification focus on traditional classification methods that are based on hand-crafted image features. The more recent progress made in visual food classification uses various popular deep architectures such as convolutional neural networks (CNNs) and performs the classification directly using visual image features extracted from the trained network. In this thesis, we will explore different issues in visual food classification using deep architectures. Our first work in Chapter 3 aims to develop a compact network for mobile visual food recognition. We propose a joint-learning distilled network (JDNet) that targets to achieve a high food classification accuracy using a compact student network by learning from a large teacher network. Both networks are trained simultaneously while knowledge are transferred between them using proposed knowledge distillation (KD) techniques. With the joint model, we achieve strong performance as compared to the state-of-the-art large models. Chapter 4 and 5 address the issue of data scarcity that is often encountered in visual food classification. Known frameworks and approaches have heavy reliance on many-shot training of a deep network on existing large-scale food datasets. However, it is common for many food categories that it is difficult to collect a large number of images for training. In view of the situation, we study the task of visual food classification under low-shot learning scenarios: 1) few-shot learning (FSL) that performs the classification using only a few labeled samples per category; 2) zero-shot learning (ZSL) that aims to classify new categories that are unseen during network training. Our second work in Chapter 4 aims to integrate few-shot and many-shot learning. Traditional few-shot learning is unable to properly address the problem of visual food classification due to the complex characteristics and large variations of food images. In addition, most few-shot frameworks cannot perform classification for many-shot and few-shot categories at the same time. Hence, we propose a fusion learning method that unifies many-shot and few-shot under a single framework. It leverages image features and text embeddings, and adopts a graph convolutional network (GCN) to capture inter-class correlations between different food categories. Our method achieves state-of-the-art few-shot and fusion classification performance on several food benchmark datasets. Our third work in Chapter 5 focuses on zero-shot learning, where food images are not available for new categories during network training and hence semantic information plays an important role. We propose a bi-directional visual-semantic autoencoder network (VSAN) that is dedicated to explore the rich visual-semantic interactions and generate discriminative representations in both visual and semantic spaces. VSAN aims to generate discriminative visual feature that incorporates semantic information using a proposed attribute autoencoder network, and to generate new semantic attribute and class label embeddings that preserve visual relations across different classes by a proposed visual hierarchy. Comprehensive experiments on 4 benchmark datasets demonstrate the superior performance of VSAN against prior zero-shot learning works. In summary, this thesis aims to address different aspects and tasks of visual food classification by using different deep learning techniques. We propose the following methods and frameworks: 1) a joint-learning distilled network (JDNet) for compact network design; 2) a fusion learning framework for the integration of few-shot and many-shot classification and 3) a bi-directional visual-semantic autoencoder network (VSAN) for zero-shot learning.