Enhancing recipe retrieval with foundation models: A data augmentation perspective

Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundat...

Full description

Saved in:
Bibliographic Details
Main Authors: SONG, Fangzhou, ZHU, Bin, HAO, Yanbin, WANG, Shuo
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9726
https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10726
record_format dspace
spelling sg-smu-ink.sis_research-107262024-12-16T06:57:05Z Enhancing recipe retrieval with foundation models: A data augmentation perspective SONG, Fangzhou ZHU, Bin HAO, Yanbin WANG, Shuo Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. Code is available at https://github.com/Noah888/DAR. 2024-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9726 info:doi/10.1007/978-3-031-72983-6_7 https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Recipe retrieval Data augmentation Foundation models Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Recipe retrieval
Data augmentation
Foundation models
Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Recipe retrieval
Data augmentation
Foundation models
Databases and Information Systems
Graphics and Human Computer Interfaces
SONG, Fangzhou
ZHU, Bin
HAO, Yanbin
WANG, Shuo
Enhancing recipe retrieval with foundation models: A data augmentation perspective
description Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. Code is available at https://github.com/Noah888/DAR.
format text
author SONG, Fangzhou
ZHU, Bin
HAO, Yanbin
WANG, Shuo
author_facet SONG, Fangzhou
ZHU, Bin
HAO, Yanbin
WANG, Shuo
author_sort SONG, Fangzhou
title Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_short Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_full Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_fullStr Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_full_unstemmed Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_sort enhancing recipe retrieval with foundation models: a data augmentation perspective
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9726
https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf
_version_ 1819113120370524160