Enhancing recipe retrieval with foundation models: A data augmentation perspective

Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundat...

Full description

Saved in:

Bibliographic Details
Main Authors:	SONG, Fangzhou, ZHU, Bin, HAO, Yanbin, WANG, Shuo
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Recipe retrieval Data augmentation Foundation models Databases and Information Systems Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/9726 https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10726
record_format	dspace
spelling	sg-smu-ink.sis_research-107262024-12-16T06:57:05Z Enhancing recipe retrieval with foundation models: A data augmentation perspective SONG, Fangzhou ZHU, Bin HAO, Yanbin WANG, Shuo Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. Code is available at https://github.com/Noah888/DAR. 2024-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9726 info:doi/10.1007/978-3-031-72983-6_7 https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Recipe retrieval Data augmentation Foundation models Databases and Information Systems Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Recipe retrieval Data augmentation Foundation models Databases and Information Systems Graphics and Human Computer Interfaces
spellingShingle	Recipe retrieval Data augmentation Foundation models Databases and Information Systems Graphics and Human Computer Interfaces SONG, Fangzhou ZHU, Bin HAO, Yanbin WANG, Shuo Enhancing recipe retrieval with foundation models: A data augmentation perspective
description	Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. Code is available at https://github.com/Noah888/DAR.
format	text
author	SONG, Fangzhou ZHU, Bin HAO, Yanbin WANG, Shuo
author_facet	SONG, Fangzhou ZHU, Bin HAO, Yanbin WANG, Shuo
author_sort	SONG, Fangzhou
title	Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_short	Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_full	Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_fullStr	Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_full_unstemmed	Enhancing recipe retrieval with foundation models: A data augmentation perspective
title_sort	enhancing recipe retrieval with foundation models: a data augmentation perspective
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9726 https://ink.library.smu.edu.sg/context/sis_research/article/10726/viewcontent/06751.pdf
_version_	1819113120370524160

Enhancing recipe retrieval with foundation models: A data augmentation perspective

Similar Items