Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification

Entity-level (aka target-dependent) sentiment analysis of social media posts has recently attracted increasing attention, and its goal is to predict the sentiment orientations over individual target entities mentioned in users' posts. Most existing approaches to this task primarily rely on the...

Full description

Saved in:
Bibliographic Details
Main Authors: YU, Jianfei, JIANG, Jing
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/5504
https://ink.library.smu.edu.sg/context/sis_research/article/6507/viewcontent/TASLP.2019.2957872.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Entity-level (aka target-dependent) sentiment analysis of social media posts has recently attracted increasing attention, and its goal is to predict the sentiment orientations over individual target entities mentioned in users' posts. Most existing approaches to this task primarily rely on the textual content, but fail to consider the other important data sources (e.g., images, videos, and user profiles), which can potentially enhance these text-based approaches. Motivated by the observation, we study entity-level multimodal sentiment classification in this article, and aim to explore the usefulness of images for entity-level sentiment detection in social media posts. Specifically, we propose an Entity-Sensitive Attention and Fusion Network (ESAFN) for this task. First, to capture the intra-modality dynamics, ESAFN leverages an effective attention mechanism to generate entity-sensitive textual representations, followed by aggregating them with a textual fusion layer. Next, ESAFN learns the entity-sensitive visual representation with an entity-oriented visual attention mechanism, followed by a gated mechanism to eliminate the noisy visual context. Moreover, to capture the inter-modality dynamics, ESAFN further fuses the textual and visual representations with a bilinear interaction layer. To evaluate the effectiveness of ESAFN, we manually annotate the sentiment orientation over each given entity based on two recently released multimodal NER datasets, and show that ESAFN can significantly outperform several highly competitive unimodal and multimodal methods.