City Research Online

MULTICAUSENET temporal attention for multimodal emotion cause pair extraction

Junchi, M., Chaudhry, H. N., Kulsoom, F. , Guihua, Y., Khan, S. U., Biswas, S. ORCID: 0000-0002-6770-9845, Khan, Z. U. & Khan, F. (2025). MULTICAUSENET temporal attention for multimodal emotion cause pair extraction. Scientific Reports, 15(1), article number 19372. doi: 10.1038/s41598-025-01221-w

Abstract

In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.

Publication Type: Article
Additional Information: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. © Crown 2025
Publisher Keywords: Emotion–cause pair extraction, Multimodal emotion recognition, Graph attention networks (GATs), Vision transformers (ViTs), Transformers and attention mechanisms, Feature fusion, Multimodal graphs, Self and cross attention, Emotion triggers
Subjects: B Philosophy. Psychology. Religion > BF Psychology
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
T Technology > T Technology (General)
Departments: School of Science & Technology
School of Science & Technology > Computer Science
SWORD Depositor:
[thumbnail of s41598-025-01221-w.pdf]
Preview
Text - Published Version
Available under License Creative Commons Attribution.

Download (6MB) | Preview

Export

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Downloads

Downloads per month over past year

View more statistics

Actions (login required)

Admin Login Admin Login