MULTICAUSENET temporal attention for multimodal emotion cause pair extraction
Junchi, M., Chaudhry, H. N., Kulsoom, F. , Guihua, Y., Khan, S. U., Biswas, S. ORCID: 0000-0002-6770-9845, Khan, Z. U. & Khan, F. (2025).
MULTICAUSENET temporal attention for multimodal emotion cause pair extraction.
Scientific Reports, 15(1),
article number 19372.
doi: 10.1038/s41598-025-01221-w
Abstract
In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.
Publication Type: | Article |
---|---|
Additional Information: | This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. © Crown 2025 |
Publisher Keywords: | Emotion–cause pair extraction, Multimodal emotion recognition, Graph attention networks (GATs), Vision transformers (ViTs), Transformers and attention mechanisms, Feature fusion, Multimodal graphs, Self and cross attention, Emotion triggers |
Subjects: | B Philosophy. Psychology. Religion > BF Psychology Q Science > QA Mathematics > QA75 Electronic computers. Computer science T Technology > T Technology (General) |
Departments: | School of Science & Technology School of Science & Technology > Computer Science |
SWORD Depositor: |
Available under License Creative Commons Attribution.
Download (6MB) | Preview
Export
Downloads
Downloads per month over past year