Learning Speech Emotion Representations in the Quaternion Domain

Guizzo, E.; Weyde, T.; Scardapane, S.; Comminiello, D.

Learning Speech Emotion Representations in the Quaternion Domain

Guizzo, E., Weyde, T. ORCID: 0000-0001-8028-9905, Scardapane, S. & Comminiello, D. (2023). Learning Speech Emotion Representations in the Quaternion Domain. IEEE/ACM Transactions on Audio Speech and Language Processing, 31, pp. 1200-1212. doi: 10.1109/taslp.2023.3250840

Abstract

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach.

Publication Type:	Article
Additional Information:	This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Publisher Keywords:	Quaternions, Task analysis, Feature extraction, Speech recognition, Emotion recognition, Speech processing, Data models
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Departments:	School of Science & Technology > Department of Computer Science
SWORD Depositor:	Symplectic Administrator

[thumbnail of Learning_Speech_Emotion_Representations_in_the_Quaternion_Domain.pdf]

Preview

Text - Published Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

Official URL: https://doi.org/10.1109/taslp.2023.3250840

Export

Downloads

Downloads per month over past year

View more statistics

Metadata

Altmetric

View Altmetric information about this item.

CORE (COnnecting REpositories)

Actions (login required)

Admin Login

Creators:	Guizzo, E. Weyde, T. ORCID: 0000-0001-8028-9905 Scardapane, S. Comminiello, D.
Status:	Published
Refereed:	Yes
Journal or Publication Title:	IEEE/ACM Transactions on Audio Speech and Language Processing
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
ISSN:	2329-9290
e-ISSN:	2329-9304
URI:	https://openaccess.city.ac.uk/id/eprint/30187
Date available in CRO:	04 Apr 2023 09:22
Date deposited:	3 April 2023
Dates:	Date Event 1 March 2023 Published 1 March 2023 Published Online 9 February 2023 Accepted