Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

  • Ruoyu Zhao
  • , Xiantao Jiang
  • , Richard Yu
  • , Victor C.M. Leung
  • , Tao Wang
  • , Shaohu Zhang

Research output: Contribution to journalArticlepeer-review

Abstract

Speech Emotion Recognition (SER) is important in improving human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in the linguistic and acoustic features of different languages. In this study, we propose a novel approach, HuMP-CAT, which combines HuBERT (Hidden Unit BERT), MFCC (Mel-Frequency Cepstral Coefficients), and Prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source data set to train the source model and evaluate the proposed method on seven data sets in five languages (i.e., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB in German language and 79.48% on EMOVO in Italian language. Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
Original languageEnglish
JournalIEEE Internet of Things Journal
Issue numberIssue
DOIs
StateAccepted/In press - Jan 1 2025

Keywords

  • Cross-attention transformer
  • Cross-linguistic speech emotion recognition
  • Multi-feature fusion

Fingerprint

Dive into the research topics of 'Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this