Abstract
Speech Emotion Recognition (SER) is important in improving human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in the linguistic and acoustic features of different languages. In this study, we propose a novel approach, HuMP-CAT, which combines HuBERT (Hidden Unit BERT), MFCC (Mel-Frequency Cepstral Coefficients), and Prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source data set to train the source model and evaluate the proposed method on seven data sets in five languages (i.e., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB in German language and 79.48% on EMOVO in Italian language. Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
| Original language | English |
|---|---|
| Journal | IEEE Internet of Things Journal |
| Issue number | Issue |
| DOIs | |
| State | Accepted/In press - Jan 1 2025 |
Keywords
- Cross-attention transformer
- Cross-linguistic speech emotion recognition
- Multi-feature fusion
Fingerprint
Dive into the research topics of 'Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver