TY - JOUR
T1 - Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition
AU - Zhao, Ruoyu
AU - Jiang, Xiantao
AU - Richard Yu, null
AU - Leung, Victor C.M.
AU - Wang, Tao
AU - Zhang, Shaohu
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Speech Emotion Recognition (SER) is important in improving human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in the linguistic and acoustic features of different languages. In this study, we propose a novel approach, HuMP-CAT, which combines HuBERT (Hidden Unit BERT), MFCC (Mel-Frequency Cepstral Coefficients), and Prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source data set to train the source model and evaluate the proposed method on seven data sets in five languages (i.e., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB in German language and 79.48% on EMOVO in Italian language. Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
AB - Speech Emotion Recognition (SER) is important in improving human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in the linguistic and acoustic features of different languages. In this study, we propose a novel approach, HuMP-CAT, which combines HuBERT (Hidden Unit BERT), MFCC (Mel-Frequency Cepstral Coefficients), and Prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source data set to train the source model and evaluate the proposed method on seven data sets in five languages (i.e., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB in German language and 79.48% on EMOVO in Italian language. Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
KW - Cross-attention transformer
KW - Cross-linguistic speech emotion recognition
KW - Multi-feature fusion
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105017250834&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=105017250834&origin=inward
U2 - 10.1109/JIOT.2025.3613687
DO - 10.1109/JIOT.2025.3613687
M3 - Article
SN - 2327-4662
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
IS - Issue
ER -