Ross Maddox, Jiebo Luo
Data scarcity issues have been a long-standing challenge for speech emotion recognition (SER) tasks. The issue is more severe for dimensional emotion (e.g., arousal and valence) estimation tasks due to the increased difficulty in the annotation of dimensional values. This study proposes a semi-supervised method for obtaining arousal-valence annotations of a speech corpus when only discrete emotion category information is available. Our method proposes to compute the weighted sum of intermediate outputs of large-scale pre-trained speech model wavLM as utterance-level speech embeddings and combine with a linear MLP to extract speech emotion features. Then, the high-dimensional speech emotion features are mapped to the Arousal-Valence space using a modified version of the dimensionality reduction algorithm UMATO with the aid of speech utterance’s coarse emotion category label. Results show comparable performance with supervised regression models on the IEMOCAP dataset, and further experiments on other datasets demonstrate the method’s universal applicability. The proposed method can reduce the labor-intensive task of dimensional emotion labeling and be useful in scenarios where dimensional values are required.