Abstract:In view of the issue of insufficient capture of dynamic information in traditional manual features of speech signals, Wav2vec 2.0 model is introduced for an extraction of long-range dependencies in speech signals, thus obtaining sufficient emotional feature representations based on feature fusion. By extracting the most representative MFCC features from speech signals, with Wav2vec adopted to compensate for the lack of MFCC in capturing dynamic information, richer and more representative speech emotion features can be obtained. By utilizing the cross attention mechanism, the acoustic features of speech are integrated with contextual information so as to obtain a more comprehensive and accurate feature representation. Consequently, an accurate prediction of emotional states can be achieved through Transformer networks. Through experiments on MELD and EEIDB datasets, it is found that the proposed method achieves 44.32% and 65.50% in weighted F1-Score metrics, respectively, which verifies its effectiveness and superiority in performance.