The wav2vec2-large-robust-12-ft-emotion-msp-dim model is a model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0. It takes a raw audio signal as input and predicts the arousal, dominance, and valence dimensions of speech emotion in a range of 0 to 1. The model was created by fine-tuning the Wav2Vec2-Large-Robust model on the MSP-Podcast dataset and is pruned to 12 transformer layers. It also provides the pooled states of the last transformer layer. The model can be used for emotion recognition in speech applications.

