[](#model-for-dimensional-speech-emotion-recognition-based-on-wav2vec-20)Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0
==============================================================================================================================================

The model expects a raw audio signal as input and outputs predictions for arousal, dominance and valence in a range of approximately 0...1. In addition, it also provides the pooled states of the last transformer layer. The model was created by fine-tuning [Wav2Vec2-Large-Robust](https://huggingface.co/facebook/wav2vec2-large-robust) on [MSP-Podcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) (v1.7). The model was pruned from 24 to 12 transformer layers before fine-tuning. An [ONNX](https://onnx.ai/%22) export of the model is available from [doi:10.5281/zenodo.6221127](https://zenodo.org/record/6221127). Further details are given in the associated [paper](https://arxiv.org/abs/2203.07378) and [tutorial](https://github.com/audeering/w2v2-how-to).

[](#usage)Usage
===============

    import numpy as np
    import torch
    import torch.nn as nn
    from transformers import Wav2Vec2Processor
    from transformers.models.wav2vec2.modeling_wav2vec2 import (
        Wav2Vec2Model,
        Wav2Vec2PreTrainedModel,
    )
    
    
    class RegressionHead(nn.Module):
        r"""Classification head."""
    
        def __init__(self, config):
    
            super().__init__()
    
            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
            self.dropout = nn.Dropout(config.final_dropout)
            self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
    
        def forward(self, features, **kwargs):
    
            x = features
            x = self.dropout(x)
            x = self.dense(x)
            x = torch.tanh(x)
            x = self.dropout(x)
            x = self.out_proj(x)
    
            return x
    
    
    class EmotionModel(Wav2Vec2PreTrainedModel):
        r"""Speech emotion classifier."""
    
        def __init__(self, config):
    
            super().__init__(config)
    
            self.config = config
            self.wav2vec2 = Wav2Vec2Model(config)
            self.classifier = RegressionHead(config)
            self.init_weights()
    
        def forward(
                self,
                input_values,
        ):
    
            outputs = self.wav2vec2(input_values)
            hidden_states = outputs[0]
            hidden_states = torch.mean(hidden_states, dim=1)
            logits = self.classifier(hidden_states)
    
            return hidden_states, logits
    
    
    
    # load model from hub
    device = 'cpu'
    model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    model = EmotionModel.from_pretrained(model_name)
    
    # dummy signal
    sampling_rate = 16000
    signal = np.zeros((1, sampling_rate), dtype=np.float32)
    
    
    def process_func(
        x: np.ndarray,
        sampling_rate: int,
        embeddings: bool = False,
    ) -> np.ndarray:
        r"""Predict emotions or extract embeddings from raw audio signal."""
    
        # run through processor to normalize signal
        # always returns a batch, so we just get the first entry
        # then we put it on the device
        y = processor(x, sampling_rate=sampling_rate)
        y = y['input_values'][0]
        y = y.reshape(1, -1)
        y = torch.from_numpy(y).to(device)
    
        # run through model
        with torch.no_grad():
            y = model(y)[0 if embeddings else 1]
    
        # convert to numpy
        y = y.detach().cpu().numpy()
    
        return y
    
    
    print(process_func(signal, sampling_rate))
    #  Arousal    dominance valence
    # [[0.5460754  0.6062266  0.40431657]]
    
    print(process_func(signal, sampling_rate, embeddings=True))
    # Pooled hidden states of last transformer layer
    # [[-0.00752167  0.0065819  -0.00746342 ...  0.00663632  0.00848748
    #    0.00599211]]

## Model overview

The `wav2vec2-large-robust-12-ft-emotion-msp-dim` model is a fine-tuned version of the [Wav2Vec2-Large-Robust](https://huggingface.co/facebook/wav2vec2-large-robust) model for dimensional speech emotion recognition. This model was created by [audeering](https://aimodels.fyi/creators/huggingFace/audeering) and fine-tuned on the [MSP-Podcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) dataset. It expects a raw audio signal as input and outputs predictions for arousal, dominance and valence in a range of approximately 0 to 1. The model also provides the pooled states of the last transformer layer.

Similar models include the [wav2vec2-large-960h-lv60-self](https://aimodels.fyi/models/huggingFace/wav2vec2-large-960h-lv60-self-facebook) model from Facebook, which was pre-trained and fine-tuned on 960 hours of speech data for speech recognition, as well as the [wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://aimodels.fyi/models/huggingFace/wav2vec2-lg-xlsr-en-speech-emotion-recognition-ehcalabres) model, which was fine-tuned for speech emotion recognition.

## Model inputs and outputs

### Inputs
- **Raw audio signal**: The model expects a raw audio signal as input, which must be sampled at the appropriate rate.

### Outputs
- **Arousal, dominance and valence predictions**: The model outputs predictions for arousal, dominance and valence in a range of approximately 0 to 1.
- **Pooled states**: The model also provides the pooled states of the last transformer layer.

## Capabilities

The `wav2vec2-large-robust-12-ft-emotion-msp-dim` model is capable of dimensional speech emotion recognition, meaning it can predict the arousal, dominance, and valence of a given speech sample. This can be useful for applications that require understanding the emotional state of the speaker, such as customer service, mental health monitoring, or human-robot interaction.

## What can I use it for?

You can use this model for various applications that require dimensional speech emotion recognition, such as:

- **Customer service**: Analyze customer calls to better understand their emotional state and provide more personalized and empathetic support.
- **Mental health monitoring**: Track the emotional state of patients or study participants over time to detect changes that may indicate mental health issues.
- **Human-robot interaction**: Enable robots to better understand and respond to the emotional state of humans they interact with, leading to more natural and engaging interactions.

## Things to try

One interesting thing to try with this model is to explore how the arousal, dominance, and valence predictions change across different speech samples or scenarios. You could, for example, analyze how the emotional state of a speaker varies during a conversation, or compare the emotional responses of different speakers to the same stimulus. Additionally, you could experiment with using the pooled states from the model as features in other machine learning tasks, such as speaker identification or acoustic event detection.