Abstract
This chapter introduces an emotion recognition system based on audio and
video cues. For audio-based emotion recognition, we have explored various aspects of
feature extraction and classification strategy and found that wavelet analysis is sound.
We have shown comparative results for discriminating capabilities of various
combinations of features using the Fisher Discriminant Analysis (FDA). Finally, we
have combined the audio and video features using a feature-level fusion approach. All
the experiments are performed with eNTERFACE and RML databases. Though we
have applied multiple classifiers, SVM shows significantly improved performance with
a single modality and fusion. The results obtained using fusion outperformed in
contrast results based on a single modality of audio or video. We can conclude that
fusion approaches are best as it is using complementary information from multiple
modalities.