Audio-Visual Emotion Recognition Using Multilevel Fusion


  • Muhammad Shoaib Department of Electronics, University of Peshawar, Pakistan
  • Sana Ul Haq Department of Electronics, University of Peshawar, Pakistan
  • Muhammad Saeed Shah Department of Electronics, University of Peshawar, Pakistan
  • Imtiaz Rasool University of Peshawar
  • Mohammad Omer Farooq Department of Electronics, University of Peshawar, Pakistan


Emotion Recognition, Decision-Level Fusion, Sum Rule, Product Rule, Classification


In the affective computing domain, many researchers have worked on automatic human emotion recognition in recent years. Unimodal techniques, i.e., audio, visual or physiological signals, have been used in most emotion recognition research. The research indicates that one modality can trump the other when it comes to classification accuracy. Some emotions may have better classification accuracy in one modality, while others may be easily separated in the other. In the proposed research, emotion recognition is performed using both unimodal and bimodal techniques. Experiments were performed using six emotions of an audio-visual interactive emotional dyadic motion capture (IEMOCAP) database. The classification was performed using three different feature selection methods and seven various classification techniques. The recognition accuracy of 64.54% was obtained for the audio modality, and 96.77% for the visual modality using rotation forest classifier. For the bimodal approach, the best accuracy of 96.04% was obtained for feature-level fusion using the rotation forest classifier. The decision-level fusion resulted in the best performance of 97.60% for the product rule, while obtained an accuracy of 97.51% for the sum rule. The bimodal approach provided better results in comparison to unimodal approach, and the decision-level fusion provided better results.