Multimodal Emotion Recognition using Deep Neural Network- A Survey
DOI:
https://doi.org/10.26438/ijcse/v6si6.9598Keywords:
DCNN, DBN, LSTM, SVRAbstract
Emotion recognition is a process by which human emotional states can be identified. Most of the present methods make use of visual and audio information’s together. With recent advancements in deep neural networking, there are several methodologies to identify human emotional states. One of the methods that detect the emotional states is based on a multimodal Deep Convolution Neural Network (DCNN), that use both the audio and visual cues in a deep model. BLSTMRNN is another method which makes use of multimodal features to capture emotions. A much more efficient approach is using a convolutional neural network (CNN) to extract features from the speech, and for the visual modality, the features can be extracted using a deep residual network of 50 layers. To capture contextual information’s a long short-term memory network can be utilized above these two models. Deep belief networks are another method which takes multimodal emotion recognition into account by first learning the features of the audio and video separately; after which it concatenates these two features. Visual features hold more importance in emotion recognition, so ResNet along with SVR for training can be used to predict emotion states effectively.
References
[1]Y. Wang and L. Guan, " Recognizing human emotional state from audio-visual signals", IEEE Trans. Multimedia., pp:936–946, 2008.
[2]A. Hanjalic and L. Xu, "Affective video content representation and modelling", IEEE Trans. Multimedia., pp: 143–154, 2005.
[3]Y. Cao, Y. Chen, and D. Khosla, "Spiking deep convolutional neural networks for energy-efficient object recognition”, Int. J. Comput. Vis., pp:54–66, 2015
[4]W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al., ”Deepid-net: Deformable deep convolutional neural networks for object Detection”, In CVPR, 2015.
[5]S. Zhang, S. Zhang, T. Huang, and W. Gao, “Multimodal deep convolutional neural network for audio-visual emotion recognition,” in Proc. Int. Conf. Multimedia Retrieval, pp. 281–284, 2016.
[6]A. Krrizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", In NIPS, 2012.
[7]S. Hochreiter and J. Schmidhuber, ”Long short-term memory,” NeuralComput., pp. 1735-1780, 1997.
[8]Xiong, X., De la Torre, F., “Supervised descent method and its applications to face alignment”, in: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 532–539,
2013
[9]Farneb¨ack, G, ” Two-frame motion estimation based on polynomial expansion, in: Image Analysis”, in Springer, pp. 363–370, 2003.
[10]Schuster, M., Paliwal, K.K., “Bidirectional recurrent neural networks” IEEE Trans. on Signal Processing 45, pp:2673–2681, 1997.
[11]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. Conf. Comput. Vis. Pattern Recognit, pp. 770–778, 2016.
[12]F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, ”Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions”, IEEE Int. Conf. Workshops Automat. Face Gesture Recognit., pp. 1–8, 2013.
[13]Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, pp. 1–127, 2009.
[14]J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng,
“Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 689–696, 2011
[15] F. Ringeval et al., “Prediction of asynchronous dimensional emotion ratings from audio visual and physiological data,” Pattern Recognit. Lett., pp. 22–30, 2015.
[16]Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Bjorn W.Schuller, and Stefanos Zafeiriou, "End-to-End Multimodal Emotion Recognition Using Deep Neural networks", in IEEE Journal of Selected Topics in Signal Processing, pp: 1301 - 1309, 2017.
[17]Y. Kim, H. Lee, and E. M. Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 3687–369, 2013
[18]B. Sun, S. Cao, L. Li, J. He, and L. Yu, “Exploring multimodal visual features for continuous affect recognition,” in Proc. 6th Int. Workshop Audio/Visual Emotion Challenge, Amsterdam, pp. 83–88, 2016.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
