HCI Bibliography Home | HCI Conferences | AVEC Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
AVEC Tables of Contents: 14

Proceedings of the 2014 International Workshop on Audio/Visual Emotion Challenge

Fullname:Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge
Editors:Michel Valstar; Björn Schuller; Jarek Krajewski; Roddy Cowie; Maja Pantic
Location:Orlando, Florida
Standard No:ISBN: 978-1-4503-3119-7; ACM DL: Table of Contents; hcibib: AVEC14
Links:Workshop Website | Conference Website
  1. Keynote Address
  2. Introduction
  3. Affect
  4. Depression
  5. Poster Session

Keynote Address

Automatic Assessment of Depression from Speech and Behavioural Signals BIBAFull-Text 1
  Julien Epps
Research into automatic recognition and prediction of depression from behavioural signals like speech and facial video represents an exciting mix of opportunity and challenge. The opportunity comes from the huge prevalence of depression worldwide and the fact that clinicians already explicitly or implicitly account for observable behaviour in their assessments. The challenge comes from the multi-factorial nature of depression, and the complexity of behavioural signals, which convey several other important types of information as well as depression. Investigations in our group to date have revealed some interesting perspectives on how to deal with confounding effects (e.g. due to speaker identity) and the role of depression-related signal variability. This presentation will focus on how depression is manifested in the speech signal, how to model depression in speech, methods for mitigating unwanted variability in speech, how depression assessment is different from more mainstream affective computing, what is needed from depression databases, and different possible system designs and applications. A range of fertile areas for future research will be suggested.


AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge BIBAFull-Text 3-10
  Michel Valstar; Björn Schuller; Kirsty Smith; Timur Almaev; Florian Eyben; Jarek Krajewski; Roddy Cowie; Maja Pantic
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evaluation of a patient's level of depression the observation of facial expressions and vocal cues. It is in this context that we present the fourth Audio-Visual Emotion recognition Challenge (AVEC 2014). This edition of the challenge uses a subset of the tasks used in a previous challenge, allowing for more focussed studies. In addition, labels for a third dimension (Dominance) have been added and the number of annotators per clip has been increased to a minimum of three, with most clips annotated by 5. The challenge has two goals logically organised as sub-challenges: the first is to predict the continuous values of the affective dimensions valence, arousal and dominance at each moment in time. The second is to predict the value of a single self-reported severity of depression indicator for each recording in the dataset. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.


Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video BIBAFull-Text 11-18
  Linlin Chao; Jianhua Tao; Minghao Yang; Ya Li; Zhengqi Wen
Understanding nonverbal behaviors in human machine interaction is a complex and challenge task. One of the key aspects is to recognize human emotion states accurately. This paper presents our effort to the Audio/Visual Emotion Challenge (AVEC'14), whose goal is to predict the continuous values of the emotion dimensions arousal, valence and dominance at each moment in time. The proposed method utilizes deep belief network based models to recognize emotion states from audio and visual modalities. Firstly, we employ temporal pooling functions in the deep neutral network to encode dynamic information in the features, which achieves the first time scale temporal modeling. Secondly, we combine the predicted results from different modalities and emotion temporal context information simultaneously. The proposed multimodal-temporal fusion achieves temporal modeling for the emotion states in the second time scale. Experiments results show the efficiency of each key point of the proposed method and competitive results are obtained.
Ensemble CCA for Continuous Emotion Prediction BIBAFull-Text 19-26
  Heysem Kaya; Fazilet Çilli; Albert Ali Salah
This paper presents our work on ACM MM Audio Visual Emotion Corpus 2014 (AVEC 2014) using the baseline features in accordance with the challenge protocol. For prediction, we use Canonical Correlation Analysis (CCA) in affect sub-challenge (ASC) and Moore-Penrose generalized inverse (MPGI) in depression sub-challenge (DSC). The video baseline provides histograms of Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) features. Based on our preliminary experiments on AVEC 2013 challenge data, we focus on the inner facial regions that correspond to eyes and mouth area. We obtain an ensemble of regional linear regressors via CCA and MPGI. We also enrich the 2014 baseline set with Local Phase Quantization (LPQ) features extracted using Intraface toolkit detected/tracked faces. Combining both representations in a CCA ensemble approach, on the challenge test set we reach an average Pearson's Correlation Coefficient (PCC) of 0.3932, outperforming the ASC test set baseline PCC of 0.1966. On the DSC, combining modality specific MPGI based ensemble systems, we reach 9.61 Root Mean Square Error (RMSE).
Building a Database of Political Speech: Does Culture Matter in Charisma Annotations? BIBAFull-Text 27-31
  Ailbhe Cullen; Andrew Hines; Naomi Harte
For both individual politicians and political parties the internet has become a vital tool for self-promotion and the distribution of ideas. The rise of streaming has enabled political debates and speeches to reach global audiences. In this paper, we explore the nature of charisma in political speech, with a view to automatic detection. To this end, we have collected a new database of political speech from YouTube and other on-line resources. Annotation is performed both by native listeners, and Amazon Mechanical Turk (AMT) workers. Detailed analysis shows that both label sets are equally reliable. The results support the use of crowd-sourced labels for speaker traits such as charisma in political speech, even where cultural subtleties are present. The impact of these different annotations on charisma prediction from political speech is also investigated.
Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions BIBAFull-Text 33-40
  Rahul Gupta; Nikolaos Malandrakis; Bo Xiao; Tanaya Guha; Maarten Van Segbroeck; Matthew Black; Alexandros Potamianos; Shrikanth Narayanan
Depression is one of the most common mood disorders. Technology has the potential to assist in screening and treating people with depression by robustly modeling and tracking the complex behavioral cues associated with the disorder (e.g., speech, language, facial expressions, head movement, body language). Similarly, robust affect recognition is another challenge which stands to benefit from modeling such cues. The Audio/Visual Emotion Challenge (AVEC) aims toward understanding the two phenomena and modeling their correlation with observable cues across several modalities. In this paper, we use multimodal signal processing methodologies to address the two problems using data from human-computer interactions. We develop separate systems for predicting depression levels and affective dimensions, experimenting with several methods for combining the multimodal information. The proposed depression prediction system uses a feature selection approach based on audio, visual, and linguistic cues to predict depression scores for each session. Similarly, we use multiple systems trained on audio and visual cues to predict the affective dimensions in continuous-time. Our affect recognition system accounts for context during the frame-wise inference and performs a linear fusion of outcomes from the audio-visual systems. For both problems, our proposed systems outperform the video-feature based baseline systems. As part of this work, we analyze the role played by each modality in predicting the target variable and provide analytical insights.
Inferring Depression and Affect from Application Dependent Meta Knowledge BIBAFull-Text 41-48
  Markus Kächele; Martin Schels; Friedhelm Schwenker
This paper outlines our contribution to the 2014 edition of the AVEC competition. It comprises classification results and considerations for both the continuous affect recognition sub-challenge and also the depression recognition sub-challenge. Rather than relying on statistical features that are normally extracted from the raw audio-visual data we propose an approach based on abstract meta information about individual subjects and also prototypical task and label dependent templates to infer the respective emotional states. The results of the approach that were submitted to both parts of the challenge significantly outperformed the baseline approaches. Further, we elaborate on several issues about the labeling of affective corpora and the choice of appropriate performance measures.


Fusing Affective Dimensions and Audio-Visual Features from Segmented Video for Depression Recognition: INAOE-BUAP's Participation at AVEC'14 Challenge BIBAFull-Text 49-55
  Humberto Pérez Espinosa; Hugo Jair Escalante; Luis Villaseñor-Pineda; Manuel Montes-y-Gómez; David Pinto-Avedaño; Veronica Reyez-Meza
Depression is a disease that affects a considerable portion of the world population. Severe cases of depression interfere with the common live of patients, for those patients a strict monitoring is necessary in order to control the progress of the disease and to prevent undesired side effects. A way to keep track of patients with depression is by means of online monitoring via human-computer-interaction. The AVEC'14 challenge aims at developing technology towards the online monitoring of depression patients. This paper describes an approach to depression recognition from audiovisual information in the context of the AVEC'14 challenge. The proposed method relies on an effective voice segmentation procedure, followed by segment-level feature extraction and aggregation. Finally, a meta-model is trained to fuse mono-modal information. The main novel features of our proposal are that (1) we use affective dimensions for building depression recognition models; (2) we extract visual information from voice and silence segments separately; (3) we consolidate features and use a meta-model for fusion. The proposed methodology is evaluated, experimental results reveal the method is competitive.
Model Fusion for Multimodal Depression Classification and Level Detection BIBAFull-Text 57-63
  Mohammed Senoussaoui; Milton Sarria-Paja; João F. Santos; Tiago H. Falk
Audio-visual emotion and mood disorder cues have been recently explored to develop tools to assist psychologists and psychiatrists in evaluating a patient's level of depression. In this paper, we present a number of different multimodal depression level predictors using a model fusion approach, in the context of the AVEC14 challenge. We show that an i-vector based representation for short term audio features contains useful information for depression classification and prediction. We also employed a classification step prior to regression to allow having different regression models depending on the presence or absence of depression. Our experiments show that a combination of our audio-based model and two other models based on the LGBP-TOP video features lead to an improvement of 4% over the baseline model proposed by the challenge organizers.
Vocal and Facial Biomarkers of Depression based on Motor Incoordination and Timing BIBAFull-Text 65-72
  James R. Williamson; Thomas F. Quatieri; Brian S. Helfer; Gregory Ciccarelli; Daryush D. Mehta
In individuals with major depressive disorder, neurophysiological changes often alter motor control and thus affect the mechanisms controlling speech production and facial expression. These changes are typically associated with psychomotor retardation, a condition marked by slowed neuromotor output that is behaviorally manifested as altered coordination and timing across multiple motor-based properties. Changes in motor outputs can be inferred from vocal acoustics and facial movements as individuals speak. We derive novel multi-scale correlation structure and timing feature sets from audio-based vocal features and video-based facial action units from recordings provided by the 4th International Audio/Video Emotion Challenge (AVEC). The feature sets enable detection of changes in coordination, movement, and timing of vocal and facial gestures that are potentially symptomatic of depression. Combining complementary features in Gaussian mixture model and extreme learning machine classifiers, our multivariate regression scheme predicts Beck depression inventory ratings on the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31. Future work calls for continued study into detection of neurological disorders based on altered coordination and timing across audio and video modalities.

Poster Session

Automatic Depression Scale Prediction using Facial Expression Dynamics and Regression BIBAFull-Text 73-80
  Asim Jan; Hongying Meng; Yona Falinie A. Gaus; Fan Zhang; Saeed Turabzadeh
Depression is a state of low mood and aversion to activity that can affect a person's thoughts, behavior, feelings and sense of well-being. In such a low mood, both the facial expression and voice appear different from the ones in normal states. In this paper, an automatic system is proposed to predict the scales of Beck Depression Inventory from naturalistic facial expression of the patients with depression. Firstly, features are extracted from corresponding video and audio signals to represent characteristics of facial and vocal expression under depression. Secondly, dynamic features generation method is proposed in the extracted video feature space based on the idea of Motion History Histogram (MHH) for 2-D video motion extraction. Thirdly, Partial Least Squares (PLS) and Linear regression are applied to learn the relationship between the dynamic features and depression scales using training data, and then to predict the depression scale for unseen ones. Finally, decision level fusion was done for combining predictions from both video and audio modalities. The proposed approach is evaluated on the AVEC2014 dataset and the experimental results demonstrate its effectiveness.
Emotion Recognition and Depression Diagnosis by Acoustic and Visual Features: A Multimodal Approach BIBAFull-Text 81-86
  Maxim Sidorov; Wolfgang Minker
There is an enormous number of potential applications of the system which is capable to recognize human emotions. Such opportunity can be useful in various applications, e.g., improvement of Spoken Dialogue Systems (SDSs) or monitoring agents in call-centers. Depression is another aspect of human beings which is closely related to emotions. The system, that can automatically diagnose patient's depression can be helpful to physicians in order to support their decisions and avoid critical mistakes. Therefore, the Affect and Depression Recognition Sub-Challenges (ASC and DSC correspondingly) of the second combined open Audio/Visual Emotion and Depression recognition Challenge (AVEC 2014) is focused on estimating emotions and depression. This study presents the results of multimodal affect and depression recognition based on four different segmentation methods, using support vector regression. Furthermore, a speaker identification procedure has been introduced in order to build the speaker-specific emotion/depression recognition systems.
Depression Estimation Using Audiovisual Features and Fisher Vector Encoding BIBAFull-Text 87-91
  Varun Jain; James L. Crowley; Anind K. Dey; Augustin Lux
We investigate the use of two visual descriptors: Local Binary Patterns-Three Orthogonal Planes (LBP-TOP) and Dense Trajectories for depression assessment on the AVEC 2014 challenge dataset. We encode the visual information generated by the two descriptors using Fisher Vector encoding which has been shown to be one of the best performing methods to encode visual data for image classification. We also incorporate audio features in the final system to introduce multiple input modalities. The results produced using Linear Support Vector regression outperform the baseline method.
The SRI AVEC-2014 Evaluation System BIBAFull-Text 93-101
  Vikramjit Mitra; Elizabeth Shriberg; Mitchell McLaren; Andreas Kathol; Colleen Richey; Dimitra Vergyri; Martin Graciarena
Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.