HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2014 International Conference on Multimodal Interaction

Fullname:Proceedings of the 16th ACM International Conference on Multimodal Interaction
Editors:Albert Ali Salah; Jeffrey Cohn; Björn Schuller; Oya Aran; Louis-Philippe Morency; Philip R. Cohen
Location:Istanbul, Turkey
Dates:2014-Nov-12 to 2014-Nov-16
Standard No:ISBN: 978-1-4503-2885-2; ACM DL: Table of Contents; hcibib: ICMI14
Links:Conference Website
  1. Keynote Address
  2. Oral Session 1: Dialogue and Social Interaction
  3. Oral Session 2: Multimodal Fusion
  4. Demo Session 1
  5. Poster Session 1
  6. Keynote Address 2
  7. Oral Session 3: Affect and Cognitive Modeling
  8. Oral Session 4: Nonverbal Behaviors
  9. Doctoral Spotlight Session
  10. Keynote Address 3
  11. Oral Session 5: Mobile and Urban Interaction
  12. Oral Session 6: Healthcare and Assistive Technologies
  13. Keynote Address 4
  14. The Second Emotion Recognition In The Wild Challenge
  15. Workshop Overviews

Keynote Address

Bursting our Digital Bubbles: Life Beyond the App BIBAFull-Text 1
  Yvonne Rogers
30 years ago, a common caricature of computing was of a frustrated user sat staring at a PC hands hovering over a keyboard and mouse. Nowadays, the picture is very different. The PC has been largely overtaken by the laptop, the smartphone and the tablet; more and more people are using them extensively and everywhere as they go about their working and everyday lives. Instead the caricature has become one of people increasingly living in their own digital bubbles -- heads-down glued to a mobile device, pecking and swiping at digital content with one finger. How can designers and researchers break out of this app mindset to exploit the new generation of affordable multimodal technologies, in the form of physical computing, internet of things, and sensor toolkits, to begin creating more diverse heads-up, hands-on, arms-out user experiences? In my talk I will argue for a radical rethink of our relationship with future technologies. One that inspires us, through shared devices, tools and data, to be more creative, playful and thoughtful of each other and our surrounding environments.

Oral Session 1: Dialogue and Social Interaction

Managing Human-Robot Engagement with Forecasts and... um... Hesitations BIBAFull-Text 2-9
  Dan Bohus; Eric Horvitz
We explore methods for managing conversational engagement in open-world, physically situated dialog systems. We investigate a self-supervised methodology for constructing forecasting models that aim to anticipate when participants are about to terminate their interactions with a situated system. We study how these models can be leveraged to guide a disengagement policy that uses linguistic hesitation actions, such as filled and non-filled pauses, when uncertainty about the continuation of engagement arises. The hesitations allow for additional time for sensing and inference, and convey the system's uncertainty. We report results from a study of the proposed approach with a directions-giving robot deployed in the wild.
Written Activity, Representations and Fluency as Predictors of Domain Expertise in Mathematics BIBAFull-Text 10-17
  Sharon Oviatt; Adrienne Cohen
The emerging field of multimodal learning analytics evaluates natural communication modalities (digital pen, speech, images) to identify domain expertise, learning, and learning-oriented precursors. Using the Math Data Corpus, this research investigated students' digital pen input as small groups collaborated on solving math problems. Compared with non-experts, findings indicated that domain experts have an opposite pattern of accelerating total written activity as problem difficulty increases, a lower written and spoken disfluency rate, and they express different content -- including a higher ratio of nonlinguistic symbolic representations and structured diagrams to elemental marks. Implications are discussed for developing reliable multimodal learning analytics systems that incorporate digital pen input to automatically track the consolidation of domain expertise. This includes prediction based on a combination of activity patterns, fluency, and content analysis. New MMLA systems are expected to have special utility on cell phones, which already have multimodal interfaces and are the dominant educational platform worldwide.
Analysis of Respiration for Prediction of "Who Will Be Next Speaker and When?" in Multi-Party Meetings BIBAFull-Text 18-25
  Ryo Ishii; Kazuhiro Otsuka; Shiro Kumano; Junji Yamato
To build a model for predicting the next speaker and the start time of the next utterance in multi-party meetings, we performed a fundamental study of how respiration could be effective for the prediction model. The results of the analysis reveal that a speaker inhales more rapidly and quickly right after the end of a unit of utterance in turn-keeping. The next speaker takes a bigger breath toward speaking in turn-changing than listeners who will not become the next speaker. Based on the results of the analysis, we constructed the prediction models to evaluate how effective the parameters are. The results of the evaluation suggest that the speaker's inhalation right after a unit of utterance, such as the start time from the end of the unit of utterance and the slope and duration of the inhalation phase, is effective for predicting whether turn-keeping or turn-changing happen about 350 ms before the start time of the next utterance on average and that listener's inhalation before the next utterance, such as the maximal inspiration and amplitude of the inhalation phase, is effective for predicting the next speaker in turn-changing about 900 ms before the start time of the next utterance on average.
A Multimodal In-Car Dialogue System That Tracks The Driver's Attention BIBAFull-Text 26-33
  Spyros Kousidis; Casey Kennington; Timo Baumann; Hendrik Buschmeier; Stefan Kopp; David Schlangen
When a passenger speaks to a driver, he or she is co-located with the driver, is generally aware of the situation, and can stop speaking to allow the driver to focus on the driving task. In-car dialogue systems ignore these important aspects, making them more distracting than even cell-phone conversations. We developed and tested a "situationally-aware" dialogue system that can interrupt its speech when a situation which requires more attention from the driver is detected, and can resume when driving conditions return to normal. Furthermore, our system allows driver-controlled resumption of interrupted speech via verbal or visual cues (head nods). Over two experiments, we found that the situationally-aware spoken dialogue system improves driving performance and attention to the speech content, while driver-controlled speech resumption does not hinder performance in either of these two tasks.

Oral Session 2: Multimodal Fusion

Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals BIBAFull-Text 34-41
  Héctor P. Martínez; Georgios N. Yannakakis
Multimodal datasets often feature a combination of continuous signals and a series of discrete events. For instance, when studying human behaviour it is common to annotate actions performed by the participant over several other modalities such as video recordings of the face or physiological signals. These events are nominal, not frequent and are not sampled at a continuous rate while signals are numeric and often sampled at short fixed intervals. This fundamentally different nature complicates the analysis of the relation among these modalities which is often studied after each modality has been summarised or reduced. This paper investigates a novel approach to model the relation between such modality types bypassing the need for summarising each modality independently of each other. For that purpose, we introduce a deep learning model based on convolutional neural networks that is adapted to process multiple modalities at different time resolutions we name deep multimodal fusion. Furthermore, we introduce and compare three alternative methods (convolution, training and pooling fusion) to integrate sequences of events with continuous signals within this model. We evaluate deep multimodal fusion using a game user dataset where player physiological signals are recorded in parallel with game events. Results suggest that the proposed architecture can appropriately capture multimodal information as it yields higher prediction accuracies compared to single-modality models. In addition, it appears that pooling fusion, based on a novel filter-pooling method provides the more effective fusion approach for the investigated types of data.
The Additive Value of Multimodal Features for Predicting Engagement, Frustration, and Learning during Tutoring BIBAFull-Text 42-49
  Joseph F. Grafsgaard; Joseph B. Wiggins; Alexandria Katarina Vail; Kristy Elizabeth Boyer; Eric N. Wiebe; James C. Lester
Detecting learning-centered affective states is difficult, yet crucial for adapting most effectively to users. Within tutoring in particular, the combined context of student task actions and tutorial dialogue shape the student's affective experience. As we move toward detecting affect, we may also supplement the task and dialogue streams with rich sensor data. In a study of introductory computer programming tutoring, human tutors communicated with students through a text-based interface. Automated approaches were leveraged to annotate dialogue, task actions, facial movements, postural positions, and hand-to-face gestures. These dialogue, nonverbal behavior, and task action input streams were then used to predict retrospective student self-reports of engagement and frustration, as well as pretest/posttest learning gains. The results show that the combined set of multimodal features is most predictive, indicating an additive effect. Additionally, the findings demonstrate that the role of nonverbal behavior may depend on the dialogue and task context in which it occurs. This line of research identifies contextual and behavioral cues that may be leveraged in future adaptive multimodal systems.
Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach BIBAFull-Text 50-57
  Sunghyun Park; Han Suk Shim; Moitreya Chatterjee; Kenji Sagae; Louis-Philippe Morency
Our lives are heavily influenced by persuasive communication, and it is essential in almost any types of social interactions from business negotiation to conversation with our friends and family. With the rapid growth of social multimedia websites, it is becoming ever more important and useful to understand persuasiveness in the context of social multimedia content online. In this paper, we introduce our newly created multimedia corpus of 1,000 movie review videos obtained from a social multimedia website called ExpoTV.com, which will be made freely available to the research community. Our research results presented here revolve around the following 3 main research hypotheses. Firstly, we show that computational descriptors derived from verbal and nonverbal behavior can be predictive of persuasiveness. We further show that combining descriptors from multiple communication modalities (audio, text and visual) improve the prediction performance compared to using those from single modality alone. Secondly, we investigate if having prior knowledge of a speaker expressing a positive or negative opinion helps better predict the speaker's persuasiveness. Lastly, we show that it is possible to make comparable prediction of persuasiveness by only looking at thin slices (shorter time windows) of a speaker's behavior.
Deception detection using a multimodal approach BIBAFull-Text 58-65
  Mohamed Abouelenien; Veronica Pérez-Rosas; Rada Mihalcea; Mihai Burzo
In this paper we address the automatic identification of deceit by using a multimodal approach. We collect deceptive and truthful responses using a multimodal setting where we acquire data using a microphone, a thermal camera, as well as physiological sensors. Among all available modalities, we focus on three modalities namely, language use, physiological response, and thermal sensing. To our knowledge, this is the first work to integrate these specific modalities to detect deceit. Several experiments are carried out in which we first select representative features for each modality, and then we analyze joint models that integrate several modalities. The experimental results show that the combination of features from different modalities significantly improves the detection of deceptive behaviors as compared to the use of one modality at a time. Moreover, the use of non-contact modalities proved to be comparable with and sometimes better than existing contact-based methods. The proposed method increases the efficiency of detecting deceit by avoiding human involvement in an attempt to move towards a completely automated non-invasive deception detection process.

Demo Session 1

Multimodal Interaction for Future Control Centers: An Interactive Demonstrator BIBAFull-Text 66-67
  Ferdinand Fuhrmann; Rene Kaiser
This interactive demo exhibits a visionary multimodal interaction concept designed to support operators in future control centers. The applied multi-layered hardware and software architecture directly supports the operators in performing their lengthy monitoring and urgent alarm handling tasks. Operators are presented with visual information on three completely configurable levels of screen displays. Gesture interaction via skeleton and finger tracking acts as the main control interaction principle. We particularly developed a special sensor-equipped chair as well as an audio interface allowing for speaking and listening in isolation without any wearable device.
Emotional Charades BIBAFull-Text 68-69
  Stefano Piana; Alessandra Staglianò; Francesca Odone; Antonio Camurri
This is a short description of the Emotional Charades serious game demo. Our goal is to focus on emotion expression through body gestures, making the players aware of the amount of affective information their bodies convey. The whole framework aims at helping children with autism to understand and express emotions. We also want to compare the performances of our automatic recognition system and the ones achieved by humans.
Glass Shooter: Exploring First-Person Shooter Game Control with Google Glass BIBAFull-Text 70-71
  Chun-Yen Hsu; Ying-Chao Tung; Han-Yu Wang; Silvia Chyou; Jer-Wei Lin; Mike Y. Chen
Smart Glasses offer the opportunity to use head mounted sensors, such as gyroscope and accelerometers, to enable new types of game interaction. To better understand game play experience on Smart Glasses, we recruited 24 participants to play four current games on Google Glass that uses different interaction methods, including gyroscope, voice, touchpad, and in-air gesture. Study results showed that participants were concerned with comfort and social acceptance. Also, their favorite input method was gyroscope, and their favorite game type was First-Person Shooter (FPS) game. Hence, we implemented a FPS game on Google Glass using gyroscope for changing the viewport, and divide FPS controls into four categories: (a) Viewport Control, (b) Aim Control, (c) Fire Control, (d) Move Control. We implemented multiple control method in each category to evaluate and explore glass game control design.
Orchestration for Group Videoconferencing: An Interactive Demonstrator BIBAFull-Text 72-73
  Wolfgang Weiss; Rene Kaiser; Manolis Falelakis
In this demonstration we invite visitors to join a live videoconferencing session with remote participants across Europe. We demonstrate the behavior of an automatic decision making component in the realm of social video communication. Our approach takes into account several aspects such as the current conversational situation, conversational metrics of the past, and device capabilities, to make decisions on the visual representation of available video streams. The combination of these cues and the application of automatic decision making rules results into commands of how to mix and how to compose the available video streams for each conversation node's screen. The demo's features are another step towards optimally supporting users in communication within various communication contexts and adapting the user interface to the users' needs.
Integrating Remote PPG in Facial Expression Analysis Framework BIBAFull-Text 74-75
  H. Emrah Tasli; Amogh Gudi; Marten Den Uyl
This demonstration paper presents the FaceReader framework where human face image and skin color variations are analyzed for observing facial expressions, vital signs including but not limited to average heart rate (HR), heart rate variability (HRV) and also the stress and confidence levels of the person. Remote monitoring of the facial and vital signs could be useful for wide range of applications. FaceReader uses active appearance modeling for facial analysis and novel signal processing techniques for heart rate and variability estimation. The performance has been objectively evaluated and psychological guidelines for stress measurements are incorporated in the framework for analysis.
Context-Aware Multimodal Robotic Health Assistant BIBAFull-Text 76-77
  Vidyavisal Mangipudi; Raj Tumuluri
Reduced adherence to medical regimen has led to poorer health, more frequent hospitalization and costs the American economy over $290 Billion annually. EasyHealth Assistant (EHA) is a context aware and interactive robot that helps patients receive their medication in the prescribed dosage at the right time. Additionally, EHA features multimodal elements such as Face Recognition, Speech Recognition + TTS, Motion Sensing and MindWave (EEG) interactions that were developed using W3C MMI Architecture and Markup Languages. EHA improves the Caregiver/ Doctor -- Patient collaboration with tools like Remote control and Video conference. It also provides the Caregivers with real-time statistics and allows easy monitoring of medical adherence and health vitals, which should result in improved outcome for the patient.
WebSanyog: A Portable Assistive Web Browser for People with Cerebral Palsy BIBAFull-Text 78-79
  Tirthankar Dasgupta; Manjira Sinha; Gagan Kandra; Anupam Basu
The paper presents design and development of WebSanyog, an Android based web browser that helps people with severe form of spastic cerebral palsy and highly restricted motor movement skills to access web contents. The target user group has acted as our design advisors through constant interaction during the whole process. Features like, auto scanning mechanism, predictive keyboard and intelligent link parser make the system suitable for our target users. The browser is primarily developed for mobile and tablet based devices keeping in mind the portability issue.
The hybrid Agent MARCO BIBAFull-Text 80-81
  Nicolas Riesterer; Christian Becker Asano; Julien Hué; Christian Dornhege; Bernhard Nebel
We present MARCO, a hybrid, chess playing agent equipped with a custom-built robotic arm and a virtual agent's face displaying emotions. MARCO was built to investigate the hypothesis that hybrid agents capable of displaying emotions make playing chess more personal and enjoyable. In addition, we aim to explore means of achieving emotional contagion between man and machine.
Towards Supporting Non-linear Navigation in Educational Videos BIBAFull-Text 82-83
  Kuldeep Yadav; Kundan Shrivastava; Om Deshmukh
MOOC participants spend most of their time watching videos in a course and recent studies have found that there is a requirement of non-linear navigation system for educational videos. We propose a system that provides efficient and non-linear navigation in a given video using multi-modal dimensions (i.e. customized word-cloud, image cloud) derived from video content. The end-to-end system is implemented and we demonstrate capabilities of proposed system using any given YouTube or MOOC video.

Poster Session 1

Detecting conversing groups with a single worn accelerometer BIBAFull-Text 84-91
  Hayley Hung; Gwenn Englebienne; Laura Cabrera Quiros
In this paper we propose the novel task of detecting groups of conversing people using only a single body-worn accelerometer per person. Our approach estimates each individual's social actions and uses the co-ordination of these social actions between pairs to identify group membership. The aim of such an approach is to be deployed in dense crowded environments. Our work differs significantly from previous approaches, which have tended to rely on audio and/or proximity sensing, often in much less crowded scenarios, for estimating whether people are talking together or who is speaking. Ultimately, we are interested in detecting who is speaking, who is conversing with whom, and from that, to infer socially relevant information about the interaction such as whether people are enjoying themselves, or the quality of their relationship in these extremely dense crowded scenarios. Striving towards this long-term goal, this paper presents a systematic study to understand how to detect groups of people who are conversing together in this setting, where we achieve a $64%$ classification accuracy using a fully automated system.
Identification of the Driver's Interest Point using a Head Pose Trajectory for Situated Dialog Systems BIBAFull-Text 92-95
  Young-Ho Kim; Teruhisa Misu
This paper addresses issues existing in situated language understanding in a moving car. Particularly, we propose a method for understanding user queries regarding specific target buildings in their surroundings based on the driver's head pose and speech information. To identify a meaningful head pose motion related to the user query that is among spontaneous motions while driving, we construct a model describing the relationship between sequences of a driver's head pose and the relative direction to an interest point using the Gaussian process regression. We also consider time-varying interest point using kernel density estimation. We collected situated queries from subject drivers by using our research system embedded in a real car. The proposed method achieves an improvement in the target identification rate by 14% in the user-independent training condition and 27% in the user-dependent training condition over the method that uses the head motion at the start-of-speech timing.
An Explorative Study on Crossmodal Congruence Between Visual and Tactile Icons Based on Emotional Responses BIBAFull-Text 96-103
  Taekbeom Yoo; Yongjae Yoo; Seungmoon Choi
Tactile icons, brief tactile stimuli conveying abstract information, have found their use in various applications, and their use with visual elements is increasing on touchscreen user interfaces. However, effective design guidelines of tactile icons for crossmodal use have not been established. This paper addresses this problem by investigating the congruence between visual and tactile icons based on the hypothesis that emotional agreement between the icons improves congruence. The validity of this hypothesis was examined in three experiments. In Exp. I, we selected common visual icons and estimated their emotional responses using the circumplex model of affect. Tactile icons to be used as a pair were designed in Exp. II by varying their amplitude, frequency, and envelope (rhythm). Their emotional responses were also evaluated. In Exp. III, the congruence of 192 crossmodal icons made by combining the visual icons (8) and the tactile icons (24) was evaluated, and these congruence scores were compared with the valence and arousal scores of the two unimodal icons obtained in Exp. I and II. Experimental results suggested that the congruence of a crossmodal icon highly depends on the agreement in the emotional responses between its visual and tactile icons. This finding provides feasibility to the development of general design guidelines and heuristics for crossmodal icons that rely on the relationship between the emotional responses from the individual modalities. Our approach is expected to advance the current practice that associates the physical parameters between the different senses with better intuitiveness and simplicity.
Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News BIBAFull-Text 104-111
  Joseph G. Ellis; Brendan Jou; Shih-Fu Chang
We present a multimodal sentiment study performed on a novel collection of videos mined from broadcast and cable television news programs. To the best of our knowledge, this is the first dataset released for studying sentiment in the domain of broadcast video news. We describe our algorithm for the processing and creation of person-specific segments from news video, yielding 929 sentence-length videos, and are annotated via Amazon Mechanical Turk. The spoken transcript and the video content itself are each annotated for their expression of positive, negative or neutral sentiment. Based on these gathered user annotations, we demonstrate for news video the importance of taking into account multimodal information for sentiment prediction, and in particular, challenging previous text-based approaches that rely solely on available transcripts. We show that as much as 21.54% of the sentiment annotations for transcripts differ from their respective sentiment annotations when the video clip itself is presented. We present audio and visual classification baselines over a three-way sentiment prediction of positive, negative and neutral, as well as person-dependent versus person-independent classification influence on performance. Finally, we release the News Rover Sentiment dataset to the greater research community.
Dyadic Behavior Analysis in Depression Severity Assessment Interviews BIBAFull-Text 112-119
  Stefan Scherer; Zakia Hammal; Ying Yang; Louis-Philippe Morency; Jeffrey F. Cohn
Previous literature suggests that depression impacts vocal timing of both participants and clinical interviewers but is mixed with respect to acoustic features. To investigate further, 57 middle-aged adults (men and women) with Major Depression Disorder and their clinical interviewers (all women) were studied. Participants were interviewed for depression severity on up to four occasions over a 21 week period using the Hamilton Rating Scale for Depression (HRSD), which is a criterion measure for depression severity in clinical trials. Acoustic features were extracted for both participants and interviewers using COVAREP Toolbox. Missing data occurred due to missed appointments, technical problems, or insufficient vocal samples. Data from 36 participants and their interviewers met criteria and were included for analysis to compare between high and low depression severity. Acoustic features for participants varied between men and women as expected, and failed to vary with depression severity for participants. For interviewers, acoustic characteristics strongly varied with severity of the interviewee's depression. Accommodation -- the tendency of interactants to adapt their communicative behavior to each other -- between interviewers and interviewees was inversely related to depression severity. These findings suggest that interviewers modify their acoustic features in response to depression severity, and depression severity strongly impacts interpersonal accommodation.
Touching the Void -- Introducing CoST: Corpus of Social Touch BIBAFull-Text 120-127
  Merel M. Jung; Ronald Poppe; Mannes Poel; Dirk K. J. Heylen
Touch behavior is of great importance during social interaction. To transfer the tactile modality from interpersonal interaction to other areas such as Human-Robot Interaction (HRI) and remote communication automatic recognition of social touch is necessary. This paper introduces CoST: Corpus of Social Touch, a collection containing 7805 instances of 14 different social touch gestures. The gestures were performed in three variations: gentle, normal and rough, on a sensor grid wrapped around a mannequin arm. Recognition of the rough variations of these 14 gesture classes using Bayesian classifiers and Support Vector Machines (SVMs) resulted in an overall accuracy of 54% and 53%, respectively. Furthermore, this paper provides more insight into the challenges of automatic recognition of social touch gestures, including which gestures can be recognized more easily and which are more difficult to recognize.
Unsupervised Domain Adaptation for Personalized Facial Emotion Recognition BIBAFull-Text 128-135
  Gloria Zen; Enver Sangineto; Elisa Ricci; Nicu Sebe
The way in which human beings express emotions depends on their specific personality and cultural background. As a consequence, person independent facial expression classifiers usually fail to accurately recognize emotions which vary between different individuals. On the other hand, training a person-specific classifier for each new user is a time consuming activity which involves collecting hundreds of labeled samples. In this paper we present a personalization approach in which only unlabeled target-specific data are required. The method is based on our previous paper [20] in which a regression framework is proposed to learn the relation between the user's specific sample distribution and the parameters of her/his classifier. Once this relation is learned, a target classifier can be constructed using only the new user's sample distribution to transfer the personalized parameters. The novelty of this paper with respect to [20] is the introduction of a new method to represent the source sample distribution based on using only the Support Vectors of the source classifiers. Moreover, we present here a simplified regression framework which achieves the same or even slightly superior experimental results with respect to [20] but it is much easier to reproduce.
Predicting Influential Statements in Group Discussions using Speech and Head Motion Information BIBAFull-Text 136-143
  Fumio Nihei; Yukiko I. Nakano; Yuki Hayashi; Hung-Hsuan Hung; Shogo Okada
Group discussions are used widely when generating new ideas and forming decisions as a group. Therefore, it is assumed that giving social influence to other members through facilitating the discussion is an important part of discussion skill. This study focuses on influential statements that affect discussion flow and highly related to facilitation, and aims to establish a model that predicts influential statements in group discussions. First, we collected a multimodal corpus using different group discussion tasks; in-basket and case-study. Based on schemes for analyzing arguments, each utterance was annotated as being influential or not. Then, we created classification models for predicting influential utterances using prosodic features as well as attention and head motion information from the speaker and other members of the group. In our model evaluation, we discovered that the assessment of each participant in terms of discussion facilitation skills by experienced observers correlated highly to the number of influential utterances by a given participant. This suggests that the proposed model can predict influential statements with considerable accuracy, and the prediction results can be a good predictor of facilitators in group discussions.
The Relation of Eye Gaze and Face Pose: Potential Impact on Speech Recognition BIBAFull-Text 144-147
  Malcolm Slaney; Andreas Stolcke; Dilek Hakkani-Tür
We are interested in using context to improve speech recognition and speech understanding. Knowing what the user is attending to visually helps us predict their utterances and thus makes speech recognition easier. Eye gaze is one way to access this signal, but is often unavailable (or expensive to gather) at longer distances. In this paper we look at joint eye-gaze and facial-pose information while users perform a speech reading task. We hypothesize, and verify experimentally, that the eyes lead, and then the face follows. Face pose might not be as fast, or as accurate a signal of visual attention as eye gaze, but based on experiments correlating eye gaze with speech recognition, we conclude that face pose provides useful information to bias a recognizer toward higher accuracy.
Speech-Driven Animation Constrained by Appropriate Discourse Functions BIBAFull-Text 148-155
  Najmeh Sadoughi; Yang Liu; Carlos Busso
Conversational agents provide powerful opportunities to interact and engage with the users. The challenge is how to create naturalistic behaviors that replicate the complex gestures observed during human interactions. Previous studies have used rule-based frameworks or data-driven models to generate appropriate gestures, which are properly synchronized with the underlying discourse functions. Among these methods, speech-driven approaches are especially appealing given the rich information conveyed on speech. It captures emotional cues and prosodic patterns that are important to synthesize behaviors (i.e., modeling the variability and complexity of the timings of the behaviors). The main limitation of these models is that they fail to capture the underlying semantic and discourse functions of the message (e.g., nodding). This study proposes a speech-driven framework that explicitly model discourse functions, bridging the gap between speech-driven and rule-based models. The approach is based on dynamic Bayesian Network (DBN), where an additional node is introduced to constrain the models by specific discourse functions. We implement the approach by synthesizing head and eyebrow motion. We conduct perceptual evaluations to compare the animations generated using the constrained and unconstrained models.
Many Fingers Make Light Work: Non-Visual Capacitive Surface Exploration BIBAFull-Text 156-163
  Martin Halvey; Andy Crossan
In this paper we investigate how we can change interactions with mobile devices so we can better support subtle low effort intermittent interaction. In particular we conducted an evaluation with varying interaction techniques which looked at non-visual touch based exploration of information on a capacitive surface. The results of this evaluation indicate that there is very little difference in terms of selection accuracy between the interaction techniques that we implemented and a slight but significant time reduction when using multiple fingers to search, over one finger. Users found locating information and relating information to physical landmarks easier than relating virtual locations to each other. In addition it was found that search strategy and interaction varied between tasks and also at different points in the task.
Multimodal Interaction History and its use in Error Detection and Recovery BIBAFull-Text 164-171
  Felix Schüssel; Frank Honold; Miriam Schmidt; Nikola Bubalo; Anke Huckauf; Michael Weber
Multimodal systems still tend to ignore the individual input behavior of users, and at the same time, suffer from erroneous sensor inputs. Although many researchers have described user behavior in specific settings and tasks, little to nothing is known about the applicability of such information, when it comes to increase the robustness of a system for multimodal inputs. We conducted a gamified experimental study to investigate individual user behavior and error types found in an actually running system. It is shown, that previous ways of describing input behavior by a simple classification scheme (like simultaneous and sequential) are not suited to build up an individual interaction history. Instead, we propose to use temporal distributions of different metrics derived from multimodal event timings. We identify the major errors that can occur in multimodal interactions and finally show how such an interaction history can practically be applied for error detection and recovery. Applying the proposed approach to the experimental data, the initial error rate is reduced from 4.9% to a minimum of 1.2%.
Gesture Heatmaps: Understanding Gesture Performance with Colorful Visualizations BIBAFull-Text 172-179
  Radu-Daniel Vatavu; Lisa Anthony; Jacob O. Wobbrock
We introduce gesture heatmaps, a novel gesture analysis technique that employs color maps to visualize the variation of local features along the gesture path. Beyond current gesture analysis practices that characterize gesture articulations with single-value descriptors, e.g., size, path length, or speed, gesture heatmaps are able to show with colorful visualizations how the value of any such descriptors vary along the gesture path. We evaluate gesture heatmaps on three public datasets comprising 15,840 gesture samples of 70 gesture types from 45 participants, on which we demonstrate heatmaps' capabilities to (1) explain causes for recognition errors, (2) characterize users' gesture articulation patterns under various conditions, e.g., finger versus pen gestures, and (3) help understand users' subjective perceptions of gesture commands, such as why some gestures are perceived easier to execute than others. We also introduce chromatic confusion matrices that employ gesture heatmaps to extend the expressiveness of standard confusion matrices to better understand gesture classification performance. We believe that gesture heatmaps will prove useful to researchers and practitioners doing gesture analysis, and consequently, they will inform the design of better gesture sets and development of more accurate recognizers.
Personal Aesthetics for Soft Biometrics: A Generative Multi-resolution Approach BIBAFull-Text 180-187
  Cristina Segalin; Alessandro Perina; Marco Cristani
Are we recognizable by our image preferences? This paper answers affirmatively the question, presenting a soft biometric approach where the preferred images of an individual are used as his personal signature in identification tasks. The approach builds a multi-resolution latent space, formed by multiple Counting Grids, where similar images are mapped nearby. On this space, a set of preferred images of a user produces an ensemble of intensity maps, highlighting in an intuitive way his personal aesthetic preferences. These maps are then used for learning a battery of discriminative classifiers (one for each resolution), which characterizes the user and serves to perform identification. Results are promising: on a dataset of 200 users, and 40K images, using 20 preferred images as biometric template gives 66% of probability of guessing the correct user. This makes the "personal aesthetics" a very hot topic for soft biometrics, while its usage in standard biometric applications seems to be far from being effective, as we show in a simple user study.
Synchronising Physiological and Behavioural Sensors in a Driving Simulator BIBAFull-Text 188-195
  Ronnie Taib; Benjamin Itzstein; Kun Yu
Accurate and noise robust multimodal activity and mental state monitoring can be achieved by combining physiological, behavioural and environmental signals. This is especially promising in assistive driving technologies, because vehicles now ship with sensors ranging from wheel and pedal activity, to voice and eye tracking. In practice, however, multimodal user studies are confronted with challenging data collection and synchronisation issues, due to the diversity of sensing, acquisition and storage systems. Referencing current research on cognitive load measurement in a driving simulator, this paper describes the steps we take to consistently collect and synchronise signals, using the Orbit Measurement Library (OML) framework, combined with a multimodal version of a cinema clapperboard. The resulting data is automatically stored in a networked database, in a structured format, including metadata about the data and experiment. Moreover, fine-grained synchronisation between all signals is provided without additional hardware, and clock drift can be corrected post-hoc.
Data-Driven Model of Nonverbal Behavior for Socially Assistive Human-Robot Interactions BIBAFull-Text 196-199
  Henny Admoni; Brian Scassellati
Socially assistive robotics (SAR) aims to develop robots that help people through interactions that are inherently social, such as tutoring and coaching. For these interactions to be effective, socially assistive robots must be able to recognize and use nonverbal social cues like eye gaze and gesture. In this paper, we present a preliminary model for nonverbal robot behavior in a tutoring application. Using empirical data from teachers and students in human-human tutoring interactions, the model can be both predictive (recognizing the context of new nonverbal behaviors) and generative (creating new robot nonverbal behaviors based on a desired context) using the same underlying data representation.
Towards Automated Assessment of Public Speaking Skills Using Multimodal Cues BIBAFull-Text 200-203
  Lei Chen; Gary Feng; Jilliam Joe; Chee Wee Leong; Christopher Kitchen; Chong Min Lee
Traditional assessments of public speaking skills rely on human scoring. We report an initial study on the development of an automated scoring model for public speaking performances using multimodal technologies. Task design, rubric development, and human rating were conducted according to standards in educational assessment. An initial corpus of 17 speakers with 4 speaking tasks was collected using audio, video, and 3D motion capturing devices. A scoring model based on basic features in the speech content, speech delivery, and hand, body, and head movements significantly predicts human rating, suggesting the feasibility of using multimodal technologies in the assessment of public speaking skills.
Increasing Customers' Attention using Implicit and Explicit Interaction in Urban Advertisement BIBAFull-Text 204-207
  Matthias Wölfel; Luigi Bucchino
Online advertising campaigns are gaining customers' attention in comparison to advertisement campaigns in the urban space. How to bring back users' attention to advertisement in the urban space by using implicit and explicit interactions of the user is investigated in this publication. We have used age, gender and position estimates to automatically customize the advertisement campaign and 3D gestures to allow the customer to interact with the shown content. To evaluate the overall acceptability and particular aspects of such kind of targeted and interactive advertisement we have developed a prototypical implementation and placed it into a crowded shopping mall. In total 98 random visitors of the mall have experienced the system and answered a questionnaire afterwards.
System for Presenting and Creating Smell Effects to Video BIBAFull-Text 208-215
  Risa Suzuki; Shutaro Homma; Eri Matsuura; Ken-ichi Okada
Olfaction has recently been gaining attention in information and communication technology, as shown by attempts in theaters to screen videos while emitting scents. However, because there is no current infrastructure to communicate and synchronize odor information with visual information, people cannot enjoy this experience at home. Therefore, we have constructed a system of smell videos which could be applied to television (TV), allowing viewers to experience scents while watching their videos. To solve the abovementioned technical problems, we propose using the existing system for broadcasting closed caption. Our system's implementation is mindful of both video viewers and producers, allowing the system on the viewer end to disperse odorants in synchronization with videos, and allowing producers to add odor information to videos. We finally verify the system's feasibility. We expect that this study will make smell videos become common, and people will enjoy ones in daily life in the near future.
CrossMotion: Fusing Device and Image Motion for User Identification, Tracking and Device Association BIBAFull-Text 216-223
  Andrew D. Wilson; Hrvoje Benko
Identifying and tracking people and mobile devices indoors has many applications, but is still a challenging problem. We introduce a cross-modal sensor fusion approach to track mobile devices and the users carrying them. The CrossMotion technique matches the acceleration of a mobile device, as measured by an onboard internal measurement unit, to similar acceleration observed in the infrared and depth images of a Microsoft Kinect v2 camera. This matching process is conceptually simple and avoids many of the difficulties typical of more common appearance-based approaches. In particular, CrossMotion does not require a model of the appearance of either the user or the device, nor in many cases a direct line of sight to the device. We demonstrate a real time implementation that can be applied to many ubiquitous computing scenarios. In our experiments, CrossMotion found the person's body 99% of the time, on average within 7cm of a reference device position.
Statistical Analysis of Personality and Identity in Chats Using a Keylogging Platform BIBAFull-Text 224-231
  Giorgio Roffo; Cinzia Giorgetta; Roberta Ferrario; Walter Riviera; Marco Cristani
Interacting via text chats can be considered as a hybrid type of communication, in which textual information delivery follows turn-taking dynamics, resembling spoken interactions. An interesting research question is whether personality can be observed in chats, similarly as happening in face-to-face exchanges. After an encouraging preliminary work on Skype, in this study we have set up our own chat service in which key-logging functionalities have been activated, so that the timings of each key pressing can be measured. Using this framework, we organized semi-structured chats between 50 subjects, whose personality traits have been analyzed through psychometric tests, and a single operator, for a total of 16 hours of conversation. On this data, we have observed that some personality traits are linked with the way we are chatting (measured by stylometric cues), by means of statistically significant correlations and regression studies. Finally, we have assessed that some of the stylometric cues are very discriminative for the recognition of a user in a identification scenario. These facts taken together could underlie that some personality traits drive us in chatting in a particular fashion, which turns out to be very recognizable.
Understanding Users' Perceived Difficulty of Multi-Touch Gesture Articulation BIBAFull-Text 232-239
  Yosra Rekik; Radu-Daniel Vatavu; Laurent Grisoni
We show that users are consistent in their assessments of the articulation difficulty of multi-touch gestures, even under the many degrees of freedom afforded by multi-touch input, such as (1) various number of fingers touching the surface, (2) various number of strokes that structure the gesture shape, and (3) single-handed and bimanual input. To understand more about perceived difficulty, we characterize gesture articulations captured under these conditions with geometric and kinematic descriptors computed on a dataset of 7,200 samples of 30 distinct gesture types collected from 18 participants. We correlate the values of the objective descriptors with users' subjective assessments of articulation difficulty and report path length, production time, and gesture size as the highest correlators (max Pearson's r=.95). We also report new findings about multi-touch gesture input, e.g., gestures produced with more fingers are larger in size and take more time to produce than single-touch gestures; bimanual articulations are not only faster than single-handed input, but they are also longer in path length, present more strokes, and result in gesture shapes that are deformed horizontally by 35% in average. We use our findings to outline a number of 14 guidelines to assist multi-touch gesture set design, recognizer development, and inform gesture-to-function mappings through the prism of the user-perceived difficulty of gesture articulation.
A Multimodal Context-based Approach for Distress Assessment BIBAFull-Text 240-246
  Sayan Ghosh; Moitreya Chatterjee; Louis-Philippe Morency
The increasing prevalence of psychological distress disorders, such as depression and post-traumatic stress, necessitates a serious effort to create new tools and technologies to help with their diagnosis and treatment. In recent years, new computational approaches were proposed to objectively analyze patient non-verbal behaviors over the duration of the entire interaction between the patient and the clinician. In this paper, we go beyond non-verbal behaviors and propose a tri-modal approach which integrates verbal behaviors with acoustic and visual behaviors to analyze psychological distress during the course of the dyadic semi-structured interviews. Our approach exploits the advantages of the dyadic nature of these interactions to contextualize the participant responses based on the affective components (intimacy and polarity levels) of the questions. We validate our approach using one of the largest corpus of semi-structured interviews for distress assessment which consists of 154 multimodal dyadic interactions. Our results show significant improvement on distress prediction performance when integrating verbal behaviors with acoustic and visual behaviors. In addition, our analysis shows that contextualizing the responses improves the prediction performance, most significantly with positive and intimate questions.
Exploring a Model of Gaze for Grounding in Multimodal HRI BIBAFull-Text 247-254
  Gregor Mehlmann; Markus Häring; Kathrin Janowski; Tobias Baur; Patrick Gebhard; Elisabeth André
Grounding is an important process that underlies all human interaction. Hence, it is crucial for building social robots that are expected to collaborate effectively with humans. Gaze behavior plays versatile roles in establishing, maintaining and repairing the common ground. Integrating all these roles in a computational dialog model is a complex task since gaze is generally combined with multiple parallel information modalities and involved in multiple processes for the generation and recognition of behavior. Going beyond related work, we present a modeling approach focusing on these multi-modal, parallel and bi-directional aspects of gaze that need to be considered for grounding and their interleaving with the dialog and task management. We illustrate and discuss the different roles of gaze as well as advantages and drawbacks of our modeling approach based on a first user study with a technically sophisticated shared workspace application with a social humanoid robot.
Predicting Learning and Engagement in Tutorial Dialogue: A Personality-Based Model BIBAFull-Text 255-262
  Alexandria Katarina Vail; Joseph F. Grafsgaard; Joseph B. Wiggins; James C. Lester; Kristy Elizabeth Boyer
A variety of studies have established that users with different personality profiles exhibit different patterns of behavior when interacting with a system. Although patterns of behavior have been successfully used to predict cognitive and affective outcomes of an interaction, little work has been done to identify the variations in these patterns based on user personality profile. In this paper, we model sequences of facial expressions, postural shifts, hand-to-face gestures, system interaction events, and textual dialogue messages of a user interacting with a human tutor in a computer-mediated tutorial session. We use these models to predict the user's learning gain, frustration, and engagement at the end of the session. In particular, we examine the behavior of users based on their Extraversion trait score of a Big Five Factor personality survey. The analysis reveals a variety of personality-specific sequences of behavior that are significantly indicative of cognitive and affective outcomes. These results could impact user experience design of future interactive systems.
Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions BIBAFull-Text 263-266
  Dilek Hakkani-Tür; Malcolm Slaney; Asli Celikyilmaz; Larry Heck
When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.
SoundFLEX: Designing Audio to Guide Interactions with Shape-Retaining Deformable Interfaces BIBAFull-Text 267-274
  Koray Tahiroglu; Thomas Svedström; Valtteri Wikström; Simon Overstall; Johan Kildal; Teemu Ahmaniemi
Shape-retaining freely-deformable interfaces can take innumerable distinct shapes, and creating specific target configurations can be a challenge. In this paper, we investigate how audio can guide a user in this process, through the use of either musical or metaphoric sounds. In a formative user study, we found that sound encouraged action possibilities and made the affordances of the interface perceivable. We also found that adding audio as a modality along with vision and touch, made a positive contribution to guiding users' interactions with the interface.
Investigating Intrusiveness of Workload Adaptation BIBAFull-Text 275-281
  Felix Putze; Tanja Schultz
In this paper, we investigate how an automatic task assistant which can detect and react to a user's workload level is able to support the user in a complex, dynamic task. In a user study, we design a dispatcher scenario with low and high workload conditions and compare the effect of four support strategies with different levels of intrusiveness using objective and subjective metrics. We see that a more intrusive strategy results in higher efficiency and effectiveness, but is also less accepted by the participants. We also show that the benefit of supportive behavior depends on the user's workload level, i.e. adaptation to its changes are necessary. We describe and evaluate a Brain Computer Interface that is able to provide the necessary user state detection.

Keynote Address 2

Smart Multimodal Interaction through Big Data BIBAFull-Text 282
  Cafer Tosun
Smart phones and mobile technologies have changed software usage dramatically. Ease of use and simplicity has made software accessible to a huge number of users. In addition, technological advancements in multimodal interaction are opening new frontiers in software. Users are interacting with software systems through multiple channels such as gestures and speech. Touch screens, cameras, sensors, and wearable devices are enablers of this interaction. The user expectation is that the interaction with business software also becomes as simple as the interaction with consumer software. In particular, through the usage of mobile devices, consumer and business software is coming closer together. Next generation software systems and applications will have to enable smart, seamless and contextual multimodal interaction capabilities. New tools, technologies and solutions will be required to increase the ease of use and to build the user experience of the future.

Oral Session 3: Affect and Cognitive Modeling

Natural Communication about Uncertainties in Situated Interaction BIBAFull-Text 283-290
  Tomislav Pejsa; Dan Bohus; Michael F. Cohen; Chit W. Saw; James Mahoney; Eric Horvitz
Physically situated, multimodal interactive systems must often grapple with uncertainties about properties of the world, people, and their intentions and actions. We present methods for estimating and communicating about different uncertainties in situated interaction, leveraging the affordances of an embodied conversational agent. The approach harnesses a representation that captures both the magnitude and the sources of uncertainty, and a set of policies that select and coordinate the production of nonverbal and verbal behaviors to communicate the system's uncertainties to conversational participants. The methods are designed to enlist participants' help in a natural manner to resolve uncertainties arising during interactions. We report on a preliminary implementation of the proposed methods in a deployed system and illustrate the functionality with a trace from a sample interaction.
The SWELL Knowledge Work Dataset for Stress and User Modeling Research BIBAFull-Text 291-298
  Saskia Koldijk; Maya Sappelli; Suzan Verberne; Mark A. Neerincx; Wessel Kraaij
This paper describes the new multimodal SWELL knowledge work (SWELL-KW) dataset for research on stress and user modeling. The dataset was collected in an experiment, in which 25 people performed typical knowledge work (writing reports, making presentations, reading e-mail, searching for information). We manipulated their working conditions with the stressors: email interruptions and time pressure. A varied set of data was recorded: computer logging, facial expression from camera recordings, body postures from a Kinect 3D sensor and heart rate (variability) and skin conductance from body sensors. The dataset made available not only contains raw data, but also preprocessed data and extracted features. The participants' subjective experience on task load, mental effort, emotion and perceived stress was assessed with validated questionnaires as a ground truth. The resulting dataset on working behavior and affect is a valuable contribution to several research fields, such as work psychology, user modeling and context aware systems.
Rhythmic Body Movements of Laughter BIBAFull-Text 299-306
  Radoslaw Niewiadomski; Maurizio Mancini; Yu Ding; Catherine Pelachaud; Gualtiero Volpe
In this paper we focus on three aspects of multimodal expressions of laughter. First, we propose a procedural method to synthesize rhythmic body movements of laughter based on spectral analysis of laughter episodes. For this purpose, we analyze laughter body motions from motion capture data and we reconstruct them with appropriate harmonics. Then we reduce the parameter space to two dimensions. These are the inputs of the actual model to generate a continuum of laughs rhythmic body movements.
   In the paper, we also propose a method to integrate rhythmic body movements generated by our model with other synthetized expressive cues of laughter such as facial expressions and additional body movements. Finally, we present a real-time human-virtual character interaction scenario where virtual character applies our model to answer to human's laugh in real-time.
Automatic Blinking Detection towards Stress Discovery BIBAFull-Text 307-310
  Alvaro Marcos-Ramiro; Daniel Pizarro-Perez; Marta Marron-Romera; Daniel Gatica-Perez
We present a robust method to automatically detect blinks in video sequences of conversations, aimed to discovering stress. Psychological studies have shown a relationship between blink frequency and dopamine levels, which in turn are affected by stress. Task performance correlates through an inverted U shape to both dopamine and stress levels. This shows the importance of automatic blink detection as a way of reducing human coding burden. We use an off-the-shelf face tracker in order to extract the eye region. Then, we perform per-pixel classification of the extracted eye images to later identify blinks through their dynamics. We evaluate the performance of our system with a job interview database with annotations of psychological variables, and show statistically significant correlation between perceived stress resistance and the automatically detected blink patterns.

Oral Session 4: Nonverbal Behaviors

Mid-air Authentication Gestures: An Exploration of Authentication Based on Palm and Finger Motions BIBAFull-Text 311-318
  Ilhan Aslan; Andreas Uhl; Alexander Meschtscherjakov; Manfred Tscheligi
Authentication based on touch-less mid-air gestures would benefit a multitude of ubicomp applications, which are used in clean environments (e.g., medical environments or clean rooms). In order to explore the potential of mid-air gestures for novel authentication approaches, we performed a series of studies and design experiments. First, we collected data from more then 200 users during a three-day science event organised within a shopping mall. This data was used to investigate capabilities of the Leap Motion sensor and to formulate an initial design problem. The design problem, as well as the design of mid-air gestures for authentication purposes, were iterated in subsequent design activities. In a final study with 13 participants, we evaluated two mid-air gestures for authentication purposes in different situations, including different body positions. Our results highlight a need for different mid-air gestures for differing situations and carefully chosen constraints for mid-air gestures.
Automatic Detection of Naturalistic Hand-over-Face Gesture Descriptors BIBAFull-Text 319-326
  Marwa M. Mahmoud; Tadas Baltrušaitis; Peter Robinson
One of the main factors that limit the accuracy of facial analysis systems is hand occlusion. As the face becomes occluded, facial features are either lost, corrupted or erroneously detected. Hand-over-face occlusions are considered not only very common but also very challenging to handle. Moreover, there is empirical evidence that some of these hand-over-face gestures serve as cues for recognition of cognitive mental states. In this paper, we detect hand-over-face occlusions and classify hand-over-face gesture descriptors in videos of natural expressions using multi-modal fusion of different state-of-the-art spatial and spatio-temporal features. We show experimentally that we can successfully detect face occlusions with an accuracy of 83%. We also demonstrate that we can classify gesture descriptors (hand shape, hand action and facial region occluded) significantly higher than a naive baseline. To our knowledge, this work is the first attempt to automatically detect and classify hand-over-face gestures in natural expressions.
Capturing Upper Body Motion in Conversation: An Appearance Quasi-Invariant Approach BIBAFull-Text 327-334
  Alvaro Marcos-Ramiro; Daniel Pizarro-Perez; Marta Marron-Romera; Daniel Gatica-Perez
We address the problem of body communication retrieval and measuring in seated conversations by means of markerless motion capture. In psychological studies, the use of automatic methods is key to reduce the subjectivity present in manual behavioral coding used to extract these cues. These studies usually involve hundreds of subjects with different clothing, non-acted poses, or different distances to the camera in uncalibrated, RGB-only video. However, range cameras are not yet common in psychology research, especially in existing recordings. Therefore, it becomes highly relevant to develop a fast method that is able to work in these conditions. Given the known relationship between depth and motion estimates, we propose to robustly integrate highly appearance-invariant image motion features in a machine learning approach, complemented with an effective tracking scheme. We evaluate the method's performance with existing databases and a database of upper body poses displayed in job interviews that we make public, showing that in our scenario it is comparable to that of Kinect without using a range camera, and state-of-the-art w.r.t. the HumanEva and ChaLearn 2011 evaluation datasets.
User Independent Gaze Estimation by Exploiting Similarity Measures in the Eye Pair Appearance Eigenspace BIBAFull-Text 335-338
  Nanxiang Li; Carlos Busso
The design of gaze-based computer interfaces has been an active research area for over 40 years. One challenge of using gaze detectors is the repetitive calibration process required to adjust the parameters of the systems, and the constrained conditions imposed on the user for robust gaze estimation. We envision user-independent gaze detectors that do not require calibration, or any cooperation from the user. Toward this goal, we investigate an appearance-based approach, where we estimate the eigenspace for the gaze using principal component analysis (PCA). The projections are used as features of regression models that estimate the screen's coordinates. As expected, the performance of the approach decreases when the models are trained without data from the target user (i.e., user-independent condition). This study proposes an appealing training approach to bridge the gap in performance between user-dependent and user-independent conditions. Using the projections onto the eigenspace, the scheme identifies samples in training set that are similar to the testing images. We build the sample covariance matrix and the regression models only with these samples. We consider either similar frames or data from subjects with similar eye appearance. The promising results suggest that the proposed training approach is a feasible and convenient scheme for gaze-based multimodal interfaces.

Doctoral Spotlight Session

Exploring multimodality for translator-computer interaction BIBAFull-Text 339-343
  Julián Zapata
Multimodal interaction has the potential to become one of the most efficient, cost-effective and ergonomic working strategies for translation professionals, in the near future. This paper describes our doctoral research project on multimodality in translator-computer interaction (TCI). The specific objective of this project is to observe and analyze the translator experience (TX) with off-the-shelf voice-and-touch-enabled multimodal interfaces, as compared to the interaction with traditional keyboard-and-mouse graphical user interfaces. Overall, this project aims to collect user-centered data and provide recommendations for better-grounded translation tool design. Thus, the study will make an original contribution to the advancement of knowledge in the fields of TCI and human-computer interaction in general.
Towards Social Touch Intelligence: Developing a Robust System for Automatic Touch Recognition BIBAFull-Text 344-348
  Merel M. Jung
Touch behavior is of great importance during social interaction. Automatic recognition of social touch is necessary to transfer the touch modality from interpersonal interaction to other areas such as Human-Robot Interaction (HRI). This paper describes a PhD research program on the automatic detection, classification and interpretation of touch in social interaction between humans and artifacts. Progress thus far includes the recording of a Corpus of Social Touch (CoST) consisting of pressure sensor data of 14 different touch gestures and first classification results. Classification of these 14 gestures resulted in an overall accuracy of 53% using Bayesian classifiers. Further work includes the enhancement of the gesture recognition, building an embodied system for real-time classification and testing this system in a possible application scenario.
Facial Expression Analysis for Estimating Pain in Clinical Settings BIBAFull-Text 349-353
  Karan Sikka
Pain assessment is vital for effective pain management in clinical settings. It is generally obtained via patient's self-report or observer's assessment. Both of these approaches suffer from several drawbacks such as unavailability of self-report, idiosyncratic use and observer bias. This work aims at developing automated machine learning based approaches for estimating pain in clinical settings. We propose to use facial expression information to accomplish current goals since previous studies have demonstrated consistency between facial behavior and experienced pain. Moreover, with recent advances in computer vision it is possible to design algorithms for identifying spontaneous expressions such as pain in more naturalistic conditions.
   Our focus is towards designing robust computer vision models for estimating pain in videos containing patient's facial behavior. In this regard we discuss different research problem, technical approaches and challenges that needs to be addressed. In this work we particularly highlight the problem of predicting self-report measures of pain intensity since this problem is not only more challenging but also received less attention. We also discuss our efforts towards collecting an in-situ pediatric pain dataset for validating these approaches. We conclude the paper by presenting some results on both UNBC Mc-Master Pain dataset and pediatric pain dataset.
Realizing Robust Human-Robot Interaction under Real Environments with Noises BIBAFull-Text 354-358
  Takaaki Sugiyama
A human speaker considers her interlocutor's situation when she determines to begin speaking in human-human interaction. We assume this tendency is also applicable to human-robot interaction when a human treats a humanoid robot as a social being and behaves as a cooperative user. As a part of this social norm, we have built a model of predicting when a user is likely to begin speaking to a humanoid robot. This proposed model can be used to prevent a robot from generating erroneous reactions by ignoring input noises. In my Ph.D. thesis, we will realize robust human-robot interaction under real environments with noises. To achieve this, we began constructing a robot dialogue system using multiple modalities, such as audio and visual, and the robot's posture information. We plan to: 1) construct a robot dialogue system, 2) develop systems using social norms, such as an input sound classifier, controlling user's untimely utterances, and estimating user's degree of urgency, and 3) extend it from a one-to-one dialogue system to a multi-party one.
Speaker- and Corpus-Independent Methods for Affect Classification in Computational Paralinguistics BIBAFull-Text 359-363
  Heysem Kaya
The analysis of spoken emotions is of increasing interest in human computer interaction, in order to drive the machine communication into a humane manner. It has manifold applications ranging from intelligent tutoring systems to affect sensitive robots, from smart call centers to patient telemonitoring. In general the study of computational paralinguistics, which covers the analysis of speaker states and traits, faces with real life challenges of inter-speaker and inter-corpus variability. In this paper, a brief summary of the progress and future directions of my PhD study titled Adaptive Mixture Models for Speech Emotion Recognition that targets these challenges are given. An automatic mixture model selection method for Mixture of Factor Analyzers is proposed for modeling high dimensional data. To provide the mentioned statistical method a compact set of potent features, novel feature selection methods based on Canonical Correlation Analysis are introduced.
The Impact of Changing Communication Practices BIBAFull-Text 364-368
  Ailbhe N. Finnerty
Due to advancements in communication technologies, how we interact with each other has changed significantly. An advantage is being able to keep in touch (family, friends) and collaborate (colleagues) with others over large distances. However, these technologies can remove behavioural cues, such as, changes in tone, gesturing and posture which can add depth and meaning to an interaction. In this paper two studies are presented which investigate changing communication practices in 1) the workplace and in 2) a loosely connected social group. The interactions of the participants were analysed by comparing synchronous (occurring in real time; e.g. face to face) and asynchronous (delayed; email, sms) patterns of communication. The findings showed a prevalence of asynchronous methods of communication in Study 1, which had an impact on affective states (positive and negative) and on self reported measures of productivity, creativity, while in Study 2 synchronous communication patterns affected stress.
Multi-Resident Human Behaviour Identification in Ambient Assisted Living Environments BIBAFull-Text 369-373
  Hande Alemdar
Multimodal interactions in ambient assisted living environments require human behaviour to be recognized and monitored automatically. The complex nature of human behaviour makes it extremely difficult to infer and adapt to, especially in multi-resident environments. This proposed research aims to contribute to the multimodal interaction community by (i) providing publicly available, naturalistic, rich and annotated datasets for human behaviour modeling, (ii) introducing evaluation methods of several inference methods from a behaviour monitoring perspective, (iii) developing novel methods for recognizing individual behaviour in multi-resident smart environments without assuming any person identification, (iv) proposing methods for mitigating the scalability issues by using transfer, active, and semi-supervised learning techniques. The proposed studies will address both practical and methodological aspects of human behaviour recognition in smart interactive environments.
Gaze-Based Proactive User Interface for Pen-Based Systems BIBAFull-Text 374-378
  Çagla Çig
In typical human-computer interaction, users convey their intentions through traditional input devices (e.g. keyboards, mice, joysticks) coupled with standard graphical user interface elements. Recently, pen-based interaction has emerged as a more intuitive alternative to these traditional means. However, existing pen-based systems are limited by the fact that they rely heavily on auxiliary mode switching mechanisms during interaction (e.g. hard or soft modifier keys, buttons, menus). In this paper, I describe the roadmap for my PhD research which aims at using eye gaze movements that naturally occur during pen-based interaction to reduce dependency on explicit mode selection mechanisms in pen-based systems.
Appearance based user-independent gaze estimation BIBAFull-Text 379-383
  Nanxiang Li
An ideal gaze user interface should be able to accurately estimates the user's gaze direction in a non-intrusive setting. Most studies on gaze estimation focus on the accuracy of the estimation results, imposing important constraints on the user such as no head movement, intrusive head mount setting and repetitive calibration process. Due to these limitations, most graphic user interfaces (GUIs) are reluctant to include gaze as an input modality. We envision user-independent gaze detectors for user computer interaction that do not impose any constraints on the users. We believe the appearance of the eye pairs, which implicitly reveals head pose, provides conclusive information on the gaze direction. More importantly, the relative appearance changes in the eye pairs due to the different gaze direction should be consistent among different human subjects. We collected a multimodal corpus (MSP-GAZE) to study and evaluate user independent, appearance based gaze estimation approaches. This corpus considers important factors that affect the appearance based gaze estimation: the individual difference, the head movement, and the distance between the user and the interface's screen. Using this database, our initial study focused on the eye pair appearance eigenspace approach, where the projections into the eye appearance eigenspace basis are used to build regression models to estimate the gaze position. We compare the results between user dependent (training and testing on the same subject) and user independent (testing subject is not included in the training data) models. As expected, due to the individual differences between subjects, the performance decreases when the models are trained without data from the target user. The study aims to reduce the gap between user dependent and user independent conditions.
Affective Analysis of Abstract Paintings Using Statistical Analysis and Art Theory BIBAFull-Text 384-388
  Andreza Sartori
A novel approach to the emotion classification of abstract paintings is proposed. Based on a user study, we employ computer vision techniques to understand what makes an abstract artwork emotional. Our aim is to identify and quantify which are the emotional regions of abstract paintings, as well as the role of each feature (colour, shapes and texture) on the human emotional response. In addition, we investigate the link between the detected emotional content and the way people look at abstract paintings by using eye-tracking recordings. A bottom-up saliency model was applied to compare with eye-tracking in order to predict the emotional salient regions of abstract paintings. In future, we aim to extract metadata associated to the paintings (e.g., title, keywords, textual description, etc.) in order to correlate it with the emotional responses of the paintings. This research opens opportunity to understand why a specific painting is perceived as emotional on global and local scales.
The Secret Language of Our Body: Affect and Personality Recognition Using Physiological Signals BIBAFull-Text 389-393
  Julia Wache
We present a novel framework for decoding individuals? emotional state and personality traits based on physiological responses to affective movie clips. During watching 36 video clips we used measures of Electrocardiogram (ECG), Galvanic Skin Response (GSR), facial-Electroencephalogram (EEG) and facial emotional responses to decode i) the emotional state of participants and ii) their Big Five personality traits extending previous work that had connected either explicit (user ratings) with some implicit (physiological) affective responses or one of them with selected personality traits.
   We make the first dataset comprising both affective and personality information publicly available for further research and we further explore different methods and implementations for automated emotion and personality detection for future applications.
Perceptions of Interpersonal Behavior are Influenced by Gender, Facial Expression Intensity, and Head Pose BIBAFull-Text 394-398
  Jeffrey M. Girard
Across multiple channels, nonverbal behavior communicates information about affective states and interpersonal intentions. Researchers interested in understanding how these nonverbal messages are transmitted and interpreted have examined the relationship between behavior and ratings of interpersonal motives using dimensions such as agency and communion. However, previous work has focused on images of posed behavior and it is unclear how well these results will generalize to more dynamic representations of real-world behavior. The current study proposes to extend the current literature by examining how gender, facial expression intensity, and head pose influence interpersonal ratings in videos of spontaneous nonverbal behavior.
Authoring Communicative Behaviors for Situated, Embodied Characters BIBAFull-Text 399-403
  Tomislav Pejsa
Embodied conversational agents hold great potential as multimodal interfaces due to their ability to communicate naturally using speech and nonverbal cues. The goal of my research is to enable animators and designers to endow ECAs with interactive behaviors that are controllable, communicatively effective, as well as natural and aesthetically appealing; I focus in particular on spatially situated, communicative nonverbal behaviors such as gaze and deictic gestures. This goal requires addressing challenges in the space of animation authoring and editing, parametric control, behavior coordination and planning, and retargeting to different embodiment designs. My research will aim to provide animators and designers with techniques and tools needed to author natural, expressive, and controllable gaze and gesture movements that leverage empirical or learned models of human behavior, to apply such behaviors to characters with different designs and communicative styles, and to develop techniques and models for planning of coordinated behaviors that economically and correctly convey the range of diverse cues required for multimodal, user-machine interaction.
Multimodal Analysis and Modeling of Nonverbal Behaviors during Tutoring BIBAFull-Text 404-408
  Joseph F. Grafsgaard
Detecting learning-centered affective states is difficult, yet crucial for adapting most effectively to users. Within tutoring in particular, the combined context of student task actions and tutorial dialogue shape the student's affective experience. As we move toward detecting affect, we may also supplement the task and dialogue streams with rich sensor data. In studies of introductory computer programming tutoring, human tutors communicated with students through text-based interfaces. Manual and automated approaches were leveraged to annotate dialogue, task actions, facial movements, postural positions, and hand-to-face gestures. Prior investigations in this line of doctoral research identified associations between nonverbal behavior and learning-centered affect, such as engagement and frustration. Additionally, preliminary work used hidden Markov models to analyze sequences of affective tutorial interaction. Further work will address the sequential nature of the multimodal data. This line of research is expected to improve automated understanding of learning-centered affect, with particular insights into how affect unfolds from moment to moment during tutoring. This may result in systems that treat student affect not as transient states, but instead as interconnected links in a student's path toward learning.

Keynote Address 3

Computation of Emotions BIBAFull-Text 409-410
  Peter Robinson
When people talk to each other, they express their feelings through facial expressions, tone of voice, body postures and gestures. They even do this when they are interacting with machines. These hidden signals are an important part of human communication, but most computer systems ignore them. Emotions need to be considered as an important mode of communication between people and interactive systems. Affective computing has enjoyed considerable success over the past 20 years, but many challenges remain.

Oral Session 5: Mobile and Urban Interaction

Non-Visual Navigation Using Combined Audio Music and Haptic Cues BIBAFull-Text 411-418
  Emily Fujimoto; Matthew Turk
While a great deal of work has been done exploring non-visual navigation interfaces using audio and haptic cues, little is known about the combination of the two. We investigate combining different state-of-the-art interfaces for communicating direction and distance information using vibrotactile and audio music cues, limiting ourselves to interfaces that are possible with current off-the-shelf smartphones. We use experimental logs, subjective task load questionnaires, and user comments to see how users' perceived performance, objective performance, and acceptance of the system varied for different combinations. Users' perceived performance did not differ much between the unimodal and multimodal interfaces, but a few users commented that the multimodal interfaces added some cognitive load. Objective performance showed that some multimodal combinations resulted in significantly less direction or distance error over some of the unimodal ones, especially the purely haptic interface. Based on these findings we propose a few design considerations for multimodal haptic/audio navigation interfaces.
Tactile Feedback for Above-Device Gesture Interfaces: Adding Touch to Touchless Interactions BIBAFull-Text 419-426
  Euan Freeman; Stephen Brewster; Vuokko Lantz
Above-device gesture interfaces let people interact in the space above mobile devices using hand and finger movements. For example, users could gesture over a mobile phone or wearable without having to use the touchscreen. We look at how above-device interfaces can also give feedback in the space over the device. Recent haptic and wearable technologies give new ways to provide tactile feedback while gesturing, letting touchless gesture interfaces give touch feedback. In this paper we take a first detailed look at how tactile feedback can be given during above-device interaction. We compare approaches for giving feedback (ultrasound haptics, wearables and direct feedback) and also look at feedback design. Our findings show that tactile feedback can enhance above-device gesture interfaces.
Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data BIBAFull-Text 427-434
  Andrey Bogomolov; Bruno Lepri; Jacopo Staiano; Nuria Oliver; Fabio Pianesi; Alex Pentland
In this paper, we present a novel approach to predict crime in a geographic space from multiple data sources, in particular mobile phone and demographic data. The main contribution of the proposed approach lies in using aggregated and anonymized human behavioral data derived from mobile network activity to tackle the crime prediction problem. While previous research efforts have used either background historical knowledge or offenders' profiling, our findings support the hypothesis that aggregated human behavioral data captured from the mobile network infrastructure, in combination with basic demographic information, can be used to predict crime. In our experimental results with real crime data from London we obtain an accuracy of almost 70% when predicting whether a specific area in the city will be a crime hotspot or not. Moreover, we provide a discussion of the implications of our findings for data-driven crime analysis.
Impact of Coordinate Systems on 3D Manipulations in Mobile Augmented Reality BIBAFull-Text 435-438
  Philipp Tiefenbacher; Steven Wichert; Daniel Merget; Gerhard Rigoll
Mobile touch PCs allow interactions with virtual objects in augmented reality scenes. Manipulations of 3D objects are a common way of such interactions, which can be performed in three different coordinate systems: the camera-, object- and world coordinate systems. The camera coordinate system changes continuously in augmented reality as it depends on the mobile device's pose. The axis orientations of the world coordinate system are steady, whereas the axes of the object coordinates base on previous manipulations. The selection of a coordinate system therefore influences the 3D transformation's orientation independent from the used manipulation type.
   In this paper, we evaluate the impact of the three possible coordinate systems on rotation and on translation of a 3D item in an augmented reality scenario. A study with 36 participants determines the best coordinates for translation and rotation.

Oral Session 6: Healthcare and Assistive Technologies

Digital Reading Support for The Blind by Multimodal Interaction BIBAFull-Text 439-446
  Yasmine N. El-Glaly; Francis Quek
Slate-type devices allow Individuals with Blindness or Severe Visual Impairment (IBSVI) to read in place with the touch of their fingertip by audio-rendering the words they touch. Such technologies are helpful for spatial cognition while reading. However, users have to move their fingers slowly or they may lose place on screen. Also, IBSVI may wander between lines without realizing they did. In this paper, we address these two interaction problems by introducing dynamic speech-touch interaction model, and intelligent reading support system. With this model, the speed of the speech will dynamically change coping up with the user's finger speed. The proposed model is composed of: 1- Audio Dynamics Model, and 2- Off-line Speech Synthesis Technique. The intelligent reading support system predicts the direction of reading, corrects the reading word if the user drifts, and notifies the user using a sonic gutter to help her from straying off the reading line. We tested the new audio dynamics model, the sonic gutter, and the reading support model in two user studies. The participants' feedback helped us fine-tune the parameters of the two models. Finally, we ran an evaluation study where the reading support system is compared to other VoiceOver technologies. The results showed preponderance to the reading support system with its audio dynamics and intelligent reading support components.
Measuring Child Visual Attention using Markerless Head Tracking from Color and Depth Sensing Cameras BIBAFull-Text 447-454
  Jonathan Bidwell; Irfan A. Essa; Agata Rozga; Gregory D. Abowd
A child's failure to respond to his or her name being called is an early warning sign for autism and response to name is currently assessed as a part of standard autism screening and diagnostic tools. In this paper, we explore markerless child head tracking as an unobtrusive approach for automatically predicting child response to name. Head turns are used as a proxy for visual attention. We analyzed 50 recorded response to name sessions with the goal of predicting if children, ages 15 to 30 months, responded to name calls by turning to look at an examiner within a defined time interval. The child's head turn angles and hand annotated child name call intervals were extracted from each session. Human assisted tracking was employed using an overhead Kinect camera, and automated tracking was later employed using an additional forward facing camera as a proof-of-concept. We explore two distinct analytical approaches for predicting child responses, one relying on rule-based approached and another on random forest classification. In addition, we derive child response latency as a new measurement that could provide researchers and clinicians with finer grain quantitative information currently unavailable in the field due to human limitations. Finally we reflect on steps for adapting our system to work in less constrained natural settings.
Bi-Modal Detection of Painful Reaching for Chronic Pain Rehabilitation Systems BIBAFull-Text 455-458
  Temitayo A. Olugbade; M. S. Hane Aung; Nadia Bianchi-Berthouze; Nicolai Marquardt; Amanda C. Williams
Physical activity is essential in chronic pain rehabilitation. However, anxiety due to pain or a perceived exacerbation of pain causes people to guard against beneficial exercise. Interactive rehabilitation technology sensitive to such behaviour could provide feedback to overcome such psychological barriers. To this end, we developed a Support Vector Machine framework with the feature level fusion of body motion and muscle activity descriptors to discriminate three levels of pain (none, low and high). All subjects underwent a forward reaching exercise which is typically feared among people with chronic back pain. The levels of pain were categorized from control subjects (no pain) and thresholded self reported levels from people with chronic pain. Salient features were identified using a backward feature selection process. Using feature sets from each modality separately led to high pain classification F1 scores of 0.63 and 0.69 for movement and muscle activity respectively. However using a combined bimodal feature set this increased to F1 = 0.8.

Keynote Address 4

A World without Barriers: Connecting the World across Languages, Distances and Media BIBAFull-Text 459-460
  Alexander Waibel
As our world becomes increasingly interdependent and globalization brings people together more than ever, we quickly discover that it is no longer the absence of connectivity (the "digital divide") that separates us, but that new and different forms of alienation still keep us apart, including language, culture, distance and interfaces. Can technology provide solutions to bring us closer to our fellow humans?
   In this talk, I will present multilingual and multimodal interface technology solutions that offer the best of both worlds: maintaining our cultural diversity and locale while providing for better communication, greater integration and collaboration.
   We explore: Smart phone based speech translators for everyday travelers and humanitarian missions Simultaneous translation systems and services to translate academic lectures and political speeches in real time (at Universities, the European Parliament and broadcasting services) Multimodal language-transparent interfaces and smartrooms to improve joint and distributed communication and interaction.
   We will first discuss the difficulties of language processing; review how the technology works today and what levels of performance are now possible. Key to today's systems is effective machine learning, without which scaling multilingual and multimodal systems to unlimited domains, modalities, accents, and more than 6,000 languages would be hopeless. Equally important are effective human-computer interfaces, so that language differences fade naturally into the background and communication and interaction become natural and engaging. I will present recent research results as well as examples from our field trials and deployments in educational, commercial, humanitarian and government settings.

The Second Emotion Recognition In The Wild Challenge

Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol BIBAFull-Text 461-466
  Abhinav Dhall; Roland Goecke; Jyoti Joshi; Karan Sikka; Tom Gedeon
The Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in constrained lab-controlled like environment. While this data was a good starting point, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the exponential increase in the number of video clips being uploaded online, it is worthwhile to explore the performance of emotion recognition methods that work 'in the wild'. The goal of this Grand Challenge is to carry forward the common platform defined during EmotiW 2013, for evaluation of emotion recognition methods in real-world conditions. The database in the 2014 challenge is the Acted Facial Expression In Wild (AFEW) 4.0, which has been collected from movies showing close-to-real-world conditions. The paper describes the data partitions, the baseline method and the experimental protocol.
Neural Networks for Emotion Recognition in the Wild BIBAFull-Text 467-472
  Michal Grosicki
In this paper we present neural networks based method for emotion recognition. Proposed model was developed as part of 2014 Emotion Recognition in the Wild Challenge. It is composed of modality specific neural networks, which where trained separately on audio and video data extracted from short video clips taken from various movies. Each network was trained on frame-level data, which in later stages were aggregated by simple averaging of predicted class distributions for each clip. In the next stage various techniques for combining modalities where investigated with the best being support vector machine with RBF kernel. Our method achieved accuracy of 37.84%, which is better than 33.7% obtained by the best baseline model provided by organisers.
Emotion Recognition in the Wild: Incorporating Voice and Lip Activity in Multimodal Decision-Level Fusion BIBAFull-Text 473-480
  Fabien Ringeval; Shahin Amiriparian; Florian Eyben; Klaus Scherer; Björn Schuller
In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms of audiovisual content; e.g., speech and/or face of the subject expressing the emotion can be absent in the data. We therefore propose to tackle this issue by incorporating both voice and lip activity as additional features in a decision-level fusion. Results obtained on the blind test set show that the decision-level fusion can improve the best mono-modal approach, and that the addition of both voice and lip activity in the feature set leads to the best performance (UAR=35.27%), with an absolute improvement of 5.36% over the baseline.
Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild BIBAFull-Text 481-486
  Bo Sun; Liandong Li; Tian Zuo; Ying Chen; Guoyan Zhou; Xuewen Wu
Emotion recognition in the wild is a very challenging task. In this paper, we investigate a variety of different multimodal features from video and audio to evaluate their discriminative ability to human emotion analysis. For each clip, we extract SIFT, LBP-TOP, PHOG, LPQ-TOP and audio features. We train different classifiers for every kind of features on the dataset from EmotiW 2014 Challenge, and we propose a novel hierarchical classifier fusion method for all the extracted features. The final achievement we gained on the test set is 47.17% which is much better than the best baseline recognition rate of 33.7%.
Combining Modality-Specific Extreme Learning Machines for Emotion Recognition in the Wild BIBAFull-Text 487-493
  Heysem Kaya; Albert Ali Salah
This paper presents our contribution to ACM ICMI 2014 Emotion Recognition in the Wild Challenge and Workshop. The proposed system utilizes Extreme Learning Machines (ELM) for modeling modality-specific features and combines the scores for final prediction. The state-of-the-art results in acoustic and visual emotion recognition are obtained either using deep Neural Networks (DNN) or Support Vector Machines (SVM). The ELM paradigm is proposed as a fast and accurate alternative to these two popular machine learning methods. Benefiting from fast learning advantage of ELM, we carry out extensive tests on the data using moderate computational resources. In the video modality, we test combination of regional visual features obtained from the inner face. In the audio modality, we carry out tests to enhance training via other emotional corpora. We further investigate the suitability of several recently proposed feature selection approaches to prune the acoustic features. In our study, the best results for both modalities are obtained with Kernel ELM compared to basic ELM. On the challenge test set, we obtain 37.84%, 39.07% and 44.23% classification accuracies for audio, video and multimodal fusion, respectively.
Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild BIBAFull-Text 494-501
  Mengyi Liu; Ruiping Wang; Shaoxin Li; Shiguang Shan; Zhiwu Huang; Xilin Chen
In this paper, we present the method for our submission to the Emotion Recognition in the Wild Challenge (EmotiW 2014). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform an extensive evaluation on the challenge data (including validation set and blind test set), and evaluate the effects of different strategies in our pipeline. The final recognition accuracy achieved 50.4% on test set, with a significant gain of 16.7% above the challenge baseline 33.7%.
Enhanced Autocorrelation in Real World Emotion Recognition BIBAFull-Text 502-507
  Sascha Meudt; Friedhelm Schwenker
Multimodal emotion recognition in real world environments is still a challenging task of affective computing research. Recognizing the affective or physiological state of an individual is difficult for humans as well as for computer systems, and thus finding suitable discriminative features is the most promising approach in multimodal emotion recognition. In the literature numerous features have been developed or adapted from related signal processing tasks. But still, classifying emotional states in real world scenarios is difficult and the performance of automatic classifiers is rather limited. This is mainly due to the fact that emotional states can not be distinguished by a well defined set of discriminating features. In this work we present an enhanced autocorrelation feature as a multi pitch detection feature and compare its performance to feature well known, and state-of-the-art in signal and speech processing. Results of the evaluation show that the enhanced autocorrelation outperform other state-of-the-art features in case of the challenge data set. The complexity of this benchmark data set lies in between real world data sets showing naturalistic emotional utterances, and the widely applied and well-understood acted emotional data sets.
Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning BIBAFull-Text 508-513
  JunKai Chen; Zenghai Chen; Zheru Chi; Hong Fu
This paper presents our proposed approach for the second Emotion Recognition in The Wild Challenge. We propose a new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG_TOP) to represent facial expressions. We also explore the properties of visual features and audio features, and adopt Multiple Kernel Learning (MKL) to find an optimal feature fusion. An SVM with multiple kernels is trained for the facial expression classification. Experimental results demonstrate that our method achieves a promising performance. The overall classification accuracy on the validation set and test set are 40.21% and 45.21%, respectively.
Improved Spatiotemporal Local Monogenic Binary Pattern for Emotion Recognition in The Wild BIBAFull-Text 514-520
  Xiaohua Huang; Qiuhai He; Xiaopeng Hong; Guoying Zhao; Matti Pietikainen
Local binary pattern from three orthogonal planes (LBP-TOP) has been widely used in emotion recognition in the wild. However, it suffers from illumination and pose changes. This paper mainly focuses on the robustness of LBP-TOP to unconstrained environment. Recent proposed method, spatiotemporal local monogenic binary pattern (STLMBP), was verified to work promisingly in different illumination conditions. Thus this paper proposes an improved spatiotemporal feature descriptor based on STLMBP. The improved descriptor uses not only magnitude and orientation, but also the phase information, which provide complementary information. In detail, the magnitude, orientation and phase images are obtained by using an effective monogenic filter, and multiple feature vectors are finally fused by multiple kernel learning. STLMBP and the proposed method are evaluated in the Acted Facial Expression in the Wild as part of the 2014 Emotion Recognition in the Wild Challenge. They achieve competitive results, with an accuracy gain of 6.35% and 7.65% above the challenge baseline (LBP-TOP) over video.
Emotion Recognition in Real-world Conditions with Acoustic and Visual Features BIBAFull-Text 521-524
  Maxim Sidorov; Wolfgang Minker
There is an enormous number of potential applications of the system which is capable to recognize human emotions. Such opportunity can be useful in various applications, e.g., improvement of Spoken Dialogue Systems (SDSs) or monitoring agents in call-centers. Therefore, the Emotion Recognition In The Wild Challenge 2014 (EmotiW 2014) is focused on estimating emotions in real-world situations. This study presents the results of multimodal emotion recognition based on support vector classifier. The described approach results in 41.77% of overall classification accuracy in the multimodal case. The obtained result is more than 17% higher than the baseline result for multimodal approach.

Workshop Overviews

ERM4HCI 2014: The 2nd Workshop on Emotion Representation and Modelling in Human-Computer-Interaction-Systems BIBAFull-Text 525-526
  Kim Hartmann; Björn Schuller; Ronald Böck
In this paper the organisers present a brief overview of the second workshop on Emotion Representation and Modelling in Human-Computer-Interaction-Systems. The ERM4HCI 2014 workshop is again held in conjunction with the 16th ACM International Conference on Multimodal Interaction (ICMI 2014) taking place in Istanbul, Turkey. This year's ERM4HCI is focussed on the characteristics which are used to describe and further, to identify emotions. Moreover, the corresponding relations to personality and user state models are of interest. Especially, options towards a minimal set of characteristics will be discussed in the context of multimodal affective Human-Computer Interaction.
Gaze-in 2014: the 7th Workshop on Eye Gaze in Intelligent Human Machine Interaction BIBAFull-Text 527-528
  Hung-Hsuan Huang; Roman Bednarik; Kristiina Jokinen; Yukiko I. Nakano
This paper presents a summary of the seventh workshop on Eye Gaze in Intelligent Human Machine Interaction. The Gaze-in 2014 workshop is a part of a series of workshops held around the topics related to gaze and multimodal interaction. The workshop web-site can be found at http://hhhuang.homelinux.com/gaze_in/.
MAPTRAITS 2014 -- The First Audio/Visual Mapping Personality Traits Challenge -- An Introduction: Perceived Personality and Social Dimensions BIBAFull-Text 529-530
  Oya Celiktutan; Florian Eyben; Evangelos Sariyanidi; Hatice Gunes; Björn Schuller
The Audio/Visual Mapping Personality Challenge and Workshop (MAPTRAITS) is a competition event that is organised to facilitate the development of signal processing and machine learning techniques for the automatic analysis of personality traits and social dimensions. MAPTRAITS includes two sub-challenges, the continuous space-time sub-challenge and the quantised space-time sub-challenge. The continuous sub-challenge evaluated how systems predict the variation of perceived personality traits and social dimensions in time, whereas the quantised challenge evaluated the ability of systems to predict the overall perceived traits and dimensions in shorter video clips. To analyse the effect of audio and visual modalities on personality perception, we compared systems under three different settings: visual-only, audio-only and audio-visual. With MAPTRAITS we aimed at improving the knowledge on the automatic analysis of personality traits and social dimensions by producing a benchmarking protocol and encouraging the participation of various research groups from different backgrounds.
MLA'14: Third Multimodal Learning Analytics Workshop and Grand Challenges BIBAFull-Text 531-532
  Xavier Ochoa; Marcelo Worsley; Katherine Chiluiza; Saturnino Luz
This paper summarizes the third Multimodal Learning Analytics Workshop and Grand Challenges (MLA'14). This subfield of Learning Analytics focuses on the interpretation of the multimodal interactions that occurs in learning environments, both digital and physical. This is a hybrid event that includes presentations about methods and techniques to analyze and merge the different signals captured from these environments (workshop session) and more concrete results from the application of Multimodal Learning Analytics techniques to predict the performance of students while solving math problems or presenting in the classroom (challenges sessions). A total of eight articles will be presented in this event. The main conclusion from this event is that Multimodal Learning Analytics is a desirable research endeavour that could produce results that can be currently applied to improve the learning process.
ICMI 2014 Workshop on Multimodal, Multi-Party, Real-World Human-Robot Interaction BIBAFull-Text 533-534
  Mary Ellen Foster; Manuel Giuliani; Ronald Petrick
The Workshop on Multimodal, Multi-Party, Real-World Human-Robot Interaction will be held in Istanbul on 16 November 2014, co-located with the 16th International Conference on Multimodal Interaction (ICMI 2014). The workshop objective is to address the challenges that robots face when interacting with humans in real-world scenarios. The workshop brings together researchers from intention and activity recognition, person tracking, robust speech recognition and language processing, multimodal fusion, planning and decision making under uncertainty, and service robot design. The programme consists of two invited talks, three long paper talks, and seven late-breaking abstracts. Information on the workshop and pointers to workshop papers and slides can be found at http://www.macs.hw.ac.uk/~mef3/icmi-2014-workshop-hri/.
An Outline of Opportunities for Multimodal Research BIBAFull-Text 535-536
  Dirk Heylen; Alessandro Vinciarelli
This paper summarizes the contributions to the Workshop "Roadmapping the Future of Multimodal Interaction Research including Business Opportunities and Challenges". We present major challenges and ideas for making progress in the field of social signal processing and related fields as presented by the contributors of the workshop.
UM3I 2014: International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions BIBAFull-Text 537-538
  Samer Al Moubayed; Dan Bohus; Anna Esposito; Dirk Heylen; Maria Koutsombogera; Harris Papageorgiou; Gabriel Skantze
In this paper, we present a brief summary of the international workshop on Modeling Multiparty, Multimodal Interactions. The UM3I 2014 workshop is held in conjunction with the ICMI 2014 conference. The workshop will highlight recent developments and adopted methodologies in the analysis and modeling of multiparty and multimodal interactions, the design and implementation principles of related human-machine interfaces, as well as the identification of potential limitations and ways of overcoming them.