HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2015 International Conference on Multimodal Interaction

Fullname:Proceedings of the 17th ACM International Conference on Multimodal Interaction
Editors:Zhengyou Zhang; Phil Cohen; Dan Bohus; Radu Horaud; Helen Meng
Location:Seattle, Washington
Dates:2015-Nov-09 to 2015-Nov-13
Standard No:ISBN: 978-1-4503-3912-4; ACM DL: Table of Contents; hcibib: ICMI15
Links:Conference Website
  1. Keynote Address 1
  2. Keynote Address 2
  3. Keynote Address 3 (Sustained Accomplishment Award Talk)
  4. Oral Session 1: Machine Learning in Multimodal Systems
  5. Oral Session 2: Audio-Visual, Multimodal Inference
  6. Oral Session 3: Language, Speech and Dialog
  7. Oral Session 4: Communication Dynamics
  8. Oral Session 5: Interaction Techniques
  9. Oral Session 6: Mobile and Wearable
  10. Poster Session
  11. Demonstrations
  12. Grand Challenge 1: Recognition of Social Touch Gestures Challenge 2015
  13. Grand Challenge 2: Emotion Recognition in the Wild Challenge 2015
  14. Grand Challenge 3: Multimodal Learning and Analytics Grand Challenge 2015
  15. Doctoral Consortium

Keynote Address 1

Sharing Representations for Long Tail Computer Vision Problems BIBAFull-Text 1
  Samy Bengio
The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. This has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image captioning, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.

Keynote Address 2

Interaction Studies with Social Robots BIBAFull-Text 3
  Kerstin Dautenhahn
Over the past 10 years we have seen worldwide an immense growth of research and development into companion robots. Those are robots that fulfil particular tasks, but do so in a socially acceptable manner. The companionship aspect reflects the repeated and long-term nature of such interactions, and the potential of people to form relationships with such robots, e.g. as friendly assistants. A number of companion and assistant robots have been entering the market, two of the latest examples are Aldebaran's Pepper robot, or Jibo (Cynthia Breazeal). Companion robots are more and more targeting particular application areas, e.g. as home assistants or therapeutic tools. Research into companion robots needs to address many fundamental research problems concerning perception, cognition, action and learning, but regardless how sophisticated our robotic systems may be, the potential users need to be taken into account from the early stages of development. The talk will emphasize the need for a highly user-centred approach towards design, development and evaluation of companion robots. An important challenge is to evaluate robots in realistic and long-term scenarios, in order to capture as closely as possible those key aspects that will play a role when using such robots in the real world. In order to illustrate these points, my talk will give examples of interaction studies that my research team has been involved in. This includes studies into how people perceive robots' non-verbal cues, creating and evaluating realistic scenarios for home companion robots using narrative framing, and verbal and tactile interaction of children with the therapeutic and social robot Kaspar. The talk will highlight the issues we encountered when we proceeded from laboratory-based experiments and prototypes to real-world applications.

Keynote Address 3 (Sustained Accomplishment Award Talk)

Connections: 2015 ICMI Sustained Accomplishment Award Lecture BIBAFull-Text 5
  Eric Horvitz
Our community has long pursued principles and methods for enabling fluid and effortless collaborations between people and computing systems. Forging deep connections between people and machines has come into focus over the last 25 years as a grand challenge at the intersection of artificial intelligence, human-computer interaction, and cognitive psychology. I will review experiences and directions with leveraging advances in perception, learning, and reasoning in pursuit of our shared dreams.

Oral Session 1: Machine Learning in Multimodal Systems

Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits BIBAFull-Text 7-14
  Moitreya Chatterjee; Sunghyun Park; Louis-Philippe Morency; Stefan Scherer
Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.
Personality Trait Classification via Co-Occurrent Multiparty Multimodal Event Discovery BIBAFull-Text 15-22
  Shogo Okada; Oya Aran; Daniel Gatica-Perez
This paper proposes a novel feature extraction framework from multi-party multimodal conversation for inference of personality traits and emergent leadership. The proposed framework represents multi modal features as the combination of each participant's nonverbal activity and group activity. This feature representation enables to compare the nonverbal patterns extracted from the participants of different groups in a metric space. It captures how the target member outputs nonverbal behavior observed in a group (e.g. the member speaks while all members move their body), and can be available for any kind of multiparty conversation task. Frequent co-occurrent events are discovered using graph clustering from multimodal sequences. The proposed framework is applied for the ELEA corpus which is an audio visual dataset collected from group meetings. We evaluate the framework for binary classification task of 10 personality traits. Experimental results show that the model trained with co-occurrence features obtained higher accuracy than previously related work in 8 out of 10 traits. In addition, the co-occurrence features improve the accuracy from 2% up to 17%.
Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring BIBAFull-Text 23-30
  Vikram Ramanarayanan; Chee Wee Leong; Lei Chen; Gary Feng; David Suendermann-Oeft
We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams -- the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.
Gender Representation in Cinematic Content: A Multimodal Approach BIBAFull-Text 31-34
  Tanaya Guha; Che-Wei Huang; Naveen Kumar; Yan Zhu; Shrikanth S. Narayanan
The goal of this paper is to enable an objective understanding of gender portrayals in popular films and media through multimodal content analysis. An automated system for analyzing gender representation in terms of screen presence and speaking time is developed. First, we perform independent processing of the video and the audio content to estimate gender distribution of screen presence at shot level, and of speech at utterance level. A measure of the movie's excitement or intensity is computed using audiovisual features for every scene. This measure is used as a weighting function to combine the gender-based screen/speaking time information at shot/utterance level to compute gender representation for the entire movie. Detailed results and analyses are presented on seventeen full length Hollywood movies.

Oral Session 2: Audio-Visual, Multimodal Inference

Effects of Good Speaking Techniques on Audience Engagement BIBAFull-Text 35-42
  Keith Curtis; Gareth J. F. Jones; Nick Campbell
Understanding audience engagement levels for presentations has the potential to enable richer and more focused interaction with audio-visual recordings. We describe an investigation into automated analysis of multimodal recordings of scientific talks where the use of modalities most typically associated with engagement such as eye-gaze is not feasible. We first study visual and acoustic features to identify those most commonly associated with good speaking techniques. To understand audience interpretation of good speaking techniques, we engaged human annotators to rate the qualities of the speaker for a series of 30-second video segments taken from a corpus of 9 hours of presentations from an academic conference. Our annotators also watched corresponding video recordings of the audience to presentations to estimate the level of audience engagement for each talk. We then explored the effectiveness of multimodal features extracted from the presentation video against Likert-scale ratings of each speaker as assigned by the annotators. and on manually labelled audience engagement levels. These features were used to build a classifier to rate the qualities of a new speaker. This was able classify a rating for a presenter over an 8-class range with an accuracy of 52%. By combining these classes to a 4-class range accuracy increases to 73%. We analyse linear correlations with individual speaker-based modalities and actual audience engagement levels to understand the corresponding effect on audience engagement. A further classifier was then built to predict the level of audience engagement to a presentation by analysing the speaker's use of acoustic and visual cues. Using these speaker based modalities pre-fused with speaker ratings only, we are able to predict actual audience engagement levels with an accuracy of 68%. By combining with basic visual features from the audience as whole, we are able to improve this to an accuracy of 70%.
Multimodal Public Speaking Performance Assessment BIBAFull-Text 43-50
  Torsten Wörtwein; Mathieu Chollet; Boris Schauerte; Louis-Philippe Morency; Rainer Stiefelhagen; Stefan Scherer
The ability to speak proficiently in public is essential for many professions and in everyday life. Public speaking skills are difficult to master and require extensive training. Recent developments in technology enable new approaches for public speaking training that allow users to practice in engaging and interactive environments. Here, we focus on the automatic assessment of nonverbal behavior and multimodal modeling of public speaking behavior. We automatically identify audiovisual nonverbal behaviors that are correlated to expert judges' opinions of key performance aspects. These automatic assessments enable a virtual audience to provide feedback that is essential for training during a public speaking performance. We utilize multimodal ensemble tree learners to automatically approximate expert judges' evaluations to provide post-hoc performance assessments to the speakers. Our automatic performance evaluation is highly correlated with the experts' opinions with r = 0.745 for the overall performance assessments. We compare multimodal approaches with single modalities and find that the multimodal ensembles consistently outperform single modalities.
I Would Hire You in a Minute: Thin Slices of Nonverbal Behavior in Job Interviews BIBAFull-Text 51-58
  Laurent Son Nguyen; Daniel Gatica-Perez
In everyday life, judgments people make about others are based on brief excerpts of interactions, known as thin slices. Inferences stemming from such minimal information can be quite accurate, and nonverbal behavior plays an important role in the impression formation. Because protagonists are strangers, employment interviews are a case where both nonverbal behavior and thin slices can be predictive of outcomes. In this work, we analyze the predictive validity of thin slices of real job interviews, where slices are defined by the sequence of questions in a structured interview format. We approach this problem from an audio-visual, dyadic, and nonverbal perspective, where sensing, cue extraction, and inference are automated. Our study shows that although nonverbal behavioral cues extracted from thin slices were not as predictive as when extracted from the full interaction, they were still predictive of hirability impressions with R² values up to 0.34, which was comparable to the predictive validity of human observers on thin slices. Applicant audio cues were found to yield the most accurate results.
Deception Detection using Real-life Trial Data BIBAFull-Text 59-66
  Verónica Pérez-Rosas; Mohamed Abouelenien; Rada Mihalcea; Mihai Burzo
Hearings of witnesses and defendants play a crucial role when reaching court trial decisions. Given the high-stake nature of trial outcomes, implementing accurate and effective computational methods to evaluate the honesty of court testimonies can offer valuable support during the decision making process. In this paper, we address the identification of deception in real-life trial data. We introduce a novel dataset consisting of videos collected from public court trials. We explore the use of verbal and non-verbal modalities to build a multimodal deception detection system that aims to discriminate between truthful and deceptive statements provided by defendants and witnesses. We achieve classification accuracies in the range of 60-75% when using a model that extracts and fuses features from the linguistic and gesture modalities. In addition, we present a human deception detection study where we evaluate the human capability of detecting deception in trial hearings. The results show that our system outperforms the human capability of identifying deceit.

Oral Session 3: Language, Speech and Dialog

Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects BIBAFull-Text 67-74
  Gabriel Skantze; Martin Johansson; Jonas Beskow
In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors at a time could play a collaborative card sorting game together with the robot head Furhat, where the three players discuss the solution together. The cards are shown on a touch table between the players, thus constituting a target for joint attention. We describe how the system was implemented in order to manage turn-taking and attention to users and objects in the shared physical space. We also discuss how multi-modal redundancy (from speech, card movements and head pose) is exploited to maintain meaningful discussions, given that the system has to process conversational speech from both children and adults in a noisy environment. Finally, we present an analysis of 373 interactions, where we investigate the robustness of the system, to what extent the system's attention can shape the users' turn-taking behaviour, and how the system can produce multi-modal turn-taking signals (filled pauses, facial gestures, breath and gaze) to deal with processing delays in the system.
Visual Saliency and Crowdsourcing-based Priors for an In-car Situated Dialog System BIBAFull-Text 75-82
  Teruhisa Misu
This paper addresses issues in situated language understanding in a moving car. We propose a reference resolution method to identify user queries about specific target objects in their surroundings. We investigate methods of predicting which target object is likely to be queried given a visual scene and what kind of linguistic cues users naturally provide to describe a given target object in a situated environment. We propose methods to incorporate the visual saliency of the visual scene as a prior. Crowdsourced statistics of how people describe an object are also used as a prior. We have collected situated utterances from drivers using our research system, which was embedded in a real vehicle. We demonstrate that the proposed algorithms improve target identification rate by 15.1%.
Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding BIBAFull-Text 83-86
  Yun-Nung Chen; Ming Sun; Alexander I. Rudnicky; Anatole Gershman
Spoken language interfaces are appearing in various smart devices (e.g. smart-phones, smart-TV, in-car navigating systems) and serve as intelligent assistants (IAs). However, most of them do not consider individual users' behavioral profiles and contexts when modeling user intents. Such behavioral patterns are user-specific and provide useful cues to improve spoken language understanding (SLU). This paper focuses on leveraging the app behavior history to improve spoken dialog systems performance. We developed a matrix factorization approach that models speech and app usage patterns to predict user intents (e.g. launching a specific app). We collected multi-turn interactions in a WoZ scenario; users were asked to reproduce the multi-app tasks that they had performed earlier on their smart-phones. By modeling latent semantics behind lexical and behavioral patterns, the proposed multi-model system achieves about 52% of turn accuracy for intent prediction on ASR transcripts.
Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video BIBAFull-Text 87-90
  Punarjay Chakravarty; Sayeh Mirzaei; Tinne Tuytelaars; Hugo Van hamme
Active speakers have traditionally been identified in video by detecting their moving lips. This paper demonstrates the same using spatio-temporal features that aim to capture other cues: movement of the head, upper body and hands of active speakers. Speaker directional information, obtained using sound source localization from a microphone array is used to supervise the training of these video features.

Oral Session 4: Communication Dynamics

Predicting Participation Styles using Co-occurrence Patterns of Nonverbal Behaviors in Collaborative Learning BIBAFull-Text 91-98
  Yukiko I. Nakano; Sakiko Nihonyanagi; Yutaka Takase; Yuki Hayashi; Shogo Okada
With the goal of assessing participant attitudes and group activities in collaborative learning, this study presents models of participation styles based on co-occurrence patterns of nonverbal behaviors between conversational participants. First, we collected conversations among groups of three people in a collaborative learning situation, wherein each participant had a digital pen and wore a glasses-type eye tracker. We then divided the collected multimodal data into 0.1-second intervals. The discretized data were applied to an unsupervised method to find co-occurrence behavioral patterns. As a result, we discovered 122 multimodal behavioral motifs from more than 3,000 possible combinations of behaviors by three participants. Using the multimodal behavioral motifs as predictor variables, we created regression models for assessing participation styles. The multiple correlation coefficients ranged from 0.74 to 0.84, indicating a good fit between the models and the data. A correlation analysis also enabled us to identify a smaller set of behavioral motifs (fewer than 30) that are statistically significant as predictors of participation styles. These results show that automatically discovered combinations of multiple kinds of nonverbal information with high co-occurrence frequencies observed between multiple participants as well as for a single participant are useful in characterizing the participant's attitudes towards the conversation.
Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings BIBAFull-Text 99-106
  Ryo Ishii; Shiro Kumano; Kazuhiro Otsuka
Techniques that use nonverbal behaviors to predict turn-taking situations, such as who will be the next speaker and the next utterance timing in multi-party meetings are receiving a lot of attention recently. It has long been known that gaze is a physical behavior that plays an important role in transferring the speaking turn between humans. Recently, a line of research has focused on the relationship between turn-taking and respiration, a biological signal that conveys information about the intention or preliminary action to start to speak. It has been demonstrated that respiration and gaze behavior separately have the potential to allow predicting the next speaker and the next utterance timing in multi-party meetings. As a multimodal fusion to create models for predicting the next speaker in multi-party meetings, we integrated respiration and gaze behavior, which were extracted from different modalities and are completely different in quality, and implemented a model uses information about them to predict the next speaker at the end of an utterance. The model has a two-step processing. The first is to predict whether turn-keeping or turn-taking happens; the second is to predict the next speaker in turn-taking. We constructed prediction models with either respiration or gaze behavior and with both respiration and gaze behaviors as features and compared their performance. The results suggest that the model with both respiration and gaze behaviors performs better than the one using only respiration or gaze behavior. It is revealed that multimodal fusion using respiration and gaze behavior is effective for predicting the next speaker in multi-party meetings. It was found that gaze behavior is more useful for predicting turn-keeping/turn-taking than respiration and that respiration is more useful for predicting the next speaker in turn-taking.
Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions BIBAFull-Text 107-114
  Catharine Oertel; Kenneth A. Funes Mora; Joakim Gustafson; Jean-Marc Odobez
Estimating a silent participant's degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio-visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64%.
Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors BIBAFull-Text 115-122
  Najmeh Sadoughi; Carlos Busso
Creating believable behaviors for conversational agents (CAs) is a challenging task, given the complex relationship between speech and various nonverbal behaviors. The two main approaches are rule-based systems, which tend to produce behaviors with limited variations compared to natural interactions, and data-driven systems, which tend to ignore the underlying semantic meaning of the message (e.g., gestures without meaning). We envision a hybrid system, acting as the behavior realization layer in rule-based systems, while exploiting the rich variation in natural interactions. Constrained on a given target gesture (e.g., head nod) and speech signal, the system will generate novel realizations learned from the data, capturing the timely relationship between speech and gestures. An important task in this research is identifying multiple examples of the target gestures in the corpus. This paper proposes a data mining framework for detecting gestures of interest in a motion capture database. First, we train One-class support vector machines (SVMs) to detect candidate segments conveying the target gesture. Second, we use dynamic time alignment kernel (DTAK) to compare the similarity between the examples (i.e., target gesture) and the given segments. We evaluate the approach for five prototypical hand and head gestures showing reasonable performance. These retrieved gestures are then used to train a speech-driven framework based on dynamic Bayesian networks (DBNs) to synthesize these target behaviors.

Oral Session 5: Interaction Techniques

Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input BIBAFull-Text 123-130
  Konstantin Klamka; Andreas Siegel; Stefan Vogt; Fabian Göbel; Sophie Stellmach; Raimund Dachselt
For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.
Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions BIBAFull-Text 131-138
  Ishan Chatterjee; Robert Xiao; Chris Harrison
Humans rely on eye gaze and hand manipulations extensively in their everyday activities. Most often, users gaze at an object to perceive it and then use their hands to manipulate it. We propose applying a multimodal, gaze plus free-space gesture approach to enable rapid, precise and expressive touch-free interactions. We show the input methods are highly complementary, mitigating issues of imprecision and limited expressivity in gaze-alone systems, and issues of targeting speed in gesture-alone systems. We extend an existing interaction taxonomy that naturally divides the gaze+gesture interaction space, which we then populate with a series of example interaction techniques to illustrate the character and utility of each method. We contextualize these interaction techniques in three example scenarios. In our user study, we pit our approach against five contemporary approaches; results show that gaze+gesture can outperform systems using gaze or gesture alone, and in general, approach the performance of "gold standard" input systems, such as the mouse and trackpad.
Digital Flavor: Towards Digitally Simulating Virtual Flavors BIBAFull-Text 139-146
  Nimesha Ranasinghe; Gajan Suthokumar; Kuan-Yi Lee; Ellen Yi-Luen Do
Flavor is often a pleasurable sensory perception we experience daily while eating and drinking. However, the sensation of flavor is rarely considered in the age of digital communication mainly due to the unavailability of flavors as a digitally controllable media. This paper introduces a digital instrument (Digital Flavor Synthesizing device), which actuates taste (electrical and thermal stimulation) and smell sensations (controlled scent emitting) together to simulate different flavors digitally. A preliminary user experiment is conducted to study the effectiveness of this method with predefined five different flavor stimuli. Experimental results show that the users were effectively able to identify different flavors such as minty, spicy, and lemony. Moreover, we outline several challenges ahead along with future possibilities of this technology. In summary, our work demonstrates a novel controllable instrument for flavor simulation, which will be valuable in multimodal interactive systems for rendering virtual flavors digitally.
Different Strokes and Different Folks: Economical Dynamic Surface Sensing and Affect-Related Touch Recognition BIBAFull-Text 147-154
  Xi Laura Cang; Paul Bucci; Andrew Strang; Jeff Allen; Karon MacLean; H. Y. Sean Liu
Social touch is an essential non-verbal channel whose great interactive potential can be realized by the ability to recognize gestures performed on inviting surfaces. To assess impact on recognition performance of sensor motion, substrate and coverings, we collected gesture data from a low-cost multitouch fabric pressure-location sensor while varying these factors. For six gestures most relevant in a haptic social robot context plus a no-touch control, we conducted two studies, with the sensor (1) stationary, varying substrate and cover (n=10); and (2) attached to a robot under a fur covering, flexing or stationary (n=16). For a stationary sensor, a random forest model achieved 90.0% recognition accuracy (chance 14.2%) when trained on all data, but as high as 94.6% (mean 89.1%) when trained on the same individual. A curved, flexing surface achieved 79.4% overall but averaged 85.7% when trained and tested on the same individual. These results suggest that under realistic conditions, recognition with this type of flexible sensor is sufficient for many applications of interactive social touch. We further found evidence that users exhibit an idiosyncratic 'touch signature', with potential to identify the toucher. Both findings enable varied contexts of affective or functional touch communication, from physically interactive robots to any touch-sensitive object.

Oral Session 6: Mobile and Wearable

MPHA: A Personal Hearing Doctor Based on Mobile Devices BIBAFull-Text 155-162
  Yuhao Wu; Jia Jia; WaiKim Leung; Yejun Liu; Lianhong Cai
As more and more people inquire to know their hearing level condition, audiometry is becoming increasingly important. However, traditional audiometric method requires the involvement of audiometers, which are very expensive and time consuming. In this paper, we present mobile personal hearing assessment (MPHA), a novel interactive mode for testing hearing level based on mobile devices. MPHA, 1) provides a general method to calibrate sound intensity for mobile devices to guarantee the reliability and validity of the audiometry system; 2) designs an audiometric correction algorithm for the real noisy audiometric environment. The experimental results show that MPHA is reliable and valid compared with conventional audiometric assessment.
Towards Attentive, Bi-directional MOOC Learning on Mobile Devices BIBAFull-Text 163-170
  Xiang Xiao; Jingtao Wang
AttentiveLearner is a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner converts the built-in camera of mobile devices into both a tangible video control channel and an implicit heart rate sensing channel by analyzing the learner's fingertip transparency changes in real time. In this paper, we report disciplined research efforts in making AttentiveLearner truly practical in real-world use. Through two 18-participant user studies and follow-up analyses, we found that 1) the tangible video control interface is intuitive to use and efficient to operate; 2) heart rate signals implicitly captured by AttentiveLearner can be used to infer both the learner's interests and perceived confusion levels towards the corresponding learning topics; 3) AttentiveLearner can achieve significantly higher accuracy by predicting extreme personal learning events and aggregated learning events.
An Experiment on the Feasibility of Spatial Acquisition using a Moving Auditory Cue for Pedestrian Navigation BIBAFull-Text 171-174
  Yeseul Park; Kyle Koh; Heonjin Park; Jinwook Seo
We conducted a feasibility study on the use of a moving auditory cue for spatial acquisition for pedestrian navigation by comparing its performance with a static auditory cue, the use of which has been investigated in previous studies. To investigate the performance of human sound azimuthal localization, we designed and conducted a controlled experiment with 15 participants and found that performance was statistically significantly more accurate with an auditory source moving from the opposite direction over users' heads to the target direction than with a static sound. Based on this finding, we designed a bimodal pedestrian navigation system using both visual and auditory feedback. We evaluated the system by conducting a field study with four users and received overall positive feedback.
A Wearable Multimodal Interface for Exploring Urban Points of Interest BIBAFull-Text 175-182
  Antti Jylhä; Yi-Ta Hsieh; Valeria Orso; Salvatore Andolina; Luciano Gamberini; Giulio Jacucci
Locating points of interest (POIs) in cities is typically facilitated by visual aids such as paper maps, brochures, and mobile applications. However, these techniques require visual attention, which ideally should be on the surroundings. Non-visual techniques for navigating towards specific POIs typically lack support for free exploration of the city or more detailed guidance. To overcome these issues, we propose a multimodal, wearable system for alerting the user of nearby recommended POIs. The system, built around a tactile glove, provides audio-tactile cues when a new POI is in the vicinity, and more detailed information and guidance if the user expresses interest in this POI. We evaluated the system in a field study, comparing it to a visual baseline application. The encouraging results show that the glove-based system helps keep the attention on the surroundings and that its performance is on the same level as that of the baseline.

Poster Session

ECA Control using a Single Affective User Dimension BIBAFull-Text 183-190
  Fred Charles; Florian Pecune; Gabor Aranyi; Catherine Pelachaud; Marc Cavazza
User interaction with Embodied Conversational Agents (ECA) should involve a significant affective component to achieve realism in communication. This aspect has been studied through different frameworks describing the relationship between user and ECA, for instance alignment, rapport and empathy. We conducted an experiment to explore how an ECA's non-verbal expression can be controlled to respond to a single affective dimension generated by users as input. Our system is based on the mapping of a high-level affective dimension, approach/avoidance, onto a new ECA control mechanism in which Action Units (AU) are activated through a neural network. Since 'approach' has been associated to prefrontal cortex activation, we use a measure of prefrontal cortex left-asymmetry through fNIRS as a single input signal representing the user's attitude towards the ECA. We carried out the experiment with 10 subjects, who have been instructed to express a positive mental attitude towards the ECA. In return, the ECA facial expression would reflect the perceived attitude under a neurofeedback paradigm. Our results suggest that users are able to successfully interact with the ECA and perceive its response as consistent and realistic, both in terms of ECA responsiveness and in terms of relevance of facial expressions. From a system perspective, the empirical calibration of the network supports a progressive recruitment of various AUs, which provides a principled description of the ECA response and its intensity. Our findings suggest that complex ECA facial expressions can be successfully aligned with one high-level affective dimension. Furthermore, this use of a single dimension as input could support experiments in the fine-tuning of AU activation or their personalization to user preferred modalities.
Multimodal Interaction with a Bifocal View on Mobile Devices BIBAFull-Text 191-198
  Sebastien Pelurson; Laurence Nigay
On a mobile device, the intuitive Focus+Context layout of a detailed view (focus) and perspective/distorted panels on either side (context) is particularly suitable for maximizing the utilization of the limited available display area. Interacting with such a bifocal view requires both fast access to data in the context view and high precision interaction with data in the detailed focus view. We introduce combined modalities that solve this problem by combining the well-known flick-drag gesture-based precise modality with modalities for fast access to data in the context view. The modalities for fast access to data in the context view include direct touch in the context view as well as navigation based on drag gestures, on tilting the device, on side-pressure inputs or by spatially moving the device (dynamic peephole). Results of a comparison experiment of the combined modalities show that the performance can be analyzed according to a 3-phase model of the task: a focus-targeting phase, a transition phase (modality switch) and a cursor-pointing phase. Moreover modalities of the focus-targeting phase based on a discrete mode of navigation control (direct access, pressure sensors as discrete navigation controller) require a long transition phase: this is mainly due to disorientation induced by the loss of control in movements. This effect is significantly more pronounced than the articulatory time for changing the position of the fingers between the two modalities ("homing" time).
NaLMC: A Database on Non-acted and Acted Emotional Sequences in HCI BIBAFull-Text 199-202
  Kim Hartmann; Julia Krüger; Jörg Frommer; Andreas Wendemuth
We report on the investigation on acted and non-acted emotional speech and the resulting Non-/acted LAST MINUTE corpus (NaLMC) database. The database consists of newly recorded acted emotional speech samples which were designed to allow the direct comparison of acted and non-acted emotional speech. The non-acted samples are taken from the LAST MINUTE corpus (LMC) [1]. Furthermore, emotional labels were added to selected passages of the LMC and a self-rating of the LMC recordings was performed. Although the main objective of the NaLMC database is to allow the comparative analysis of acted and non-acted emotional speech, both audio and video signals were recorded to allow multimodal investigations.
Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos BIBAFull-Text 203-210
  Behjat Siddiquie; Dave Chisholm; Ajay Divakaran
We introduce the task of automatically classifying politically persuasive web videos and propose a highly effective multi-modal approach for this task. We extract audio, visual, and textual features that attempt to capture affect and semantics in the audio-visual content and sentiment in the viewers' comments. We demonstrate that each of the feature modalities can be used to classify politically persuasive content, and that fusing them leads to the best performance. We also perform experiments to examine human accuracy and inter-coder reliability for this task and show that our best automatic classifier slightly outperforms average human performance. Finally we show that politically persuasive videos generate more strongly negative viewer comments than non-persuasive videos and analyze how affective content can be used to predict viewer reactions.
Toward Better Understanding of Engagement in Multiparty Spoken Interaction with Children BIBAFull-Text 211-218
  Samer Al Moubayed; Jill Lehman
A system's ability to understand and model a human's engagement during an interactive task is important for both adapting its behavior to the moment and achieving a coherent interaction over time. Standard practice for creating such a capability requires uncovering and modeling the multimodal cues that predict engagement in a given task environment. The first step in this methodology is to have human coders produce "gold standard" judgments of sample behavior. In this paper we report results from applying this first step to the complex and varied behavior of children playing a fast-paced, speech-controlled, side-scrolling game called Mole Madness. We introduce a concrete metric for engagement-willingness to continue the interaction -- that leads to better inter-coder judgments for children playing in pairs, explore how coders perceive the relative contribution of audio and visual cues, and describe engagement trends and patterns in our population. We also examine how the measures change when the same children play Mole Madness with a robot instead of a peer. We conclude by discussing the implications of the differences within and across play conditions for the automatic estimation of engagement and the extension of our autonomous robot player into a "buddy" that can individualize interaction for each player and game.
Gestimator: Shape and Stroke Similarity Based Gesture Recognition BIBAFull-Text 219-226
  Yina Ye; Petteri Nurmi
Template-based approaches are currently the most popular gesture recognition solution for interactive systems as they provide accurate and runtime efficient performance in a wide range of applications. The basic idea in these approaches is to measure similarity between a user gesture and a set of pre-recorded templates, and to determine the appropriate gesture type using a nearest neighbor classifier. While simple and elegant, this approach performs well only when the gestures are relatively simple and unambiguous. In increasingly many scenarios, such as authentication, interactive learning, and health care applications, the gestures of interest are complex, consist of multiple sub-strokes, and closely resemble other gestures. Merely considering the shape of the gesture is not sufficient for these scenarios, and robust identification of the constituent sequence of sub-strokes is also required. The present paper contributes by introducing Gestimator, a novel gesture recognizer that combines shape and stroke-based similarity into a sequential classification framework for robust gesture recognition. Experiments carried out using three datasets demonstrate significant performance gains compared to current state-of-the-art techniques. The performance improvements are highest for complex gestures, but consistent improvements are achieved even for simple and widely studied gesture types.
Classification of Children's Social Dominance in Group Interactions with Robots BIBAFull-Text 227-234
  Sarah Strohkorb; Iolanda Leite; Natalie Warren; Brian Scassellati
As social robots become more widespread in educational environments, their ability to understand group dynamics and engage multiple children in social interactions is crucial. Social dominance is a highly influential factor in social interactions, expressed through both verbal and nonverbal behaviors. In this paper, we present a method for determining whether a participant is high or low in social dominance in a group interaction with children and robots. We investigated the correlation between many verbal and nonverbal behavioral features with social dominance levels collected through teacher surveys. We additionally implemented Logistic Regression and Support Vector Machines models with classification accuracies of 81% and 89%, respectively, showing that using a small subset of nonverbal behavioral features, these models can successfully classify children's social dominance level. Our approach for classifying social dominance is novel not only for its application to children, but also for achieving high classification accuracies using a reduced set of nonverbal features that, in future work, can be automatically extracted with current sensing technology.
Spectators' Synchronization Detection based on Manifold Representation of Physiological Signals: Application to Movie Highlights Detection BIBAFull-Text 235-238
  Michal Muszynski; Theodoros Kostoulas; Guillaume Chanel; Patrizia Lombardo; Thierry Pun
Detection of highlights in movies is a challenge for the affective understanding and implicit tagging of films. Under the hypothesis that synchronization of the reaction of spectators indicates such highlights, we define a synchronization measure between spectators that is capable of extracting movie highlights. The intuitive idea of our approach is to define (a) a parameterization of one spectator's physiological data on a manifold; (b) the synchronization measure between spectators as the Kolmogorov-Smirnov distance between local shape distributions of the underlying manifolds. We evaluate our approach using data collected in an experiment where the electro-dermal activity of spectators was recorded during the entire projection of a movie in a cinema. We compare our methodology with baseline synchronization measures, such as correlation, Spearman's rank correlation, mutual information, Kolmogorov-Smirnov distance. Results indicate that the proposed approach allows to accurately distinguish highlight from non-highlight scenes.
Implicit User-centric Personality Recognition Based on Physiological Responses to Emotional Videos BIBAFull-Text 239-246
  Julia Wache; Ramanathan Subramanian; Mojtaba Khomami Abadi; Radu-Laurentiu Vieriu; Nicu Sebe; Stefan Winkler
We present a novel framework for recognizing personality traits based on users' physiological responses to affective movie clips. Extending studies that have correlated explicit/implicit affective user responses with Extraversion and Neuroticism traits, we perform single-trial recognition of the big-five traits from Electrocardiogram (ECG), Galvanic Skin Response (GSR), Electroencephalogram (EEG) and facial emotional responses compiled from 36 users using off-the-shelf sensors. Firstly, we examine relationships among personality scales and (explicit) affective user ratings acquired in the context of prior observations. Secondly, we isolate physiological correlates of personality traits. Finally, unimodal and multimodal personality recognition results are presented. Personality differences are better revealed while analyzing responses to emotionally homogeneous (e.g., high valence, high arousal) clips, and significantly above-chance recognition is achieved for all five traits.
Detecting Mastication: A Wearable Approach BIBAFull-Text 247-250
  Abdelkareem Bedri; Apoorva Verlekar; Edison Thomaz; Valerie Avva; Thad Starner
We explore using the Outer Ear Interface (OEI) to recognize eating activities. OEI contains a 3D gyroscope and a set of proximity sensors encapsulated in an off-the-shelf earpiece to monitor jaw movement by measuring ear canal deformation. In a laboratory setting with 20 participants, OEI could distinguish eating from other activities, such as walking, talking, and silently reading, with over 90% accuracy (user independent). In a second study, six subjects wore the system for 6 hours each while performing their normal daily activities. OEI correctly classified five minute segments of time as eating or non-eating with 93% accuracy (user dependent).
Exploring Behavior Representation for Learning Analytics BIBAFull-Text 251-258
  Marcelo Worsley; Stefan Scherer; Louis-Philippe Morency; Paulo Blikstein
Multimodal analysis has long been an integral part of studying learning. Historically multimodal analyses of learning have been extremely laborious and time intensive. However, researchers have recently been exploring ways to use multimodal computational analysis in the service of studying how people learn in complex learning environments. In an effort to advance this research agenda, we present a comparative analysis of four different data segmentation techniques. In particular, we propose affect- and pose-based data segmentation, as alternatives to human-based segmentation, and fixed-window segmentation. In a study of ten dyads working on an open-ended engineering design task, we find that affect- and pose-based segmentation are more effective, than traditional approaches, for drawing correlations between learning-relevant constructs, and multimodal behaviors. We also find that pose-based segmentation outperforms the two more traditional segmentation strategies for predicting student success on the hands-on task. In this paper we discuss the algorithms used, our results, and the implications that this work may have in non-education-related contexts.
Multimodal Human Activity Recognition for Industrial Manufacturing Processes in Robotic Workcells BIBAFull-Text 259-266
  Alina Roitberg; Nikhil Somani; Alexander Perzylo; Markus Rickert; Alois Knoll
We present an approach for monitoring and interpreting human activities based on a novel multimodal vision-based interface, aiming at improving the efficiency of human-robot interaction (HRI) in industrial environments. Multi-modality is an important concept in this design, where we combine inputs from several state-of-the-art sensors to provide a variety of information, e.g. skeleton and fingertip poses. Based on typical industrial workflows, we derived multiple levels of human activity labels, including large-scale activities (e.g. assembly) and simpler sub-activities (e.g. hand gestures), creating a duration- and complexity-based hierarchy. We train supervised generative classifiers for each activity level and combine the output of this stage with a trained Hierarchical Hidden Markov Model (HHMM), which models not only the temporal aspects between the activities on the same level, but also the hierarchical relationships between the levels.
Accuracy vs. Availability Heuristic in Multimodal Affect Detection in the Wild BIBAFull-Text 267-274
  Nigel Bosch; Huili Chen; Sidney D'Mello; Ryan Baker; Valerie Shute
This paper discusses multimodal affect detection from a fusion of facial expressions and interaction features derived from students' interactions with an educational game in the noisy real-world context of a computer-enabled classroom. Log data of students' interactions with the game and face videos from 133 students were recorded in a computer-enabled classroom over a two day period. Human observers live annotated learning-centered affective states such as engagement, confusion, and frustration. The face-only detectors were more accurate than interaction-only detectors. Multimodal affect detectors did not show any substantial improvement in accuracy over the face-only detectors. However, the face-only detectors were only applicable to 65% of the cases due to face registration errors caused by excessive movement, occlusion, poor lighting, and other factors. Multimodal fusion techniques were able to improve the applicability of detectors to 98% of cases without sacrificing classification accuracy. Balancing the accuracy vs. applicability tradeoff appears to be an important feature of multimodal affect detection.
Dynamic Active Learning Based on Agreement and Applied to Emotion Recognition in Spoken Interactions BIBAFull-Text 275-278
  Yue Zhang; Eduardo Coutinho; Zixing Zhang; Caijiao Quan; Bjoern Schuller
In this contribution, we propose a novel method for Active Learning (AL) -- Dynamic Active Learning (DAL) -- which targets the reduction of the costly human labelling work necessary for modelling subjective tasks such as emotion recognition in spoken interactions. The method implements an adaptive query strategy that minimises the amount of human labelling work by deciding for each instance whether it should automatically be labelled by machine or manually by human, as well as how many human annotators are required. Extensive experiments on standardised test-beds show that DAL significantly improves the efficiency of conventional AL. In particular, DAL achieves the same classification accuracy obtained with AL with up to 79.17% less human annotation effort.
Sharing Touch Interfaces: Proximity-Sensitive Touch Targets for Tablet-Mediated Collaboration BIBAFull-Text 279-286
  Ilhan Aslan; Thomas Meneweger; Verena Fuchsberger; Manfred Tscheligi
During conversational practices, such as a tablet-mediated sales conversation between a salesperson and a customer, tablets are often used by two users who prefer specific bodily formations in order to easily face each other and the surface of the touchscreen. In a series of studies, we investigated bodily formations that are preferred during tablet-mediated sales conversations, and explored the effect of these formations on performance in acquiring touch targets (e.g., buttons) on a tablet device. We found that bodily formations cause decreased viewing angles to the shared screen, which results in a decreased performance in target acquisition. In order to address this issue, a multi-modal design consideration is presented, which combines mid-air finger movement and touch into a unified input modality, allowing the design of proximity sensitive touch targets. We conclude that the proposed embodied interaction design not only has potential to improve targeting performance, but also adapts the 'agency' of touch targets for multi-user settings.
Analyzing Multimodality of Video for User Engagement Assessment BIBAFull-Text 287-290
  Fahim A. Salim; Fasih Haider; Owen Conlan; Saturnino Luz; Nick Campbell
These days, several hours of new video content is uploaded to the internet every second. It is simply impossible for anyone to see every piece of video which could be engaging or even useful to them. Therefore it is desirable to identify videos that might be regarded as engaging automatically, for a variety of applications such as recommendation and personalized video segmentation etc. This paper explores how multimodal characteristics of video, such as prosodic, visual and paralinguistic features, can help in assessing user engagement with videos. The approach proposed in this paper achieved good accuracy (maximum F score of 96.93%) through a novel combination of features extracted directly from video recordings, demonstrating the potential of this method in identifying engaging content.
Adjacent Vehicle Collision Warning System using Image Sensor and Inertial Measurement Unit BIBAFull-Text 291-298
  Asif Iqbal; Carlos Busso; Nicholas R. Gans
Advanced driver assistance systems are the newest addition to vehicular technology. Such systems use a wide array of sensors to provide a superior driving experience. Vehicle safety and driver alert are important parts of these system. This paper proposes a driver alert system to prevent and mitigate adjacent vehicle collisions by proving warning information of on-road vehicles and possible collisions. A dynamic Bayesian network (DBN) is utilized to fuse multiple sensors to provide driver awareness. It detects oncoming adjacent vehicles and gathers ego vehicle motion characteristics using an on-board camera and inertial measurement unit (IMU). A histogram of oriented gradient feature based classifier is used to detect any adjacent vehicles. Vehicles front-rear end and side faces were considered in training the classifier. Ego vehicles heading, speed and acceleration are captured from the IMU and feed into the DBN. The network parameters were learned from data via expectation maximization (EM) algorithm. The DBN is designed to provide two type of warning to the driver, a cautionary warning and a brake alert for possible collision with other vehicles. Experiments were completed on multiple public databases, demonstrating successful warnings and brake alerts in most situations.
Automatic Detection of Mind Wandering During Reading Using Gaze and Physiology BIBAFull-Text 299-306
  Robert Bixler; Nathaniel Blanchard; Luke Garrison; Sidney D'Mello
Mind wandering (MW) entails an involuntary shift in attention from task-related thoughts to task-unrelated thoughts, and has been shown to have detrimental effects on performance in a number of contexts. This paper proposes an automated multimodal detector of MW using eye gaze and physiology (skin conductance and skin temperature) and aspects of the context (e.g., time on task, task difficulty). Data in the form of eye gaze and physiological signals were collected as 178 participants read four instructional texts from a computer interface. Participants periodically provided self-reports of MW in response to pseudorandom auditory probes during reading. Supervised machine learning models trained on features extracted from participants' gaze fixations, physiological signals, and contextual cues were used to detect pages where participants provided positive responses of MW to the auditory probes. Two methods of combining gaze and physiology features were explored. Feature level fusion entailed building a single model by combining feature vectors from individual modalities. Decision level fusion entailed building individual models for each modality and adjudicating amongst individual decisions. Feature level fusion resulted in an 11% improvement in classification accuracy over the best unimodal model, but there was no comparable improvement for decision level fusion. This was reflected by a small improvement in both precision and recall. An analysis of the features indicated that MW was associated with fewer and longer fixations and saccades, and a higher more deterministic skin temperature. Possible applications of the detector are discussed.
Multimodal Detection of Depression in Clinical Interviews BIBAFull-Text 307-310
  Hamdi Dibeklioglu; Zakia Hammal; Ying Yang; Jeffrey F. Cohn
Current methods for depression assessment depend almost entirely on clinical interview or self-report ratings. Such measures lack systematic and efficient ways of incorporating behavioral observations that are strong indicators of psychological disorder. We compared a clinical interview of depression severity with automatic measurement in 48 participants undergoing treatment for depression. Interviews were obtained at 7-week intervals on up to four occasions. Following standard cut-offs, participants at each session were classified as remitted, intermediate, or depressed. Logistic regression classifiers using leave-one-out validation were compared for facial movement dynamics, head movement dynamics, and vocal prosody individually and in combination. Accuracy (remitted versus depressed) for facial movement dynamics was higher than that for head movement dynamics; and each was substantially higher than that for vocal prosody. Accuracy for all three modalities together reached 88.93%, exceeding that for any single modality or pair of modalities. These findings suggest that automatic detection of depression from behavioral indicators is feasible and that multimodal measures afford most powerful detection.
Spoken Interruptions Signal Productive Problem Solving and Domain Expertise in Mathematics BIBAFull-Text 311-318
  Sharon Oviatt; Kevin Hang; Jianlong Zhou; Fang Chen
Prevailing social norms prohibit interrupting another person when they are speaking. In this research, simultaneous speech was investigated in groups of students as they jointly solved math problems and peer tutored one another. Analyses were based on the Math Data Corpus, which includes ground-truth performance coding and speech transcriptions. Simultaneous speech was elevated 120-143% during the most productive phase of problem solving, compared with matched intervals. It also was elevated 18-37% in students who were domain experts, compared with non-experts. Qualitative analyses revealed that experts differed from non-experts in the function of their interruptions. Analysis of these functional asymmetries produced nine key behaviors that were used to identify the dominant math expert in a group with 95-100% accuracy in three minutes. This research demonstrates that overlapped speech is a marker of group problem-solving progress and domain expertise. It provides valuable information for the emerging field of learning analytics.
Active Haptic Feedback for Touch Enabled TV Remote BIBAFull-Text 319-322
  Anton Treskunov; Mike Darnell; Rongrong Wang
Recently a number of TV manufacturers introduced TV remotes with a touchpad which is used for indirect control of TV UI. Users can navigate the UI by moving a finger across the touch pad. However, due to the latency in visual feedback, there is a disconnection between the finger movement on the touchpad and the visual perception in the TV UI, which often causes overshooting. In this paper, we investigate how haptic feedback affects the user experience of the touchpad-based TV remote. We described two haptic prototypes built on the smartphone and Samsung 2013 TV remote respectively. We conducted two user studies with two prototypes to evaluate how the user preference and the user performance been affected. The results show that there is overwhelming support of haptic feedback in terms of subjective user preference, though we didn't find significant difference in performance between with and without haptic feedback conditions.
A Visual Analytics Approach to Finding Factors Improving Automatic Speaker Identifications BIBAFull-Text 323-326
  Pierrick Bruneau; Mickaël Stefas; Hervé Bredin; Johann Poignant; Thomas Tamisier; Claude Barras
Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.
The Influence of Visual Cues on Passive Tactile Sensations in a Multimodal Immersive Virtual Environment BIBAFull-Text 327-334
  Nina Rosa; Wolfgang Hürst; Wouter Vos; Peter Werkhoven
Haptic feedback, such as the sensation of 'being touched', is an essential part of how we experience our environment. Yet, it is often disregarded in current virtual reality (VR) systems. In addition to the technical challenge of creating such tactile experiences there are also human aspects that are not fully understood, especially with respect to how humans integrate multimodal stimuli. In this research, we proved that the visual stimuli in a VR setting can influence how vibrotactile stimuli are perceived. In particular, we identified how visual cues that are associated with the characteristic of weight influence tactile perception, whereas a similar effect could not be achieved for a temperature-related visual cue. Our results have technical implications -- for example, suggesting that a rather simple vibration motor may be sufficient to create a complex tactile experience such as perceiving weight -- and relevance for practical implementations -- for example, indicating that vibration intensities need to be 'exaggerated' to achieve certain effects.
Detection of Deception in the Mafia Party Game BIBAFull-Text 335-342
  Sergey Demyanov; James Bailey; Kotagiri Ramamohanarao; Christopher Leckie
The problem of deception detection is very challenging. Only trained people with specialist knowledge are able to demonstrate an accuracy that is sufficiently higher than random predictions. We present a multi-stage automatic system for extracting features from facial cues and evaluate it on the Mafia game database which we have collected. It is a large database of truthful and deceptive people, recorded in conditions more variable and realistic than many other databases of similar kind. We demonstrate that using the extracted features we are able to correctly classify instances with an average AUC (area under the ROC curve) equal to 0.61, significantly better than random predictions.
Individuality-Preserving Voice Reconstruction for Articulation Disorders Using Text-to-Speech Synthesis BIBAFull-Text 343-346
  Reina Ueda; Tetsuya Takiguchi; Yasuo Ariki
This paper presents a speech synthesis method for people with articulation disorders. Because the movements of such speakers are limited by their athetoid symptoms, their prosody is often unstable and their speech rate differs from that of a physically unimpaired person, which causes their speech to be less intelligible and, consequently, makes communication with physically unimpaired persons difficult. In order to deal with these problems, this paper describes a Hidden Markov Model (HMM)-based text-to-speech synthesis approach that preserves the individuality of a person with an articulation disorder and aids them in their communication. In our method, a duration model of a physically unimpaired person is used for the HMM synthesis system and an F0 model in the system is trained using the F0 patterns of the physically unimpaired person, with the average F0 being converted to the target F0 in advance. In order to preserve the target speaker's individuality, a spectral model is built from target spectra. Through experimental evaluations, we have confirmed that the proposed method successfully synthesizes intelligible speech while maintaining the target speaker's individuality.
Behavioral and Emotional Spoken Cues Related to Mental States in Human-Robot Social Interaction BIBAFull-Text 347-350
  Lucile Bechade; Guillaume Dubuisson Duplessis; Mohamed Sehili; Laurence Devillers
Understanding human behavioral and emotional cues occurring in interaction has become a major research interest due to the emergence of numerous applications such as in social robotics. While there is agreement across different theories that some behavioral signals are involved in communicating information, there is a lack of consensus regarding their specificity, their universality, and whether they convey emotions, affective, cognitive, mental states or all of those. Our goal in this study is to explore the relationship between behavioral and emotional cues extracted from speech (e.g., laughter, speech duration, negative emotions) with different communicative information about the human participant. This study is based on a corpus of audio/video data of humorous interactions between the Nao robot and 37 human participants. Participants filled three questionnaires about their personality, sense of humor and mental states regarding the interaction. This work reveals the existence of many links between behavioral and emotional cues and the mental states reported by human participants through self-report questionnaires. However, we have not found a clear connection between reported mental states and participants profiles.
Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View BIBAFull-Text 351-354
  Sven Bambach; David J. Crandall; Chen Yu
Wearable devices are becoming part of everyday life, from first-person cameras (GoPro, Google Glass), to smart watches (Apple Watch), to activity trackers (FitBit). These devices are often equipped with advanced sensors that gather data about the wearer and the environment. These sensors enable new ways of recognizing and analyzing the wearer's everyday personal activities, which could be used for intelligent human-computer interfaces and other applications. We explore one possible application by investigating how egocentric video data collected from head-mounted cameras can be used to recognize social activities between two interacting partners (e.g. playing chess or cards). In particular, we demonstrate that just the positions and poses of hands within the first-person view are highly informative for activity recognition, and present a computer vision approach that detects hands to automatically estimate activities. While hand pose detection is imperfect, we show that combining evidence across first-person views from the two social partners significantly improves activity recognition accuracy. This result highlights how integrating weak but complimentary sources of evidence from social partners engaged in the same task can help to recognize the nature of their interaction.
A Multimodal System for Real-Time Action Instruction in Motor Skill Learning BIBAFull-Text 355-362
  Iwan de Kok; Julian Hough; Felix Hülsmann; Mario Botsch; David Schlangen; Stefan Kopp
We present a multimodal coaching system that supports online motor skill learning. In this domain, closed-loop interaction between the movements of the user and the action instructions by the system is an essential requirement. To achieve this, the actions of the user need to be measured and evaluated and the system must be able to give corrective instructions on the ongoing performance. Timely delivery of these instructions, particularly during execution of the motor skill by the user, is thus of the highest importance. Based on the results of an empirical study on motor skill coaching, we analyze the requirements for an interactive coaching system and present an architecture that combines motion analysis, dialogue management, and virtual human animation in a motion tracking and 3D virtual reality hardware setup. In a preliminary study we demonstrate that the current system is capable of delivering the closed-loop interaction that is required in the motor skill learning domain.


The Application of Word Processor UI paradigms to Audio and Animation Editing BIBAFull-Text 363-364
  Andre D. Milota
This demonstration showcases Quixotic, an audio editor, and Quintessence, an animation editor. Both appropriate many of the interaction techniques found in word processors, and allow users to more quickly create time-variant media. Our different approach to the interface aims to make recorded speech and simple animation into media that can be efficiently used for one-to-one asynchronous communications, quick note taking and documentation, as well as for idea refinement.
CuddleBits: Friendly, Low-cost Furballs that Respond to Touch BIBAFull-Text 365-366
  Laura Cang; Paul Bucci; Karon E. MacLean
We present a real-time touch gesture recognition system using a low-cost fabric pressure sensor mounted on a small zoomorphic robot, affectionately called the 'CuddleBit'. We explore the relationship between gesture recognition and affect through the lens of human-robot interaction. We demonstrate our real-time gesture recognition system, including both software and hardware, and a haptic display that brings the CuddleBit to life.
Public Speaking Training with a Multimodal Interactive Virtual Audience Framework BIBAFull-Text 367-368
  Mathieu Chollet; Kalin Stefanov; Helmut Prendinger; Stefan Scherer
We have developed an interactive virtual audience platform for public speaking training. Users' public speaking behavior is automatically analyzed using multimodal sensors, and ultimodal feedback is produced by virtual characters and generic visual widgets depending on the user's behavior. The flexibility of our system allows to compare different interaction mediums (e.g. virtual reality vs normal interaction), social situations (e.g. one-on-one meetings vs large audiences) and trained behaviors (e.g. general public speaking performance vs specific behaviors).
A Multimodal System for Public Speaking with Real Time Feedback BIBAFull-Text 369-370
  Fiona Dermody; Alistair Sutherland
We have developed a multimodal prototype for public speaking with real time feedback using the Microsoft Kinect. Effective speaking involves use of gesture, facial expression, posture, voice as well as the spoken word. These modalities combine to give the appearance of self-confidence in the speaker. This initial prototype detects body pose, facial expressions and voice. Visual and text feedback is displayed in real time to the user using a video panel, icon panel and text feedback panel. The user can also set and view elapsed time during their speaking performance. Real time feedback is displayed on gaze direction, body pose and gesture, vocal tonality, vocal dysfluencies and speaking rate.
Model of Personality-Based, Nonverbal Behavior in Affective Virtual Humanoid Character BIBAFull-Text 371-372
  Maryam Saberi; Ulysses Bernardet; Steve DiPaola
In this demonstration a human user interacts with a virtual humanoid character in real-time. Our goal is to create a character that is perceived as imbued with a distinct personality while responding dynamically to inputs from the environment [4] [1]. A hybrid model that comprises continuous and discrete components, firstly, drives the logical behavior of the virtual character moving through states of the interaction, and secondly, continuously updates of the emotional expressions of the virtual character depending on feedback from interactions with the environment. A Rock-Paper-Scissors game scenario is used as framework for the interaction scenario and provides an easy-to-learn and engaging demo environment with minimum conversation.
AttentiveLearner: Adaptive Mobile MOOC Learning via Implicit Cognitive States Inference BIBAFull-Text 373-374
  Xiang Xiao; Phuong Pham; Jingtao Wang
This demo presents AttentiveLearner, a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner uses on-lens finger gestures for video control and captures learners' physiological states through implicit heart rate tracking on unmodified mobile phones. Through three user studies to date, we found AttentiveLearner easy to learn, and intuitive to use. The heart beat waveforms captured by AttentiveLearner can be used to infer learners' cognitive states and attention. AttentiveLearner may serve as a promising supplemental feedback channel orthogonal to today's learning analytics technologies.
Interactive Web-based Image Sonification for the Blind BIBAFull-Text 375-376
  Torsten Wörtwein; Boris Schauerte; Karin E. Müller; Rainer Stiefelhagen
In this demonstration, we show a web-based sonification platform that allows blind users to interactively experience various information using two nowadays widespread technologies: modern web browsers that implement high-level JavaScript APIs and touch-sensitive displays. This way, blind users can easily access information such as, for example, maps or graphs. Our current prototype provides various sonifications that can be switched depending on the image type and user preference. The prototype runs in Chrome and Firefox on PCs, smart phones, and tablets.
Nakama: A Companion for Non-verbal Affective Communication BIBAFull-Text 377-378
  Christian J. A. M. Willemse; Gerald M. Munters; Jan B. F. van Erp; Dirk Heylen
We present "Nakama": A communication device that supports affective communication between a child and its -- geographically separated -- parent. Nakama consists of a control unit at the parent's end and an actuated teddy bear for the child. The bear contains several communication channels, including social touch, temperature, and vibrotactile heartbeats; all aimed at increasing the sense of presence. The current version of Nakama is suitable for user evaluations in lab settings, with which we aim to gain a more thorough understanding of the opportunities and limitations of these less traditional communication channels.
Wir im Kiez: Multimodal App for Mutual Help Among Elderly Neighbours BIBAFull-Text 379-380
  Sven Schmeier; Aaron Ruß; Norbert Reithinger
Elderly people often need support in everyday situations -- e.g. common daily life activities like taking care of house and garden, or caring for an animal are often not possible without a larger support circle. However, especially in larger western cities, local social networks may not be very tight, friends may have moved away or died, and the traditional support structures found in so-called multi-generational families do not exist anymore. As a result, the quality of life for elderly people suffers crucially. On the other hand, people from the broader neighborhood would often gladly help and respond quickly. With the project Wir im Kiez we developed and tested a multimodal social network app equipped with a conversational interface that addresses these issues. In the demonstration, we especially focus on the needs and restrictions of seniors, both in their physical and psychological limitations.
Interact: Tightly-coupling Multimodal Dialog with an Interactive Virtual Assistant BIBAFull-Text 381-382
  Ethan Selfridge; Michael Johnston
Interact is a mobile virtual assistant that uses multimodal dialog to enable an interactive concierge experience over multiple application domains including hotel, restaurants, events, and TV search. Interact demonstrates how multimodal interaction combined with conversational dialog enables a richer and more natural user experience. This demonstration will highlight incremental recognition and understanding, multimodal speech and gesture input, context tracking over multiple simultaneous domains, and the use of multimodal interface techniques to enable disambiguation of errors and online personalization.
The UTEP AGENT System BIBAFull-Text 383-384
  David Novick; Iván Gris Sepulveda; Diego A. Rivera; Adriana Camacho; Alex Rayon; Mario Gutierrez
This paper describes a system for embodied conversational agents (ECAs) developed at the University of Texas at El Paso by the Advanced aGent ENgagement Team (AGENT) and one of the applications -- Survival on Jungle Island -- built with this system. In the Jungle application, the ECA and a human interact with speech and gesture for approximately 40 -- 60 minutes in a game composed of 23 scenes (to maintain the demonstration feasible, participants will interact only with select scenes that showcase the capabilities of our system). Each scene comprises a collection of speech input, speech output, gesture input, gesture output, scenery, triggers, and decision points.
A Distributed Architecture for Interacting with NAO BIBAFull-Text 385-386
  Fabien Badeig; Quentin Pelorson; Soraya Arias; Vincent Drouard; Israel Gebru; Xiaofei Li; Georgios Evangelidis; Radu Horaud
One of the main applications of the humanoid robot NAO -- a small robot companion -- is human-robot interaction (HRI). NAO is particularly well suited for HRI applications because of its design, hardware specifications, programming capabilities, and affordable cost. Indeed, NAO can stand up, walk, wander, dance, play soccer, sit down, recognize and grasp simple objects, detect and identify people, localize sounds, understand some spoken words, engage itself in simple and goal-directed dialogs, and synthesize speech. This is made possible due to the robot's 24 degree-of-freedom articulated structure (body, legs, feet, arms, hands, head, etc.), motors, cameras, microphones, etc., as well as to its on-board computing hardware and embedded software, e.g., robot motion control. Nevertheless, the current NAO configuration has two drawbacks that restrict the complexity of interactive behaviors that could potentially be implemented. Firstly, the on-board computing resources are inherently limited, which implies that it is difficult to implement sophisticated computer vision and audio signal analysis algorithms required by advanced interactive tasks. Secondly, programming new robot functionalities currently implies the development of embedded software, which is a difficult task in its own right necessitating specialized knowledge. The vast majority of HRI practitioners may not have this kind of expertise and hence they cannot easily and quickly implement their ideas, carry out thorough experimental validations, and design proof-of-concept demonstrators. We have developed a distributed software architecture that attempts to overcome these two limitations. Broadly speaking, NAO's on-board computing resources are augmented with external computing resources. The latter is a computer platform with its CPUs, GPUs, memory, operating system, libraries, software packages, internet access, etc. This configuration enables easy and fast development in Matlab, C, C++, or Python. Moreover, it allows the user to combine on-board libraries (motion control, face detection, etc.) with external toolboxes, e.g., OpenCv.

Grand Challenge 1: Recognition of Social Touch Gestures Challenge 2015

Touch Challenge '15: Recognizing Social Touch Gestures BIBAFull-Text 387-390
  Merel M. Jung; Xi Laura Cang; Mannes Poel; Karon E. MacLean
Advances in the field of touch recognition could open up applications for touch-based interaction in areas such as Human-Robot Interaction (HRI). We extended this challenge to the research community working on multimodal interaction with the goal of sparking interest in the touch modality and to promote exploration of the use of data processing techniques from other more mature modalities for touch recognition. Two data sets were made available containing labeled pressure sensor data of social touch gestures that were performed by touching a touch-sensitive surface with the hand. Each set was collected from similar sensor grids, but under conditions reflecting different application orientations: CoST: Corpus of Social Touch and HAART: The Human-Animal Affective Robot Touch gesture set. In this paper we describe the challenge protocol and summarize the results from the touch challenge hosted in conjunction with the 2015 ACM International Conference on Multimodal Interaction (ICMI). The most important outcomes of the challenges were: (1) transferring techniques from other modalities, such as image processing, speech, and human action recognition provided valuable feature sets; (2) gesture classification confusions were similar despite the various data processing methods used.
The Grenoble System for the Social Touch Challenge at ICMI 2015 BIBAFull-Text 391-398
  Viet-Cuong Ta; Wafa Johal; Maxime Portaz; Eric Castelli; Dominique Vaufreydaz
New technologies and especially robotics is going towards more natural user interfaces. Works have been done in different modality of interaction such as sight (visual computing), and audio (speech and audio recognition) but some other modalities are still less researched. The touch modality is one of the less studied in HRI but could be valuable for naturalistic interaction. However touch signals can vary in semantics. It is therefore necessary to be able to recognize touch gestures in order to make human-robot interaction even more natural. We propose a method to recognize touch gestures. This method was developed on the CoST corpus and then directly applied on the HAART dataset as a participation of the Social Touch Challenge at ICMI 2015. Our touch gesture recognition process is detailed in this article to make it reproducible by other research teams. Besides features set description, we manually filtered the training corpus to produce 2 datasets. For the challenge, we submitted 6 different systems. A Support Vector Machine and a Random Forest classifiers for the HAART dataset. For the CoST dataset, the same classifiers are tested in two conditions: using all or filtered training datasets. As reported by organizers, our systems have the best correct rate in this year's challenge (70.91% on HAART, 61.34% on CoST). Our performances are slightly better that other participants but stay under previous reported state-of-the-art results.
Social Touch Gesture Recognition using Random Forest and Boosting on Distinct Feature Sets BIBAFull-Text 399-406
  Yona Falinie A. Gaus; Temitayo Olugbade; Asim Jan; Rui Qin; Jingxin Liu; Fan Zhang; Hongying Meng; Nadia Bianchi-Berthouze
Touch is a primary nonverbal communication channel used to communicate emotions or other social messages. Despite its importance, this channel is still very little explored in the affective computing field, as much more focus has been placed on visual and aural channels. In this paper, we investigate the possibility to automatically discriminate between different social touch types. We propose five distinct feature sets for describing touch behaviours captured by a grid of pressure sensors. These features are then combined together by using the Random Forest and Boosting methods for categorizing the touch gesture type. The proposed methods were evaluated on both the HAART (7 gesture types over different surfaces) and the CoST (14 gesture types over the same surface) datasets made available by the Social Touch Gesture Challenge 2015. Well above chance level performances were achieved with a 67% accuracy for the HAART and 59% for the CoST testing datasets respectively.
Recognizing Touch Gestures for Social Human-Robot Interaction BIBAFull-Text 407-413
  Tugce Balli Altuglu; Kerem Altun
In this study, we performed touch gesture recognition on two sets of data provided by "Recognition of Social Touch Gestures Challenge 2015". For the first dataset, dubbed Corpus of Social Touch (CoST), touch is performed on a mannequin arm, whereas for the second dataset (Human-Animal Affective Robot Touch -- HAART) touch is performed in a human-pet interaction setting. CoST includes 14 gestures and HAART includes 7 gestures. We used the pressure data, image features, Hurst exponent, Hjorth parameters and autoregressive model coefficients as features, and performed feature selection using sequential forward floating search. We obtained classification results around 60%-70% for the HAART dataset. For the CoST dataset, the results range from 26% to 95% depending on the selection of the training/test sets.
Detecting and Identifying Tactile Gestures using Deep Autoencoders, Geometric Moments and Gesture Level Features BIBAFull-Text 415-422
  Dana Hughes; Nicholas Farrow; Halley Profita; Nikolaus Correll
While several sensing modalities and transduction approaches have been developed for tactile sensing in robotic skins, there has been much less work towards extracting features for or identifying high-level gestures performed on the skin. In this paper, we investigate using deep neural networks with hidden Markov models (DNN-HMMs), geometric moments and gesture level features to identify a set of gestures performed on robotic skins. We demonstrate that these features are useful for identifying gestures, and predict a set of gestures from a 14-class dataset with 56% accuracy, and a 7-class dataset with 71% accuracy.

Grand Challenge 2: Emotion Recognition in the Wild Challenge 2015

Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 BIBAFull-Text 423-426
  Abhinav Dhall; O. V. Ramana Murthy; Roland Goecke; Jyoti Joshi; Tom Gedeon
The third Emotion Recognition in the Wild (EmotiW) challenge 2015 consists of an audio-video based emotion and static image based facial expression classification sub-challenges, which mimics real-world conditions. The two sub-challenges are based on the Acted Facial Expression in the Wild (AFEW) 5.0 and the Static Facial Expression in the Wild (SFEW) 2.0 databases, respectively. The paper describes the data, baseline method, challenge protocol and the challenge results. A total of 12 and 17 teams participated in the video based emotion and image based expression sub-challenges, respectively.
Hierarchical Committee of Deep CNNs with Exponentially-Weighted Decision Fusion for Static Facial Expression Recognition BIBAFull-Text 427-434
  Bo-Kyeong Kim; Hwaran Lee; Jihyeon Roh; Soo-Young Lee
We present a pattern recognition framework to improve committee machines of deep convolutional neural networks (deep CNNs) and its application to static facial expression recognition in the wild (SFEW). In order to generate enough diversity of decisions, we trained multiple deep CNNs by varying network architectures, input normalization, and weight initialization as well as by adopting several learning strategies to use large external databases. Moreover, with these deep models, we formed hierarchical committees using the validation-accuracy-based exponentially-weighted average (VA-Expo-WA) rule. Through extensive experiments, the great strengths of our committee machines were demonstrated in both structural and decisional ways. On the SFEW2.0 dataset released for the 3rd Emotion Recognition in the Wild (EmotiW) sub-challenge, a test accuracy of 57.3% was obtained from the best single deep CNN, while the single-level committees yielded 58.3% and 60.5% with the simple average rule and with the VA-Expo-WA rule, respectively. Our final submission based on the 3-level hierarchy using the VA-Expo-WA achieved 61.6%, significantly higher than the SFEW baseline of 39.1%.
Image based Static Facial Expression Recognition with Multiple Deep Network Learning BIBAFull-Text 435-442
  Zhiding Yu; Cha Zhang
We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains.
Deep Learning for Emotion Recognition on Small Datasets using Transfer Learning BIBAFull-Text 443-449
  Hong-Wei Ng; Viet Dung Nguyen; Vassilios Vonikakis; Stefan Winkler
This paper presents the techniques employed in our team's submissions to the 2015 Emotion Recognition in the Wild contest, for the sub-challenge of Static Facial Expression Recognition in the Wild. The objective of this sub-challenge is to classify the emotions expressed by the primary human subject in static images extracted from movies. We follow a transfer learning approach for deep Convolutional Neural Network (CNN) architectures. Starting from a network pre-trained on the generic ImageNet dataset, we perform supervised fine-tuning on the network in a two-stage process, first on datasets relevant to facial expressions, followed by the contest's dataset. Experimental results show that this cascading fine-tuning approach achieves better results, compared to a single stage fine-tuning with the combined datasets. Our best submission exhibited an overall accuracy of 48.5% in the validation set and 55.6% in the test set, which compares favorably to the respective 35.96% and 39.13% of the challenge baseline.
Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild BIBAFull-Text 451-458
  Anbang Yao; Junchao Shao; Ningning Ma; Yurong Chen
The Emotion Recognition in the Wild (EmotiW) Challenge has been held for three years. Previous winner teams primarily focus on designing specific deep neural networks or fusing diverse hand-crafted and deep convolutional features. They all neglect to explore the significance of the latent relations among changing features resulted from facial muscle motions. In this paper, we study this recognition challenge from the perspective of analyzing the relations among expression-specific facial features in an explicit manner. Our method has three key components. First, we propose a pair-wise learning strategy to automatically seek a set of facial image patches which are important for discriminating two particular emotion categories. We found these learnt local patches are in part consistent with the locations of expression-specific Action Units (AUs), thus the features extracted from such kind of facial patches are named AU-aware facial features. Second, in each pair-wise task, we use an undirected graph structure, which takes learnt facial patches as individual vertices, to encode feature relations between any two learnt facial patches. Finally, a robust emotion representation is constructed by concatenating all task-specific graph-structured facial feature relations sequentially. Extensive experiments on the EmotiW 2015 Challenge testify the efficacy of the proposed approach. Without using additional data, our final submissions achieved competitive results on both sub-challenges including the image based static facial expression recognition (we got 55.38% recognition accuracy outperforming the baseline 39.13% with a margin of 16.25%) and the audio-video based emotion recognition (we got 53.80% recognition accuracy outperforming the baseline 39.33% and the 2014 winner team's final result 50.37% with the margins of 14.47% and 3.43%, respectively).
Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild BIBAFull-Text 459-466
  Heysem Kaya; Furkan Gürpinar; Sadaf Afshar; Albert Ali Salah
This paper presents our contribution to ACM ICMI 2015 Emotion Recognition in the Wild Challenge (EmotiW 2015). We participate in both static facial expression (SFEW) and audio-visual emotion recognition challenges. In both challenges, we use a set of visual descriptors and their early and late fusion schemes. For AFEW, we also exploit a set of popularly used spatio-temporal modeling alternatives and carry out multi-modal fusion. For classification, we employ two least squares regression based learners that are shown to be fast and accurate on former EmotiW Challenge corpora. Specifically, we use Partial Least Squares Regression (PLS) and Kernel Extreme Learning Machines (ELM), which is closely related to Kernel Regularized Least Squares. We use a General Procrustes Analysis (GPA) based alignment for face registration. By employing different alignments, descriptor types, video modeling strategies and classifiers, we diversify learners to improve the final fusion performance. Test set accuracies reached in both challenges are relatively 25% above the respective baselines.
Recurrent Neural Networks for Emotion Recognition in Video BIBAFull-Text 467-474
  Samira Ebrahimi Kahou; Vincent Michalski; Kishore Konda; Roland Memisevic; Christopher Pal
Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.
Multiple Models Fusion for Emotion Recognition in the Wild BIBAFull-Text 475-481
  Jianlong Wu; Zhouchen Lin; Hongbin Zha
Emotion recognition in the wild is a very challenging task. In this paper, we propose a multiple models fusion method to automatically recognize the expression in the video clip as part of the third Emotion Recognition in the Wild Challenge (EmotiW 2015). In our method, we first extract dense SIFT, LBP-TOP and audio features from each video clip. For dense SIFT features, we use the bag of features (BoF) model with two different encoding methods (locality-constrained linear coding and group saliency based coding) to further represent it. During the classification process, we use partial least square regression to calculate the regression value of each model. By learning the optimal weight of each model based on the regression value, we fuse these models together. We conduct experiments on the given validation and test datasets, and achieve superior performance. The best recognition accuracy of our fusion method is 52.50% on the test dataset, which is 13.17% higher than the challenge baseline accuracy of 39.33%.
A Deep Feature based Multi-kernel Learning Approach for Video Emotion Recognition BIBAFull-Text 483-490
  Wei Li; Farnaz Abtahi; Zhigang Zhu
In this paper, we describe our proposed approach for participating in the Third Emotion Recognition in the Wild Challenge (EmotiW 2015). We focus on the sub-challenge of Audio-Video Based Emotion Recognition using the AFEW dataset. The AFEW dataset consists of 7 emotion groups corresponding to the 7 basic emotions. Each group includes multiple videos from movie clips with people acting a certain emotion. In our approach, we extract LBP-TOP-based video features, openEAR energy/spectral-based audio features, and CNN (convolutional neural network) based deep image features by fine-tuning a pre-trained model with extra emotion images from the web. For each type of features, we run a SVM grid search to find the best RBF kernel. Then multi-kernel learning is employed to combine the RBF kernels to accomplish the feature fusion and generate a fused RBF kernel. Running multi-class SVM classification, we achieve a 45.23% test accuracy on the AFEW dataset. We then apply a decision optimization method to adjust the label distribution closer to the ground truth, by setting offsets for some of the classifiers' prediction confidence score. By applying this modification, the test accuracy increases to 50.46%, which is a significant improvement comparing to the baseline accuracy 39.33%.
Transductive Transfer LDA with Riesz-based Volume LBP for Emotion Recognition in The Wild BIBAFull-Text 491-496
  Yuan Zong; Wenming Zheng; Xiaohua Huang; Jingwei Yan; Tong Zhang
In this paper, we propose the method using Transductive Transfer Linear Discriminant Analysis (TTLDA) and Riesz-based Volume Local Binary Patterns (RVLBP) for image based static facial expression recognition challenge of the Emotion Recognition in the Wild Challenge (EmotiW 2015). The task of this challenge is to assign facial expression labels to frames of some movies containing a face under the real word environment. In our method, we firstly employ a multi-scale image partition scheme to divide each face image into some image blocks and use RVLBP features extracted from each block to describe each facial image. Then, we adopt the TTLDA approach based on RVLBP to cope with the expression recognition task. The experiments on the testing data of SFEW 2.0 database, which is used for image based static facial expression challenge, demonstrate that our method achieves the accuracy of 50%. This result has a 10.87% improvement over the baseline provided by this challenge organizer.
Combining Multimodal Features within a Fusion Network for Emotion Recognition in the Wild BIBAFull-Text 497-502
  Bo Sun; Liandong Li; Guoyan Zhou; Xuewen Wu; Jun He; Lejun Yu; Dongxue Li; Qinglan Wei
In this paper, we describe our work in the third Emotion Recognition in the Wild (EmotiW 2015) Challenge. For each video clip, we extract MSDF, LBP-TOP, HOG, LPQ-TOP and acoustic features to recognize the emotions of film characters. For the static facial expression recognition based on video frame, we extract MSDF, DCNN and RCNN features. We train linear SVM classifiers for these kinds of features on the AFEW and SFEW dataset, and we propose a novel fusion network to combine all the extracted features at decision level. The final achievement we gained is 51.02% on the AFEW testing set and 51.08% on the SFEW testing set, which are much better than the baseline recognition rate of 39.33% and 39.13%.
Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns BIBAFull-Text 503-510
  Gil Levi; Tal Hassner
We present a novel method for classifying emotions from static facial images. Our approach leverages on the recent success of Convolutional Neural Networks (CNN) on face recognition problems. Unlike the settings often assumed there, far less labeled data is typically available for training emotion classification systems. Our method is therefore designed with the goal of simplifying the problem domain by removing confounding factors from the input images, with an emphasis on image illumination variations. This, in an effort to reduce the amount of data required to effectively train deep CNN models. To this end, we propose novel transformations of image intensities to 3D spaces, designed to be invariant to monotonic photometric transformations. These are applied to CASIA Webface images which are then used to train an ensemble of multiple architecture CNNs on multiple representations. Each model is then fine-tuned with limited emotion labeled training data to obtain final classification models. Our method was tested on the Emotion Recognition in the Wild Challenge (EmotiW 2015), Static Facial Expression Recognition sub-challenge (SFEW) and shown to provide a substantial, 15.36% improvement over baseline results (40% gain in performance).
Quantification of Cinematography Semiotics for Video-based Facial Emotion Recognition in the EmotiW 2015 Grand Challenge BIBAFull-Text 511-518
  Albert C. Cruz
The Emotion Recognition in the Wild challenge poses significant problems to state of the art auditory and visual affect quantification systems. To overcome the challenges, we investigate supplementary meta features based on film semiotics. Movie scenes are often presented and arranged in such a way as to amplify the emotion interpreted by the viewing audience. This technique is referred to as mise en scene in the film industry and involves strict and intentional control of color palette, light source color, and arrangement of actors and objects in the scene. To this end, two algorithms for extracting mise en scene information are proposed. Rule of thirds based motion history histograms detect motion along rule of thirds guidelines. Rule of thirds color layout descriptors compactly describe a scene at rule of thirds intersections. A comprehensive system is proposed that measures expression, emotion, vocalics, syntax, semantics, and film-based meta information. The proposed mise en scene features have a higher classification rate and ROC area than LBP-TOP features on the validation set of the EmotiW 2015 challenge. The complete system improves classification performance over the baseline algorithm by 3.17% on the testing set.
Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction BIBAFull-Text 519-524
  Mehmet Kayaoglu; Cigdem Eroglu Erdem
In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whole video. The audio features are extracted using OpenSMILE or RASTA-PLP methods. Video and audio features are classified using SVM classifiers and fused at the score level. We tested eight different combinations of audio and visual features on the AFEW 5.0 (Acted Facial Expressions in the Wild) database provided by the challenge organizers. The best visual and audio-visual accuracies obtained on the test set are 45.1% and 49.9% respectively, whereas the video-based baseline for the challenge is given as 39.3%.

Grand Challenge 3: Multimodal Learning and Analytics Grand Challenge 2015

2015 Multimodal Learning and Analytics Grand Challenge BIBAFull-Text 525-529
  Marcelo Worsley; Katherine Chiluiza; Joseph F. Grafsgaard; Xavier Ochoa
Multimodality is an integral part of teaching and learning. Over the past few decades researchers have been designing, creating and analyzing novel environments that enable students to experience and demonstrate learning through a variety of modalities. The recent availability of low cost multimodal sensors, advances in artificial intelligence and improved techniques for large scale data analysis have enabled researchers and practitioners to push the boundaries on multimodal learning and multimodal learning analytics. In an effort to continue these developments, the 2015 Multimodal Learning and Analytics Grand Challenge includes a combined focus on new techniques to capture multimodal learning data, as well as the development of rich, multimodal learning applications.
Providing Real-time Feedback for Student Teachers in a Virtual Rehearsal Environment BIBAFull-Text 531-537
  Roghayeh Barmaki; Charles E. Hughes
Research in learning analytics and educational data mining has recently become prominent in the fields of computer science and education. Most scholars in the field emphasize student learning and student data analytics; however, it is also important to focus on teaching analytics and teacher preparation because of their key roles in student learning, especially in K-12 learning environments. Nonverbal communication strategies play an important role in successful interpersonal communication of teachers with their students. In order to assist novice or practicing teachers with exhibiting open and affirmative nonverbal cues in their classrooms, we have designed a multimodal teaching platform with provisions for online feedback. We used an interactive teaching rehearsal software, TeachLivE, as our basic research environment. TeachLivE employs a digital puppetry paradigm as its core technology. Individuals walk into this virtual environment and interact with virtual students displayed on a large screen. They can practice classroom management, pedagogy and content delivery skills with a teaching plan in the TeachLivE environment. We have designed an experiment to evaluate the impact of an online nonverbal feedback application. In this experiment, different types of multimodal data have been collected during two experimental settings. These data include talk-time and nonverbal behaviors of the virtual students, captured in log files; talk time and full body tracking data of the participant; and video recording of the virtual classroom with the participant. 34 student teachers participated in this 30-minute experiment. In each of the settings, the participants were provided with teaching plans from which they taught. All the participants took part in both of the experimental settings. In order to have a balanced experiment design, half of the participants received nonverbal online feedback in their first session and the other half received this feedback in the second session. A visual indication was used for feedback each time the participant exhibited a closed, defensive posture. Based on recorded full-body tracking data, we observed that only those who received feedback in their first session demonstrated a significant number of open postures in the session containing no feedback. However, the post-questionnaire information indicated that all participants were more mindful of their body postures while teaching after they had participated in the study.
Presentation Trainer, your Public Speaking Multimodal Coach BIBAFull-Text 539-546
  Jan Schneider; Dirk Börner; Peter van Rosmalen; Marcus Specht
The Presentation Trainer is a multimodal tool designed to support the practice of public speaking skills, by giving the user real-time feedback about different aspects of her nonverbal communication. It tracks the user's voice and body to interpret her current performance. Based on this performance the Presentation Trainer selects the type of intervention that will be presented as feedback to the user. This feedback mechanism has been designed taking in consideration the results from previous studies that show how difficult it is for learners to perceive and correctly interpret real-time feedback while practicing their speeches. In this paper we present the user experience evaluation of participants who used the Presentation Trainer to practice for an elevator pitch, showing that the feedback provided by the Presentation Trainer has a significant influence on learning.
Utilizing Depth Sensors for Analyzing Multimodal Presentations: Hardware, Software and Toolkits BIBAFull-Text 547-556
  Chee Wee Leong; Lei Chen; Gary Feng; Chong Min Lee; Matthew Mulholland
Body language plays an important role in learning processes and communication. For example, communication research produced evidence that mathematical knowledge can be embodied in gestures made by teachers and students. Likewise, body postures and gestures are also utilized by speakers in oral presentations to convey ideas and important messages. Consequently, capturing and analyzing non-verbal behaviors is an important aspect in multimodal learning analytics (MLA) research. With regard to sensing capabilities, the introduction of depth sensors such as the Microsoft Kinect has greatly facilitated research and development in this area. However, the rapid advancement in hardware and software capabilities is not always in sync with the expanding set of features reported in the literature. For example, though Anvil is a widely used state-of-the-art annotation and visualization toolkit for motion traces, its motion recording component based on OpenNI is outdated. As part of our research in developing multimodal educational assessments, we began an effort to develop and standardize algorithms for purposes of multimodal feature extraction and creating automated scoring models. This paper provides an overview of relevant work in multimodal research on educational tasks, and proceeds to summarize our work using multimodal sensors in developing assessments of communication skills, with attention on the use of depth sensors. Specifically, we focus on the task of public speaking assessment using Microsoft Kinect. Additionally, we introduce an open-source Python package for computing expressive body language features from Kinect motion data, which we hope will benefit the MLA research community.
Multimodal Capture of Teacher-Student Interactions for Automated Dialogic Analysis in Live Classrooms BIBAFull-Text 557-566
  Sidney K. D'Mello; Andrew M. Olney; Nathan Blanchard; Borhan Samei; Xiaoyi Sun; Brooke Ward; Sean Kelly
We focus on data collection designs for the automated analysis of teacher-student interactions in live classrooms with the goal of identifying instructional activities (e.g., lecturing, discussion) and assessing the quality of dialogic instruction (e.g., analysis of questions). Our designs were motivated by multiple technical requirements and constraints. Most importantly, teachers could be individually micfied but their audio needed to be of excellent quality for automatic speech recognition (ASR) and spoken utterance segmentation. Individual students could not be micfied but classroom audio quality only needed to be sufficient to detect student spoken utterances. Visual information could only be recorded if students could not be identified. Design 1 used an omnidirectional laptop microphone to record both teacher and classroom audio and was quickly deemed unsuitable. In Designs 2 and 3, teachers wore a wireless Samson AirLine 77 vocal headset system, which is a unidirectional microphone with a cardioid pickup pattern. In Design 2, classroom audio was recorded with dual first-generation Microsoft Kinects placed at the front corners of the class. Design 3 used a Crown PZM-30D pressure zone microphone mounted on the blackboard to record classroom audio. Designs 2 and 3 were tested by recording audio in 38 live middle school classrooms from six U.S. schools while trained human coders simultaneously performed live coding of classroom discourse. Qualitative and quantitative analyses revealed that Design 3 was suitable for three of our core tasks: (1) ASR on teacher speech (word recognition rate of 66% and word overlap rate of 69% using Google Speech ASR engine); (2) teacher utterance segmentation (F-measure of 97%); and (3) student utterance segmentation (F-measure of 66%). Ideas to incorporate video and skeletal tracking with dual second-generation Kinects to produce Design 4 are discussed.
Multimodal Selfies: Designing a Multimodal Recording Device for Students in Traditional Classrooms BIBAFull-Text 567-574
  Federico Domínguez; Katherine Chiluiza; Vanessa Echeverria; Xavier Ochoa
The traditional recording of student interaction in classrooms has raised privacy concerns in both students and academics. However, the same students are happy to share their daily lives through social media. Perception of data ownership is the key factor in this paradox. This article proposes the design of a personal Multimodal Recording Device (MRD) that could capture the actions of its owner during lectures. The MRD would be able to capture close-range video, audio, writing, and other environmental signals. Differently from traditional centralized recording systems, students would have control over their own recorded data. They could decide to share their information in exchange of access to the recordings of the instructor, notes form their classmates, and analysis of, for example, their attention performance. By sharing their data, students participate in the co-creation of enhanced and synchronized course notes that will benefit all the participating students. This work presents details about how such a device could be build from available components. This work also discusses and evaluates the design of such device, including its foreseeable costs, scalability, flexibility, intrusiveness and recording quality.

Doctoral Consortium

Temporal Association Rules for Modelling Multimodal Social Signals BIBAFull-Text 575-579
  Thomas Janssoone
In this paper, we present the first step of a methodology dedicated to deduce automatically sequences of signals expressed by humans during an interaction. The aim is to link interpersonal stances with arrangements of social signals such as modulations of Action Units and prosody during a face-to-face exchange. The long-term goal is to infer association rules of signals. We plan to use them as an input to the animation of an Embodied Conversational Agent (ECA). In this paper, we illustrate the proposed methodology to the SEMAINE-DB corpus from which we automatically extracted Action Units (AUs), head positions, turn-taking and prosody information. We have applied the data mining algorithm that is used to find the sequences of social signals featuring different social stances. We finally discuss our primary results focusing on given AUs (smiles and eyebrows) and the perspectives of this method.
Detecting and Synthesizing Synchronous Joint Action in Human-Robot Teams BIBAFull-Text 581-585
  Tariq Iqbal; Laurel D. Riek
To become capable teammates to people, robots need the ability to interpret human activities and appropriately adjust their actions in real time. The goal of our research is to build robots that can work fluently and contingently with human teams. To this end, we have designed novel nonlinear dynamical methods to automatically model and detect synchronous joint action (SJA) in human teams. We also have extended this work to enable robots to move jointly with human teammates in real time. In this paper, we describe our work to date, and discuss our future research plans to further explore this research space. The results of this work are expected to benefit researchers in social signal processing, human-machine interaction, and robotics.
Micro-opinion Sentiment Intensity Analysis and Summarization in Online Videos BIBAFull-Text 587-591
  Amir Zadeh
There has been substantial progress in the field of text based sentiment analysis but little effort has been made to incorporate other modalities. Previous work in sentiment analysis has shown that using multimodal data yields to more accurate models of sentiment. Efforts have been made towards expressing sentiment as a spectrum of intensity rather than just positive or negative. Such models are useful not only for detection of positivity or negativity, but also giving out a score of how positive or negative a statement is. Based on the state of the art studies in sentiment analysis, prediction in terms of sentiment score is still far from accurate, even in large datasets [27]. Another challenge in sentiment analysis is dealing with small segments or micro opinions as they carry less context than large segments thus making analysis of the sentiment harder. This paper presents a Ph.D. thesis shaped towards comprehensive studies in multimodal micro-opinion sentiment intensity analysis.
Attention and Engagement Aware Multimodal Conversational Systems BIBAFull-Text 593-597
  Zhou Yu
Despite their ability to complete certain tasks, dialog systems still suffer from poor adaptation to users' engagement and attention. We observe human behaviors in different conversational settings to understand human communication dynamics and then transfer the knowledge to multimodal dialog system design. To focus solely on maintaining engaging conversations, we design and implement a non-task oriented multimodal dialog system, which serves as a framework for controlled multimodal conversation analysis. We design computational methods to model user engagement and attention in real time by leveraging automatically harvested multimodal human behaviors, such as smiles and speech volume. We aim to design and implement a multimodal dialog system to coordinate with users' engagement and attention on the fly via techniques such as adaptive conversational strategies and incremental speech production.
Implicit Human-computer Interaction: Two Complementary Approaches BIBAFull-Text 599-603
  Julia Wache
One of the main goals in Human Computer Interaction (HCI) is improving the interface between users and computers: Interfacing should be intuitive, effortless and easy to learn. We approach the goal from two opposite but complementary directions: On the one hand, computer-user interaction can be enhanced if the computer can assess users differences in an automated manner. Therefore we collected physiological and psychological data from people exposed to emotional stimuli and created a database for the community to use for further research in the context of automated learning to detect the differences in the inner states of users. We employed the data both to not only predict the emotional state of users but also their personality traits. On the other hand, users need information dispatched by a computer to be easily, intuitively accessible. To minimize the cognitive effort of assimilating information we use a tactile device in form of a belt and test how it can be best used to replace or augment the information received from other senses (e.g., visual and auditory) in a navigation task. We investigate how both approaches can be combined to improve specific applications.
Instantaneous and Robust Eye-Activity Based Task Analysis BIBAFull-Text 605-609
  Hoe Kin Wong
Task analysis using eye-activity has previously been used for estimating cognitive load on a per-task basis. However, since pupil size is a continuous physiological signal, eye-based classification accuracy of cognitive load can be improved by considering cognitive load at a higher temporal resolution and incorporating models of the interactions between the task-evoked pupillary response (TEPR) and other pupillary responses such as the Pupillary Light Reflex into the classification model. In this work, methods of using eye-activity as a measure of continuous mental load will be investigated. Subsequently pupil light reflex models will be incorporated into task analysis to investigate the possibility of enhancing the reliability of cognitive load estimation in varied lighting conditions. This will culminate in the development and evaluation of a classification system which measures rapidly changing cognitive load. Task analysis of this calibre will enable interfaces in wearable optical devices to be constantly aware of the user's mental state and control information flow to prevent information overload and interruptions.
Challenges in Deep Learning for Multimodal Applications BIBAFull-Text 611-615
  Sayan Ghosh
This consortium paper outlines a research plan for investigating deep learning techniques as applied to multimodal multi-task learning and multimodal fusion. We discuss our prior research results in this area, and how these results motivate us to explore more in this direction. We also define concrete steps of enquiry we wish to undertake as a short-term goal, and further outline some other challenges of multimodal learning using deep neural networks, such as inter and intra-modality synchronization, robustness to noise in modality data acquisition, and data insufficiency.
Exploring Intent-driven Multimodal Interface for Geographical Information System BIBAFull-Text 617-621
  Feng Sun
Geographic Information Systems (GIS) offers a large amount of functions for performing spatial analysis and geospatial information retrieval. However, off-the-shelf GIS remains difficult to use for occasional GIS experts. The major problem lies in that its interface organizes spatial analysis tools and functions according to spatial data structures and corresponding algorithms, which is conceptually confusing and cognitively complex. Prior work identified the usability problem of conventional GIS interface and developed alternatives based on speech or gesture to narrow the gap between the high-functionality provided by GIS and its usability. This paper outlined my doctoral research goal in understanding human-GIS interaction activity, especially how interaction modalities assist to capture spatial analysis intention and influence collaborative spatial problem solving. We proposed a framework for enabling multimodal human-GIS interaction driven by intention. We also implemented a prototype GeoEASI (Geo-dialogue Environment for Assisted Spatial Inquiry) to demonstrate the effectiveness of our framework. GeoEASI understands commonly known spatial analysis intentions through multimodal techniques and is able to assist users to perform spatial analysis with proper strategies. Further work will evaluate the effectiveness of our framework, improve the reliability and flexibility of the system, extend the GIS interface for supporting multiple users, and integrate the system into GeoDeliberation. We will concentrate on how multimodality technology can be adopted in these circumstances and explore the potentials of it. The study aims to demonstrate the feasibility of building a GIS to be both useful and usable by introducing an intent-driven multimodal interface, forming the key to building a better theory of spatial thinking for GIS.
Software Techniques for Multimodal Input Processing in Realtime Interactive Systems BIBAFull-Text 623-627
  Martin Fischbach
Multimodal interaction frameworks are an efficient means of utilizing many existing processing and fusion techniques in a wide variety of application areas, even by non-experts. However, the application of these frameworks to highly interactive application areas like VR, AR, MR, and computer games in a reusable, modifiable, and modular manner is not straightforward. It currently lacks some software technical solutions that (1) preserve the general decoupling principle of platforms and at the same time (2) provide the required close temporal as well as semantic coupling of involved software modules and multimodal processing steps. This thesis approaches current challenges and aims at providing the research community with a framework that fosters repeatability of scientific achievements and the ability to built on previous results.
Gait and Postural Sway Analysis, A Multi-Modal System BIBAFull-Text 629-633
  Hafsa Ismail
Detecting a fall before it actually happens will positively affect lives of the elderly. While the main causes of falling are related to postural sway and walking, determining abnormalities in one of these activities or both of them would be informative to predicting the fall probability. A need exists for a portable gait and postural sway analysis system that can provide individuals with real-time information about changes and quality of gait in the real world, not just in a laboratory. In this research project I aim to build a multi-modal system that finds the correlation between vision extracted features and accelerometer and force plate data to determine a general gait and body sway pattern. Then this information is used to assess a difference to normative age and gender relevant patterns as well as any changes over time. This could provide a core indicator of broader health and function in ageing and disease.
A Computational Model of Culture-Specific Emotion Detection for Artificial Agents in the Learning Domain BIBAFull-Text 635-639
  Ganapreeta R. Naidu
Nowadays, intelligent agents are expected to be affect-sensitive as agents are becoming essential entities that supports computer-mediated tasks, especially in teaching and training. These agents use common natural modalities-such as facial expressions, gestures and eye gaze in order to recognize a user's affective state and respond accordingly. However, these nonverbal cues may not be universal as emotion recognition and expression differ from culture to culture. It is important that intelligent interfaces are equipped with the abilities to meet the challenge of cultural diversity to facilitate human-machine interaction particularly in Asia. Asians are known to be more passive and possess certain traits such as indirectness and non-confrontationalism, which lead to emotions such as (culture-specific form of) shyness and timidity. Therefore, a model based on other culture may not be applicable in an Asian setting, overruling a one-size-fits-all approach. This study is initiated to identify the discriminative markers of culture-specific emotions based on the multimodal interactions.
Record, Transform & Reproduce Social Encounters in Immersive VR: An Iterative Approach BIBAFull-Text 641-644
  Jan Kolkmeier
Immersive Virtual Reality Environments that can be accessed through multimodal natural interfaces will bring new affordances to mediated interaction with virtual embodied agents and avatars. Such interfaces will measure, amongst others, users' poses and motion which can be copied to an embodied avatar representation of the user that is situated in a virtual or augmented reality space shared with autonomous virtual agents and human controlled or semi-autonomous avatars. Designers of such environments will be challenged to facilitate believable social interactions by creating agents or semi-autonomous avatars that can respond meaningfully to users' natural behaviors, as captured by these interfaces. In our future research, we aim to realize such interactions to create rich social encounters in immersive Virtual Reality. In this current work, we present the approach we envisage to analyze and learn agent behavior from human-agent interaction in an iterative fashion. We specifically look at small-scale, 'regulative' nonverbal behaviors. Agents inform their behavior on previous observations, observing responses that these behaviors elicit in new users, thus iteratively generating corpora of short, situated human-agent interaction sequences that are to be analyzed, annotated and processed to generate socially intelligent agent behavior. Some choices and challenges of this approach are discussed.
Multimodal Affect Detection in the Wild: Accuracy, Availability, and Generalizability BIBAFull-Text 645-649
  Nigel Bosch
Affect detection is an important component of computerized learning environments that adapt the interface and materials to students' affect. This paper proposes a plan for developing and testing multimodal affect detectors that generalize across differences in data that are likely to occur in practical applications (e.g., time, demographic variables). Facial features and interaction log features are considered as modalities for affect detection in this scenario, each with their own advantages. Results are presented for completed work evaluating the accuracy of individual modality face- and interaction- based detectors, accuracy and availability of a multimodal combination of these modalities, and initial steps toward generalization of face-based detectors. Additional data collection needed for cross-culture generalization testing is also completed. Challenges and possible solutions for proposed cross-cultural generalization testing of multimodal detectors are also discussed.
Multimodal Assessment of Teaching Behavior in Immersive Rehearsal Environment-TeachLivE BIBAFull-Text 651-655
  Roghayeh Barmaki
Nonverbal behaviors such as facial expressions, eye contact, gestures, and body movements in general have strong impacts on the process of communicative interactions. Gestures play an important role in interpersonal communication in the classroom between student and teacher. To assist teachers with exhibiting open and positive nonverbal signals in their actual classroom, we have designed a multimodal teaching application with provisions for real-time feedback in coordination with our TeachLivE test-bed environment and its reflective application; ReflectLivE. Individuals walk into this virtual environment and interact with five virtual students shown on a large screen display. The recent research study is designed to have two settings (7-minute long each). In each of the settings, the participants are provided lesson plans from which they teach. All the participants are asked to take part in both settings, with half receiving automated real-time feedback about their body poses in the first session (group 1) and the other half receiving such feedback in the second session (group 2). Feedback is in the form of a visual indication each time the participant exhibits a closed stance. To create this automated feedback application, a closed posture corpus was collected and trained based on the existing TeachLivE teaching records. After each session, the participants take a post-questionnaire about their experience. We hypothesize that visual feedback improves positive body gestures for both groups during the feedback session, and that, for group 2, this persists into their second unaided session but, for group 1, improvements occur only during the second session.