HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2012 International Conference on Multimodal Interfaces

Fullname:Proceedings of the 14th ACM International Conference on Multimodal Interaction
Editors:Louis-Philippe Morency; Dan Bohus; Hamid Aghajan; Justine Cassell; Anton Nijholt; Julien Epps
Location:Santa Monica, California
Dates:2012-Oct-22 to 2012-Oct-26
Standard No:ISBN: 978-1-4503-1467-1; ACM DL: Table of Contents; hcibib: ICMI12
Links:Conference Website
Summary:Welcome to Santa Monica and to the 14th edition of the International Conference on Multimodal Interaction, ICMI 2012. ICMI is the premier international forum for multidisciplinary research on multimodal human-human and human-computer interaction, interfaces, and system development.
     We had a record number of submissions this year: 147 (74 long papers, 49 short papers, 5 special session papers and 19 demo papers). From these submissions, we accepted 15 papers for long oral presentation (20.3% acceptance rate), 10 papers for short oral presentation (20.4% acceptance rate) and 19 papers presented as posters. We have a total acceptance rate of 35.8% for all short and long papers. 12 of the 19 demo papers were accepted. All 5 special session papers were directly invited by the organizers and the papers were all accepted. In addition, the program includes three invited Keynote talks.
    One of the two novelties introduced at ICMI this year is the Multimodal Grand Challenges. Developing systems that can robustly understand human-human communication or respond to human input requires identifying the best algorithms and their failure modes. In fields such as computer vision, speech recognition, and computational linguistics, the availability of datasets and common tasks have led to great progress. This year, we accepted four challenge workshops: the Audio-Visual Emotion Challenge (AVEC), the Haptic Voice Recognition challenge, the D-META challenge and Brain-Computer Interface challenge. Stefanie Telex and Daniel Gatica-Perez are co-chairing the grand challenge this year. All four Grand Challenges will be presented on Monday, October 22nd, and a summary session will be happening on Wednesday, October 24th, afternoon during the main conference.
    The second novelty at ICMI this year is the Doctoral Consortium-a separate, one-day event to take place on Monday, October 22nd, co-chaired by Bilge Mutlu and Carlos Busso. The goal of the Doctoral Consortium is to provide Ph.D. students with an opportunity to present their work to a group of mentors and peers from a diverse set of academic and industrial backgrounds and institutions, to receive feedback on their doctoral research plan and progress, and to build a cohort of young researchers interested in designing multimodal interfaces. All accepted students receive a travel grant to attend the conference. From among 25 applications, 14 students were accepted for participation and to receive travel funding. The organizers thank the National Science Foundation (award IIS-1249319) and conference sponsors for financial support.
    The review process was organized using the PCS submission and review system, which ICMI has used in the past. The quality of the review process was high thanks to 26 Area Chairs (ACs) who helped the Program Chairs in defining the Program Committee, and in finding excellent reviewers for each paper. Once reviews were submitted, the ACs provided meta-reviews for all papers which, along with the reviews, were sent to the authors for a rebuttal -- a step which allows authors to clarify their intentions and results in higher quality final papers. In a final step, all papers and their reviews were discussed by the Program Chairs on a two-day remote meeting in order to decide on the list of accepted submissions.
    The program was formed by grouping papers into main topics of interest for this year's conference. Following the trend in previous ICMI events and many other academic meetings, to minimize paper consumption we decided to distribute the conference proceedings on USB Flash Drives. The program chairs pre-selected the top-ranked paper submissions based on reviewers and area chairs comments. The Best Paper Award committee reviewed these top-ranked papers carefully, identified award finalists (marked in the technical program), and selected the recipients of the Best Student Paper Award and the Best Paper Award. The final award decisions will be announced at the conference banquet.
  1. Keynote 1
  2. Nonverbal / behaviour
  3. Affect
  4. Demo session 1
  5. Poster session
  6. 3 Vision
  7. Keynote 2
  8. Special session: child-computer interaction
  9. Gestures
  10. Demo session 2
  11. Doctoral spotlight session
  12. Grand challenge overview
  13. Keynote 3
  14. Touch / taste
  15. Multimodal interaction
  16. Challenge 1: 2nd international audio/visual emotion challenge and workshop -- AVEC 2012
  17. Challenge 2: haptic voice recognition grand challenge
  18. Challenge 3: BCI grand challenge: brain-computer interfaces as intelligent sensors for enhancing human-computer interaction
  19. Workshop overview

Keynote 1

The co-operative, transformative organization of human action and knowledge BIBAFull-Text 1-2
  Charles Goodwin
Human action is built by actively and simultaneously combining materials with intrinsically different properties into situated contextual configurations where they can mutually elaborate each other to create a whole that is both different from, and greater than, any of its constitutive parts. These resources include many different kinds of lexical and syntactic structures, prosody, gesture, embodied participation frameworks, sequential organization, and different kinds of materials in the environment, including tools created by others that structure local perception. The simultaneous use of different kinds of resources to build single actions has a number of consequences. First, different actors can contribute different kinds of materials that are implicated in the construction of a single action. For example embodied visual displays by hearers operate simultaneously on emerging talk by a speaker so that both the utterance and the turn have intrinsic organization that is both multi-party and multimodal. Someone with aphasia who is unable to produce lexical and syntactic structure can nonetheless contribute crucial prosodic and sequential materials to a local action, while appropriating the lexical contributions of others, and thus become a powerful speaker in conversation, despite catastrophically impoverished language. One effect of this simultaneous, distributed heterogeneity is that frequently the organization of action cannot be easily equated with the activities of single individuals, such as the person speaking at the moment, or with phenomena within a single medium such as talk. Second, subsequent action is frequently built through systematic transformations of the different kinds of materials provided by a prior action. In this process some elements of the prior contextual configuration, such as the encompassing participation framework, may remain unchanged, while others undergo significant modification. A punctual perspective on action, in which separate actions discretely follow one another, thus becomes more complex when action is seen to emerge within an unfolding mosaic of disparate materials and time frames which make possible not only systematic change, but also more enduring frameworks that provide crucial continuity. Third, the distributed, compositional structure of action provides a framework for developing the skills of newcomers within structured collaborative action. Fourth, human tools differ from the tools of other animals in that, like actions in talk, they are built by combining unlike materials into a whole not found in any of the individual parts (for example using a stone, a piece of wood and leather thongs to make an ax). This same combinatorial heterogeneity sits at the heart of human action in interaction, including language use. It creates within the unfolding organization of situated activity itself the distinctive forms of transformative collaborative action in the world, including socially organized perceptual and cognitive structures and the mutual alignment of bodies to each other, which constitutes us as humans.

Nonverbal / behaviour

Two people walk into a bar: dynamic multi-party social interaction with a robot agent BIBAFull-Text 3-10
  Mary Ellen Foster; Andre Gaschler; Manuel Giuliani; Amy Isard; Maria Pateraki; Ronald P. A. Petrick
We introduce a humanoid robot bartender that is capable of dealing with multiple customers in a dynamic, multi-party social setting. The robot system incorporates state-of-the-art components for computer vision, linguistic processing, state management, high-level reasoning, and robot control. In a user evaluation, 31 participants interacted with the bartender in a range of social situations. Most customers successfully obtained a drink from the bartender in all scenarios, and the factors that had the greatest impact on subjective satisfaction were task success and dialogue efficiency.
Changes in verbal and nonverbal conversational behavior in long-term interaction BIBAFull-Text 11-18
  Daniel Schulman; Timothy Bickmore
We present an empirical investigation of conversational behavior in dyadic interaction spanning multiple conversations, in the context of a developing interpersonal relationship between a health counselor and her clients. Using a longitudinal video corpus of behavior change counseling conversations, we show systematic changes in verbal and nonverbal behavior during greetings (within the first minute of conversations). Both the number of prior conversations and self-reported assessments of the strength of the interpersonal relationship are predictive of changes in verbal and nonverbal behavior.
   We present a model and implementation of nonverbal behavior generation for conversational agents that incorporates these findings, and discuss how the results can be applied to multimodal recognition of conversational behavior over time.
I already know your answer: using nonverbal behaviors to predict immediate outcomes in a dyadic negotiation BIBAFull-Text 19-22
  Sunghyun Park; Jonathan Gratch; Louis-Philippe Morency
Be it in our workplace or with our family or friends, negotiation comprises a fundamental fabric of our everyday life, and it is apparent that a system that can automatically predict negotiation outcomes will have substantial implications. In this paper, we focus on finding nonverbal behaviors that are predictive of immediate outcomes (acceptances or rejections of proposals) in a dyadic negotiation. Looking at the nonverbal behaviors of the respondent alone would be inadequate since ample predictive information could also reside in the behaviors of the proposer, as well as the past history between the two parties. With this intuition in mind, we show that a more accurate prediction can be achieved by considering all the three sources (multimodal) of information together. We evaluate our approach on a face-to-face negotiation dataset consisting of 42 dyadic interactions and show that integrating all three sources of information outperforms each individual predictor.
Modeling dominance effects on nonverbal behaviors using granger causality BIBAFull-Text 23-26
  Kyriaki Kalimeri; Bruno Lepri; Oya Aran; Dinesh Babu Jayagopi; Daniel Gatica-Perez; Fabio Pianesi
In this paper we modeled the effects that dominant people might induce on the nonverbal behavior (speech energy and body motion) of the other meeting participants using Granger causality technique. Our initial hypothesis that more dominant people have generalized higher influence was not validated when using the DOME-AMI corpus as data source. However, from the correlational analysis some interesting patterns emerged: contradicting our initial hypothesis dominant individuals are not accounting for the majority of the causal flow in a social interaction. Moreover, they seem to have more intense causal effects as their causal density was significantly higher. Finally dominant individuals tend to respond to the causal effects more often with complementarity than with mimicry.
Multimodal human behavior analysis: learning correlation and interaction across modalities BIBAFull-Text 27-30
  Yale Song; Louis-Philippe Morency; Randall Davis
Multimodal human behavior analysis is a challenging task due to the presence of complex nonlinear correlations and interactions across modalities. We present a novel approach to this problem based on Kernel Canonical Correlation Analysis (KCCA) and Multi-view Hidden Conditional Random Fields (MV-HCRF). Our approach uses a nonlinear kernel to map multimodal data to a high-dimensional feature space and finds a new projection of the data that maximizes the correlation across modalities. We use a multi-chain structured graphical model with disjoint sets of latent variables, one set per modality, to jointly learn both view-shared and view-specific sub-structures of the projected data, capturing interaction across modalities explicitly. We evaluate our approach on a task of agreement and disagreement recognition from nonverbal audio-visual cues using the Canal 9 dataset. Experimental results show that KCCA makes capturing nonlinear hidden dynamics easier and MV-HCRF helps learning interaction across modalities.


Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies BIBAFull-Text 31-38
  Sidney D'Mello; Jacqueline Kory
The recent influx of multimodal affect classifiers raises the important question of whether these classifiers yield accuracy rates that exceed their unimodal counterparts. This question was addressed by performing a meta-analysis on 30 published studies that reported both multimodal and unimodal affect detection accuracies. The results indicated that multimodal accuracies were consistently better than unimodal accuracies and yielded an average 8.12% improvement over the best unimodal classifiers. However, performance improvements were three times lower when classifiers were trained on natural or seminatural data (4.39% improvement) compared to acted data (12.1% improvement). Importantly, performance of the best unimodal classifier explained an impressive 80.6% (cross-validated) of the variance in multimodal accuracy. The results also indicated that multimodal accuracies were substantially higher than accuracies of the second-best unimodal classifiers (an average improvement of 29.4%) irrespective of the naturalness of the training data. Theoretical and applied implications of the findings are discussed.
Multimodal recognition of personality traits in human-computer collaborative tasks BIBAFull-Text 39-46
  Ligia Batrinca; Bruno Lepri; Nadia Mana; Fabio Pianesi
The user's personality in Human-Computer Interaction (HCI) plays an important role for the overall success of the interaction. The present study focuses on automatically recognizing the Big Five personality traits from 2-5 min long videos, in which the computer interacts using different levels of collaboration, in order to elicit the manifestation of these personality traits. Emotional Stability and Extraversion are the easiest traits to automatically detect under the different collaborative settings: all the settings for Emotional Stability and intermediate and fully-non collaborative settings for Extraversion. Interestingly, Agreeableness and Conscientiousness can be detected only under a moderately non-collaborative setting. Finally, our task does not seem to activate the full range of dispositions for Creativity.
Automatic detection of pain intensity BIBAFull-Text 47-52
  Zakia Hammal; Jeffrey F. Cohn
Previous efforts suggest that occurrence of pain can be detected from the face. Can intensity of pain be detected as well? The Prkachin and Solomon Pain Intensity (PSPI) metric was used to classify four levels of pain intensity (none, trace, weak, and strong) in 25 participants with previous shoulder injury (McMaster-UNBC Pain Archive). Participants were recorded while they completed a series of movements of their affected and unaffected shoulders. From the video recordings, canonical normalized appearance of the face (CAPP) was extracted using active appearance modeling. To control for variation in face size, all CAPP were rescaled to 96x96 pixels. CAPP then was passed through a set of Log-Normal filters consisting of 7 frequencies and 15 orientations to extract 9216 features. To detect pain level, 4 support vector machines (SVMs) were separately trained for the automatic measurement of pain intensity on a frame-by-frame level using both 5-folds cross-validation and leave-one-subject-out cross-validation. F1 for each level of pain intensity ranged from 91% to 96% and from 40% to 67% for 5-folds and leave-one-subject-out cross-validation, respectively. Intra-class correlation, which assesses the consistency of continuous pain intensity between manual and automatic PSPI was 0.85 and 0.55 for 5-folds and leave-one-subject-out cross-validation, respectively, which suggests moderate to high consistency. These findings show that pain intensity can be reliably measured from facial expression in participants with orthopedic injury.
FaceTube: predicting personality from facial expressions of emotion in online conversational video BIBAFull-Text 53-56
  Joan-Isaac Biel; Lucía Teijeiro-Mosquera; Daniel Gatica-Perez
The advances in automatic facial expression recognition make possible to mine and characterize large amounts of data, opening a wide research domain on behavioral understanding. In this paper, we leverage the use of a state-of-the-art facial expression recognition technology to characterize users of a popular type of online social video, conversational vlogs. First, we propose the use of several activity cues to characterize vloggers based on frame-by-frame estimates of facial expressions of emotion. Then, we present results for the task of automatically predicting vloggers' personality impressions using facial expressions and the Big-Five traits. Our results are promising, specially for the case of the Extraversion impression, and in addition our work poses interesting questions regarding the representation of multiple natural facial expressions occurring in conversational video.

Demo session 1

The blue one to the left: enabling expressive user interaction in a multimodal interface for object selection in virtual 3d environments BIBAFull-Text 57-58
  Pulkit Budhiraja; Sriganesh Madhvanath
Interaction with virtual 3D environments comes with a host of challenges. For instance, because 3D objects tend to occlude one another, performing object selection by pointing gestures is problematic, and more so when there are many objects in the scene. In the real world we tend to use speech to clarify our intent, by referring to distinctive attributes of the object and/or its absolute or relative location in space. Multimodal interactive systems involving speech and gesture have generally relied on speech for commands and deictic gestures for indicating the target object. In this paper, we present a system which allows object references to be made using gestures and speech, and supports a variety of expressions inspired by real-world usage.
Pixene: creating memories while sharing photos BIBAFull-Text 59-60
  Ramadevi Vennelakanti; Sriganesh Madhvanath; Anbumani Subramanian; Ajith Sowndararajan; Arun David; Prasenjit Dey
In this paper we describe Pixene, a photo sharing system that focuses on the capture and subsequent visualization and consumption of interactions around shared photos, where the sharing may be with physically co-present friends and family, or online with one's social network. In the former scenario, the interactions may be richly multimodal and involve pointing and spoken comments. Remote interaction is primarily in the form of 'like's and text comments on social networking sites. Pixene thus acts as a common repository for interactions over photos and brings interactions from a co-located and online photo sharing into a single platform. Pixene also provides a rich photo browsing experience that allows users to view not only the photographs but also the interaction history around them, e.g. who saw it, who did they see it with, what they said in association with different regions of interest, comments and 'like's. In this paper, we describe the features and system design of Pixene.
Designing multiuser multimodal gestural interactions for the living room BIBAFull-Text 61-62
  Sriganesh Madhvanath; Ramadevi Vennelakanti; Anbumani Subramanian; Ankit Shekhawat; Prasenjit Dey; Amit Rajan
Most work in the space of multimodal and gestural interaction has focused on single user productivity tasks. The design of multimodal, freehand gestural interaction for multiuser lean-back scenarios is a relatively nascent area that has come into focus because of the availability of commodity depth cameras. In this paper, we describe our approach to designing multimodal gestural interaction for multiuser photo browsing in the living room, typically a shared experience with friends and family. We believe that our learnings from this process will add value to the efforts of other researchers and designers interested in this design space.
Using explanations for runtime dialogue adaptation BIBAFull-Text 63-64
  Florian Nothdurft; Frank Honold; Peter Kurzok
In this demo paper we present a system that is capable of adapting the dialogue between human and so-called companion systems in real-time. Companion systems are continually available, co-operative, and reliable assistants which adapt to a user's capabilities, preferences, requirements, and current needs. Typically state-of-the art Human-Computer interfaces adapt the interaction only to pre-defined levels of expertise. In contrast, the presented system adapts the structure and content of the interaction to each user by including explanations to prepare him for upcoming tasks he has to solve together with the companion system.
NeuroDialog: an EEG-enabled spoken dialog interface BIBAFull-Text 65-66
  Seshadri Sridharan; Yun-Nung Chen; Kai-Min Chang; Alexander I. Rudnicky
Understanding user intent is a difficult problem in Dialog Systems, as they often need to make decisions under uncertainty. Using an inexpensive, consumer grade EEG sensor and a Wizard-of-Oz dialog system, we show that it is possible to detect system misunderstanding even before the user reacts vocally. We also present the design and implementation details of NeuroDialog, a proof-of-concept dialog system that uses an EEG based predictive model to detect system misrecognitions during live interaction.
Companion technology for multimodal interaction BIBAFull-Text 67-68
  Frank Honold; Felix Schüssel; Florian Nothdurft; Peter Kurzok
We present a context adaptive approach for multimodal interaction for the use in cognitive technical systems, so called Companion Systems. Such systems yield properties of multimodality, individuality, adaptability, availability, cooperativeness and trustworthiness. These characteristics represent a new type of interactive systems that are not only practical and efficient to operate, but as well agreeable, hence the term "companion". Companion technology has to consider the entire situation of the user, machine, and environment. The presented prototype depicts a system that offers assistance in the task of wiring the components of a home cinema system. The user interface for this task is not predefined, but built on the fly by dedicated fission and fusion components, thereby adapting the system's multimodal output and input capabilities to the user and the environment.

Poster session

IrisTK: a statechart-based toolkit for multi-party face-to-face interaction BIBAFull-Text 69-76
  Gabriel Skantze; Samer Al Moubayed
In this paper, we present IrisTK -- a toolkit for rapid development of real-time systems for multi-party face-to-face interaction. The toolkit consists of a message passing system, a set of modules for multi-modal input and output, and a dialog authoring language based on the notion of statecharts. The toolkit has been applied to a large scale study in a public museum setting, where the back-projected robot head Furhat interacted with the visitors in multi-party dialog.
Estimating conversational dominance in multiparty interaction BIBAFull-Text 77-84
  Yukiko Nakano; Yuki Fukuhara
It is important for conversational agents that manage multiparty conversations to recognize the group dynamics existing among the users. This paper proposes a method for estimating the conversational dominance of participants in group interactions. First, we conducted a Wizard-of-Oz experiment to collect conversational speech, and motion data. Then, we analyzed various paralinguistic speech and gaze behaviors to elucidate the factors that predict conversational dominance. Finally, by exploiting the speech and gaze data as estimation parameters, we created a regression model to estimate conversational dominance, and the multiple correlation coefficient of this model was 0.85.
Learning relevance from natural eye movements in pervasive interfaces BIBAFull-Text 85-92
  Melih Kandemir; Samuel Kaski
We study the feasibility of the following idea: Could a system learn to use the user's natural eye movements to infer relevance of real-world objects, if the user produced a set of learning data by clicking a "relevance" button during a learning session? If the answer is yes, the combination of eye tracking and machine learning would give a basis of "natural" interaction with the system by normally looking around, which would be very useful in mobile proactive setups. We measured the eye movements of the users while they were exploring an artificial art gallery. They labeled the relevant paintings by clicking a button while looking at them. The results show that a Gaussian process classifier accompanied by a time series kernel on the eye movements within an object predicts whether that object is relevant with better accuracy than dwell-time thresholding and random guessing.
Fishing or a Z?: investigating the effects of error on mimetic and alphabet device-based gesture interaction BIBAFull-Text 93-100
  Abdallah El Ali; Johan Kildal; Vuokko Lantz
While gesture taxonomies provide a classification of device-based gestures in terms of communicative intent, little work has addressed the usability differences in manually performing these gestures. In this primarily qualitative study, we investigate how two sets of iconic gestures that vary in familiarity, mimetic and alphabetic, are affected under varying failed recognition error rates (0-20%, 20-40%, 40-60%). Drawing on experiment logs, video observations, subjects' feedback, and a subjective workload assessment questionnaire, results revealed two main findings: a) mimetic gestures tend to evolve into diverse variations (within the activities they mimic) under high error rates, while alphabet gestures tend to become more rigid and structured and b) mimetic gestures were tolerated under recognition error rates of up to 40%, while alphabet gestures incur significant overall workload with up to only 20% error rates. Thus, while alphabet gestures are more robust to recognition errors in keeping their signature, mimetic gestures are more robust to recognition errors from a usability and user experience standpoint, and thus better suited for inclusion into mainstream device-based gesture interaction with mobile phones.
Structural and temporal inference search (STIS): pattern identification in multimodal data BIBAFull-Text 101-108
  Chreston Miller; Louis-Philippe Morency; Francis Quek
There are a multitude of annotated behavior corpora (manual and automatic annotations) available as research expands in multimodal analysis of human behavior. Despite the rich representations within these datasets, search strategies are limited with respect to the advanced representations and complex structures describing human interaction sequences. The relationships amongst human interactions are structural in nature. Hence, we present Structural and Temporal Inference Search (STIS) to support search for relevant patterns within a multimodal corpus based on the structural and temporal nature of human interactions. The user defines the structure of a behavior of interest driving a search focused on the characteristics of the structure. Occurrences of the structure are returned. We compare against two pattern mining algorithms purposed for pattern identification amongst sequences of symbolic data (e.g., sequence of events such as behavior interactions). The results are promising as STIS performs well with several datasets.
Integrating word acquisition and referential grounding towards physical world interaction BIBAFull-Text 109-116
  Rui Fang; Changsong Liu; Joyce Yue Chai
In language-based interaction between a human and an artificial agent (e.g., robot) in a physical world, because the human and the agent have different knowledge and capabilities in perceiving the shared environment, referential grounding is very difficult. To facilitate such interaction, it is important for the agent to continuously learn and acquire knowledge about the environment through interactions with humans and incorporate the learned knowledge in grounding references from human utterances. To address this issue, this paper presents a graph-based approach for referential grounding and examines how referential grounding and word acquisition influence each other in physical world interaction. Our empirical results have shown that for most words, automated word acquisition through interaction improves referential grounding performance. However, this is not the case for words describing object types, where human supervision is important. Nevertheless, better referential grounding enables more accurate acquisition of word meanings, which in turn further improves grounding performance for references in subsequent utterances.
Effects of modality on virtual button motion and performance BIBAFull-Text 117-124
  Adam Faeth; Chris Harding
The simple action of pressing a button is a multimodal interaction with an interesting depth of complexity. As the development of computer interfaces supporting 3D tasks progresses, there is a need to understand how users will interact with virtual buttons that generate multimodal feedback. Using a phone number dialing task on a virtual keypad, this study examined the effects of visual, auditory, and haptic feedback combinations on task performance and on the motion of individual button presses. The results suggest that the resistance of haptic feedback alone was not enough to prevent participants from pressing the button farther than necessary. Reinforcing haptic feedback with visual or auditory feedback shortened the depth of the presses significantly. However, the shallower presses that occurred with trimodal feedback may have led participants to release some buttons too early, which may explain an unexpected increase in mistakes where the participant missed digits from the phone number.
Modeling multimodal integration with event logic charts BIBAFull-Text 125-132
  Gregor Ulrich Mehlmann; Elisabeth André
In this paper we present a novel approach to the combined modeling of multimodal fusion and interaction management. The approach is based on a declarative multimodal event logic that allows the integration of inputs distributed over multiple modalities in accordance to spatial, temporal and semantic constraints. In conjunction with a visual state chart language, our approach supports the incremental parsing and fusion of inputs and a tight coupling with interaction management. The incremental and parallel parsing approach allows us to cope with concurrent continuous and discrete interactions and fusion on different levels of abstraction. The high-level visual and declarative modeling methods support rapid prototyping and iterative development of multimodal systems.
Multimodal motion guidance: techniques for adaptive and dynamic feedback BIBAFull-Text 133-140
  Christian Schönauer; Kenichiro Fukushi; Alex Olwal; Hannes Kaufmann; Ramesh Raskar
The ability to guide human motion through automatically generated feedback has significant potential for applications in areas, such as motor learning, human-computer interaction, telepresence, and augmented reality.
   This paper focuses on the design and development of such systems from a human cognition and perception perspective. We analyze the dimensions of the design space for motion guidance systems, spanned by technologies and human information processing, and identify opportunities for new feedback techniques.
   We present a novel motion guidance system, that was implemented based on these insights to enable feedback for position, direction and continuous velocities. It uses motion capture to track a user in space and guides using visual, vibrotactile and pneumatic actuation. Our system also introduces motion retargeting through time warping, motion dynamics and prediction, to allow more flexibility and adaptability to user performance.
Multimodal detection of salient behaviors of approach-avoidance in dyadic interactions BIBAFull-Text 141-144
  Bo Xiao; Panayiotis Georgiou; Brian Baucom; Shrikanth Narayanan
Approach-Avoidance (AA) coding is a measure of involvement and immediacy in human dyadic interactions. We focus on analyzing the salient events in interactions that trigger change points in AA code in time, as perceived by domain experts. We employ coarse level visual cues associated with body parts, as well as vocal energy features. Motion vector extraction and body pose estimation techniques are used for extracting visual cues. Functionals of these cues are used as features for SVM based machine learning experiments. We found that the coder's judgments on salient events are related to the short time interval preceding the labeling. We also show that visual cues are the main information source for decision making on salient AA events, and that considering the information from a subset of body parts provides the same information as considering the full set. The mean of absolute value and standard deviation of motion streams are the most effective functionals as feature. We achieve an F-score of 0.55 in detecting salient events using cross-validation with a one-subject-out approach.
Multimodal analysis of the implicit affective channel in computer-mediated textual communication BIBAFull-Text 145-152
  Joseph F. Grafsgaard; Robert M. Fulton; Kristy Elizabeth Boyer; Eric N. Wiebe; James C. Lester
Computer-mediated textual communication has become ubiquitous in recent years. Compared to face-to-face interactions, there is decreased bandwidth in affective information, yet studies show that interactions in this medium still produce rich and fulfilling affective outcomes. While overt communication (e.g., emoticons or explicit discussion of emotion) can explain some aspects of affect conveyed through textual dialogue, there may also be an underlying implicit affective channel through which participants perceive additional emotional information. To investigate this phenomenon, computer-mediated tutoring sessions were recorded with Kinect video and depth images and processed with novel tracking techniques for posture and hand-to-face gestures. Analyses demonstrated that tutors implicitly perceived students' focused attention, physical demand, and frustration. Additionally, bodily expressions of posture and gesture correlated with student cognitive-affective states that were perceived by tutors through the implicit affective channel. Finally, posture and gesture complement each other in multimodal predictive models of student cognitive-affective states, explaining greater variance than either modality alone. This approach of empirically studying the implicit affective channel may identify details of human behavior that can inform the design of future textual dialogue systems modeled on naturalistic interaction.
Towards sensing the influence of visual narratives on human affect BIBAFull-Text 153-160
  Mihai Burzo; Daniel McDuff; Rada Mihalcea; Louis-Philippe Morency; Alexis Narvaez; Veronica Perez-Rosas
In this paper, we explore a multimodal approach to sensing affective state during exposure to visual narratives. Using four different modalities, consisting of visual facial behaviors, thermal imaging, heart rate measurements, and verbal descriptions, we show that we can effectively predict changes in human affect. Our experiments show that these modalities complement each other, and illustrate the role played by each of the four modalities in detecting human affect.
Integrating video and accelerometer signals for nocturnal epileptic seizure detection BIBAFull-Text 161-164
  Kris Cuppens; Chih-Wei Chen; Kevin Bing-Yung Wong; Anouk Van de Vel; Lieven Lagae; Berten Ceulemans; Tinne Tuytelaars; Sabine Van Huffel; Bart Vanrumste; Hamid Aghajan
Epileptic seizure detection is traditionally done using video/electroencephalogram (EEG) monitoring, which is not applicable in a home situation. In recent years, attempts have been made to detect the seizures using other modalities. In this paper we investigate if a combined usage of accelerometers attached to the limbs and video data would increase the performance compared to a single modality approach. Therefore, we used two existing approaches for seizure detection in accelerometers and video and combined them using a linear discriminant analysis (LDA) classifier. The results for a combined detection have a better positive predictive value (PPV) of 95.00% compared to the single modality detection and reached a sensitivity of 83.33%.
GeoGazemarks: providing gaze history for the orientation on small display maps BIBAFull-Text 165-172
  Ioannis Giannopoulos; Peter Kiefer; Martin Raubal
Orientation on small display maps is often difficult because the visible spatial context is restricted. This paper proposes to provide the history of a user's visual attention on a map as visual clue to facilitate orientation. Visual attention on the map is recorded with eye tracking, clustered geo-spatially, and visualized when the user zooms out. This implicit gaze-interaction concept, called GeoGazemarks, has been evaluated in an experiment with 40 participants. The study demonstrates a significant increase in efficiency and an increase in effectiveness for a map search task, compared to standard panning and zooming.
Lost in navigation: evaluating a mobile map app for a fair BIBAFull-Text 173-180
  Anders Bouwer; Frank Nack; Abdallah El Ali
This paper describes a field study evaluating a mobile map application for the Paris Air Show. The aim of the study was to investigate how well users can navigate (to static and moving targets) and orient themselves in a fair (an unknown environment posing realistic challenges for wayfinding) with a mobile map system. The study involved 14 fair visitors who carried out three navigation tasks, which required them to switch between map navigation and deciding upon their orientation in the physical environment. Our results indicate that navigation and orientation are not as tightly coupled as described in the traditional wayfinding literature and may require different modality approaches to optimally support users. Based on this, we draw design implications on how to balance supporting the user in navigation and orientation with mobile systems without diminishing users' awareness of their surroundings.
An evaluation of game controllers and tablets as controllers for interactive tv applications BIBAFull-Text 181-188
  Dale Cox; Justin Wolford; Carlos Jensen; Dedrie Beardsley
There is a growing interest in bringing online and streaming content to the television. Gaming platforms such as the PS3, Xbox 360 and Wii are at the center of this digital convergence; platforms for accessing new media services. This presents a number of interface challenges, as controllers designed for gaming have to be adapted to accessing online content. This paper presents a user study examining the limitations and affordances of novel game controllers in an interactive TV (iTV) context and compares them to "second display" approaches using tablets. We look at task completion times, accuracy and user satisfaction across a number of tasks and find that the Wiimote is most liked and performed best in almost all tasks. Participants found the Kinect difficult to use, which led to slow performance and high error rates. We discuss challenges and opportunities for the future convergence of game consoles and iTV.
Towards multimodal deception detection -- step 1: building a collection of deceptive videos BIBAFull-Text 189-192
  Rada Mihalcea; Mihai Burzo
In this paper, we introduce a novel crowdsourced dataset of deceptive videos. We describe the collection process and the characteristics of the dataset, and we validate it through initial experiments in the recognition of deceptive language. The collection, consisting of 140 truthful and deceptive videos, will enable future experiments in multimodal deceptive detection.
A portable audio/video recorder for longitudinal study of child development BIBAFull-Text 193-200
  Soroush Vosoughi; Matthew S. Goodwin; Bill Washabaugh; Deb Roy
Collection and analysis of ultra-dense, longitudinal observational data of child behavior in natural, ecologically valid, non-laboratory settings holds significant promise for advancing the understanding of child development and developmental disorders such as autism. To this end, we created the Speechome Recorder -- a portable version of the embedded audio/video recording technology originally developed for the Human Speechome Project -- to facilitate swift, cost-effective deployment in home environments. Recording child behavior daily in these settings will enable detailed study of developmental trajectories in children from infancy through early childhood, as well as typical and atypical dynamics of communication and social interaction as they evolve over time. Its portability makes possible potentially large-scale comparative study of developmental milestones in both neurotypical and developmentally delayed children. In brief, the Speechome Recorder was designed to reduce cost, complexity, invasiveness and privacy issues associated with naturalistic, longitudinal recordings of child development.
Integrating PAMOCAT in the research cycle: linking motion capturing and conversation analysis BIBAFull-Text 201-208
  Bernhard Brüning; Christian Schnier; Karola Pitsch; Sven Wachsmuth
In order to understand and model the non-verbal communicative conduct of humans, it seems fruitful to combine qualitative (Conversation Analysis [6] [10] [11]) and quantitative analytical (motion capturing) methods. Tools for data visualization and annotation are important as they constitute a central interface between different research approaches and methodologies. With this aim we have developed the pre-annotation tool "PAMOCAT -- Pre Annotation Motion Capture Analysis Tool" that detects different phenomena. These phenomena are in the category single person and person overlapping phenomena. Included are functions for the analysis of head focused objects, hand activities, single DOF- degree of freedom activity, posture detection and intrusions into the co-participant's space. These detected phenomena related to the frames will be displayed in an overview. The phenomena can be chosen to search for a specific constellation between these different phenomena. A sophisticated user interface easily allows the annotating person to find correlations between different joints and phenomena, to analyze the corresponding 3D pose in a reconstructed virtual environment, and to export combined qualitative and quantitative annotations to standard annotation tools. Using this technique we are able to examine complex setups with three participants engaged in conversation. In this paper we propose how PAMOCAT can be integrated in the research cycle by showing a concrete PAMOCAT-based micro-analysis of a multimodal phenomenon, which deals with kinetic procedures to claim the floor.

3 Vision

Motion retrieval based on kinetic features in large motion database BIBAFull-Text 209-216
  Tianyu Huang; Haiying Liu; Gangyi Ding
Considering the increasing collections of motion capture data, motion retrieval in large motion databases is gaining in importance. In this paper, we introduce kinetic interval features describing the movement trend of motions. In our approach, motion files are decomposed into kinetic intervals. For each joint in a kinetic interval, we define the kinetic interval features as the parameters of parametric arc equations computed by fitting joints trajectories. By extracting these features, we are able to lower the dimensionality and reconstruct the motions. Multilayer index tree is used to accelerate the searching process and a candidate list of motion data is generated for matching. To find both logically and numerically similar motions, we propose a two-level similarity matching based on kinetic interval features, which can also speed up the matching process. Experiments are performed on several variants of HDM05 and CMU motion databases proving that the approach can achieve accurate and fast motion retrieval in large motion databases.
Vision-based handwriting recognition for unrestricted text input in mid-air BIBAFull-Text 217-220
  Alexander Schick; Daniel Morlock; Christoph Amma; Tanja Schultz; Rainer Stiefelhagen
We propose a vision-based system that recognizes handwriting in mid-air. The system does not depend on sensors or markers attached to the users and allows unrestricted character and word input from any position. It is the result of combining handwriting recognition based on Hidden Markov Models with multi-camera 3D hand tracking. We evaluated the system for both quantitative and qualitative aspects. The system achieves recognition rates of 86.15% for character and 97.54% for small-vocabulary isolated word recognition. Limitations are due to slow and low-resolution cameras or physical strain. Overall, the proposed handwriting recognition system provides an easy-to-use and accurate text input modality without placing restrictions on the users.
Investigating the midline effect for visual focus of attention recognition BIBAFull-Text 221-224
  Samira Sheikhi; Jean-Marc Odobez
This paper addresses the recognition of people's visual focus of attention (VFOA), the discrete version of gaze indicating who is looking at whom or what. In absence of high definition images, we rely on people's head pose to recognize the VFOA. To the contrary of most previous works that assumed a fixed mapping between head pose directions and gaze target directions, we investigate novel gaze models documented in psychovision that produce a dynamic (temporal) mapping between them. This mapping accounts for two important factors affecting the head and gaze relationship: the shoulder orientation defining the gaze midline of a person varies over time; and gaze shifts from frontal to the side involve different head rotations than the reverse. Evaluated on a public dataset and on data recorded with the humanoid robot Nao, the method exhibit better adaptivity often producing better performance than state-of-the-art approach.
Let's have dinner together: evaluate the mediated co-dining experience BIBAFull-Text 225-228
  Jun Wei; Adrian David Cheok; Ryohei Nakatsu
Having dinner together is undoubtedly a kind of pleasurable experience, which involves various channels for mutual interactions, not only audio, vision and touch, but also smell and taste. With the aim to extend this rich experience to remote situations, we developed the Co-dining system to support a range of mealtime interactions to enhance the feeling of social togetherness. This paper describes the preliminary study with this interactive multisensory system. It aims to investigate the actual effectiveness of the working prototype on enhancing the social presence and engagement during telepresent dining, and also to get a comprehensive understanding about users' perception. This evaluation focused on three main aspects: the overall Co-dining feeling, cultural awareness, and engagement. The study results revealed that this system positively achieved the sense of "being together" among users, through the interactive activities touching upon tableware, tablecloth and real edible food, and each interaction module contributed differently to the overall experience. In this paper, we report the evaluation process, present and interpret the data, and discuss the initial insights to enhance the sense of co-presence through multi-channel interactions.

Keynote 2

Infusing the physical world into user interfaces BIBAFull-Text 229-230
  Ivan Poupyrev
Advances in new materials and manufacturing techniques are rapidly blending the computational and physical worlds. With every new turn in technology development -- e.g., discovering a novel "smart" material, inventing a more efficient manufacturing process or designing a faster microprocessor -- there are new and exciting ways to take user interfaces away from the screen and blend them into our living spaces and everyday objects, making them more responsive, intelligent and adaptive. As the world around us becomes increasingly infused with technology, the user interfaces and computers themselves will disappear into the background, blending into the physical world around us. Thus, the old tried-and-true paradigms for designing interaction and interfaces must be re-evaluated, re-designed and, in some cases, even discarded to take advantage of the new possibilities that these cutting-edge technologies provide. While the challenges and opportunities are distinct, the fundamental goal remains the same: to provide for the effortless and effective consumption, control and transmission of information at any time and in any place, while delivering a unique experience that is only possible with these emerging technologies.
   In this talk I will present work produced by myself and the research group that I have been directing at Disney Research Pittsburgh. We are addressing these exciting challenges. This talk will cover projects investigating tactile and haptics interfaces, deformable computing devices, augmented reality interfaces and novel touch sensing techniques, as well as biologically-inspired interfaces, among others. The presentation will cover both projects conducted while at Sony Corporation and more recent research efforts in the Interaction Group at Walt Disney Research, Pittsburgh.

Special session: child-computer interaction

Child-computer interaction: ICMI 2012 special session BIBAFull-Text 231-232
  Anton Nijholt
This is a short introduction to the special session on child computer interaction at the International Conference on Multimodal Interaction 2012 (ICMI 2012). In human-computer interaction users have become participants in the design process. This is not different for child computer interaction applications. However, technological advances have also led to developments where children not only have the role of future consumers of an application (a game, maybe an educational game), but also design and create the application, where designing and creating is both fun and serving educational purposes. In this special session the different aspects of child computer interaction (design, usability, learning, fun, creating, collaboration) are investigated and illustrated. In addition we pay attention to the efforts to create a child-computer interaction research community.
Knowledge gaps in hands-on tangible interaction research BIBAFull-Text 233-240
  Alissa N. Antle
Multimodal interfaces including tablets, touch tables, and tangibles are beginning to receive much attention in the child-computer interaction community. Such interfaces enable interaction through actions, gestures, touch, and other modalities not tapped into by traditional desktop computing. Researchers have suggested that multimodal interfaces, such as tangibles, have great potential to support children's learning and problem solving in spatial domains due to the hands-on physical and spatial properties of this interaction style. Despite a long history of hands-on learning with physical and computational materials, there is little theoretical or empirical work that identifies specific causes for many of the claimed benefits. Neither is there empirically validated design guidance as to what design choices might be expected to have significant impacts. In this paper I suggest several avenues of investigation, based on my own research interests, which would address this knowledge gap. I provide summaries of theoretical mechanisms that may explain claimed benefits, outline how the specific features of tangible interfaces might support or enhance these mechanisms, and describe current and future investigations that address current gaps of knowledge.
Evaluating artefacts with children: age and technology effects in the reporting of expected and experienced fun BIBAFull-Text 241-248
  Janet C. Read
In interaction design, there are several metrics used to gather user experience data. A common approach is to use surveys with the usual method being to ask users after they have experienced a product as to their opinion and satisfaction. This paper describes the use of the Smileyometer (a product from the Fun Toolkit) to test for user experience with children by asking for opinions in relation to expected as well as experienced fun.
   Two studies looked at the ratings that children, from two different age groups and in two different contexts, gave to a set of varied age-appropriate interactive technology installations. The ratings given before use (expectations) are compared with ratings given after use (experience) across the age groups and across installations.
   The studies show that different ratings were given for the different installations and that there were age-related differences in the use of the Smileyometer to rate user experience; these firstly evidence that children can, and do, discriminate between different experiences and that children do reflect on user experience after using technologies. In most cases, across both age groups, children expected a lot from the technologies and their after use (experienced) rating confirmed that this was what they had got.
   The paper concludes by considering the implications of the collective findings for the design and evaluation of technologies with children.
Measuring enjoyment of an interactive museum experience BIBAFull-Text 249-256
  Elisabeth M. A. G. van Dijk; Andreas Lingnau; Hub Kockelkorn
Museums are increasingly being equipped with interactive technology. The main goal of using technology is to improve the museum-going experience of visitors. In this paper, we present the results of a study with an electronic quest through a museum aimed at children in the age of 10-12. We wanted to find out whether personalization of the quest effects enjoyment. For this purpose we involved an interactive multi-touch table in the experiment, which also offered the opportunity to add the element of collaboration. We compared a group that did the original non-personalized quest with a group that did the personalized quest. This last group interacted with the multi-touch table to personalize the quest before they started on it. No significant differences were found between the experimental groups. We did find many differences between the children of age 10-11 and those of age 11-12, on almost all measurements. On this aspect we present some methodical results about measuring enjoyment and intrinsic motivation with children of 10-12 years old.
Bifocal modeling: a study on the learning outcomes of comparing physical and computational models linked in real time BIBAFull-Text 257-264
  Paulo Blikstein
Computer modeling, and in particular agent-based modeling, has been successfully used in many scientific fields, transforming scientists' practice. Educational researchers have come to realize its potential for learning, and studies have suggested that students are able to understand concepts above their expected grade level after interacting with curricula that employ modeling and simulation. However, most simulations are 'on-screen', without connection to the physical world. Therefore, real-time model validation is challenging with extant modeling platforms. I have designed a technological and pedagogical framework to enable students to connect computer models and sensors in real time, as to validate, compare, and refine their models using real-world data. In this paper, I will focus on both technical and pedagogical aspects, describing pilot studies that suggest a real-to-virtual reciprocity which catalyzes further inquiry toward deeper understanding of scientific phenomena.
Connecting play: understanding multimodal participation in virtual worlds BIBAFull-Text 265-272
  Yasmin Kafai; Deborah Fields
In this paper we propose a multimodal approach to log file data analysis to develop a better understanding of player participation and practices in virtual worlds. To deal with the massive amounts of data collected via log files researchers traditionally have employed quantitative reduction techniques for revealing trends and patterns. We contend that certain qualitative analysis techniques can reveal particular play practices across online and offline spaces and aspects of individual players' participation invisible through other methods. We present examples from our research in the tween virtual world Whyville.net that illustrate the uses of these new techniques. In the discussion we address the benefits and limitations of our approach.


Gestures as point clouds: a $P recognizer for user interface prototypes BIBAFull-Text 273-280
  Radu-Daniel Vatavu; Lisa Anthony; Jacob O. Wobbrock
Rapid prototyping of gesture interaction for emerging touch platforms requires that developers have access to fast, simple, and accurate gesture recognition approaches. The $-family of recognizers ($1, $N) addresses this need, but the current most advanced of these, $N-Protractor, has significant memory and execution costs due to its combinatoric gesture representation approach. We present $P, a new member of the $-family, that remedies this limitation by considering gestures as clouds of points. $P performs similarly to $1 on unistrokes and is superior to $N on multistrokes. Specifically, $P delivers >99% accuracy in user-dependent testing with 5+ training samples per gesture type and stays above 99% for user-independent tests when using data from 10 participants. We provide a pseudocode listing of $P to assist developers in porting it to their specific platform and a "cheat sheet" to aid developers in selecting the best member of the $-family for their specific application needs.
Influencing gestural representation of eventualities: insights from ontology BIBAFull-Text 281-288
  Magdalena Lis
In the present paper we report results of a pilot study on the verbal and gestural representation of eventualities. We investigate the relationship between eventuality characteristics, as reflected in linguistic categorizations of verbs denoting the eventuality referred to, and physical form of co-speech gesture. Most importantly we look at whether referent's ontological type, information about which we derive from plWordNet 1.5, correlates to the viewpoint adopted in co-occurring gesture. Our results indicate a strong correlation between eventuality type and gestural viewpoint and are the first step towards a model of gestural behaviour planning driven by referents' types as reflected in speech and categorized in linguistic ontologies.
Using self-context for multimodal detection of head nods in face-to-face interactions BIBAFull-Text 289-292
  Laurent Nguyen; Jean-Marc Odobez; Daniel Gatica-Perez
Head nods occur in virtually every face-to-face discussion. As part of the backchannel domain, they are not only used to express a 'yes', but also to display interest or enhance communicative attention. Detecting head nods in natural interactions is a challenging task as head nods can be subtle, both in amplitude and duration. In this study, we make use of findings in psychology establishing that the dynamics of head gestures are conditioned on the person's speaking status. We develop a multimodal method using audio-based self-context to detect head nods in natural settings. We demonstrate that our multimodal approach using the speaking status of the person under analysis significantly improved the detection rate over a visual-only approach.

Demo session 2

Multimodal multiparty social interaction with the furhat head BIBAFull-Text 293-294
  Samer Al Moubayed; Gabriel Skantze; Jonas Beskow; Kalin Stefanov; Joakim Gustafson
We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.
An avatar-based help system for a grid computing web portal BIBAFull-Text 295-296
  Helmut Lang; Florian Nothdurft
We developed a help system that employs an avatar to cover various aspects of user assistance. In addition to assisting users with the completion of specific goals by means of dynamically generated step-by-step guides, the system is also able to provide information concerning elements of the user interface. Furthermore it answers follow-up questions entered by the user. Compared to other systems, our system is able to present help directly on the page the user is currently visiting. It also interacts with the content to help the user in finding pages and interface elements relevant to complete given tasks. According to the user's preference, help is either provided in a textual form or by speech output.
GamEMO: how physiological signals show your emotions and enhance your game experience BIBAFull-Text 297-298
  Chanel Guillaume; Kalogianni Konstantina; Pun Thierry
The proposed demonstration is an automatic emotion assessment installation used for game's dynamic difficulty adjustment. The goal of the system is to maintain the player of the game in a state of entertainment and engagement where his/her skills match the difficulty level of the game. The player's physiological signals are recorded while playing a Tetris game and signal processing, feature extraction and classification techniques are applied to the signals in order to detect when the player is anxious or bored. The level of the Tetris game is then adjusted according to the player's detected emotional state. The demonstration will also serve as an experimental protocol to test the player's experience through their interaction with the proposed platform.
Multimodal collaboration for crime scene investigation in mediated reality BIBAFull-Text 299-300
  Dragos Datcu; Thomas Swart; Stephan Lukosch; Zoltan Rusak
In this paper, we present an innovative mediated reality-oriented, real-time software system designed to support multimodal collaboration among remote CSI experts and forensic investigators at the crime scene. Our prototype integrates state-of-the art technologies for stereo navigation, 3D digital mapping and adaptive hand gesture-based user interface for natural interaction. The multimodal interface accepts mouse inputs, audio and hand gestures possibly while interacting with a purposively designed physical object. The evaluation done by a panel of international CSI practitioners [5] using an adapted Burkhardt et al. method [1], shows that our prototype system genuinely fits into the common practice of forensic investigation while clearly boosting the quality of collaboration. At the moment, the system is under consideration for routinely being adopted as part of special procedures by The Forensic Institute in The Hague, The Netherlands [6].
PAMOCAT: linking motion capturing and conversation analysis BIBAFull-Text 301-302
  Bernhard Andreas Brüning; Christian Schnier
In order to model the non-verbal communicative conduct of humans, it is needed to be understood in detail. Therefore it is fruitful to combine qualitative (Conversation Analysis) and quantitative analytical (Motion Capturing) methods. Tools for data visualization and annotation are important since they constitute a central interface between different research approaches and methodologies. With this aim we have developed the pre-annotation tool "PAMOCAT -- Pre Annotation Motion Capture Analysis Tool" [2]. This has a sophisticated user interface that easily allows the annotating person to find correlations between different joints and/or phenomena, to analyze the corresponding 3D pose in a reconstructed virtual environment, and to export combined qualitative and quantitative annotations to standard annotation tools. The tool has the functionality to easily search for distinct constellations of different phenomena and display frames in which these could be found in an annotation overview.
Multimodal dialogue in mobile local search BIBAFull-Text 303-304
  Patrick Ehlen; Michael Johnston
Speak4itSM is a multimodal, mobile search application that provides information about local businesses. Users can combine speech and touch input simultaneously to make search queries or commands to the application. For example, a user might say, "gas stations", while simultaneously tracing a route on a touchscreen. In this demonstration, we describe the extension of our multimodal semantic processing architecture and application from a one-shot query system to a multimodal dialogue system that tracks dialogue state over multiple turns. We illustrate the capabilities and limitations of an information-state-based approach to multimodal interpretation. We provide interactive demonstrations of Speak4it on a tablet and a smartphone, and explain the challenges of supporting true multimodal interaction in a deployed mobile service.

Doctoral spotlight session

Toward an argumentation-based dialogue framework for human-robot collaboration BIBAFull-Text 305-308
  Mohammad Q. Azhar
The research on human-robot dialogue to support fluent human robot interaction is still in its early stages. Current issues in the human-robot dialogue domain could be divided into two major categories, which are described in this proposal as the "what to say" problem and the "how to say it" problem. The "what to say" problem addresses ways to determine the content of plausible dialogue during human robot interaction, whereas the "how to say it" problem addresses the best ways for a robot to deliver that content (e.g., using text, gestures, speech or different modalities). Dialogue within the robotics domain also needs to address the "when to say it" problem that considers the timing of dialogue delivery (e.g., turn taking).
   Human-robot collaboration may fail for many reasons. This research focuses on conflicts and robot errors during human-robot collaboration. Conflicting beliefs occur when participating collaborators have two different views of the physical world due to inaccessible or non-static information. Intuitively, collaborators need to persuade each other to agree on the same beliefs about the state of the world in order to complete collaborative tasks successfully. Robot errors may happen due to miscommunication or simply lack of communication. Moreover, lack of dialogue support for a human to query a robot makes it even more difficult for error recovery. This may lead to failure of a collaborative plan or shared goal. Dialogue is the natural way to resolve errors due to miscommunication. This research explores the notion of an argumentation-based dialogue method for human-robot interaction (HRI). The proposed research aims to design and implement a logic-based dialogue framework grounded on argumentation theory to address the "what to say" problem of human robot communication during a collaborative task.
Timing multimodal turn-taking for human-robot cooperation BIBAFull-Text 309-312
  Crystal Chao
In human cooperation, the concurrent usage of multiple social modalities such as speech, gesture, and gaze results in robust and efficient communicative acts. Such multimodality in combination with reciprocal intentions supports fluent turn-taking. I hypothesize that human-robot turn-taking can be made more fluent through appropriate timing of multimodal actions. Managing timing includes understanding the impact that timing can have on interactions as well as having a control system that supports the manipulation of such timing. To this end, I propose to develop a computational turn-taking model of the timing and information flow of reciprocal interactions. I also propose to develop an architecture based on the timed Petri net (TPN) for the generation of coordinated multimodal behavior, inside of which the turn-taking model will regulate turn timing and action initiation and interruption in order to seize and yield control. Through user studies in multiple domains, I intend to demonstrate the system's generality and evaluate the system on balance of control, fluency, and task effectiveness.
My automated conversation helper (MACH): helping people improve social skills BIBAFull-Text 313-316
  Mohammed E. Hoque
Ever been in a situation where you didn't get that job despite being the deserving candidate? What went wrong? Psychology literature suggests that the most important skill towards making an impression during interviews is your interpersonal/social skills. Is it possible for people to improve their social skills (e.g., vary the voice intonation, and pauses appropriately; use social smiles, when appropriate; and maintain eye contact) through a computerized intervention? In this thesis, I propose to develop an autonomous and Automated Conversation Helper (3D virtual character) that can play the role of the interviewer, allowing participants to practice their social skills in context of job interviews. The Automated Conversation Helper is being developed with the ability to "see" (facial expression processing), "hear" (speech recognition and prosody analysis) and "respond" (speech and behavior synthesis) in real-time and provide live feedback on participant's non-verbal behavior.
A touch of affect: mediated social touch and affect BIBAFull-Text 317-320
  Gijs Huisman
This position paper outlines the first stages in an ongoing PhD project on mediated social touch, and the effects mediated touch can have on someone's affective state. It is argued that touch is a profound communication channel for humans, and that communication through touch can, to some extent, occur through mediation. Furthermore, touch can be used to communicate emotions, as well as have immediate affective consequences. The design of an input device, consisting of twelve force-sensitive resistors, to study the communication of emotions through mediated touch is presented. A pilot study indicated that participants used duration of touch and force applied as ways to distinguish between different emotions. This paper will conclude by discussing possible improvements for the input device, how the pilot study fits with the overall PhD project, as well as future directions for the PhD project in general.
Depression analysis: a multimodal approach BIBAFull-Text 321-324
  Jyoti Joshi
Depression is a severe mental health disorder causing high societal costs. Current clinical practice depends almost exclusively on self report and clinical opinion, risking a range of subjective biases. It is therefore useful to design a diagnostic aid to assist clinicians. This project aims at developing a novel multimodal framework for depression analysis. In this PhD work, it is hypothesized that a multimodal affective sensing system can better capture what characterises a person's affective state than single modality systems. The project will explore facial dynamics, head movements, upper body gestures, EEG measures and speech characteristics related to affect, in subjects with major depressive disorders. Integrating the individual sensing modalities, a multimodal approach that show improved performance characteristics over single modality approaches will be developed.
Design space for finger gestures with hand-held tablets BIBAFull-Text 325-328
  Katrin Wolf
This paper presents research how a finger-gesture design space for interacting with hand-held tablets may be defined. The parameters that limit or extend this space, such as anatomy-dependent gesture feasibility, grasp requirement, gesture occlusion or complexity, are discussed based on initial explorative expert interviews and following user studies. The goal of this research is defining parameters that have to be taken into account for developing a finger-gesture UI model for hand-held tablets. Although this model has a strong user-centric design approach, rather than being technology driven, technical solutions for detecting finger gestures are also considered. A model design is presented and research questions for investigating this model in greater detail are outlined.
Multi-modal interfaces for control of assistive robotic devices BIBAFull-Text 329-332
  Christopher Dale McMurrough
This paper presents an outline of dissertation research activities which aim to advance the use of non-traditional, multimodal interfaces in assistive robotic devices. The data modalities which are of particular interest in the work are perception of the environment using 3D scanning and computer vision, estimation of the user point of gaze, and perception of user intent during interaction with objects of interest. The main goal of this research is to explore the hypothesis that the combination of these data modalities can be used to provide intuitive and effective means of control over existing robotic platforms, such as wheelchairs and manipulators, to users with severe physical impairments.
Space, speech, and gesture in human-robot interaction BIBAFull-Text 333-336
  Ross Mead
To enable natural and productive situated human-robot interaction, a robot must both understand and control proxemics, the social use of space, in order to employ communication mechanisms analogous to those used by humans: social speech and gesture production and recognition. My research focuses on answering these questions: How do social (auditory and visual) and environmental (noisy and occluding) stimuli influence spatially situated communication between humans and robots, and how should a robot dynamically adjust its communication mechanisms to maximize human perceptions of its social signals in the presence of extrinsic and intrinsic sensory interference?
Machine analysis and recognition of social contexts BIBAFull-Text 337-340
  Maria O'Connor
As computers move into the social spaces traditionally reserved for humans via mobile and ubiquitous computing, they must develop into socially intelligent machines. Social context information will allow current content-based approaches in the fields of social signal processing and affective computing to be placed in the appropriate setting. Analysis of multimodal data for social context recognition and the application of this information to other problems in the field of social signal processing is the research area discussed here. This paper explains the motivation and background for this research, provide details on some preliminary work, and outlines future plans. It concludes with some expected contributions to the field of social signal processing.
Task-learning policies for collaborative task solving in human-robot interaction BIBAFull-Text 341-344
  Hae Won Park
The objective of this doctoral research is to design multimodal task-learning policies for a robotic system that targets the exchange of task rules between humans and robots. This objective is achieved through a collaborative task application during human-robot interaction where the two partners learn a task from each other and accomplish a shared goal. As a first step, a method to model human-action primitives using a pattern-recognition technique is presented. Next, algorithms are developed to generate turn-taking strategies in response to human task behaviors. The contribution of this work is in engaging robots with humans in collaborative play task by modeling statistical patterns of play behaviors and reusing previously learned knowledge to reduce the decision process. Here, results of previous work are presented, and remaining works including deploying a physically embodied agent and developing an evaluation platform are outlined.
Simulating real danger?: validation of driving simulator test and psychological factors in brake response time to danger BIBAFull-Text 345-348
  Daniele Ruscio
The aim of the present research is to study the role of human factor during the use of driving simulator for a specific driving ability: brake response time. In particular I intend to a) study the influence of interaction with virtual simulation on driving response to hazard, and b) validate a virtual driving test with external data from real life driving. I conducted a study on a real car, measuring the response time for this specific driving task and I want to compare the response to the same conditions on a driving simulator, in order to better understand the psychological factors that can influence the brake time response. Furthermore the data will give some specific information for developing validated virtual simulations that can be used for training and infrastructure design.
Virtual patients to teach cultural competency BIBAFull-Text 349-352
  Raghavi Sakpal
Cultural Competence has emerged as an important strategy in addressing issues of racial and ethnic disparities in a diversely populated country like the United States. The challenge for health care providers is to learn how cultural factors influence patients' responses to medial issues such as healing and suffering, as well as doctor-patient relationship. My research investigates a new approach for teaching cultural competency to health care providers -- using Embodied Conversational Agents (ECA) that represents different cultures.
   Objective: The goal of this project is to develop and test pool of culturally diverse virtual patients (VPs), as a training tool to teach nursing students effective communication technique (display cultural sensitivity) while interacting with patients from different cultures.
   Methods: 1) To develop VPs -- belonging to diverse culture, age and gender. 2) Develop Agent Architecture for cultural adaptation of the virtual patients. 3) Implement VPs in a clinical setting, as a tool to teach cultural sensitivity and communication skills to medical/nursing students.
Multimodal learning analytics: enabling the future of learning through multimodal data analysis and interfaces BIBAFull-Text 353-356
  Marcelo Worsley
Project-based learning has found its way into a range of formal and informal learning environments. However, systematically assessing these environments remains a significant challenge. Traditional assessments, which focus on learning outcomes, seem incongruent with the process-oriented goals of project-based learning. Multimodal interfaces and multimodal learning analytics hold significant promise for assessing learning in open-ended learning environments. With its rich integration of a multitude of data streams and naturalistic interfaces, this area of research may help usher in a new wave of education reform by supporting alternative modes of learning.
A hierarchical approach to continuous gesture analysis for natural multi-modal interaction BIBAFull-Text 357-360
  Ying Yin
I propose a systematic hierarchical approach to continuous gesture analysis using a unifying framework based on abstract hidden Markov models (AHMMs). With this framework, I will develop a gesture-based interactive interface that allows users to do both manipulative and communicative gestures without artificial restrictions, and hence enabling natural interaction.

Grand challenge overview

AVEC 2012: the continuous audio/visual emotion challenge -- an introduction BIBAFull-Text 361-362
  Björn Schuller; Michel Valstar; Roddy Cowie; Maja Pantic
The second international Audio/Visual Emotion Challenge and Workshop 2012 (AVEC 2012) is introduced shortly. 34 teams from 12 countries signed up for the Challenge. The SEMAINE database serves for prediction of four-dimensional continuous affect in audio and video. For the eligible participants, final scores for the Fully-Continuous Sub-Challenge ranged between a correlation coefficient between gold standard and prediction of 0.174 and 0.456, and for Word-Level Sub-Challenge between 0.113 and 0.280.
ICMI'12 grand challenge: haptic voice recognition BIBAFull-Text 363-370
  Khe Chai Sim; Shengdong Zhao; Kai Yu; Hank Liao
This paper describes the Haptic Voice Recognition (HVR) Grand Challenge 2012 and its datasets. The HVR Grand Challenge 2012 is a research oriented competition designed to bring together researchers across multiple disciplines to work on novel multimodal text entry methods involving speech and touch inputs. Annotated datasets were collected and released for this grand challenge as well as future research purposes. A simple recipe for building an HVR system using the Hidden Markov Model Toolkit (HTK) was also provided. In this paper, detailed analyses of the datasets will be given. Experimental results obtained using these data will also be presented.
Audio-visual robot command recognition: D-META'12 grand challenge BIBAFull-Text 371-378
  Jordi Sanchez-Riera; Xavier Alameda-Pineda; Radu Horaud
This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.
Brain computer interfaces as intelligent sensors for enhancing human-computer interaction BIBAFull-Text 379-382
  Mannes Poel; Femke Nijboer; Egon L. van den Broek; Stephen Fairclough; Anton Nijholt
BCIs are traditionally conceived as a way to control apparatus, an interface that allows you to "act on" external devices as a form of input control. We propose an alternative use of BCIs, that of monitoring users as an additional intelligent sensor to enrich traditional means of interaction. This vision is what we consider to be a grand challenge in the field of multimodal interaction. In this article, this challenge is introduced, related to existing work, and illustrated using some best practices and the contributions it has received.

Keynote 3

Using psychophysical techniques to design and evaluate multimodal interfaces: psychophysics and interface design BIBAFull-Text 383-384
  Roberta L. Klatzky
"Psychophysics" is a an approach to evaluating human perception and action capabilities that emphasizes control over the stimulus environment. Virtual environments provide an ideal setting for psychophysical research, as they facilitate not only stimulus control but precise measurement of performance. In my research I have used the psychophysical approach to inform the design and evaluation of multi-modal interfaces that enable action in remote or virtual worlds or that compensate for sensory-motor impairment in the physical environment of the user. This talk will describe such projects, emphasizing the value of behavioral science to interface engineering.

Touch / taste

Reproducing materials of virtual elements on touchscreens using supplemental thermal feedback BIBAFull-Text 385-392
  Hendrik Richter; Doris Hausen; Sven Osterwald; Andreas Butz
In our everyday life, the perception of thermal cues plays a crucial role for the identification and discrimination of materials. When touching an object, the change of temperature in the skin of our fingertips is characteristic for the touched material and can help to discriminate objects with the same texture or hardness. However, this useful perceptual channel is disregarded for interactive elements on standard touchscreens. In this paper, we present a study in which we compared the rate of object discrimination for stand-alone thermal stimuli as well as supplemental thermal stimuli characterizing virtual materials on a touchscreen. Our results show that five materials could be discriminated at a stable rate using either stand-alone or supplemental thermal stimuli. They suggest that thermal cues can enable material discrimination on touch surfaces, which gives way for expanded use of thermal stimuli on interactive surfaces.
Feeling it: the roles of stiffness, deformation range and feedback in the control of deformable ui BIBAFull-Text 393-400
  Johan Kildal; Graham Wilson
There has been little discussion on how the materials used to create deformable devices, and the subsequent interactions, might influence user performance and preference. In this paper we evaluated how the stiffness and required deformation extent (bending up and down bimanually) of mobile phone-shaped deformable devices influenced how precisely participants were able to move to and maintain target extents of deformation (bend). Given the inherent haptic feedback available from deforming devices (over rigid devices), we also compared performance with, and without, external visual feedback. User perception and preference regarding the different devices were also elicited. Results show that, while device stiffness did not significantly affect task performance, user comfort and preferences were strongly in favour of softer materials (0.45 N*m/rad) and moderate amounts of deformation. Removing external visual feedback led to less precise user input, but inaccuracy remained low enough to suggest non-visual interaction with deformable devices is feasible.
Audible rendering of text documents controlled by multi-touch interaction BIBAFull-Text 401-408
  Yasmine El-Glaly; Francis Quek; Tonya Smith-Jackson; Gurjot Dhillon
In this paper, we introduce a novel interaction model for reading text documents depending on situated touch. This interaction modality targets Individuals with Blindness or Severe Visual Impairment (IBSVI). We aim to provide IBSVI with an effective reading tool that enables them to use their spatial abilities while reading. We used an iPad device and augmented it with a static tactile overlay to display the text to serve as a kind of spatial landmark space for the IBSVI. The text is rendered audibly in response to user's touch. Two user studies, with IBSVI participants, were conducted to test the system. The first study was a laboratory-controlled study, and the second one was a longitudinal study. These studies showed that while the approach is new to the users' experience, it is a promising direction to enable self-paced spatial reading for IBSVI.
Taste/IP: the sensation of taste for digital communication BIBAFull-Text 409-416
  Nimesha Ranasinghe; Adrian David Cheok; Ryohei Nakatsu
In this paper, we present a new methodology for integrating the sense of taste with the existing digital communication domain. First, we discuss existing problems and limitations for integrating the sense of taste as a digital communication media. Then, to assess this gap, we present a solution with three core modules: the transmitter, form of communication, and receiver. The transmitter is a mobile application, where the sender formulates a taste message to send. Then, for communication, we present a new extensible markup language (XML) format, the TasteXML (TXML) to specify the format of taste messages. As the receiver (actuator), we introduce Digital Taste Stimulator, a novel method for stimulating taste sensations on human. Initial user experiments and qualitative feedbacks were discussed mainly focusing on the Digital Taste Stimulator. We conclude with a brief overview of future aspects of this technology and possibilities on other application domains.

Multimodal interaction

Learning speaker, addressee and overlap detection models from multimodal streams BIBAFull-Text 417-424
  Oriol Vinyals; Dan Bohus; Rich Caruana
A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.
Analysis of the correlation between the regularity of work behavior and stress indices based on longitudinal behavioral data BIBAFull-Text 425-432
  Shogo Okada; Yusaku Sato; Yuki Kamiya; Keiji Yamada; Katsumi Nitta
Increasingly, longitudinal behavioral data captured by various sensors are being analyzed to improve workplace performance. In this paper, we analyze the correlation between the regularity of workers' behavior and their levels of stress. We used a 23-month behavioral dataset for 18 workers that recorded their use of PCs and their locations in the office. We found that the principal eigen-behaviors extracted from the dataset with PCA represented typical work behaviors such as overwork using a PC and routine times for meetings. We found that more than 80% of each of the 18 workers' individual behaviors could be reconstructed using nine principal eigenbehaviors. In addition, the deviation ranges for the reconstruction accuracies were significantly different for workers in different positions. We conducted the correlation analysis between work behaviors of the workers and their stress level. Our results show a significant negative correlation (r > 0.69, p < 0.01) between the accuracy of reconstructed work behaviors and physical stress levels; and a significant positive correlation between the accuracy of reconstructed behavior and stress dissolution abilities. Our results suggest that the correlation between the stress level of workers and the regularity of their work behavior exists. This correlation will be useful for occupational healthcare.
Linking speaking and looking behavior patterns with group composition, perception, and performance BIBAFull-Text 433-440
  Dineshbabu Jayagopi; Dairazalia Sanchez-Cortes; Kazuhiro Otsuka; Junji Yamato; Daniel Gatica-Perez
This paper addresses the task of mining typical behavioral patterns from small group face-to-face interactions and linking them to social-psychological group variables. Towards this goal, we define group speaking and looking cues by aggregating automatically extracted cues at the individual and dyadic levels. Then, we define a bag of nonverbal patterns (Bag-of-NVPs) to discretize the group cues. The topics learnt using the Latent Dirichlet Allocation (LDA) topic model are then interpreted by studying the correlations with group variables such as group composition, group interpersonal perception, and group performance. Our results show that both group behavior cues and topics have significant correlations with (and predictive information for) all the above variables. For our study, we use interactions with unacquainted members i.e. newly formed groups.
Semi-automatic generation of multimodal user interfaces for dialogue-based interactive systems BIBAFull-Text 441-444
  Dominik Ertl; Hermann Kaindl
Automation in the course of user-interface (UI) development has the potential to save resources and time. For graphical user interfaces, quite some research has been performed on their automated generation. While the results are still not in wide-spread use, at least the problems are well understood meanwhile. In contrast, automated generation of multimodal UIs is still in its infancy. We address this problem by proposing a tool-supported process for generating multimodal UIs for dialogue-based interactive systems. For its concrete enactment, we provide tool support for generating a runtime configuration and glue code, respectively. In a nutshell, our approach generates multimodal dialogue-based UIs semi-automatically.
Designing multimodal reminders for the home: pairing content with presentation BIBAFull-Text 445-448
  Julie R. Williamson; Marilyn McGee-Lennon; Stephen Brewster
Reminder systems are a specific range of technologies for care at home that can deliver notifications or reminders (such as 'take your medication') to assist with daily living. How to best deliver these reminders is an interesting research challenge. Today's technologies have the potential to deliver the notifications in a range of output modalities. The delivery methods can be selected both by the system (depending on what devices are available in the home) and on the users' needs, capabilities and preferences. This paper describes a user-centred approach to the design of multimodal output for reminders in the home. In a focus group study (N=15), this paper explores how six output modalities could be used to present reminders in a home setting. The results demonstrate user requirements that can be incorporated into the early phases of the design of a multimodal reminder system.

Challenge 1: 2nd international audio/visual emotion challenge and workshop -- AVEC 2012

AVEC 2012: the continuous audio/visual emotion challenge BIBAFull-Text 449-456
  Björn Schuller; Michel Valster; Florian Eyben; Roddy Cowie; Maja Pantic
We present the second Audio-Visual Emotion recognition Challenge and workshop (AVEC 2012), which aims to bring together researchers from the audio and video analysis communities around the topic of emotion recognition. The goal of the challenge is to recognise four continuously valued affective dimensions: arousal, expectancy, power, and valence. There are two sub-challenges: in the Fully Continuous Sub-Challenge participants have to predict the values of the four dimensions at every moment during the recordings, while for the Word-Level Sub-Challenge a single prediction has to be given per word uttered by the user. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Facial emotion recognition with expression energy BIBAFull-Text 457-464
  Albert C. Cruz; Bir Bhanu; Ninad Thakoor
Facial emotion recognition, the inference of an emotion from apparent facial expressions, in unconstrained settings is a typical case where algorithms perform poorly. A property of the AVEC2012 data set is that individuals in testing data are not encountered in training data. In these situations, conventional approaches suffer because models developed from training data cannot properly discriminate unforeseen testing samples. Additional information beyond the feature vectors is required for successful detection of emotions. We propose two similarity metrics that address the problems of a conventional approach: neutral similarity, measuring the intensity of an expression; and temporal similarity, measuring changes in an expression over time. These similarities are taken to be the energy of facial expressions, measured with a SIFT-based warping process. Our method improves correlation by 35.5% over the baseline approach on the frame-level sub-challenge.
Multiple classifier combination using reject options and Markov fusion networks BIBAFull-Text 465-472
  Michael Glodek; Martin Schels; Günther Palm; Friedhelm Schwenker
The audio/visual emotion challenge (AVEC) resembles a benchmarking data collection in order to evaluate and develop techniques for the recognition of affective states. In our work, we present a Markov fusion network (MFN) for the combination of different individual classifiers, that is derived from the well-known Markov random fields (MRF). It is capable to restore missing values from a sequence of decisions and can integrate multiple channels and weights them dynamically using confidences. The approach shows promising challenge results compared to the baseline.
Audio-visual emotion challenge 2012: a simple approach BIBAFull-Text 473-476
  Laurens van der Maaten
The paper presents a small empirical study into emotion and affect recognition based on auditory and visual features, which was performed in the context of the Audio-Visual Emotion Challenge (AVEC) 2012. The goal of this competition is to predict continuous-valued affect ratings based on the provided auditory and visual features, e.g., local binary pattern (LBP) features extracted from aligned face images, and spectral audio features.
   Empirically, we found that there are only very weak (linear) relations between the features and the continuous-valued ratings: our best linear regressors employ the offset-feature to exploit the fact that the ratings have a dominant direction (more increasing than decreasing). Much to our surprise, only exploitation of this bias already leads to results that improve over the baseline system presented in [10]. The best performance we obtained on the AVEC 2012 test set (averaged over the test set and over four affective dimensions) is a correlation between predicted and ground-truth ratings of 0.2255 when making continuous predictions, and 0.1920 when making word-level predictions.
Step-wise emotion recognition using concatenated-HMM BIBAFull-Text 477-484
  Derya Ozkan; Stefan Scherer; Louis-Philippe Morency
Human emotion is an important part of human-human communication, since the emotional state of an individual often affects the way that he/she reacts to others. In this paper, we present a method based on concatenated Hidden Markov Model (co-HMM) to infer the dimensional and continuous emotion labels from audio-visual cues. Our method is based on the assumption that continuous emotion levels can be modeled by a set of discrete values. Based on this, we represent each emotional dimension by step-wise label classes, and learn the intrinsic and extrinsic dynamics using our co-HMM model. We evaluate our approach on the Audio-Visual Emotion Challenge (AVEC 2012) dataset. Our results show considerable improvement over the baseline regression model presented with the AVEC 2012.
Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering BIBAFull-Text 485-492
  Arman Savran; Houwei Cao; Miraj Shah; Ani Nenkova; Ragini Verma
We present experiments on fusing facial video, audio and lexical indicators for affect estimation during dyadic conversations. We use temporal statistics of texture descriptors extracted from facial video, a combination of various acoustic features, and lexical features to create regression based affect estimators for each modality. The single modality regressors are then combined using particle filtering, by treating these independent regression outputs as measurements of the affect states in a Bayesian filtering framework, where previous observations provide prediction about the current state by means of learned affect dynamics. Tested on the Audio-visual Emotion Recognition Challenge dataset, our single modality estimators achieve substantially higher scores than the official baseline method for every dimension of affect. Our filtering-based multi-modality fusion achieves correlation performance of 0.344 (baseline: 0.136) and 0.280 (baseline: 0.096) for the fully continuous and word level sub challenges, respectively.
A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection BIBAFull-Text 493-500
  Catherine Soladié; Hanan Salam; Catherine Pelachaud; Nicolas Stoiber; Renaud Séguier
This paper presents a multimodal fuzzy inference system for emotion detection. The system extracts and merges visual, acoustic and context relevant features. The experiments have been performed as part of the AVEC 2012 challenge. Facial expressions play an important role in emotion detection. However, having an automatic system to detect facial emotional expressions on unknown subjects is still a challenging problem. Here, we propose a method that adapts to the morphology of the subject and that is based on an invariant representation of facial expressions. Our method relies on 8 key expressions of emotions of the subject. In our system, each image of a video sequence is defined by its relative position to these 8 expressions. These 8 expressions are synthesized for each subject from plausible distortions learnt on other subjects and transferred on the neutral face of the subject. Expression recognition in a video sequence is performed in this space with a basic intensity-area detector. The emotion is described in the 4 dimensions: valence, arousal, power and expectancy. The results show that the duration of high intensity smile is an expression that is meaningful for continuous valence detection and can also be used to improve arousal detection. The main variations in power and expectancy are given by context data.
Robust continuous prediction of human emotions using multiscale dynamic cues BIBAFull-Text 501-508
  Jérémie Nicolle; Vincent Rapp; Kévin Bailly; Lionel Prevost; Mohamed Chetouani
Designing systems able to interact with humans in a natural manner is a complex and far from solved problem. A key aspect of natural interaction is the ability to understand and appropriately respond to human emotions. This paper details our response to the Audio/Visual Emotion Challenge (AVEC'12) whose goal is to continuously predict four affective signals describing human emotions (namely valence, arousal, expectancy and power). The proposed method uses log-magnitude Fourier spectra to extract multiscale dynamic descriptions of signals characterizing global and local face appearance as well as head movements and voice. We perform a kernel regression with very few representative samples selected via a supervised weighted-distance-based clustering, that leads to a high generalization power. For selecting features, we introduce a new correlation-based measure that takes into account a possible delay between the labels and the data and significantly increases robustness. We also propose a particularly fast regressor-level fusion framework to merge systems based on different modalities. Experiments have proven the efficiency of each key point of the proposed method and we obtain very promising results.
Elastic net for paralinguistic speech recognition BIBAFull-Text 509-516
  Pouria Fewzee; Fakhri Karray
Given the fact that the length of the feature vector that is being used for the paralinguistic recognition of speech has exceeded some thousands, the importance of a sparse representation of a model becomes notable. The importance of a sparse representation is mainly due to the more interpretability, higher generalization capability, and numerically more efficiency of such a model. In this work, as an endeavor to search for a sparse representation of speech features used for paralinguistic speech modeling, we make use of the elastic net. As for the benchmark, we use the frameworks of the second audio/visual emotion challenge and the Interspeech 2012 speaker trait challenge. Also proposed in this work is the use of part-of-speech tags as syntactic features of speech for emotional speech recognition. Results of this work show that despite the relatively small number of features that is used for the modeling tasks, generalization capability of the suggested models is comparable to those of other models that use thousands of features and more elaborate learning algorithms.
Improving generalisation and robustness of acoustic affect recognition BIBAFull-Text 517-522
  Florian Eyben; Björn Schuller; Gerhard Rigoll
Emotion recognition in real-life conditions faces several challenging factors, which most studies on emotion recognition do not consider. Such factors include background noise, varying recording levels, and acoustic properties of the environment, for example. This paper presents a systematic evaluation of the influence of background noise of various types and SNRs, as well as recording level variations on the performance of automatic emotion recognition from speech. Both, natural and spontaneous as well as acted/prototypical emotions are considered. Besides the well known influence of additive noise, a significant influence of the recording level on the recognition performance is observed. Multi-condition learning with various noise types and recording levels is proposed as a way to increase robustness of methods based on standard acoustic feature sets and commonly used classifiers. It is compared to matched conditions learning and is found to be almost on par for many settings.
Preserving actual dynamic trend of emotion in dimensional speech emotion recognition BIBAFull-Text 523-528
  Wenjing Han; Haifeng Li; Florian Eyben; Lin Ma; Jiayin Sun; Björn Schuller
In this paper, we use the concept of dynamic trend of emotion to describe how a human's emotion changes over time, which is believed to be important for understanding one's stance toward current topic in interactions. However, the importance of this concept -- to our best knowledge -- has not been paid enough attention before in the field of speech emotion recognition (SER). Inspired by this, this paper aims to evoke researchers' attention on this concept and makes a primary effort on the research of predicting correct dynamic trend of emotion in the process of SER. Specifically, we propose a novel algorithm named Order Preserving Network (OPNet) to this end. First, as the key issue for OPNet construction, we propose employing a probabilistic method to define an emotion trend-sensitive loss function. Then, a nonlinear neural network is trained using the gradient descent as optimization algorithm to minimize the constructed loss function. We validated the prediction performance of OPNet on the VAM corpus, by mean linear error as well as a rank correlation coefficient γ as measures. Comparing to k-Nearest Neighbor and support vector regression, the proposed OPNet performs better on the preservation of actual dynamic trend of emotion.
Negative sentiment in scenarios elicit pupil dilation response: an auditory study BIBAFull-Text 529-532
  Serdar Baltaci; Didem Gokcay
Earlier investigations of pupil size variation during auditory stimulation involved delivery of simple auditory tones. In this study, we investigated pupil size variation in response to auditory stimuli which involved verbal stories In our study, twenty four participants listened to scenarios with headphones while sitting in front of an eye tracker and rated stories on two bipolar dimensions: emotional valence and arousal. We analyzed pupil data after finishing subjective ratings of scenario. The results showed that the pupil size is significantly large during the story which carried strongly negative sentiment compared to the story with neutral content. To sum up, our results showed that systematically designed scenarios with highly arousing negative sentiment significantly affected the subjects' physiological pupil reaction as well as their subjective experiences. In the future, it might be possible to use pupil size variation as a predictor for sentimental content of scenarios. Auditory emotion-related cues might also be utilized to modulate the users' emotional reactions in affective computing.

Challenge 2: haptic voice recognition grand challenge

Design and implementation of the note-taking style haptic voice recognition for mobile devices BIBAFull-Text 533-538
  Seungwhan Moon; Khe Chai Sim
This research proposes the "note-taking style" Haptic Voice Recognition (HVR) technology which incorporates speech and touch sensory inputs in a note-like form to enhance the performance of speech recognition. A note is taken from a user via two different haptic input methods -- handwriting and a keyboard. A note consists of some of the keywords in the given utterance, either partially spelled or fully spelled. In order to facilitate fast input, the interface allows a shorthand writing system such as Gregg Shorthand. Using this haptic note sequence as an additional knowledge source, the algorithm re-ranks the n-best list generated by a speech engine. The simulation and experimental results show that the proposed HVR method improves the Word Error Rate (WER) and Keyword Error Rate (KER) performance in comparison to an Automatic Speech Recognition (ASR) system. Although it generates an inevitable increase in speech duration due to disfluency and occasional mistakes in haptic input, the compensation is shown to be less than conventional HVR methods. As such, this new note-taking style HVR interaction has the potential to be both natural and effective in increasing the recognition performance by choosing the most likely utterance among multiple hypotheses. This paper discusses the algorithm for the proposed system, the results from the simulation and the experiments, and the possible applications of this new technology such as aiding spoken document retrieval with haptic notes.
Development of the 2012 SJTU HVR system BIBAFull-Text 539-544
  Hainan Xu; Yuchen Fan; Kai Yu
Haptic voice recognition (HVR) is a multi-modal text entry method for smart mobile devices. It employs haptic events generated by speakers during speaking to achieve better efficiency and robustness for automatic speech recognition. This paper describes the detailed design of the 2012 SJTU submission for the HVR Grand Challenge. During the design, a new perplexity metric using conditional entropy is proposed to evaluate the potential search space reduction of a haptic event without speech input. A number of new haptic events are evaluated both theoretically and experimentally in detail. The final submission system uses the haptic event of initial letter plus final letter and reduces word error rate by 76% compared to the baseline initial letter event.
Improving mandarin predictive text input by augmenting pinyin initials with speech and tonal information BIBAFull-Text 545-550
  Guangsen Wang; Bo Li; Shilin Liu; Xuancong Wang; Xiaoxuan Wang; Khe Chai Sim
Recently, a new technology called Haptic Voice Recognition (HVR) was proposed to enhance the speech recognition efficiency and accuracy for modern mobile devices, which has been successfully applied for robust English voice recognition. As both Pinyin and handwriting input methods work quite slow in mobile devices because of typing errors and ambiguity, it is interesting to apply this technology to assist Mandarin predictive text input. However, it is not straightforward because the characteristics of Mandarin significantly differ from alphabetic western languages. In this paper, we investigated what possible haptic inputs are important and how these information can be incorporated to improve Mandarin text input. Various experiments were conducted and results have shown that with the help of acoustic and tonal information, the ambiguity of Pinyin Initial based Mandarin predictive text input is largely reduced and an oracle character error rate of 3.8% for the top 4 candidates could be achieved, which is usually the number of word candidates displayed on mobile devices. Our Mandarin HVR system has also shown its robustness in noisy environments.
LUI: lip in multimodal mobile GUI interaction BIBAFull-Text 551-554
  Maryam Azh; Shengdong Zhao
Gesture based interactions are commonly used in mobile and ubiquitous environments. Multimodal interaction techniques use lip gestures to enhance speech recognition or control mouse movement on the screen. In this paper we extend the previous work to explore LUI: lip gestures as an alternative input technique for controlling the user interface elements in a ubiquitous environment. In addition to use lips to control cursor movement, we use lip gestures to control music players and activate menus. A LUI Motion-Action library is also provided to guide future interaction design using lip gestures.
Speak-as-you-swipe (SAYS): a multimodal interface combining speech and gesture keyboard synchronously for continuous mobile text entry BIBAFull-Text 555-560
  Khe Chai Sim
Modern mobile devices, such as the smartphones and tablets, are becoming increasingly popular amongst users of all ages. Text entry is one of the most important modes of interaction between human and their mobile devices. Although typing on a touchscreen display using a soft keyboard remains the most common text input method for many users, the process can be frustratingly slow, especially on smartphones with a much smaller screen. Voice input offers an attractive alternative that completely eliminates the need for typing. However, voice input relies on automatic speech recognition technology whose performance degrades significantly in noisy environment or for non-native users. This paper presents Speak-As-You-Swipe (SAYS), a novel multimodal interface that enables efficient continuous text entry on mobile devices. SAYS integrates a gesture keyboard with speech recognition to improve the efficiency and accuracy of text entry. The swipe gesture and voice inputs provide complementary information that can be very effective in disambiguating confusions in word predictions. The word prediction hypotheses from a gesture keyboard are directly incorporated into the speech recognition process so that the SAYS interface can handle continuous input. Experimental results show that for a 20k vocabulary, the proposed SAYS interface can achieve prediction accuracy of 96.4% in clean condition and about 94.0% in noisy environment, compared to 92.2% using a gesture keyboard alone.

Challenge 3: BCI grand challenge: brain-computer interfaces as intelligent sensors for enhancing human-computer interaction

Interpersonal biocybernetics: connecting through social psychophysiology BIBAFull-Text 561-566
  Alan T. Pope; Chad L. Stephens
One embodiment of biocybernetic adaptation is a human-computer interaction system designed such that physiological signals modulate the effect that control of a task by other means, usually manual control, has on performance of the task. Such a modulation system enables a variety of human-human interactions based upon physiological self-regulation performance. These interpersonal interactions may be mixes of competition and cooperation for simulation training and/or videogame entertainment.
Adaptive EEG artifact rejection for cognitive games BIBAFull-Text 567-570
  Olexiy Kyrgyzov; Antoine Souloumiac
The separation of the informative part from an observed dataset is a significant step for dimension reduction and features extraction. In this paper, we present an approach for adaptive artifact rejection from the electroencephalogram (EEG). The main aim of our work is to increase performance of classification algorithms which have a deal with the EEG and are used in cognitive games. We provide a method to separate the EEG into informative and noised parts, select informative one and rank its dimensions. The proposed approach is based on the theoretical relation between classification accuracy, mutual information and normalized graph cut (NC) value. The presented algorithm requires a priori known class labeled EEG dataset that are utilized for a calibration phase of the brain-computer interface (BCI). Experimental results on datasets from BCI competitions show its applicability for cognitive games.
Construction of the biocybernetic loop: a case study BIBAFull-Text 571-578
  Stephen Fairclough; Kiel Gilleade
The biocybernetic loop describes the data processing protocol at the heart of all physiological computing systems. The loop also encompasses the goals of the system design with respect to the anticipated impact of the adaptation on user behaviour. There are numerous challenges facing the designer of a biocybernetic loop in terms of measurement, data processing and adaptive design. These challenges are multidisciplinary in nature spanning psychology and computer science. This paper is concerned with the design process of the biocybernetic loop. A number of criteria for an effective loop are described followed by a six-stage design cycle. The challenges faced by the designer at each stage of the design process are exemplified with reference to a case study where EEG data were used to adapt a computer game.
An interactive control strategy is more robust to non-optimal classification boundaries BIBAFull-Text 579-586
  Virginia R. de Sa
We consider a new paradigm for EEG-based brain computer interface (BCI) cursor control involving signaling satisfaction or dissatisfaction with the current motion direction instead of the usual direct control of signaling rightward or leftward desired motion. We start by assuming that the same underlying EEG signals are used to either signal directly the intent for right and left motion or to signal satisfaction and dissatisfaction with the current motion. We model the paradigm as an absorbing Markov chain and show that while both the standard system and the new interactive system have equal information transfer rate (ITR) when the Bayes optimal classification boundary (between the underlying EEG feature distributions used for the two classes) is exactly known and non-changing, the interactive system is much more robust to using a suboptimal classification boundary. Due to non-stationarity of EEG recordings, in real systems the classification boundary will often be suboptimal for the current EEG signals. We note that a variable step size gives a higher ITR for both systems (but the same robustness improvement of the interactive system remains). Finally, we present a way to probabilistically combine classifiers of natural signals of satisfaction and dissatisfaction with classifiers using standard left/right controls.
Improving BCI performance after classification BIBAFull-Text 587-594
  Danny Plass-Oude Bos; Hayrettin Gürkök; Boris Reuderink; Mannes Poel
Brain-computer interfaces offer a valuable input modality, which unfortunately comes also with a high degree of uncertainty. There are simple methods to improve detection accuracy after the incoming brain activity has already been classified, which can be divided into (1) gathering additional evidence from other sources of information, and (2) transforming the unstable classification results to be more easy to control. The methods described are easy to implement, but it is essential to apply them in the right way. This paper provides an overview of the different techniques, showing where to apply them and comparing the effects. Detection accuracy is important, but there are trade-offs to consider. Future research should investigate the effectiveness of these methods in their context of use, as well as the optimal settings to obtain the right balance between functionality and meeting the user's expectations for maximum acceptance.
Electroencephalographic detection of visual saliency of motion towards a practical brain-computer interface for video analysis BIBAFull-Text 601-606
  Matthew Weiden; Deepak Khosla; Matthew Keegan
Though Electroencephalography (EEG)-based brain-computer interfaces (BCI) have come to outperform pure computer vision algorithms on difficult image triage tasks, none of these BCIs have leveraged the effects of motion on the human visual attention system. Here we consider the advantages of leveraging the effects of motion by testing a new method for EEG-based target detection using Rapid Serial Visual Presentation (RSVP) of short moving image clips. Comparatively, canonical methods present the operator with still images only. Our experiments show that presenting moving images in RSVP instead of still images significantly increases performance (p < 0.0005). In addition, we investigate the effect of increasing target load in RSVP of moving image clips and find some support for the hypothesis that detection will decrease as target load is increased.

Workshop overview

Workshop on speech and gesture production in virtually and physically embodied conversational agents BIBAFull-Text 607-608
  Ross Mead; Maha Salem
This full day workshop aims to bring together researchers from the embodied conversational agent (ECA) and sociable robotics communities to spark discussion and collaboration between the related fields. The focus of the workshop is on co-verbal behavior production -- specifically, synchronized speech and gesture -- for either virtually or physically embodied platforms. It elucidates the subject in consideration of aspects regarding planning and realization of multimodal behavior production. Topics discussed highlight common as well as distinguishing factors of their implementations within each respective field.
1st international workshop on multimodal learning analytics: extended abstract BIBAFull-Text 609-610
  Stefan Scherer; Marcelo Worsley; Louis-Philippe Morency
This summary describes the 1st International Workshop on Multimodal Learning Analytics. This area of study brings together the technologies of multimodal analysis with the learning sciences. The intersection of these domains should enable researchers to foster an improved understanding of student learning, lead to the creation of more natural and enriching learning interfaces, and motivate the development of novel techniques for tackling challenges that are specific of education.
4th workshop on eye gaze in intelligent human machine interaction: eye gaze and multimodality BIBAFull-Text 611-612
  Yukiko I. Nakano; Kristiina Jokinen; Hung-Hsuan Huang
This is the fourth workshop in a series of workshops on Eye Gaze in Intelligent Human Machine Interaction, in which we have discussed a wide range of issues for eye gaze; technologies for sensing human attentional behaviors, roles of attentional behaviors as social gaze in human-human and human-humanoid interaction, attentional behaviors in problem-solving and task-performing, gaze-based intelligent user interfaces, and evaluation of gaze-based user interfaces. In addition to these topics, this year's workshop focuses on eye gaze in multimodal interpretation and generation. Since eye gaze is one of the facial communication modalities, gaze information can be combined with other modalities or bodily motions to contribute to the meaning of utterance and serve as communication signals.
The 3rd international workshop on social behaviour in music: SBM2012 BIBAFull-Text 613-614
  Antonio Camurri; Donald Glowinski; Maurizio Mancini; Giovanna Varni; Gualtiero Volpe
Since its first edition in 2009, the International Workshop on Social Behaviour in Music (SBM) has been an occasion for researchers and practitioners for discussing recent advances in automated analysis of social behaviour, being music the selected test-bed and application scenario. The first edition of SBM was held in Vancouver, Canada, in the framework of the 2009 IEEE International Conference on Social Computing (SocialCom 2009). The second one was held in Genova, Italy, in the framework of the 4th International ICST Conference on Intelligent Technologies for Interactive Entertainment (Intetain 2011). SBM is now at its third edition, which takes place in the framework of the 14th International Conference on Multimodal Interaction (ICMI 2012), Santa Monica, California, USA. Again, SBM aims at providing a picture of current research breakthrough and issues, giving at the same time directions for future works and collaborations.
Smart material interfaces: a material step to the future BIBAFull-Text 615-616
  Anton Nijholt; Leonardo Giusti; Andrea Minuto; Patrizia Marti
Over the past years the technology push and the creation of new technological materials made available on the market many new smart materials. Smart Material Interfaces (SMIs) want to take advantage of these materials to overcome traditional patterns of interaction, leaving behind the "digital feeling" by a more continuous space of interaction, tightly coupling digital and physical world by means of the smart materials' properties. With this workshop about SMIs, we want to draw attention to the emerging field, stimulating research and development in interfaces that make use of smart materials and encourage new and different modalities of interaction.