HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2006 International Conference on Multimodal Interfaces

Fullname:ICMI'06 Proceedings of the 8th International Conference on Multimodal Interfaces
Editors:Francis Quek; Jie Yang; Dominic Massaro; Abeer Alwan; Timothy J. Hazen
Location:Banff, Alberta, Canada
Dates:2006-Nov-02 to 2006-Nov-04
Standard No:ISBN: 1-59593-541-X; ACM DL: Table of Contents hcibib: ICMI06
  1. Keynote
  2. Poster Session 1
  3. Oral session 1: speech and gesture integration
  4. Oral session 2: perception and feedback
  5. Demonstration session
  6. Special poster session on human computing
  7. Oral session 3: language understanding and content analysis
  8. Oral session 4: collaborative systems and environments
  9. Special oral session: special session on human computing
  10. Poster session 2
  11. Oral session 5: speech and dialogue systems
  12. Oral session 6: interfaces and usability
  13. Panel


Weight, weight, don't tell me BIBAFull-Text 1
  Ted Warburton
Remember the "Internet's firstborn," Ron Lussier's dancing baby from 1996? Other than a vague sense of repeated gyrations, no one can recall any of the movements in particular. Why is that? While that animation was ground-breaking in many respects, to paraphrase a great writer, there was no there there. The dancing baby lacked personality because the movements themselves lacked "weight." Each human being has a unique perceivable movement style composed of repeated recognizable elements that in combination and phrasing capture the liveliness of movement. The use of weight, or "effort quality," is a key element in movement style, defining a dynamic expressive range. In computer representation of human movement, however, weight is often an aspect of life-ness that gets diminished or lost in the process, contributing to a lack of groundedness, personality, and verisimilitude. In this talk, I unpack the idea of effort quality and describe current work with motion capture and telematics that puts the weight back on interface design.
Movement and music: designing gestural interfaces for computer-based musical instruments BIBAFull-Text 2
  Sile O'Modhrain
The concept of body-mediated or embodied interaction, of the coupling of interface and actor, has become increasingly relevant within the domain of HCI. With the reduced size and cost of a wide variety of sensor technologies and the ease with which they can be wirelessly deployed, on the body, in devices we carry with us and in the environment, comes the opportunity to use a wide range of human motion as an integral part of our interaction with many applications. While movement is potentially a rich, multidimensional source of information upon which interface designers can draw, its very richness poses many challenges in developing robust motion capture and gesture recognition systems. In this talk, I will suggest that lessons learned by designers of computer-based musical instruments whose task is to translate expressive movement into nuanced control of sound may now help to inform the design of movement-based interfaces for a much wider range of applications.
Mixing virtual and actual BIBAFull-Text 3
  Herbert H. Clark
People often communicate with a mixture of virtual and actual elements. On the telephone, my sister and I and what we say are actual, even though our voices are virtual. In the London Underground, the warning expressed in the recording "Stand clear of the doors" is actual, even though the person making it is virtual. In the theater, Shakespeare, the actors, and I are actual, even though Romeo and Juliet and what they say are virtual. Mixtures like these cannot be accounted for in standard models of communication-for a variety of reasons. In this talk I introduce the notion of displaced actions (as on the telephone, in the London Underground, and in the theater) and characterize how they are used and interpreted in communication with a range of modern-day technologies.

Poster Session 1

Collaborative multimodal photo annotation over digital paper BIBAKFull-Text 4-11
  Paulo Barthelmess; Edward Kaiser; Xiao Huang; David McGee; Philip Cohen
The availability of metadata annotations over media content such as photos is known to enhance retrieval and organization, particularly for large data sets. The greatest challenge for obtaining annotations remains getting users to perform the large amount of tedious manual work that is required.
   In this paper we introduce an approach for semi-automated labeling based on extraction of metadata from naturally occurring conversations of groups of people discussing pictures among themselves.
   As the burden for structuring and extracting metadata is shifted from users to the system, new recognition challenges arise. We explore how multimodal language can help in 1) detecting a concise set of meaningful labels to be associated with each photo, 2) achieving robust recognition of these key semantic terms, and 3) facilitating label propagation via multimodal shortcuts. Analysis of the data of a preliminary pilot collection suggests that handwritten labels may be highly indicative of the semantics of each photo, as indicated by the correlation of handwritten terms with high frequency spoken ones. We point to initial directions exploring a multimodal fusion technique to recover robust spelling and pronunciation of these high-value terms from redundant speech and handwriting.
Keywords: automatic label extraction, collaborative interaction, intelligent interfaces, multimodal processing, photo annotation
MyConnector: analysis of context cues to predict human availability for communication BIBAKFull-Text 12-19
  Maria Danninger; Tobias Kluge; Rainer Stiefelhagen
In this thriving world of mobile communications, the difficulty of communication is no longer contacting someone, but rather contacting people in a socially appropriate manner. Ideally, senders should have some understanding of a receiver's availability in order to make contact at the right time, in the right contexts, and with the optimal communication medium.
   We describe the design and implementation of MyConnector, an adaptive and context-aware service designed to facilitate efficient and appropriate communication, based on each party's availability. One of the chief design questions of such a service is to produce technologies with sufficient contextual awareness to decide upon a person's availability for communication. We present results from a pilot study comparing a number of context cues and their predictive power for gauging one's availability.
Keywords: availability, computer-mediated communication, context-aware communication, interruptibility, user models
Human perception of intended addressee during computer-assisted meetings BIBAKFull-Text 20-27
  Rebecca Lunsford; Sharon Oviatt
Recent research aims to develop new open-microphone engagement techniques capable of identifying when a speaker is addressing a computer versus human partner, including during computer-assisted group interactions. The present research explores: (1) how accurately people can judge whether an intended interlocutor is a human versus computer, (2) which linguistic, acoustic-prosodic, and visual information sources they use to make these judgments, and (3) what type of systematic errors are present in their judgments. Sixteen participants were asked to determine a speaker's intended addressee based on actual videotaped utterances matched on illocutionary force, which were played back as: (1) lexical transcriptions only, (2) audio-only, (3) visual-only, and (4) audio-visual information. Perhaps surprisingly, people's accuracy in judging human versus computer addressees did not exceed chance levels with lexical-only content (46%). As predicted, accuracy improved significantly with audio (58%), visual (57%), and especially audio-visual information (63%). Overall, accuracy in detecting human interlocutors was significantly worse than judging computer ones, and specifically worse when only visual information was present because speakers often looked at the computer when addressing peers. In contrast, accuracy in judging computer interlocutors was significantly better whenever visual information was present than with audio alone, and it yielded the highest accuracy levels observed (86%). Questionnaire data also revealed that speakers' gaze, peers' gaze, and tone of voice were considered the most valuable information sources. These results reveal that people rely on cues appropriate for interpersonal interactions in determining computer- versus human-directed speech during mixed human-computer interactions, even though this degrades their accuracy. Future systems that process actual rather than expected communication patterns potentially could be designed that perform better than humans.
Keywords: acoustic-prosodic cues, dialogue style, gaze, human-computer teamwork, intended addressee, multiparty interaction, open-microphone engagement
Automatic detection of group functional roles in face to face interactions BIBAKFull-Text 28-34
  Massimo Zancanaro; Bruno Lepri; Fabio Pianesi
In this paper, we discuss a machine learning approach to automatically detect functional roles played by participants in a face to face interaction. We shortly introduce the coding scheme we used to classify the roles of the group members and the corpus we collected to assess the coding scheme reliability as well as to train statistical systems for automatic recognition of roles. We then discuss a machine learning approach based on multi-class SVM to automatically detect such roles by employing simple features of the visual and acoustical scene. The effectiveness of the classification is better than the chosen baselines and although the results are not yet good enough for a real application, they demonstrate the feasibility of the task of detecting group functional roles in face to face interactions.
Keywords: group interaction, intelligent environments, support vector machines
Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech BIBAKFull-Text 35-38
  Hari Krishna Maganti; Daniel Gatica-Perez
Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audio-visual information approaches.
Keywords: audio-visual speaker tracking, microphone array ASR
Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't BIBAKFull-Text 39-42
  Cosmin Munteanu; Gerald Penn; Ron Baecker; Yuecheng Zhang
The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One challenge to skimming and browsing through such archives is the lack of text transcripts of the webcast's audio channel. This paper describes a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototype-based and Wizard of Oz simulations. We used such a system in a user study showing that transcripts with WERs less than 25% are acceptable for use in webcast archives. As current ASR systems can only deliver, in realistic conditions, Word Error Rates (WERs) of around 45%, we also describe a solution for reducing the WER of such transcripts by engaging users to collaborate in a "wiki" fashion on editing the imperfect transcripts obtained through ASR.
Keywords: automatic speech recognition, collaboration, webcasts
Cross-modal coordination of expressive strength between voice and gesture for personified media BIBAKFull-Text 43-50
  Tomoko Yonezawa; Noriko Suzuki; Shinji Abe; Kenji Mase; Kiyoshi Kogure
The aim of this paper is to clarify the relationship between the expressive strengths of gestures and voice for embodied and personified interfaces. We conduct perceptual tests using a puppet interface, while controlling singing-voice expressions, to empirically determine the naturalness and strength of various combinations of gesture and voice. The results show that (1) the strength of cross-modal perception is affected more by gestural expression than by the expressions of a singing voice, and (2) the appropriateness of cross-modal perception is affected by expressive combinations between singing voice and gestures in personified expressions. As a promising solution, we propose balancing a singing voice and gestural expressions by expanding and correcting the width and shape of the curve of expressive strength in the singing voice.
Keywords: cross-modality, perceptual experiment, personified puppet-interface, vocal-gestural expression
VirtualHuman: dialogic and affective interaction with virtual characters BIBAKFull-Text 51-58
  Norbert Reithinger; Patrick Gebhard; Markus Löckelt; Alassane Ndiaye; Norbert Pfleger; Martin Klesen
Natural multimodal interaction with realistic virtual characters provides rich opportunities for entertainment and education. In this paper we present the current VirtualHuman demonstrator system. It provides a knowledge-based framework to create interactive applications in a multi-user, multi-agent setting. The behavior of the virtual humans and objects in the 3D environment is controlled by interacting affective conversational dialogue engines. An elaborate model of affective behavior adds natural emotional reactions and presence of the virtual humans. Actions are defined in a XML-based markup language that supports the incremental specification of synchronized multimodal output. The system was successfully demonstrated during CeBIT 2006.
Keywords: AI techniques & adaptive multimodal interfaces, mobile, tangible & virtual/augmented multimodal interfaces, multimodal input and output interfaces, speech and conversational interfaces
From vocal to multimodal dialogue management BIBAKFull-Text 59-67
  Miroslav Melichar; Pavel Cenek
Multimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal interaction natural and smooth, while keeping it manageable from the system perspective. Another central issue concerns algorithms for multimodal dialogue management. This paper presents a solution that relies on adapting an existing unimodal, vocal dialogue management framework to make it able to cope with multimodality. An experimental multimodal system, Archivus, is described together with discussion of the required changes to the unimodal dialogue management algorithms. Results of pilot Wizard of Oz experiments with Archivus focusing on system efficiency and user behaviour are presented1.
Keywords: Wizard of Oz, dialogue management, dialogue systems, graphical user interface (GUI), human computer interaction (HCI), multimodal systems, rapid dialogue prototyping
Human-Robot dialogue for joint construction tasks BIBAKFull-Text 68-71
  Mary Ellen Foster; Tomas By; Markus Rickert; Alois Knoll
We describe a human-robot dialogue system that allows a human to collaborate with a robot agent on assembling construction toys. The human and the robot are fully equal peers in the interaction, rather than simply partners. Joint action is supported at all stages of the interaction: the participants agree on a construction task, jointly decide how to proceed to proceed with the task, and also implement the selected plans jointly. The symmetry provides novel challenges for a dialogue system, and also makes it possible for findings from human-human joint-action dialogues to be easily implemented and tested.
Keywords: human-robot interaction, multimodal dialogue
roBlocks: a robotic construction kit for mathematics and science education BIBAKFull-Text 72-75
  Eric Schweikardt; Mark D. Gross
We describe work in progress on roBlocks, a computational construction kit that encourages users to experiment and play with a collection of sensor, logic and actuator blocks, exposing them to a variety of advanced concepts including kinematics, feedback and distributed control. Its interface presents novice users with a simple, tangible set of robotic blocks, whereas advanced users work with software tools to analyze and rewrite the programs embedded in each block. Early results suggest that roBlocks may be an effective vehicle to expose young people to complex ideas in science, technology, engineering and mathematics.
Keywords: construction kit, robotics education, tangible interface

Oral session 1: speech and gesture integration

GSI demo: multiuser gesture/speech interaction over digital tables by wrapping single user applications BIBAKFull-Text 76-83
  Edward Tse; Saul Greenberg; Chia Shen
Most commercial software applications are designed for a single user using a keyboard/mouse over an upright monitor. Our interest is exploiting these systems so they work over a digital table. Mirroring what people do when working over traditional tables, we want to allow multiple people to interact naturally with the tabletop application and with each other via rich speech and hand gestures. In previous papers, we illustrated multi-user gesture and speech interaction on a digital table for geospatial applications -- Google Earth, Warcraft III and The Sims. In this paper, we describe our underlying architecture: GSI Demo. First, GSI Demo creates a run-time wrapper around existing single user applications: it accepts and translates speech and gestures from multiple people into a single stream of keyboard and mouse inputs recognized by the application. Second, it lets people use multimodal demonstration -- instead of programming -- to quickly map their own speech and gestures to these keyboard/mouse inputs. For example, continuous gestures are trained by saying "Computer, when I do [one finger gesture], you do [mouse drag]". Similarly, discrete speech commands can be trained by saying "Computer, when I say [layer bars], you do [keyboard and mouse macro]". The end result is that end users can rapidly transform single user commercial applications into a multi-user, multimodal digital tabletop system.
Keywords: digital tables, multimodal input, programming by demonstration
Co-Adaptation of audio-visual speech and gesture classifiers BIBAKFull-Text 84-91
  C. Mario Christoudias; Kate Saenko; Louis-Philippe Morency; Trevor Darrell
The construction of robust multimodal interfaces often requires large amounts of labeled training data to account for cross-user differences and variation in the environment. In this work, we investigate whether unlabeled training data can be leveraged to build more reliable audio-visual classifiers through co-training, a multi-view learning algorithm. Multimodal tasks are good candidates for multi-view learning, since each modality provides a potentially redundant view to the learning algorithm. We apply co-training to two problems: audio-visual speech unit classification, and user agreement recognition using spoken utterances and head gestures. We demonstrate that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio-visual modalities. We also propose a co-adaptation algorithm, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.
Keywords: adaptation, audio-visual speech and gesture, co-training, human-computer interfaces, semi-supervised learning
Towards the integration of shape-related information in 3-D gestures and speech BIBAKFull-Text 92-99
  Timo Sowa
This paper presents a model for the unified semantic representation of shape conveyed by speech and coverbal 3-D gestures. The representation is tailored to capture the semantic contributions of both modalities during free descriptions of objects. It is shown how the semantic content of shape-related adjectives, nouns, and iconic gestures can be modeled and combined when they occur together in multimodal utterances like "a longish bar" + iconic gesture. The model has been applied for the development of a prototype system for gesture recognition and integration with speech.
Keywords: gesture, multimodal integration, shape, speech

Oral session 2: perception and feedback

Which one is better?: information navigation techniques for spatially aware handheld displays BIBAKFull-Text 100-107
  Michael Rohs; Georg Essl
Information navigation techniques for handheld devices support interacting with large virtual spaces on small displays, for example finding targets on a large-scale map. Since only a small part of the virtual space can be shown on the screen at once, typical interfaces allow for scrolling and panning to reach off-screen content. Spatially aware handheld displays sense their position and orientation in physical space in order to provide a corresponding view in virtual space. We implemented various one-handed navigation techniques for camera-tracked spatially aware displays. The techniques are compared in a series of abstract selection tasks that require the investigation of different levels of detail. The tasks are relevant for interfaces that enable navigating large scale maps and finding contextual information on them. The results show that halo is significantly faster than other techniques. In complex situations zoom and halo show comparable performance. Surprisingly, the combination of halo and zooming is detrimental to user performance.
Keywords: camera phones, handheld devices, information navigation, navigation aids, small displays, spatial cognition, spatial interaction, spatially aware displays
Comparing the effects of visual-auditory and visual-tactile feedback on user performance: a meta-analysis BIBAKFull-Text 108-117
  Jennifer L. Burke; Matthew S. Prewett; Ashley A. Gray; Liuquin Yang; Frederick R. B. Stilson; Michael D. Coovert; Linda R. Elliot; Elizabeth Redden
In a meta-analysis of 43 studies, we examined the effects of multimodal feedback on user performance, comparing visual-auditory and visual-tactile feedback to visual feedback alone. Results indicate that adding an additional modality to visual feedback improves performance overall. Both visual-auditory feedback and visual-tactile feedback provided advantages in reducing reaction times and improving performance scores, but were not effective in reducing error rates. Effects are moderated by task type, workload, and number of tasks. Visual-auditory feedback is most effective when a single task is being performed (g = .87), and under normal workload conditions (g = .71). Visual-tactile feedback is more effective when multiple tasks are begin performed (g = .77) and workload conditions are high (g = .84). Both types of multimodal feedback are effective for target acquisition tasks; but vary in effectiveness for other task types. Implications for practice and research are discussed.
Keywords: meta-analysis, multimodal interface, visual-auditory feedback, visual-tactile feedback
Multimodal estimation of user interruptibility for smart mobile telephones BIBAKFull-Text 118-125
  Robert Malkin; Datong Chen; Jie Yang; Alex Waibel
Context-aware computer systems are characterized by the ability to consider user state information in their decision logic. One example application of context-aware computing is the smart mobile telephone. Ideally, a smart mobile telephone should be able to consider both social factors (i.e., known relationships between contactor and contactee) and environmental factors (i.e., the contactee's current locale and activity) when deciding how to handle an incoming request for communication.
   Toward providing this kind of user state information and improving the ability of the mobile phone to handle calls intelligently, we present work on inferring environmental factors from sensory data and using this information to predict user interruptibility. Specifically, we learn the structure and parameters of a user state model from continuous ambient audio and visual information from periodic still images, and attempt to associate the learned states with user-reported interruptibility levels. We report experimental results using this technique on real data, and show how such an approach can allow for adaptation to specific user preferences.
Keywords: HMMs, context awareness, hierarchical HMMs, scene learning, smart mobile telephones, user interruptibility

Demonstration session

Short message dictation on Symbian series 60 mobile phones BIBAKFull-Text 126-127
  E. Karpov; I. Kiss; J. Leppänen; J. Olsen; D. Oria; S. Sivadas; J. Tian
Dictation of natural language text on embedded mobile devices is a challenging task. First, it involves memory and CPU-efficient implementation of robust speech recognition algorithms that are generally resource demanding. Secondly, the acoustic and language models employed in the recognizer require the availability of suitable text and speech language resources, typically for a wide set of languages. Thirdly, a proper design of the UI is also essential. The UI has to provide intuitive and easy means for dictation and error correction, and must be suitable for a mobile usage scenario. In this demonstrator, an embedded speech recognition system for short message (SMS) dictation in US English is presented. The system is running on Nokia Series 60 mobile phones (e.g., N70, E60). The system's vocabulary is 23 thousand words. Its Flash and RAM memory footprints are small, 2 and 2.5 megabytes, respectively. After a short enrollment session, most native speakers can achieve a word accuracy of over 90% when dictating short messages in quiet or moderately noisy environments.
Keywords: embedded dictation, low complexity, low footprint, mobile dictation UI, speech recognition
The NIST smart data flow system II multimodal data transport infrastructure BIBAKFull-Text 128
  Antoine Fillinger; Stéphane Degré; Imad Hamchi; Vincent Stanford
Multimodal interfaces require numerous computing devices, sensors, and dynamic networking, to acquire, transport, and process the sensor streams necessary to sense human activities and respond to them. The NIST Smart Data Flow System Version II embodies many improvements requested by the research community including multiple operating systems, simplified data transport protocols, additional language bindings, an extensible object oriented architecture, and improved fault tolerance.
Keywords: data streams, distributed computing, multimodal data transport infrastructure, smart data flow, smart spaces
A contextual multimodal integrator BIBAKFull-Text 129-130
  Péter Pál Boda
Multimodal Integration addresses the problem of combining various user inputs into a single semantic representation that can be used in deciding the next step of system action(s). The method presented in this paper uses a statistical framework to implement the integration mechanism and includes contextual information additionally to the actual user input. The underlying assumption is that the more information sources are taken into account, the better picture can be drawn about the actual intention of the user in the given context of the interaction. The paper presents the latest results with a Maximum Entropy classifier, with special emphasis on the use of contextual information (type of gesture movements and type of objects selected). Instead of explaining the design and implementation process in details (a longer paper to be published later will do that), only a short description is provided here about the demonstration implementation that produces above 91% accuracy for the 1st best and higher than 96% for the accumulated five N-bests results.
Keywords: context, data fusion, machine learning, maximum entropy, multimodal database, multimodal integration, virtual modality
Collaborative multimodal photo annotation over digital paper BIBAKFull-Text 131-132
  Paulo Barthelmess; Edward Kaiser; Xiao Huang; David McGee; Philip Cohen
The availability of metadata annotations over media content such as photos is known to enhance retrieval and organization, particularly for large data sets. The greatest challenge for obtaining annotations remains getting users to perform the large amount of tedious manual work that is required. In this demo we show a system for semi-automated labeling based on extraction of metadata from naturally occurring conversations of groups of people discussing pictures among themselves. The system supports a variety of collaborative label elicitation scenarios mixing co-located and distributed participants, operating primarily via speech, handwriting and sketching over tangible digital paper photo printouts. We demonstrate the real-time capabilities of the system by providing hands-on annotation experience for conference participants. Demo annotations are performed over public domain pictures portraying mainstream themes (e.g. from famous movies).
Keywords: collaborative interaction, demo, intelligent interfaces, multimodal processing, photo annotation
CarDialer: multi-modal in-vehicle cellphone control application BIBAKFull-Text 133-134
  Vladimír Bergl; Martin Èmejrek; Martin Fanta; Martin Labský; Ladislav Seredi; Jan Sedivý; Lubos Ures
This demo presents CarDialer -- an in-car cellphone control application. Its multi-modal user interface blends state-of-the-art speech recognition technology (including text-to-speech synthesis) with the existing well proven elements of a vehicle information system GUI (buttons mounted on a steering wheel and an LCD equipped with touch-screen). This conversational system provides access to name dialing, unconstrained dictation of numbers, adding new names, operations with lists of calls and messages, notification of presence, etc. The application is fully functional from the first start, no prerequisite steps such as configuration, speech recognition enrollment) are required. The presentation of the proposed multi-modal architecture goes beyond the specific application and presents a modular platform to integrate application logic with various incarnations of UI modalities.
Keywords: automated speech recognition, multi-modal, name dialer, vehicle information system
Gender and age estimation system robust to pose variations BIBAKFull-Text 135-136
  Erina Takikawa; Koichi Kinoshita; Shihong Lao; Masato Kawade
For applications based on facial image processing, pose variation is a difficult problem. In this paper, we propose a gender and age estimation system that is robust against pose variations. The acceptable facial pose range is a yaw (left-right) from -30 degrees to +30 degrees and a pitch (up-down) from -20 degrees to +20 degrees. According to our experiments on several large databases collected under real environments, the gender estimation accuracy is 84.8% and the age estimation accuracy is 80.9% (subjects are divided into 5 classes). The average processing time is about 70 ms/frame for gender estimation and 95 ms/frame for age estimation (Pentium4 3.2 GHz). The system can be used to automatically analyze shopping customers and pedestrians using surveillance cameras.
Keywords: age estimation, facial image, gender estimation
A fast and robust 3D head pose and gaze estimation system BIBAKFull-Text 137-138
  Koichi Kinoshita; Yong Ma; Shihong Lao; Masato Kawaade
We developed a fast and robust head pose and gaze estimation system. This system can detect facial points and estimate 3D pose angles and gaze direction under various conditions including facial expression changes and partial occlusion. We need only one face image as input and do not need special devices such as blinking LEDs or stereo cameras. Moreover, no calibration is needed. The system shows a 95% head pose estimation accuracy and 81% gaze estimation accuracy (when the error margin is 15 degrees). The processing time is about 15 ms/frame (Pentium4 3.2 GHz). Acceptable range of facial pose is within a yaw (left-right) of ñ60 degrees and within a pitch (up-down) of ñ30 degrees.
Keywords: facial image, gaze estimation, pose estimation

Special poster session on human computing

Audio-visual emotion recognition in adult attachment interview BIBAKFull-Text 139-145
  Zhihong Zeng; Yuxiao Hu; Yun Fu; Thomas S. Huang; Glenn I. Roisman; Zhen Wen
Automatic multimodal recognition of spontaneous affective expressions is a largely unexplored and challenging problem. In this paper, we explore audio-visual emotion recognition in a realistic human conversation setting -- Adult Attachment Interview (AAI). Based on the assumption that facial expression and vocal expression be at the same coarse affective states, positive and negative emotion sequences are labeled according to Facial Action Coding System Emotion Codes. Facial texture in visual channel and prosody in audio channel are integrated in the framework of Adaboost multi-stream hidden Markov model (AMHMM) in which Adaboost learning scheme is used to build component HMM fusion. Our approach is evaluated in the preliminary AAI spontaneous emotion recognition experiments.
Keywords: affect recognition, affective computing, emotion recognition, multimodal human-computer interaction
Modeling naturalistic affective states via facial and vocal expressions recognition BIBAKFull-Text 146-154
  George Caridakis; Lori Malatesta; Loic Kessous; Noam Amir; Amaryllis Raouzaiou; Kostas Karpouzis
Affective and human-centered computing are two areas related to HCI which have attracted attention during the past years. One of the reasons that this may be attributed to, is the plethora of devices able to record and process multimodal input from the part of the users and adapt their functionality to their preferences or individual habits, thus enhancing usability and becoming attractive to users less accustomed with conventional interfaces. In the quest to receive feedback from the users in an unobtrusive manner, the visual and auditory modalities allow us to infer the users' emotional state, combining information both from facial expression recognition and speech prosody feature extraction. In this paper, we describe a multi-cue, dynamic approach in naturalistic video sequences. Contrary to strictly controlled recording conditions of audiovisual material, the current research focuses on sequences taken from nearly real world situations. Recognition is performed via a 'Simple Recurrent Network' which lends itself well to modeling dynamic events in both user's facial expressions and speech. Moreover this approach differs from existing work in that it models user expressivity using a dimensional representation of activation and valence, instead of detecting the usual 'universal emotions' which are scarce in everyday human-machine interaction. The algorithm is deployed on an audiovisual database which was recorded simulating human-human discourse and, therefore, contains less extreme expressivity and subtle variations of a number of emotion labels.
Keywords: affective interaction, facial expression recognition, image processing, multimodal analysis, naturalistic data, prosodic feature extraction, user modeling
A 'need to know' system for group classification BIBAKFull-Text 155-161
  Wen Dong; Jonathan Gips; Alex (Sandy) Pentland
This paper outlines the design of a distributed sensor classification system with abnormality detection intended for groups of people who are participating in coordinated activities. The system comprises an implementation of a distributed Dynamic Bayesian Network (DBN) model called the Influence Model (IM) that relies heavily on an inter-process communication architecture called Enchantment to establish the pathways of information that the model requires. We use three examples to illustrate how the "need to know" system effectively recognizes the group structure by simulating the work of cooperating individuals.
Keywords: complex dynamic systems, hidden Markov model, influence model, social network, state-space model
Spontaneous vs. posed facial behavior: automatic analysis of brow actions BIBAKFull-Text 162-170
  Michel F. Valstar; Maja Pantic; Zara Ambadar; Jeffrey F. Cohn
Past research on automatic facial expression analysis has focused mostly on the recognition of prototypic expressions of discrete emotions rather than on the analysis of dynamic changes over time, although the importance of temporal dynamics of facial expressions for interpretation of the observed facial behavior has been acknowledged for over 20 years. For instance, it has been shown that the temporal dynamics of spontaneous and volitional smiles are fundamentally different from each other. In this work, we argue that the same holds for the temporal dynamics of brow actions and show that velocity, duration, and order of occurrence of brow actions are highly relevant parameters for distinguishing posed from spontaneous brow actions. The proposed system for discrimination between volitional and spontaneous brow actions is based on automatic detection of Action Units (AUs) and their temporal segments (onset, apex, offset) produced by movements of the eyebrows. For each temporal segment of an activated AU, we compute a number of mid-level feature parameters including the maximal intensity, duration, and order of occurrence. We use Gentle Boost to select the most important of these parameters. The selected parameters are used further to train Relevance Vector Machines to determine per temporal segment of an activated AU whether the action was displayed spontaneously or volitionally. Finally, a probabilistic decision function determines the class (spontaneous or posed) for the entire brow action. When tested on 189 samples taken from three different sets of spontaneous and volitional facial data, we attain a 90.7% correct recognition rate.
Keywords: automatic facial expression analysis, temporal dynamics
Gaze-X: adaptive affective multimodal interface for single-user office scenarios BIBAKFull-Text 171-178
  Ludo Maat; Maja Pantic
This paper describes an intelligent system that we developed to support affective multimodal human-computer interaction (AMM-HCI) where the user's actions and emotions are modeled and then used to adapt the HCI and support the user in his or her activity. The proposed system, which we named Gaze-X, is based on sensing and interpretation of the human part of the computer's context, known as W5+ (who, where, what, when, why, how). It integrates a number of natural human communicative modalities including speech, eye gaze direction, face and facial expression, and a number of standard HCI modalities like keystrokes, mouse movements, and active software identification, which, in turn, are fed into processes that provide decision making and adapt the HCI to support the user in his or her activity according to his or her preferences. To attain a system that can be educated, that can improve its knowledge and decision making through experience, we use case-based reasoning as the inference engine of Gaze-X. The utilized case base is a dynamic, incrementally self-organizing event-content-addressable memory that allows fact retrieval and evaluation of encountered events based upon the user preferences and the generalizations formed from prior input. To support concepts of concurrency, modularity/scalability, persistency, and mobility, Gaze-X has been built as an agent-based system where different agents are responsible for different parts of the processing. A usability study conducted in an office scenario with a number of users indicates that Gaze-X is perceived as effective, easy to use, useful, and affectively qualitative.
Keywords: affective computing, facial expressions, multimodal interfaces
Human computing, virtual humans and artificial imperfection BIBAKFull-Text 179-184
  Z. M. Ruttkay; D. Reidsma; A. Nijholt
In this paper we raise the issue whether imperfections, characteristic of human-human communication, should be taken into account when developing virtual humans. We argue that endowing virtual humans with the imperfections of humans can help making them more 'comfortable' to interact with. That is, the natural communication of a virtual human should not be restricted to multimodal utterances that are always perfect, both in the sense of form and of content. We illustrate our views with examples from two own applications that we have worked on: the Virtual Dancer, and the Virtual Trainer. In both applications imperfectness helps in keeping the interaction engaging and entertaining.
Keywords: embodied conversational agents, human computing, imperfections, virtual humans

Oral session 3: language understanding and content analysis

Using maximum entropy (ME) model to incorporate gesture cues for SU detection BIBAKFull-Text 185-192
  Lei Chen; Mary Harper; Zhongqiang Huang
Accurate identification of sentence units (SUs) in spontaneous speech has been found to improve the accuracy of speech recognition, as well as downstream applications such as parsing. In recent multimodal investigations, gestural features were utilized, in addition to lexical and prosodic cues from the speech channel, for detecting SUs in conversational interactions using a hidden Markov model (HMM) approach. Although this approach is computationally efficient and provides a convenient way to modularize the knowledge sources, it has two drawbacks for our SU task. First, standard HMM training methods maximize the joint probability of observations and hidden events, as opposed to the posterior probability of a hidden event given observations, a criterion more closely related to SU classification error. A second challenge for integrating gestural features is that their absence sanctions neither SU events nor non-events; it is only the co-timing of gestures with the speech channel that should impact our model. To address these problems, a Maximum Entropy (ME) model is used to combine multimodal cues for SU estimation. Experiments carried out on VACE multi-party meetings confirm that the ME modeling approach provides a solid framework for multimodal integration.
Keywords: gesture, language models, meetings, multimodal fusion, prosody, sentence boundary detection
Salience modeling based on non-verbal modalities for spoken language understanding BIBAKFull-Text 193-200
  Shaolin Qu; Joyce Y. Chai
Previous studies have shown that, in multimodal conversational systems, fusing information from multiple modalities together can improve the overall input interpretation through mutual disambiguation. Inspired by these findings, this paper investigates non-verbal modalities, in particular deictic gesture, in spoken language processing. Our assumption is that during multimodal conversation, user's deictic gestures on the graphic display can signal the underlying domain model that is salient at that particular point of interaction. This salient domain model can be used to constrain hypotheses for spoken language processing. Based on this assumption, this paper examines different configurations of salience driven language models (e.g., n-gram and probabilistic context free grammar) for spoken language processing across different stages. Our empirical results have shown the potential of integrating salience models based on non-verbal modalities in spoken language understanding.
Keywords: language modeling, multimodal interfaces, salience modeling, spoken language understanding
EM detection of common origin of multi-modal cues BIBAKFull-Text 201-208
  A. K. Noulas; B. J. A. Kröse
Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables -- that is the identity of the speakers and the visible persons. In the M-step, the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.
Keywords: audio-visual synchrony, content extraction, multi-modal, multi-modal cue assignment, speaker detection

Oral session 4: collaborative systems and environments

Prototyping novel collaborative multimodal systems: simulation, data collection and analysis tools for the next decade BIBAKFull-Text 209-216
  Alexander M. Arthur; Rebecca Lunsford; Matt Wesson; Sharon Oviatt
To support research and development of next-generation multimodal interfaces for complex collaborative tasks, a comprehensive new infrastructure has been created for collecting and analyzing time-synchronized audio, video, and pen-based data during multi-party meetings. This infrastructure needs to be unobtrusive and to collect rich data involving multiple information sources of high temporal fidelity to allow the collection and annotation of simulation-driven studies of natural human-human-computer interactions. Furthermore, it must be flexibly extensible to facilitate exploratory research. This paper describes both the infrastructure put in place to record, encode, playback and annotate the meeting-related media data, and also the simulation environment used to prototype novel system concepts.
Keywords: annotation tools, data collection infrastructure, meeting, multi-party, multimodal interfaces, simulation studies, synchronized media
Combining audio and video to predict helpers' focus of attention in multiparty remote collaboration on physical tasks BIBAKFull-Text 217-224
  Jiazhi Ou; Yanxin Shi; Jeffrey Wong; Susan R. Fussell; Jie Yang
The increasing interest in supporting multiparty remote collaboration has created both opportunities and challenges for the research community. The research reported here aims to develop tools to support multiparty remote collaborations and to study human behaviors using these tools. In this paper we first introduce an experimental multimedia (video and audio) system with which an expert can collaborate with several novices. We then use this system to study helpers' focus of attention (FOA) during a collaborative circuit assembly task. We investigate the relationship between FOA and language as well as activities using multimodal (audio and video) data, and use learning methods to predict helpers' FOA. We process different modalities separately and fusion the results to make a final decision. We employ a sliding window-based delayed labeling method to automatically predict changes in FOA in real time using only the dialogue among the helper and workers. We apply an adaptive background subtraction method and support vector machine to recognize the worker's activities from the video. To predict the helper's FOA, we make decisions using the information of joint project boundaries and workers' recent activities. The overall prediction accuracies are 79.52% using audio only and 81.79% using audio and video combined.
Keywords: computer-supported cooperative work, focus of attention, multimodal integration, remote collaborative physical tasks
The role of psychological ownership and ownership markers in collaborative working environment BIBAKFull-Text 225-232
  QianYing Wang; Alberto Battocchi; Ilenia Graziola; Fabio Pianesi; Daniel Tomasini; Massimo Zancanaro; Clifford Nass
In this paper, we present a study concerning psychological ownership for digital entities in the context of collaborative working environments. In the first part of the paper we present a conceptual framework of ownership: various issues such as definition, effects, target factors and behavioral manifestation are explicated. We then focus on ownership marking, a behavioral manifestation that is closely tied to psychological ownership. We designed an experiment using DiamondTouch Table to investigate the effect of two of the most widely used ownership markers on users' attitudes and performance. Both performance and attitudinal differences were found, suggesting the significant role of ownership and ownership markers in the groupware and interactive workspaces design.
Keywords: collaborative multimodal environment, communicative marker, defensive marker, digital ownership, marking behavior

Special oral session: special session on human computing

Foundations of human computing: facial expression and emotion BIBAKFull-Text 233-238
  Jeffrey F. Cohn
Many people believe that emotions and subjective feelings are one and the same and that a goal of human-centered computing is emotion recognition. The first belief is outdated; the second mistaken. For human-centered computing to succeed, a different way of thinking is needed.
   Emotions are species-typical patterns that evolved because of their value in addressing fundamental life tasks[19]. Emotions consist of multiple components that may include intentions, action tendencies, appraisals, other cognitions, central and peripheral changes in physiology, and subjective feelings. Emotions are not directly observable, but are inferred from expressive behavior, self-report, physiological indicators, and context. I focus on expressive behavior because of its coherence with other indicators and the depth of research on the facial expression of emotion in behavioral and computer science. In this paper, among the topics I include are approaches to measurement, timing or dynamics, individual differences, dyadic interaction, and inference. I propose that design and implementation of perceptual user interfaces may be better informed by considering the complexity of emotion, its various indicators, measurement, individual differences, dyadic interaction, and problems of inference.
Keywords: automatic facial image analysis, emotion, facial expression, human-computer interaction, temporal dynamics
Human computing and machine understanding of human behavior: a survey BIBAKFull-Text 239-248
  Maja Pantic; Alex Pentland; Anton Nijholt; Thomas Huang
A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior.
Keywords: affective computing, anticipatory user interfaces, multimodal user interfaces, socially-aware computing
Computing human faces for human viewers: automated animation in photographs and paintings BIBAFull-Text 249-256
  Volker Blanz
This paper describes a system for animating and modifying faces in images. It combines an algorithm for 3D face reconstruction from single images with a learning-based approach for 3D animation and face modification. Modifications include changes of facial attributes, such as body weight, masculine or feminine look, or overall head shape, as well as cut-and-paste exchange of faces. Unlike traditional photo retouche, this technique can be applied across changes in pose and lighting. Bridging the gap between photorealistic image processing and 3D graphics, the system provides tools for interacting with existing image material, such as photographs or paintings. The core of the approach is a statistical analysis of a dataset of 3D faces, and an analysis-by-synthesis loop that simulates the process of image formation for high-level image processing.

Poster session 2

Detection and application of influence rankings in small group meetings BIBAKFull-Text 257-264
  Rutger Rienks; Dong Zhang; Daniel Gatica-Perez; Wilfried Post
We address the problem of automatically detecting participant's influence levels in meetings. The impact and social psychological background are discussed. The more influential a participant is, the more he or she influences the outcome of a meeting. Experiments on 40 meetings show that application of statistical (both dynamic and static) models while using simply obtainable features results in a best prediction performance of 70.59% when using a static model, a balanced training set, and three discrete classes: high, normal and low. Application of the detected levels are shown in various ways i.e. in a virtual meeting environment as well as in a meeting browser system.
Keywords: dominance detection, influence detection, machine learning, small group research
Tracking the multi person wandering visual focus of attention BIBAKFull-Text 265-272
  Kevin Smith; Sileye O. Ba; Daniel Gatica-Perez; Jean-Marc Odobez
Estimating the wandering visual focus of attention (WVFOA) for multiple people is an important problem with many applications in human behavior understanding. One such application, addressed in this paper, monitors the attention of passers-by to outdoor advertisements. This paper investigates the problem of tracking the wandering visual focus-of-attention (VFOA) of multiple people, an important problem with many applications in human behavior understanding. We address the specific problem of monitoring attention to outdoor advertisements. To solve the WVFOA problem, we propose a multi-person tracking approach based on a hybrid Dynamic Bayesian Network that simultaneously infers the number of people in the scene, their body and head locations, and their head pose, in a joint state-space formulation that is amenable for person interaction modeling. The model exploits both global measurements and individual observations for the VFOA. For inference in the resulting high-dimensional state-space, we propose a trans-dimensional Markov Chain Monte Carlo (MCMC) sampling scheme, which not only handles a varying number of people, but also efficiently searches the state-space by allowing person-part state updates. Our model was rigorously evaluated for tracking and its ability to recognize when people look at an outdoor advertisement using a realistic data set.
Keywords: MCMC, head pose tracking, multi-person tracking
Toward open-microphone engagement for multiparty interactions BIBAKFull-Text 273-280
  Rebecca Lunsford; Sharon Oviatt; Alexander M. Arthur
There currently is considerable interest in developing new open-microphone engagement techniques for speech and multimodal interfaces that perform robustly in complex mobile and multiparty field environments. State-of-the-art audio-visual open-microphone engagement systems aim to eliminate the need for explicit user engagement by processing more implicit cues that a user is addressing the system, which results in lower cognitive load for the user. This is an especially important consideration for mobile and educational interfaces due to the higher load required by explicit system engagement. In the present research, longitudinal data were collected with six triads of high-school students who engaged in peer tutoring on math problems with the aid of a simulated computer assistant. Results revealed that amplitude was 3.25dB higher when users addressed a computer rather than human peer when no lexical marker of intended interlocutor was present, and 2.4dB higher for all data. These basic results were replicated for both matched and adjacent utterances to computer versus human partners. With respect to dialogue style, speakers did not direct a higher ratio of commands to the computer, although such dialogue differences have been assumed in prior work. Results of this research reveal that amplitude is a powerful cue marking a speaker's intended addressee, which should be leveraged to design more effective microphone engagement during computer-assisted multiparty interactions.
Keywords: collaborative peer tutoring, computer-supported collaborative work, dialogue style, intended addressee, multimodal interaction, open-microphone engagement, spoken amplitude, user communication modeling
Tracking head pose and focus of attention with multiple far-field cameras BIBAKFull-Text 281-286
  Michael Voit; Rainer Stiefelhagen
In this work we present our recent approach on estimating head orientations and foci of attention of multiple people in a smart room, which is equipped with several cameras to monitor the room. In our approach, we estimate each person's head orientation with respect to the room coordinate system by using all camera views. We implemented a Neural Network to estimate head pose on every single camera view, a Bayes filter is then applied to integrate every estimate into one final, joint hypothesis. Using this scheme, we can track peoples' horizontal head orientations in a full 360° range at almost all positions within the room. The tracked head orientations are then used to determine who is looking at whom, i.e. people's focus of attention. We report experimental results on one meeting video, that was recorded in the smart room.
Keywords: Bayesian filter, focus of attention, gaze, head orientation, head pose, neural networks
Recognizing gaze aversion gestures in embodied conversational discourse BIBAKFull-Text 287-294
  Louis-Philippe Morency; C. Mario Christoudias; Trevor Darrell
Eye gaze offers several key cues regarding conversational discourse during face-to-face interaction between people. While a large body of research results exist to document the use of gaze in human-to-human interaction, and in animating realistic embodied avatars, recognition of conversational eye gestures -- distinct eye movement patterns relevant to discourse -- has received less attention. We analyze eye gestures during interaction with an animated embodied agent and propose a non-intrusive vision-based approach to estimate eye gaze and recognize eye gestures. In our user study, human participants avert their gaze (i.e. with "look-away" or "thinking" gestures) during periods of cognitive load. Using our approach, an agent can visually differentiate whether a user is thinking about a response or is waiting for the agent or robot to take its turn.
Keywords: aversion gestures, embodied conversational agent, eye gaze tracking, eye gestures, human-computer interaction, turn-taking
Explorations in sound for tilting-based interfaces BIBAKFull-Text 295-301
  Matthias Rath; Michael Rohs
Everyday experience as well as recent studies tell that information contained in ecological sonic feedback may improve human control of, and interaction with, a system. This notion is particularly worthwhile to consider in the context of mobile, tilting-based interfaces as have been proposed, developed and studied extensively. Two interfaces are used for this scope, the Ballancer, based on the metaphor of balancing a rolling ball on a track, and a more concretely application-oriented setup of a mobile phone with tilting-based input. First pilot studies have been conducted.
Keywords: acoustic feedback, auditory feedback, auditory information, control, tilting-based input
Haptic phonemes: basic building blocks of haptic communication BIBAKFull-Text 302-309
  Mario Enriquez; Karon MacLean; Christian Chita
A haptic phoneme represents the smallest unit of a constructed haptic signal to which a meaning can be assigned. These haptic phonemes can be combined serially or in parallel to form haptic words, or haptic icons, which can hold more elaborate meanings for their users. Here, we use phonemes which consist of brief (<2 seconds) haptic stimuli composed of a simple waveform at a constant frequency and amplitude. Building on previous results showing that a set of 12 such haptic stimuli can be perceptually distinguished, here we test learnability and recall of associations for arbitrarily chosen stimulus-meaning pairs. We found that users could consistently recall an arbitrary association between a haptic stimulus and its assigned arbitrary meaning in a 9-phoneme set, during a 45 minute test period following a reinforced learning stage.
Keywords: haptic icons, haptic interfaces, tactile language, touch
Toward haptic rendering for a virtual dissection BIBAKFull-Text 310-317
  Nasim Melony Vafai; Shahram Payandeh; John Dill
In this paper we present a novel data structure combined with geometrically efficient techniques to simulate a "tissue peeling" method for deformable bodies. This is done to preserve the basic shape of a body in conjunction with soft-tissue deformation of multiple deformable bodies in a geometry-based model. We demonstrate our approach through haptic rendering of a virtual anatomical model for a dissection simulator that consists of surface skin along with multiple internal organs. The simulator uses multimodal cues in the form of haptic feedback to provide guidance and performance feedback to the user. The realism of the simulation is enhanced by computation of interaction forces using extrapolation techniques to send these forces back to the user via a haptic device.
Keywords: collision detection, force feedback, haptics, soft tissue deformation, virtual reality
Embrace system for remote counseling BIBAKFull-Text 318-325
  Osamu Morikawa; Sayuri Hashimoto; Tsunetsugu Munakata; Junzo Okunaka
In counseling, non-verbal communication such as making physical contacts is an effective skill of role playing. In remote counseling via videophones, spacing and physical contacts cannot be used, and communication must be made only with expressions and words. This paper describes an embrace system for remote counseling, which consists of HyperMirror and vibrators and can provide effects similar to those by physical contacts in face-to-face counseling.
Keywords: HyperMirror, remote counseling, remote embrace, tactile and vibration
Enabling multimodal communications for enhancing the ability of learning for the visually impaired BIBAKFull-Text 326-332
  Francis Quek; David McNeill; Francisco Oliveira
Students who are blind are typically one to three years behind their seeing counterparts in mathematics and science. We posit that a key reason for this resides in the inability of such students to access multimodal embodied communicative behavior of mathematics instructors. This impedes the ability of blind students and their teachers to maintain situated communication. In this paper, we set forth the relevant phenomenological analyses to support this claim. We show that mathematical communication and instruction are inherent embodied; that the blind are able to conceptualize visuo-spatial information; and argue that uptake of embodied behavior is critical to receiving relevant mathematical information. Based on this analysis, we advance an approach to provide students who are blind with awareness of their teachers' deictic gestural activity via a set of haptic output devices. We lay forth a set of open research question that researcher in multimodal interfaces may address.
Keywords: awareness, catchment, embodied awareness, embodied deictic activity, embodiment, gestures, growth point, mediating technology, multimodal, multimodal interfaces, situated discourse, spatio-temporal cues
The benefits of multimodal information: a meta-analysis comparing visual and visual-tactile feedback BIBAKFull-Text 333-338
  Matthew S. Prewett; Liuquin Yang; Frederick R. B. Stilson; Ashley A. Gray; Michael D. Coovert; Jennifer Burke; Elizabeth Redden; Linda R. Elliot
Information display systems have become increasingly complex and more difficult for human cognition to process effectively. Based upon Wickens' Multiple Resource Theory (MRT), information delivered using multiple modalities (i.e., visual and tactile) could be more effective than communicating the same information through a single modality. The purpose of this meta-analysis is to compare user effectiveness when using visual-tactile task feedback (a multimodality) to using only visual task feedback (a single modality). Results indicate that using visual-tactile feedback enhances task effectiveness more so than visual feedback (g = .38). When assessing different criteria, visual-tactile feedback is particularly effective at reducing reaction time (g = .631) and increasing performance (g = .618). Follow up moderator analyses indicate that visual-tactile feedback is more effective when workload is high (g = .844) and multiple tasks are being performed (g = .767). Implications of results are discussed in the paper.
Keywords: meta-analysis, multimodal, visual feedback, visual-tactile feedback

Oral session 5: speech and dialogue systems

Word graph based speech recognition error correction by handwriting input BIBAKFull-Text 339-346
  Peng Liu; Frank K. Soong
We propose a convenient handwriting user interface for correcting speech recognition errors efficiently. Via the proposed hand-marked correction on the displayed recognition result, substitution, deletion and insertion errors can be corrected efficiently by rescoring the word graph generated in the recognition pass. A new path in the graph that matches the user's feedback in the maximum likelihood sense is found.
   With the aid of language model and hand corrections part in the best decoded path, rescoring the word graph can correct more errors than user provides. All recognition errors can be corrected after finite number of corrections. Experimental results show that by indicating one word error in user feedback, 33.8% of the erroneous sentences can be corrected; while by indicating one character error, 12.9% of the erroneous sentences can be corrected.
Keywords: handwriting recognition, interactive error correction, multimodal interface, speech recognition, word graph
Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations BIBAKFull-Text 347-356
  Edward C. Kaiser
New language constantly emerges from complex, collaborative human-human interactions like meetings -- such as, for instance, when a presenter handwrites a new term on a whiteboard while saying it. Fixed vocabulary recognizers fail on such new terms, which often are critical to dialogue understanding. We present a proof-of-concept multimodal system that combines information from handwriting and speech recognition to learn the spelling, pronunciation and semantics of out-of-vocabulary terms from single instances of redundant multimodal presentation (e.g. saying a term while handwriting it). For the task of recognizing the spelling and semantics of abbreviated Gantt chart labels across a held-out test series of five scheduling meetings we show a significant relative error rate reduction of 37% when our learning methods are used and allowed to persist across the meeting series, as opposed to when they are not used.
Keywords: handwriting, multimodal, speech
Multimodal fusion: a new hybrid strategy for dialogue systems BIBAKFull-Text 357-363
  Pilar Manchón Portillo; Guillermo Pérez García; Gabriel Amores Carredano
This is a new hybrid fusion strategy based primarily on the implementation of two former and differentiated approaches to multimodal fusion [11] in multimodal dialogue systems. Both approaches, their predecessors and their respective advantages and disadvantages will be described in order to illustrate how the new strategy merges them into a more solid and coherent solution. The first strategy was largely based on Johnston's approach [5] and implies the inclusion of multimodal grammar entries and temporal constraints. The second approach implied the fusion of information coming from different channels at dialogue level. The new hybrid strategy hereby described requires the inclusion of multimodal grammar entries and temporal constraints plus the additional information at dialogue level utilized in the second strategy. Within this new approach therefore, the fusion process will be initiated at grammar level and will be culminated at dialogue level.
Keywords: NLP, dialogue systems, multimodal fusion

Oral session 6: interfaces and usability

Evaluating usability based on multimodal information: an empirical study BIBAKFull-Text 364-371
  Tao Lin; Atsumi Imamiya
New technologies are making it possible to provide an enriched view of interaction for researchers using multimodal information. This preliminary study explores the use of multiple information streams in usability evaluation. In the study, easy, medium and difficult versions of a game task were used to vary the levels of mental effort. Multimodal data streams during the three versions were analyzed, including eye tracking, pupil size, hand movement, heart rate variability (HRV) and subjectively reported data. Four findings indicate the potential value of usability evaluations based on multimodal information: First, subjective and physiological measures showed significant sensitivity to task difficulty. Second, different mental workload levels appeared to correlate with eye movement patterns, especially with a combined eye-hand movement measure. Third, HRV showed correlations with saccade speed. Finally, we present a new method using the ratio of eye fixations over mouse clicks to evaluate performance in more detail. These results warrant further investigations and take an initial step toward establishing usability evaluation methods based on multimodal information.
Keywords: eye tracking, multimodal, physiological measures, usability
A new approach to haptic augmentation of the GUI BIBAKFull-Text 372-379
  Thomas N. Smyth; Arthur E. Kirkpatrick
Most users do not experience the same level of fluency in their interactions with computers that they do with physical objects in their daily life. We believe that much of this results from the limitations of unimodal interaction. Previous efforts in the haptics literature to remedy those limitations have been creative and numerous, but have failed to produce substantial improvements in human performance. This paper presents a new approach, whereby haptic interaction techniques are designed from scratch, in explicit consideration of the strengths and weaknesses of the haptic and motor systems. We introduce a haptic alternative to the tool palette, called Pokespace, which follows this approach. Two studies (6 and 12 participants) conducted with Pokespace found no performance improvement over a traditional interface, but showed that participants learned to use the interface proficiently after about 10 minutes, and could do so without visual attention. The studies also suggested several improvements to our design.
Keywords: 3D interaction, haptic feedback, haptic interface, multimodal interface, rehearsal, tool palette, visual attention
HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads BIBAKFull-Text 380-387
  Nadia Mana; Fabio Pianesi
One of the research goals in the human-computer interaction community is to build believable Embodied Conversational Agents, that is, agents able to communicate complex information with human-like expressiveness and naturalness. Since emotions play a crucial role in human communication and most of them are expressed through the face, having more believable ECAs implies to give them the ability of displaying emotional facial expressions.
   This paper presents a system based on Hidden Markov Models (HMMs) for the synthesis of emotional facial expressions during speech. The HMMs were trained on a set of emotion examples in which a professional actor uttered Italian non-sense words, acting various emotional facial expressions with different intensities.
   The evaluation of the experimental results, performed comparing the "synthetic examples" (generated by the system) with a reference "natural example" (one of the actor's examples) in three different ways, shows that HMMs for emotional facial expressions synthesis have some limitations but are suitable to make a synthetic Talking Head more expressive and realistic.
Keywords: MPEG4 facial animation, emotional facial expression modeling, face synthesis, hidden Markov models, talking heads


Embodiment and multimodality BIBAKFull-Text 388-390
  Francis Quek
Students who are blind are typically one to three years behind their seeing counterparts in mathematics and science. We posit that a key reason for this resides in the inability of such students to access multimodal embodied communicative behavior of mathematics instructors. This impedes the ability of blind students and their teachers to maintain situated communication. In this paper, we set forth the relevant phenomenological analyses to support this claim. We show that mathematical communication and instruction are inherent embodied; that the blind are able to conceptualize visuo-spatial information; and argue that uptake of embodied behavior is critical to receiving relevant mathematical information. Based on this analysis, we advance an approach to provide students who are blind with awareness of their teachers' deictic gestural activity via a set of haptic output devices. We lay forth a set of open research question that researcher in multimodal interfaces may address.
Keywords: awareness, embodiment, gestures, multimodal, theory