HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2007 International Conference on Multimodal Interfaces

Fullname:ICMI'07 Proceedings of the 9th International Conference on Multimodal Interfaces
Editors:Kenji Mase; Dominic Massaro; Kazuya Takeda; Deb Roy; Alexandros Potamianos
Location:Nagoya, Aichi, Japan
Dates:2007-Nov-12 to 2007-Nov-15
Standard No:ISBN: 1-59593-817-6, 978-1-59593-817-6; ACM DL: Table of Contents hcibib: ICMI07
Links:Conference Home Page
  1. Keynote
  2. Oral session 1: spontaneous behavior 1
  3. Oral session 2: spontaneous behavior 2
  4. Poster session 1
  5. Oral session 3: cross-modality
  6. Poster session 2
  7. Oral session 4: meeting applications
  8. Poster session 3
  9. Oral session 5: interactive systems 1
  10. Oral session 6: interactive systems 2
  11. Workshops


Interfacing life: a year in the life of a research lab BIBAKFull-Text 1
  Yuri Ivanov
Humans perceive life around them through a variety of sensory inputs. Some, such as vision, or audition, have high information content, while others, such as touch and smell, do not. Humans and other animals use this gradation of senses to know how to attend to what's important.
   In contrast, it is widely accepted that in tasks of monitoring living spaces the modalities with high information content hold the key to decoding the behavior and intentions of the space occupants. In surveillance, video cameras are used to record everything that they can possibly see in the hopes that if something happens, it can later be found in the recorded data. Unfortunately, the latter proved to be harder than it sounds.
   In our work we challenge this idea and introduce a monitoring system that is built as a combination of channels with varying information content. The system has been deployed for over a year in our lab space and consists of a large motion sensor network combined with several video cameras. While the sensors give a general context of the events in the entire 3000 square meters of the space, cameras only attend to selected occurrences of the office activities. The system demonstrates several monitoring tasks which are all but impossible to perform in a traditional camera-only setting.
   In the talk we share our experiences, challenges and solutions in building and maintaining the system. We show some results from the data that we have collected for the period of over a year and introduce some other successful and novel applications of the system.
Keywords: heterogenenous sensor networks, human behavior and analysis
The great challenge of multimodal interfaces towards symbiosis of human and robots BIBAKFull-Text 2
  Norihiro Hagita
This paper introduces the possibilities of symbiosis between human and communication robots from the viewpoint of multi-modal interfaces. Current communication abilities of robots, such as speech recognition, are insufficient for practical use and needs to be improved. A network robot system integrating ubiquitous networking and robot technologies, has been introduced in Japan, Korea and EU countries in order to improve the abilities. Recent field experiments on communication robots based on the system were made in a science museum, a train station and a shopping mall in Japan. Results suggests that network robot systems may be used more as the next-generation communication media. The improvement of communication ability causes problems on privacy policy, since the history of human robot interaction often includes personal information. For example, when a robot asks me, "Hi, Nori. I know you," and I have never met it before, how should I respond to it? Therefore, access control method based on multi-modal interfaces would be required and discussed. Android science will be introduced as an ultimate human interface. The research aims to clarify the difference between "existence" for robot-like robots and "presence" for human-like robots. Once the appearance of robots becomes more similar to that of humans, how should I respond to it? The development of communication robots at our lab, including privacy policy and android science is outlined.
Keywords: communication robot, humanoid robot
Just in time learning: implementing principles of multimodal processing and learning for education BIBAKFull-Text 3-8
  Dominic W. Massaro
Baldi, a 3-D computer-animated tutor has been developed to teach speech and language. I review this technology and pedagogy and describe evaluation experiments that have substantiated the effectiveness of our language-training program, Timo Vocabulary, to teach vocabulary and grammar. With a new Lesson Creator, teachers, parents, and even students can build original lessons that allow concepts, vocabulary, animations, and pictures to be easily integrated. The Lesson Creator application facilitates the specialization and individualization of lessons by allowing teachers to create customized vocabulary lists Just in Time as they are needed. The Lesson Creator allows the coach to give descriptions of the concepts as well as corrective feedback, which allows errorless learning and encourages the child to think as they are learning. I describe the Lesson Creator, illustrate it, and speculate on how its evaluation can be accomplished.
Keywords: education, language learning, multisensory integration, speech, vocabulary

Oral session 1: spontaneous behavior 1

The painful face: pain expression recognition using active appearance models BIBAKFull-Text 9-14
  Ahmed Bilal Ashraf; Simon Lucey; Jeffrey F. Cohn; Tsuhan Chen; Zara Ambadar; Ken Prkachin; Patty Solomon; Barry J. Theobald
Pain is typically assessed by patient self-report. Self-reported pain, however, is difficult to interpret and may be impaired or not even possible, as in young children or the severely ill. Behavioral scientists have identified reliable and valid facial indicators of pain. Until now they required manual measurement by highly skilled observers. We developed an approach that automatically recognizes acute pain. Adult patients with rotator cuff injury were video-recorded while a physiotherapist manipulated their affected and unaffected shoulder. Skilled observers rated pain expression from the video on a 5-point Likert-type scale. From these ratings, sequences were categorized as no-pain (rating of 0), pain (rating of 3, 4, or 5), and indeterminate (rating of 1 or 2). We explored machine learning approaches for pain-no pain classification. Active Appearance Models (AAM) were used to decouple shape and appearance parameters from the digitized face images. Support vector machines (SVM) were used with several representations from the AAM. Using a leave-one-out procedure, we achieved an equal error rate of 19% (hit rate = 81%) using canonical appearance and shape features. These findings suggest the feasibility of automatic pain detection from video.
Keywords: active appearance models, automatic facial image analysis, facial expression, pain, support vector machines
Faces of pain: automated measurement of spontaneous all facial expressions of genuine and posed pain BIBAKFull-Text 15-21
  Gwen C. Littlewort; Marian Stewart Bartlett; Kang Lee
We present initial results from the application of an automated facial expression recognition system to spontaneous facial expressions of pain. In this study, 26 participants were videotaped under three experimental conditions: baseline, posed pain, and real pain. In the real pain condition, subjects experienced cold pressor pain by submerging their arm in ice water. Our goal was to automatically determine which experimental condition was shown in a 60 second clip from a previously unseen subject. We chose a machine learning approach, previously used successfully to categorize basic emotional facial expressions in posed datasets as well as to detect individual facial actions of the Facial Action Coding System (FACS) (Littlewort et al, 2006; Bartlett et al., 2006). For this study, we trained 20 Action Unit (AU) classifiers on over 5000 images selected from a combination of posed and spontaneous facial expressions. The output of the system was a real valued number indicating the distance to the separating hyperplane for each classifier. Applying this system to the pain video data produced a 20 channel output stream, consisting of one real value for each learned AU, for each frame of the video. This data was passed to a second layer of classifiers to predict the difference between baseline and pained faces, and the difference between expressions of real pain and fake pain. Naíve human subjects tested on the same videos were at chance for differentiating faked from real pain, obtaining only 52% accuracy. The automated system was successfully able to differentiate faked from real pain. In an analysis of 26 subjects, the system obtained 72% correct for subject independent discrimination of real versus fake pain on a 2-alternative forced choice. Moreover, the most discriminative facial action in the automated system output was AU 4 (brow lower), which all was consistent with findings using human expert FACS codes.
Keywords: FACS, computer vision, deception, facial action coding system, facial expression recognition, machine learning, pain, spontaneous behavior
Visual inference of human emotion and behaviour BIBAKFull-Text 22-29
  Shaogang Gong; Caifeng Shan; Tao Xiang
We address the problem of automatic interpretation of non-exaggerated human facial and body behaviours captured in video. We illustrate our approach by three examples. (1) We introduce Canonical Correlation Analysis (CCA) and Matrix Canonical Correlation Analysis (MCCA) for capturing and analyzing spatial correlations among non-adjacent facial parts for facial behaviour analysis. (2) We extend Canonical Correlation Analysis to multimodality correlation for behaviour inference using both facial and body gestures. (3) We model temporal correlation among human movement patterns in a wider space using a mixture of Multi-Observation Hidden Markov Model for human behaviour profiling and behavioural anomaly detection.
Keywords: anomaly detection, behaviour profiling, body language recognition, human emotion recognition, intention inference

Oral session 2: spontaneous behavior 2

Audiovisual recognition of spontaneous interest within conversations BIBAKFull-Text 30-37
  Bjöern Schuller; Ronald Müeller; Benedikt Höernler; Anja Höethker; Hitoshi Konosu; Gerhard Rigoll
In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.
Keywords: affective computing, audiovisual, emotion, interest
How to distinguish posed from spontaneous smiles using geometric features BIBAKFull-Text 38-45
  Michel F. Valstar; Hatice Gunes; Maja Pantic
Automatic distinction between posed and spontaneous expressions is an unsolved problem. Previously cognitive sciences' studies indicated that the automatic separation of posed from spontaneous expressions is possible using the face modality alone. However, little is known about the information contained in head and shoulder motion. In this work, we propose to (i) distinguish between posed and spontaneous smiles by fusing the head, face, and shoulder modalities, (ii) investigate which modalities carry important information and how the information of the modalities relate to each other, and (iii) to which extent the temporal dynamics of these signals attribute to solving the problem. We use a cylindrical head tracker to track the head movements and two particle filtering techniques to track the facial and shoulder movements. Classification is performed by kernel methods combined with ensemble learning techniques. We investigated two aspects of multimodal fusion: the level of abstraction (i.e., early, mid-level, and late fusion) and the fusion rule used (i.e., sum, product and weight criteria). Experimental results from 100 videos displaying posed smiles and 102 videos displaying spontaneous smiles are presented. Best results were obtained with late fusion of all modalities when 94.0% of the videos were classified correctly.
Keywords: deception detection, human information processing, multimodal video processing
Eliciting, capturing and tagging spontaneous facialaffect in autism spectrum disorder BIBAKFull-Text 46-53
  Rana el Kaliouby; Alea Teeters
The emergence of novel affective technologies such as wearable interventions for individuals who have difficulties with social-emotional communication requires reliable, real-time processing of spontaneous expressions. This paper describes a novel wearable camera and a systematic methodology to elicit, capture and tag natural, yet experimentally controlled face videos in dyadic conversations. The MIT-Groden-Autism corpus is the first corpus of naturally-evoked facial expressions of individuals with and without Autism Spectrum Dis-orders (ASD), a growing population who have difficulties with social-emotion communication. It is also the largest in number and duration of the videos, and represents affective-cognitive states that extend beyond the basic emotions. We highlight the machine vision challenges inherent in processing such a corpus, including pose changes and pathological affective displays.
Keywords: affective computing, autism spectrum disorder, facial expressions, spontaneous video corpus

Poster session 1

Statistical segmentation and recognition of fingertip trajectories for a gesture interface BIBAKFull-Text 54-57
  Kazuhiro Morimoto; Chiyomi Miyajima; Norihide Kitaoka; Katunobu Itou; Kazuya Takeda
This paper presents a virtual push button interface created by drawing a shape or line in the air with a fingertip. As an example of such a gesture-based interface, we developed a four-button interface for entering multi-digit numbers by pushing gestures within an invisible 2x2 button matrix inside a square drawn by the user. Trajectories of fingertip movements entering randomly chosen multi-digit numbers are captured with a 3D position sensor mounted on the forefinger's tip. We propose a statistical segmentation method for the trajectory of movements and a normalization method that is associated with the direction and size of gestures. The performance of the proposed method is evaluated in HMM-based gesture recognition. The recognition rate of 60.0% was improved to 91.3% after applying the normalization method.
Keywords: 3D position sensor, affine transformation, gesture interface, hidden Markov model, principal component analysis
A tactile language for intuitive human-robot communication BIBAKFull-Text 58-65
  Andreas J. Schmid; Martin Hoffmann; Heinz Woern
This paper presents a tactile language for controlling a robot through its artificial skin. This language greatly improves the multimodal human-robot communication by adding both redundant and inherently new ways of robot control through the tactile mode. We defined an interface for arbitrary tactile sensors, implemented a symbol recognition for multi-finger contacts, and integrated that together with a freely available character recognition software into an easy-to-extend system for tactile language processing that can also incorporate and process data from non-tactile interfaces. The recognized tactile symbols allow for both a direct control of the robot's tool center point as well as abstract commands like "stop" or "grasp object x with grasp type y". In addition to this versatility, the symbols are also extremely expressive since multiple parameters like direction, distance, and speed can be decoded from a single human finger stroke. Furthermore, our efficient symbol recognition implementation achieves real-time performance while being platform-independent. We have successfully used both a multi-touch finger pad and our artificial robot skin as tactile interfaces. We evaluated our tactile language system by measuring its symbol and angle recognition performance, and the results are promising.
Keywords: human-robot cooperation, robot control, tactile interface, tactile language
Simultaneous prediction of dialog acts and address types in three-party conversations BIBAKFull-Text 66-73
  Yosuke Matsusaka; Mika Enomoto; Yasuharu Den
This paper reports on automatic prediction of dialog acts and address types in three-party conversations. In multi-party interaction, dialog structure becomes more complex compared to one-to-one case, because there is more than one hearer for an utterance. To cope with this problem, we predict dialog acts and address types simultaneously on our framework. Prediction of dialog act labels has gained to 68.5% by considering both context and address types. CART decision tree analysis has also been applied to examine useful features to predict those labels.
Keywords: dialog act, gaze, multi-party interaction, prosody, recognition
Developing and analyzing intuitive modes for interactive object modeling BIBAKFull-Text 74-81
  Alexander Kasper; Regine Becher; Peter Steinhaus; Rüdiger Dillmann
In this paper we present two approaches for intuitive interactive modelling of special object attributes by use of specific sensoric hardware. After a brief overview over the state of the art in interactive, intuitive object modeling, we motivate the modeling task by deriving the dierent object attributes that shall be modeled from an analysis of important interactions with objects. As an example domain, we chose the setting of a service robot in a kitchen. Tasks from this domain were used to derive important basic actions from which in turn the necessary object attributes were inferred.
   In the main section of the paper, two of the derived attributes are presented, each with an intuitive interactive modeling method. The object attributes to be modeled a restable object positions and movement restrictions for objects. Both of the intuitive interaction methods were evaluated with a group of test persons and the results are discussed. The paper ends with conclusions on the discussed results and a preview of future work in this area, in particular of potential applications.
Keywords: interactive object modeling, user interface
Extraction of important interactions in medical interviews using nonverbal information BIBAKFull-Text 82-85
  Yuichi Sawamoto; Yuichi Koyama; Yasushi Hirano; Shoji Kajita; Kenji Mase; Kimiko Katsuyama; Kazunobu Yamauchi
We propose a method of extracting important interaction patterns in medical interviews. Because the interview is a major step where doctor-patient communication takes place, improving the skill and the quality of the medical interview will lead to a better medical care. A pattern mining method for multimodal interaction logs, such as gestures and speech, is applied to medical interviews in order to extract certain doctor-patient interactions. As a result, we demonstrated that several interesting patterns are extracted and we examined their interpretations. The extracted patterns are considered to be ones that doctors should acquire in training and practice for the medical interview.
Keywords: medical interview, multimodal interaction patterns
Towards smart meeting: enabling technologies and a real-world application BIBAKFull-Text 86-93
  Zhiwen Yu; Motoyuki Ozeki; Yohsuke Fujii; Yuichi Nakamura
In this paper, we describe the enabling technologies to develop a smart meeting system based on a three layered generic model. From physical level to semantic level, it consists of meeting capturing, meeting recognition, and semantic processing. Based on the overview of underlying technologies and existing work, we propose a novel real-world smart meeting application, called MeetingAssistant. It is distinctive from previous systems in two aspects. First it provides the real-time browsing that allows a participant to instantly view the status of the current meeting. This feature is helpful in activating discussion and facilitating human communication during a meeting. Second, the context-aware browsing adaptively selects and displays meeting information according to user's situational context, e.g., user purpose, which makes meeting viewing more efficient.
Keywords: context-aware, meeting browser, real-time, smart meeting
Multimodal cues for addressee-hood in triadic communication with a human information retrieval agent BIBAKFull-Text 94-101
  Jacques Terken; Irene Joris; Linda De Valk
Over the last few years, a number of studies have dealt with the question of how the addressee of an utterance can be determined from observable behavioural features in the context of mixed human-human and human-computer interaction (e.g. in the case of someone talking alternatingly to a robot and another person). Often in these cases, the behaviour is strongly influenced by the difference in communicative ability of the robot and the other person, and the "salience" of the robot or system, turning it into a situational distractor. In the current paper, we study triadic human-human communication, where one of the participants plays the role of an information retrieval agent (such as in a travel agency where two customers who want to book a vacation, engage in a dialogue with the travel agent to specify constraints on preferable options). Through a perception experiment we investigate the role of audio and visual cues as markers of addressee-hood of utterances by customers. The outcomes show that both audio and visual cues provide specific types of information, and that combined audio-visual cues give the best performance. In addition, we conduct a detailed analysis of the eye gaze behaviour of the information retrieval agent both when listening and speaking, providing input for modelling the behaviour of an embodied conversational agent.
Keywords: addresseehood, conversational agents, eye gaze, multimodal interaction, perceptual user interfaces
The effect of input mode on inactivity and interaction times of multimodal systems BIBAKFull-Text 102-109
  Manolis Perakakis; Alexandros Potamianos
In this paper, the efficiency and usage patterns of input modes in multimodal dialogue systems is investigated for both desktop and personal digital assistant (PDA) working environments. For this purpose a form-filling travel reservation application is evaluated that combines the speech and visual modalities; three multimodal modes of interaction are implemented, namely: "Click-To-Talk", "Open-Mike" and "Modality-Selection". The three multimodal systems are evaluated and compared with the "GUI-Only" and "Speech-Only" unimodal systems. Mode and duration statistics are computed for each system, for each turn and for each attribute in the form. Turn time is decomposed in interaction and inactivity time and the statistics for each input mode are computed. Results show that multimodal and adaptive interfaces are superior in terms of interaction time, but not always in terms of inactivity time. Also users tend to use the most efficient input mode, although our experiments show a bias towards the speech modality.
Keywords: input modality selection, mobile multimodal interfaces
Positional mapping: keyboard mapping based on characters writing positions for mobile devices BIBAKFull-Text 110-117
  Ye Kyaw Thu; Yoshiyori Urano
Keyboard or keypad layout is one of the important factors to increase user text input speed especially on limited keypad such as mobile phones. This paper introduces novel key mapping method "Positional Mapping" (PM) for phonetic scripts such as Myanmar language based on its characters writing positions. Our approach has made key mapping for Myanmar language very simple and easier to memorize. We have developed positional mapping text input prototypes for mobile phone keypad, PDA, customizable keyboard DX1 input system and dual-joystick game pad, and conducted user studies for each prototype. Evaluation was made based on users' actual typing speed of our four PM prototypes, and it has proved that first time users can type at appropriate average typing speeds (i.e. 3min 47sec with DX1, 4min 42sec with mobile phone prototype, 4min 26sec with PDA and 5min 30sec with Dual Joystick Game Pad) to finish short Myanmar SMS message of 6 sentences. Positional Mapping can be extended to other phonetic scripts, which we present with a Bangla mobile phone prototype in this paper.
Keywords: Bangla language, Myanmar language, keypad layout, mobile phone, pen-based, soft keyboard, stylus input, text entry
Five-key text input using rhythmic mappings BIBAKFull-Text 118-121
  Christine Szentgyorgyi; Edward Lank
Novel key mappings, including chording, character prediction, and multi-tap, allow the use of fewer keys than those on a conventional keyboard to enter text. In this paper, we explore a text input method that makes use of rhythmic mappings of five keys. The keying technique averages 1.5 keystrokes per character for typical English text. In initial testing, the technique shows performance similar to chording and other multi-tap techniques, and our subjects had few problems with basic text entry. Five-key entry techniques may have benefits for text entry in multi-point touch devices, as they eliminate targeting by providing a unique mapping for each finger.
Keywords: multi-tap, one-handed text entry, rhythmic tapping, touch
Toward content-aware multimodal tagging of personal photo collections BIBAKFull-Text 122-125
  Paulo Barthelmess; Edward Kaiser; David R. McGee
A growing number of tools is becoming available, that make use of existing tags to help organize and retrieve photos, facilitating the management and use of photo sets. The tagging on which these techniques rely remains a time consuming, labor intensive task that discourages many users. To address this problem, we aim to leverage the multimodal content of naturally occurring photo discussions among friends and families to automatically extract tags from a combination of conversational speech, handwriting, and photo content analysis. While naturally occurring discussions are rich sources of information about photos, methods need to be developed to reliably extract a set of discriminative tags from this noisy, unconstrained group discourse. To this end, this paper contributes an analysis of pilot data identifying robust multimodal features examining the interplay between photo content and other modalities such as speech and handwriting. Our analysis is motivated by a search for design implications leading to the effective incorporation of automated location and person identification (e.g. based on GPS and facial recognition technologies) into a system able to extract tags from natural multimodal conversations.
Keywords: automatic label extraction, collaborative interaction, intelligent interfaces, multimodal processing, photo annotation, tagging
A survey of affect recognition methods: audio, visual and spontaneous expressions BIBAKFull-Text 126-133
  Zhihong Zeng; Maja Pantic; Glenn I. Roisman; Thomas S. Huang
Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. Promising approaches have been reported, including automatic methods for facial and vocal affect recognition. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions-despite the fact that deliberate behavior differs in visual and audio expressions from spontaneously occurring behavior. Recently efforts to develop algorithms that can process naturally occurring human affective behavior have emerged. This paper surveys these efforts. We first discuss human emotion perception from a psychological perspective. Next, we examine the available approaches to solving the problem of machine understanding of human affective behavior occurring in real-world settings. We finally outline some scientific and engineering challenges for advancing human affect sensing technology.
Keywords: affect recognition, affective computing, emotion recognition, human computing, multimodal human computer interaction, multimodal user interfaces
Real-time expression cloning using appearance models BIBAKFull-Text 134-139
  Barry-John Theobald; Iain A. Matthews; Jeffrey F. Cohn; Steven M. Boker
Active Appearance Models (AAMs) are generative parametric models commonly used to track, recognise and synthesise faces in images and video sequences. In this paper we describe a method for transferring dynamic facial gestures between subjects in real-time. The main advantages of our approach are that: 1) the mapping is computed automatically and does not require high-level semantic information describing facial expressions or visual speech gestures. 2) The mapping is simple and intuitive, allowing expressions to be transferred and rendered in real-time. 3) The mapped expression can be constrained to have the appearance of the target producing the expression, rather than the source expression imposed onto the target face. 4) Near-videorealistic talking faces for new subjects can be created without the cost of recording and processing a complete training corpus for each. Our system enables face-to-face interaction with an avatar driven by an AAM of an actual person in real-time and we show examples of arbitrary expressive speech frames cloned across different subjects.
Keywords: active appearance models, expression cloning, facial animation
Gaze-communicative behavior of stuffed-toy robot with joint attention and eye contact based on ambient gaze-tracking BIBAKFull-Text 140-145
  Tomoko Yonezawa; Hirotake Yamazoe; Akira Utsumi; Shinji Abe
This paper proposes a gaze-communicative stuffed-toy robot system with joint attention and eye-contact reactions based on ambient gaze-tracking. For free and natural interaction, we adopted our remote gaze-tracking method. Corresponding to the user's gaze, the gaze-reactive stuffed-toy robot is designed to gradually establish 1) joint attention using the direction of the robot's head and 2) eye-contact reactions from several sets of motion. From both subjective evaluations and observations of the user's gaze in the demonstration experiments, we found that i) joint attention draws the user's interest along with the user-guessed interest of the robot, ii) "eye contact" brings the user a favorable feeling for the robot, and iii) this feeling is enhanced when "eye contact" is used in combination with "joint attention." These results support the approach of our embodied gaze-communication model.
Keywords: eye contact, gaze communication, joint attention, stuffed-toy robot
Map navigation with mobile devices: virtual versus physical movement with and without visual context BIBAKFull-Text 146-153
  Michael Rohs; Johannes Schöning; Martin Raubal; Georg Essl; Antonio Krüger
A user study was conducted to compare the performance of three methods for map navigation with mobile devices. These methods are joystick navigation, the dynamic peephole method without visual context, and the magic lens paradigm using external visual context. The joystick method is the familiar scrolling and panning of a virtual map keeping the device itself static. In the dynamic peephole method the device is moved and the map is fixed with respect to an external frame of reference, but no visual information is present outside the device's display. The magic lens method augments an external content with graphical overlays, hence providing visual context outside the device display. Here too motion of the device serves to steer navigation. We compare these methods in a study measuring user performance, motion patterns, and subjective preference via questionnaires. The study demonstrates the advantage of dynamic peephole and magic lens interaction over joystick interaction in terms of search time and degree of exploration of the search space.
Keywords: augmented reality, camera phones, camera-based interaction, handheld displays, interaction techniques, maps, mobile devices, navigation, spatially aware displays

Oral session 3: cross-modality

Can you talk or only touch-talk: A VoIP-based phone feature for quick, quiet, and private communication BIBAKFull-Text 154-161
  Maria Danninger; Leila Takayama; Qianying Wang; Courtney Schultz; Jörg Beringer; Paul Hofmann; Frankie James; Clifford Nass
Advances in mobile communication technologies have allowed people in more places to reach each other more conveniently than ever before. However, many mobile phone communications occur in inappropriate contexts, disturbing others in close proximity, invading personal and corporate privacy, and more broadly breaking social norms. This paper presents a telephony system that allows users to answer calls quietly and privately without speaking. The paper discusses the iterative process of design, implementation and system evaluation. The resulting system is a VoIP-based telephony system that can be immediately deployed from any phone capable of sending DTMF signals. Observations and results from inserting and evaluating this technology in real-world business contexts through two design cycles of the Touch-Talk feature are reported.
Keywords: VoIP, business context, computer-mediated communication, mobile phones, telephony, touch-talk
Designing audio and tactile crossmodal icons for mobile devices BIBAKFull-Text 162-169
  Eve Hoggan; Stephen Brewster
This paper reports an experiment into the design of crossmodal icons which can provide an alternative form of output for mobile devices using audio and tactile modalities to communicate information. A complete set of crossmodal icons was created by encoding three dimensions of information in three crossmodal auditory/tactile parameters. Earcons were used for the audio and Tactons for the tactile crossmodal icons. The experiment investigated absolute identification of audio and tactile crossmodal icons when a user is trained in one modality and tested in the other (and given no training in the other modality) to see if knowledge could be transferred between modalities. We also compared performance when users were static and mobile to see any effects that mobility might have on recognition of the cues. The results showed that if participants were trained in sound with Earcons and then tested with the same messages presented via Tactons they could recognize 85% of messages when stationary and 76% when mobile. When trained with Tactons and tested with Earcons participants could accurately recognize 76.5% of messages when stationary and 71% of messages when mobile. These results suggest that participants can recognize and understand a message in a different modality very effectively. These results will aid designers of mobile displays in creating effective crossmodal cues which require minimal training for users and can provide alternative presentation modalities through which information may be presented if the context requires.
Keywords: crossmodal interaction, earcons, mobile interaction, multimodal interaction, tactons (tactile icons)
A study on the scalability of non-preferred hand mode manipulation BIBAKFull-Text 170-177
  Jaime Ruiz; Edward Lank
In pen-tablet input devices modes allow overloading of the electronic stylus. In the case of two modes, switching modes with the non-preferred hand is most effective [12]. Further, allowing temporal overlap of mode switch and pen action boosts speed [11]. We examine the effect of increasing the number of interface modes accessible via non-preferred hand mode switching on task performance in pen-tablet interfaces. We demonstrate that the temporal benefit of overlapping mode-selection and pen action for the two mode case is preserved as the number of modes increases. This benefit is the result of both concurrent action of the hands, and reduced planning time for the overall task. Finally, while allowing bimanual overlap is still faster it takes longer to switch modes as the number of modes increases. Improved understanding of the temporal costs presented assists in the design of pen-tablet interfaces with larger sets of interface modes.
Keywords: bimanual interaction, concurrent mode switching, interaction technique, mode, pen interfaces, stylus

Poster session 2

VoicePen: augmenting pen input with simultaneous non-linguistic vocalization BIBAKFull-Text 178-185
  Susumu Harada; T. Scott Saponas; James A. Landay
This paper explores using non-linguistic vocalization as an additional modality to augment digital pen input on a tablet computer. We investigated this through a set of novel interaction techniques and a feasibility study. Typically, digital pen users control one or two parameters using stylus position and sometimes pen pressure. However, in many scenarios the user can benefit from the ability to continuously vary additional parameters. Non-linguistic vocalizations, such as vowel sounds, variation of pitch, or control of loudness have the potential to provide fluid continuous input concurrently with pen interaction. We present a set of interaction techniques that leverage the combination of voice and pen input when performing both creative drawing and object manipulation tasks. Our feasibility evaluation suggests that with little training people can use non-linguistic vocalization to productively augment digital pen interaction.
Keywords: multimodal input, pen-based interface, voice-based interface
A large-scale behavior corpus including multi-angle video data for observing infants' long-term developmental processes BIBAKFull-Text 186-192
  Shinya Kiriyama; Goh Yamamoto; Naofumi Otani; Shogo Ishikawa; Yoichi Takebayashi
We have developed a method for multimodal observation of infant development. In order to analyze development of problem solving skills by observing scenes of task achievement or communication with others, we have introduced a method for extracting detailed behavioral features expressed by gestures or eyes. We have realized an environment for recording behavior of the same infants continuously as multi-angle video. The environment has evolved into a practical infrastructure through the following four steps; (1) Establish an infant school and study the camera arrangement. (2) Obtain participants in the school who agree with the project purpose and start to hold regular classes. (3) Begin to construct a multimodal infant behavior corpus with considering observation methods. (4) Practice development process analyses using the corpus. We have constructed a support tool for observing a huge amount of video data which increases with age. The system has contributed to enrich the corpus with annotations from multimodal viewpoints about infant development. With a focus on the demonstrative expression as a fundamental human behavior, we have extracted 240 scenes from the video during 10 months and observed them. The analysis results have revealed interesting findings about the developmental changes in infants' gestures and eyes, and indicated the effectiveness of the proposed observation method.
Keywords: behavior observation, infant development, multi-angle video, multimodal behavior corpus
The micole architecture: multimodal support for inclusion of visually impaired children BIBAKFull-Text 193-200
  Thomas Pietrzak; Benoît Martin; Isabelle Pecci; Rami Saarinen; Roope Raisamo; Janne Järvi
Modern information technology allows us to seek out new ways to support the computer use and communication of disabled people. With the aid of new interaction technologies and techniques visually impaired and sighted users can collaborate, for example, in the classroom situations. The main goal of the MICOLE project was to create a software architecture that makes it easier for the developers to create multimodal multi-user applications. The framework is based on interconnected software agents. The hardware used in this study includes VTPlayer Mouse which has two built-in Braille displays, and several haptic devices such as PHANToM Omni, PHANToM Desktop and PHANToM Premium. We also used the SpaceMouse and various audio setups in the applications. In this paper we present a software architecture, a set of software agents, and an example of using the architecture. The example application shown is an electric circuit application that follows the single-user with many devices scenario. The application uses a PHANToM and a VTPlayer Mouse together with visual and audio feedback to make the electric circuits understandable through touch.
Keywords: distributed/collaborative multimodal interfaces, haptic interfaces, multimodal input and output interfaces, universal access interfaces
Interfaces for musical activities and interfaces for musicians are not the same: the case for codes, a web-based environment for cooperative music prototyping BIBAKFull-Text 201-207
  Evandro Manara Miletto; Luciano Vargas Flores; Marcelo Soares Pimenta; Jérôme Rutily; Leonardo Santagada
In this paper, some requirements of user interfaces for musical activities are investigated and discussed, particularly focusing on the necessary distinction between interfaces for musical activities and interfaces for musicians. We also discuss the interactive and cooperative aspects of music creation activities in CODES, a Web-based environment for cooperative music prototyping, designed mainly for novices in music. Aspects related to interaction flexibility and usability are presented, as well as features to support manipulation of complex musical information, cooperative activities and group awareness, which allow users to understand the actions and decisions of all group members cooperating and sharing a music prototype.
Keywords: computer music, cooperative music prototyping, human-computer interaction, interfaces for novices, world wide web
TotalRecall: visualization and semi-automatic annotation of very large audio-visual corpora BIBAKFull-Text 208-215
  Rony Kubat; Philip DeCamp; Brandon Roy
We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project, for which more than 100,000 hours of multitrack audio and video have been collected over a twenty-two month period. Our goal in this project is to transcribe speech in over 10,000 hours of audio recordings, and to annotate the position and head orientation of multiple people in the 10,000 hours of corresponding video. Higher level behavioral analysis of the corpus will be based on these and other annotations. To efficiently cope with this huge corpus, we are developing semi-automatic data coding methods that are integrated into TotalRecall. Ultimately, this system and the underlying methodology may enable new forms of multimodal behavioral analysis grounded in ultradense longitudinal data.
Keywords: multimedia corpora, semi-automation, speech transcription, video annotation, visualization
Extensible middleware framework for multimodal interfaces in distributed environments BIBAKFull-Text 216-219
  Vitor Fernandes; Tiago Guerreiro; Bruno Araújo; Joaquim Jorge; João Pereira
We present a framework to manage multimodal applications and interfaces in a reusable and extensible manner. We achieve this by focusing the architecture both on applications' needs and devices' capabilities. One particular domain we want to approach is collaborative environments where several modalities and applications make it necessary to provide for an extensible system combining diverse components across heterogeneous platforms on-the-fly. This paper describes the proposed framework and its main contributions in the context of an architectural application scenario. We demonstrate how to connect different non-conventional applications and input modalities around an immersive environment (tiled display wall).
Keywords: capability, collaborative, extensible, framework, multimodal interfaces, reusable
Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments BIBAKFull-Text 220-227
  Jong-Seok Lee; Cheol Hoon Park
The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers' appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.
Keywords: audio-visual speech recognition, feature extraction, hidden Markov model, late integration, neural network, noise-robustness, temporal filtering
Reciprocal attentive communication in remote meeting with a humanoid robot BIBAKFull-Text 228-235
  Tomoyuki Morita; Kenji Mase; Yasushi Hirano; Shoji Kajita
In this paper, we investigate the reciprocal attention modality in remote communication. A remote meeting system with a humanoid robot avatar is proposed to overcome the invisible wall for a video conferencing system. Our experimental result shows that a tangible robot avatar provides more effective reciprocal attention against video communication. The subjects in the experiment are asked to determine whether a remote participant with the avatar is actively listening or not to the local presenter's talk. In this system, the head motion of a remote participant is transferred and expressed by the head motion of a humanoid robot. While the presenter has difficulty in determining the extent of a remote participant's attention with a video conferencing system, she/he has better sensing of remote attentive states with the robot. Based on the evaluation result, we propose a vision system for the remote user that integrates omni-directional camera and robot-eye camera images to provide a wideview with a delay compensation feature.
Keywords: gaze, gesture, humanoid robot, remote communication, robot teleconferencing
Password management using doodles BIBAKFull-Text 236-239
  Naveen Sundar Govindarajulu; Sriganesh Madhvanath
The average computer user needs to remember a large number of text username and password combinations for different applications, which places a large cognitive load on the user. Consequently users tend to write down passwords, use easy to remember (and guess) passwords, or use the same password for multiple applications, leading to security risks. This paper describes the use of personalized hand-drawn "doodles" for recall and management of password information. Since doodles can be easier to remember than text passwords, the cognitive load on the user is reduced. Our method involves recognizing doodles by matching them against stored prototypes using handwritten shape matching techniques. We have built a system which manages passwords for web applications through a web browser. In this system, the user logs into a web application by drawing a doodle using a touchpad or digitizing tablet attached to the computer. The user is automatically logged into the web application if the doodle matches the doodle drawn during enrollment. We also report accuracy results for our doodle recognition system, and conclude with a summary of next steps.
Keywords: doodles, password management
A computational model for spatial expression resolution BIBAKFull-Text 240-246
  Andrea Corradini
This paper presents a computational model for the interpretation of linguistic spatial propositions in the restricted realm of a 2D puzzle game. Based on an experiment aimed at analyzing human judgment of spatial expressions, we establish a set of criteria that explain human preference for certain interpretations over others. For each of these criteria, we define a metric that combines the semantic and pragmatic contextual information regarding the game as well as the utterance being resolved. Each metric gives rise to a potential field that characterizes the degree of likelihood for carrying out the instruction at a specific hypothesized location. We resort to machine learning techniques to determine a model of spatial relationships from the data collected during the experiment. Sentence interpretation occurs by matching the potential field of each of its possible interpretations to the model at hand. The system's explanation capabilities lead to the correct assessment of ambiguous situated utterances for a large percentage of the collected expressions.
Keywords: machine learning, psycholinguistic study, spatial expressions
Disambiguating speech commands using physical context BIBAKFull-Text 247-254
  Katherine M. Everitt; Susumu Harada; Jeff Bilmes; James A. Landay
Speech has great potential as an input mechanism for ubiquitous computing. However, the current requirements necessary for accurate speech recognition, such as a quiet environment and a well-positioned and high-quality microphone, are unreasonable to expect in a realistic setting. In a physical environment, there is often contextual information which can be sensed and used to augment the speech signal. We investigated improving speech recognition rates for an electronic personal trainer using knowledge about what equipment was in use as context. We performed an experiment with participants speaking in an instrumented apartment environment and compared the recognition rates of a larger grammar with those of a smaller grammar that is determined by the context.
Keywords: context, exercise, fitness, speech recognition

Oral session 4: meeting applications

Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances BIBAKFull-Text 255-262
  Kazuhiro Otsuka; Hiroshi Sawada; Junji Yamato
A novel probabilistic framework is proposed for analyzing cross-modal nonverbal interactions in multiparty face-to-face conversations. The goal is to determine "who responds to whom, when, and how" from multimodal cues including gaze, head gestures, and utterances. We formulate this problem as the probabilistic inference of the causal relationship among participants' behaviors involving head gestures and utterances. To solve this problem, this paper proposes a hierarchical probabilistic model; the structures of interactions are probabilistically determined from high-level conversation regimes (such as monologue or dialogue) and gaze directions. Based on the model, the interaction structures, gaze, and conversation regimes, are simultaneously inferred from observed head motion and utterances, using a Markov chain Monte Carlo method. The head gestures, including nodding, shaking and tilt, are recognized with a novel Wavelet-based technique from magnetic sensor signals. The utterances are detected using data captured by lapel microphones. Experiments on four-person conversations confirm the effectiveness of the framework in discovering interactions such as question-and-answer and addressing behavior followed by back-channel responses.
Keywords: Bayesian network, eye gaze, face-to-face multiparty conversation, Gibbs sampler, head gestures, Markov chain Monte Carlo, nonverbal behaviors, semi-Markov process
Influencing social dynamics in meetings through a peripheral display BIBAKFull-Text 263-270
  Janienke Sturm; Olga Houben-van Herwijnen; Anke Eyck; Jacques Terken
We present a service providing real-time feedback to participants of small group meetings on the social dynamics of the meeting. The service measures and visualizes properties of participants' behaviour that are relevant to the social dynamics of the meeting: speaking time and gaze behaviour. The dynamic visualization is offered to meeting participants during the meeting through a peripheral display. Whereas an initial version was evaluated using wizards to obtain the required information about gazing behaviour and speaking activity instead of perceptual systems, in the current paper we employ a system including automated perceptual components. We describe the system properties and the perceptual components. The service was evaluated in a within-subjects experiment, where groups of participants discussed topics of general interest, with a total of 82 participants. It was found that the presence of the feedback about speaking time influenced the behaviour of the participants in such a way that it made over-participators to behave less dominant and under-participators to become more active. Feedback on eye gaze behaviour did not affect participants' gazing behaviour (both for listeners and for speakers) during the meeting.
Keywords: head orientation detection, meetings, peripheral display, social dynamics, speech activity detection
Using the influence model to recognize functional roles in meetings BIBAKFull-Text 271-278
  Wen Dong; Bruno Lepri; Alessandro Cappelletti; Alex Sandy Pentland; Fabio Pianesi; Massimo Zancanaro
In this paper, an influence model is used to recognize functional roles played during meetings. Previous works on the same corpus demonstrated a high recognition accuracy using SVMs with RBF kernels. In this paper, we discuss the problems of that approach, mainly over-fitting, the curse of dimensionality and the inability to generalize to different group configurations. We present results obtained with an influence modeling method that avoid these problems and ensures both greater robustness and generalization capability.
Keywords: group interaction, intelligent environments, support vector machines

Poster session 3

User impressions of a stuffed doll robot's facing direction in animation systems BIBAKFull-Text 279-284
  Hiroko Tochigi; Kazuhiko Shinozawa; Norihiro Hagita
This paper investigates the effect on user impressions of the body direction of a stuffed doll robot in an animation system. Many systems that combine a computer display with a robot have been developed, and one of their applications is entertainment, for example, an animation system. In these systems, the robot, as a 3D agent, can be more effective than a 2D agent in helping the user enjoy the animation experience by using spatial characteristics, such as body direction, as a means of expression. The direction in which the robot faces, i.e., towards the human or towards the display, is investigated here.
   User impressions from 25 subjects were examined. The experiment results show that the robot facing the display together with a user is effective for eliciting good feelings from the user, regardless of the user's personality characteristics. Results also suggest that extroverted subjects tend to have a better feeling towards a robot facing the user than introverted ones.
Keywords: animation system, impression evaluation, stuffed doll robot
Speech-driven embodied entrainment character system with hand motion input in mobile environment BIBAKFull-Text 285-290
  Kouzi Osaki; Tomio Watanabe; Michiya Yamamoto
InterActor is a speech-input-driven CG-embodied interaction character that can generate communicative movements and actions for entrained interactions. InterPuppet, on the other hand, is an embodied interaction character that is driven by both speech input-similar to InterActor-and hand motion input, like a puppet. Therefore, humans can use InterPuppet to effectively communicate by using deliberate body movements and natural communicative movements and actions. In this paper, an advanced InterPuppet system that uses a cellular-phone-type device is developed, which can be used in a mobile environment. The effectiveness of the system is demonstrated by performing a sensory evaluation experiment in an actual remote communication scenario.
Keywords: cellular phone, embodied communication, embodied interaction, human communication, human interaction
Natural multimodal dialogue systems: a configurable dialogue and presentation strategies component BIBAKFull-Text 291-298
  Meriam Horchani; Benjamin Caron; Laurence Nigay; Franck Panaget
In the context of natural multimodal dialogue systems, we address the challenging issue of the definition of cooperative answers in an appropriate multimodal form. Highlighting the intertwined relation of multimodal outputs with the content, we focus on the Dialogic strategy component, a component that defines from the set of possible contents to answer a user's request, the content to be presented to the user and its multimodal presentation. The content selection and the presentation allocation managed by the Dialogic strategy component are based on various constraints such as the availability of a modality and the user's preferences. Considering three generic types of dialogue strategies and their corresponding handled types of information as well as three generic types of presentation tasks, we present a first implementation of the Dialogic strategy component based on rules. By providing a graphical interface to configure the component by editing the rules, we show how the component can be easily modified by ergonomists at design time for exploring different solutions. In further work we envision letting the user modify the component at runtime.
Keywords: development tool, dialogue and presentation strategies, multimodal output
Modeling human interaction resources to support the design of wearable multimodal systems BIBAKFull-Text 299-306
  Tobias Klug; Max Mühlhäuser
Designing wearable application interfaces that integrate well into real-world processes like aircraft maintenance or medical examinations is challenging. One of the main success criteria is how well the multimodal interaction with the computer system fits an already existing real-world task. Therefore, the interface design needs to take the real-world task flow into account from the beginning.
   We propose a model of interaction devices and human interaction capabilities that helps evaluate how well different interaction devices/techniques integrate with specific real-world scenarios. The model was developed based on a survey of wearable interaction research literature. Combining this model with descriptions of observed real-world tasks, possible conflicts between task performance and device requirements can be visualized helping the interface designer to find a suitable solution.
Keywords: interaction devices, interaction resource model, multimodal interaction, wearable computing
Speech-filtered bubble ray: improving target acquisition on display walls BIBAKFull-Text 307-314
  Edward Tse; Mark Hancock; Saul Greenberg
The rapid development of large interactive wall displays has been accompanied by research on methods that allow people to interact with the display at a distance. The basic method for target acquisition is by ray casting a cursor from one's pointing finger or hand position; the problem is that selection is slow and error-prone with small targets. A better method is the bubble cursor that resizes the cursor's activation area to effectively enlarge the target size. The catch is that this technique's effectiveness depends on the proximity of surrounding targets: while beneficial in sparse spaces, it is less so when targets are densely packed together. Our method is the speech-filtered bubble ray that uses speech to transform a dense target space into a sparse one. Our strategy builds on what people already do: people pointing to distant objects in a physical workspace typically disambiguate their choice through speech. For example, a person could point to a stack of books and say "the green one". Gesture indicates the approximate location for the search, and speech 'filters' unrelated books from the search. Our technique works the same way; a person specifies a property of the desired object, and only the location of objects matching that property trigger the bubble size. In a controlled evaluation, people were faster and preferred using the speech-filtered bubble ray over the standard bubble ray and ray casting approach.
Keywords: freehand interaction, gestures, large display walls, multimodal, pointing, speech, speech filtering
Using pen input features as indices of cognitive load BIBAKFull-Text 315-318
  Natalie Ruiz; Ronnie Taib; Yu (David) Shi; Eric Choi; Fang Chen
Multimodal interfaces are known to be useful in map-based applications, and in complex, time-pressure based tasks. Cognitive load variations in such tasks have been found to impact multimodal behaviour. For example, users become more multimodal and tend towards semantic complementarity as cognitive load increases. The richness of multimodal data means that systems could monitor particular input features to detect experienced load variations. In this paper, we present our attempt to induce controlled levels of load and solicit natural speech and pen-gesture inputs. In particular, we analyse for these features in the pen gesture modality. Our experimental design relies on a map-based Wizard of Oz, using a tablet PC. This paper details analysis of pen-gesture interaction across subjects, and presents suggestive trends of increases in the degree of degeneration of pen-gestures in some subjects, and possible trends in gesture kinematics, when cognitive load increases.
Keywords: cognitive load, multimodal, pen gesture, speech
Automated generation of non-verbal behavior for virtual embodied characters BIBAKFull-Text 319-322
  Werner Breitfuss; Helmut Prendinger; Mitsuru Ishizuka
In this paper we introduce a system that automatically adds different types of non-verbal behavior to a given dialogue script between two virtual embodied agents. It allows us to transform a dialogue in text format into an agent behavior script enriched by eye gaze and conversational gesture behavior. The agents' gaze behavior is informed by theories of human face-to-face gaze behavior. Gestures are generated based on the analysis of linguistic and contextual information of the input text. The resulting annotated dialogue script is then transformed into the Multimodal Presentation Markup Language for 3D agents (MPML3D), which controls the multi-modal behavior of animated life-like agents, including facial and body animation and synthetic speech. Using our system makes it very easy to add appropriate non-verbal behavior to a given dialogue text, a task that would otherwise be very cumbersome and time consuming.
Keywords: animation agent systems, multi-modal presentation, multimodal input and output interfaces, processing of language and action patterns
Detecting communication errors from visual cues during the system's conversational turn BIBAKFull-Text 323-326
  Sy Bor Wang; David Demirdjian; Trevor Darrell
Automatic detection of communication errors in conversational systems has been explored extensively in the speech community. However, most previous studies have used only acoustic cues. Visual information has also been used by the speech community to improve speech recognition in dialogue systems, but this visual information is only used when the speaker is communicating vocally. A recent perceptual study indicated that human observers can detect communication problems when they see the visual footage of the speaker during the system's reply. In this paper, we present work in progress towards the development of a communication error detector that exploits this visual cue. In datasets we collected or acquired, facial motion features and head poses were estimated while users were listening to the system response and passed to a classifier for detecting a communication error. Preliminary experiments have demonstrated that the speaker's visual information during the system's reply is potentially useful and accuracy of automatic detection is close to human performance.
Keywords: conversational systems, system error detection, visual feedback
Multimodal interaction analysis in a smart house BIBAKFull-Text 327-334
  Pilar Manchón; Carmen del Solar; Gabriel Amores; Guillermo Pérez
This is a large extension to a previous paper presented in LREC 2006 [6]. It describes the motivation, collection and format of the MIMUS corpus, as well as an in-depth and issue-focused analysis of the data. MIMUS [8] is the result of multimodal WoZ experiments conducted at the University of Seville as part of the TALK project. The main objective of the MIMUS corpus was to gather information about different users and their performance, preferences and usage of a multimodal multilingual natural dialogue system in the Smart Home scenario in Spanish. The focus group is composed by wheel-chair-bound users, because of their special motivation to use this kind of technology, along with their specific needs. Throughout this article, the WoZ platform, experiments, methodology, annotation schemes and tools, and all relevant data will be discussed, as well as the results of the in-depth analysis of these data. The corpus compresses a set of three related experiments. Due to the limited scope of this article, only some results related to the first two experiments (1A and 1B) will be discussed. This article will focus on subject's preferences, multimodal behavioural patterns and willingness to use this kind of technology.
Keywords: HCI, mixed-modality events, multimodal corpus, multimodal entries, multimodal experiments, multimodal interaction
A multi-modal mobile device for learning Japanese kanji characters through mnemonic stories BIBAKFull-Text 335-338
  Norman Lin; Shoji Kajita; Kenji Mase
We describe the design of a novel multi-modal, mobile computer system to support foreign students in learning Japanese kanji characters through creation of mnemonic stories. Our system treats complicated kanji shapes as hierarchical compositions of smaller shapes (following Heisig, 1986) and allows hyperlink navigation to quickly follow whole-part relationships. Visual display of kanji shape and meaning are augmented with user-supplied mnemonic stories in audio form, thereby dividing the learning information multi-modally into visual and audio modalities. A device-naming scheme and color-coding allow for asynchronous sharing of audio mnemonic stories among different users' devices. We describe the design decisions for our mobile multi-modal interface and present initial usability results based on feedback from beginning kanji learners. Our combination of mnemonic stories, audio and video modalities, and mobile device provide a new and effective system for computer-assisted kanji learning.
Keywords: Chinese characters, JSL, Japanese as a second language, kanji, language education, mobile computing

Oral session 5: interactive systems 1

3d augmented mirror: a multimodal interface for string instrument learning and teaching with gesture support BIBAKFull-Text 339-345
  Kia C. Ng; Tillman Weyde; Oliver Larkin; Kerstin Neubarth; Thijs Koerselman; Bee Ong
Multimodal interfaces can open up new possibilities for music education, where the traditional model of teaching is based predominantly on verbal feedback. This paper explores the development and use of multimodal interfaces in novel tools to support music practice training. The design of multimodal interfaces for music education presents a challenge in several respects. One is the integration of multimodal technology into the music learning process. The other is the technological development, where we present a solution that aims to support string practice training with visual and auditory feedback. Building on the traditional function of a physical mirror as a teaching aid, we describe the concept and development of an "augmented mirror" using 3D motion capture technology.
Keywords: 3d, education, feedback, gesture, interface, motion capture, multimodal, music, sonification, visualisation, visualization
Interest estimation based on dynamic bayesian networks for visual attentive presentation agents BIBAKFull-Text 346-349
  Boris Brandherm; Helmut Prendinger; Mitsuru Ishizuka
In this paper, we describe an interface consisting of a virtual showroom where a team of two highly realistic 3D agents presents product items in an entertaining and attractive way. The presentation flow adapts to users' attentiveness, or lack thereof, and may thus provide a more personalized and user-attractive experience of the presentation. In order to infer users' attention and visual interest regarding interface objects, our system analyzes eye movements in real-time. Interest detection algorithms used in previous research determine an object of interest based on the time that eye gaze dwells on that object. However, this kind of algorithm is not well suited for dynamic presentations where the goal is to assess the user's focus of attention regarding a dynamically changing presentation. Here, the current context of the object of attention has to be considered, i.e., whether the visual object is part of (or contributes to) the current presentation content or not. Therefore, we propose a new approach that estimates the interest (or non-interest) of a user by means of dynamic Bayesian networks. Each of a predefined set of visual objects has a dynamic Bayesian network assigned to it, which calculates the current interest of the user in this object. The estimation takes into account (1) each new gaze point, (2) the current context of the object, and (3) preceding estimations of the object itself. Based on these estimations the presentation agents can provide timely and appropriate response.
Keywords: dynamic Bayesian network, eye tracking, interest recognition, multi-modal presentation
On-line multi-modal speaker diarization BIBAKFull-Text 350-357
  Athanasios Noulas; Ben J. A. Krose
This paper presents a novel framework that utilizes multi-modal information to achieve speaker diarization. We use dynamic Bayesian networks to achieve on-line results. We progress from a simple observation model to a complex multi-modal one as more data becomes available. We present an efficient way to guide the learning procedure of the complex model using the early results achieved with the simple model. We present the results achieved in various real-world situations, including videos coming from webcameras, human computer interaction and video conferences.
Keywords: audio-visual, multi-modal, speaker detection, speaker diarization

Oral session 6: interactive systems 2

Presentation sensei: a presentation training system using speech and image processing BIBAKFull-Text 358-365
  Kazutaka Kurihara; Masataka Goto; Jun Ogata; Yosuke Matsusaka; Takeo Igarashi
In this paper we present a presentation training system that observes a presentation rehearsal and provides the speaker with recommendations for improving the delivery of the presentation, such as to speak more slowly and to look at the audience. Our system "Presentation Sensei" is equipped with a microphone and camera to analyze a presentation by combining speech and image processing techniques. Based on the results of the analysis, the system gives the speaker instant feedback with respect to the speaking rate, eye contact with the audience, and timing. It also alerts the speaker when some of these indices exceed predefined warning thresholds. After the presentation, the system generates visual summaries of the analysis results for the speaker's self-examinations. Our goal is not to improve the content on a semantic level, but to improve the delivery of it by reducing inappropriate basic behavior patterns. We asked a few test users to try the system and they found it very useful for improving their presentations. We also compared the system's output with the observations of a human evaluator. The result shows that the system successfully detected some inappropriate behavior. The contribution of this work is to introduce a practical recognition-based human training system and to show its feasibility despite the limitations of state-of-the-art speech and video recognition technologies.
Keywords: image processing, presentation, sensei, speech processing, training
The world of mushrooms: human-computer interaction prototype systems for ambient intelligence BIBAKFull-Text 366-373
  Yasuhiro Minami; Minako Sawaki; Kohji Dohsaka; Ryuichiro Higashinaka; Kentaro Ishizuka; Hideki Isozaki; Tatsushi Matsubayashi; Masato Miyoshi; Atsushi Nakamura; Takanobu Oba; Hiroshi Sawada; Takeshi Yamada; Eisaku Maeda
Our new research project called "ambient intelligence" concentrates on the creation of new lifestyles through research on communication science and intelligence integration. It is premised on the creation of such virtual communication partners as fairies and goblins that can be constantly at our side. We call these virtual communication partners mushrooms.
   To show the essence of ambient intelligence, we developed two multimodal prototype systems: mushrooms that watch, listen, and answer questions and a Quizmaster Mushroom. These two systems work in real time using speech, sound, dialogue, and vision technologies.
   We performed preliminary experiments with the Quizmaster Mushroom. The results showed that the system can transmit knowledge to users while they are playing the quizzes.
   Furthermore, through the two mushrooms, we found policies for design effects in multimodal interface and integration.
Keywords: dialog, multimodal interfaces, visual-auditory feedback
Evaluation of haptically augmented touchscreen gui elements under cognitive load BIBAKFull-Text 374-381
  Rock Leung; Karon MacLean; Martin Bue Bertelsen; Mayukh Saubhasik
Adding expressive haptic feedback to mobile devices has great potential to improve their usability, particularly in multitasking situations where one's visual attention is required. Piezoelectric actuators are emerging as one suitable technology for rendering expressive haptic feedback on mobile devices. We describe the design of redundant piezoelectric haptic augmentations of touchscreen GUI buttons, progress bars, and scroll bars, and their evaluation under varying cognitive load. Our haptically augmented progress bars and scroll bars led to significantly faster task completion, and favourable subjective reactions. We further discuss resulting insights into designing useful haptic feedback for touchscreens and highlight challenges, including means of enhancing usability, types of interactions where value is maximized, difficulty in disambiguating background from foreground signals, tradeoffs in haptic strength vs. resolution, and subtleties in evaluating these types of interactions.
Keywords: GUI elements, haptic feedback, mobile device, multimodal, multitasking, piezoelectric actuators, touchscreen, usability


Multimodal interfaces in semantic interaction BIBAKFull-Text 382
  Naoto Iwahashi; Mikio Nakano
This workshop addresses the approaches, methods, standardization, and theories for multimodal interfaces in which machines need to interact with humans adaptively according to context, such as the situation in the real world and each human's individual characteristics. To realize such interaction -- as semantic interaction -- it is necessary to extract and use the valuable context information needed for understanding interaction from the obtained real-world information. In addition, it is important for the user and the machine to share knowledge and an understanding of a given situation naturally through speech, images, graphics, manipulators, and so on. Submitted papers address these topics from diverse fields, such as human-robot interaction, machine learning, and game design.
Keywords: context, human-robot interaction, multimodal interface, semantic interaction, situatedness
Workshop on tagging, mining and retrieval of human related activity information BIBAKFull-Text 383-384
  Paulo Barthelmess; Edward Kaiser
Inexpensive and user friendly cameras, microphones, and other devices such as digital pens are making it increasingly easy to capture, store and process large amounts of data over a variety of media. Even though the barriers for data acquisition have been lowered, making use of these data remains challenging. The focus of the present workshop is on issues related to theory, methods and techniques for facilitating the organization, retrieval and reuse of multimodal information. The emphasis is on organization and retrieval of information related to human activity, i.e. that is generated and consumed by individuals and groups as they go about their work, learning and leisure.
Keywords: browsing, mining, multimedia, multimodal, retrieval, tagging
Workshop on massive datasets BIBKFull-Text 385
  Christopher R. Wren; Yuri A. Ivanov
Keywords: architecture, data mining, evaluation, motion, sensor networks, tracking, visualization