HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2004 International Conference on Multimodal Interfaces

Fullname:ICMI'04 Proceedings of the 6th IEEE International Conference on Multimodal Interfaces
Editors:Rajeev Sharma; Trevor Darrell; Mary Harper; Gianni Lazzari; Matthew Turk
Location:State College, Pennsylvania, USA
Dates:2004-Oct-13 to 2004-Oct-15
Standard No:ISBN: 1-58113-995-0; ACM DL: Table of Contents hcibib: ICMI04
  1. Gaze
  2. Multimodial conversational agents
  3. Architecture
  4. Multimodal applications
  5. Multimodal communication
  6. Multimodal interaction
  7. Poster session 1
  8. Poster session 2
  9. Demo session 1
  10. Demo session 2
  11. Doctoral spotlight session


Two-way eye contact between humans and robots BIBAKFull-Text 1-8
  Yoshinori Kuno; Arihiro Sakurai; Dai Miyauchi; Akio Nakamura
Eye contact is an effective means of controlling human communication, such as in starting communication. It seems that we can make eye contact if we simply look at each other. However, this alone does not establish eye contact. Both parties also need to be aware of being watched by the other. We propose a method of two-way eye contact for human-robot communication. When a human wants to start communication with a robot, he/she watches the robot. If it finds a human looking at it, the robot turns to him/her, changing its facial expressions to let him/her know its awareness of his/her gaze. When the robot wants to initiate communication with a particular person, it moves its body and face toward him/her and changes its facial expressions to make the person notice its gaze. We show several experimental results to prove the effectiveness of this method. Moreover, we present a robot that can recognize hand gestures after making eye contact with the human to show the usefulness of eye contact as a means of controlling communication.
Keywords: embodied agent, eye contact, gaze, gesture recognition, human-robot interface, nonverbal behavior
Another person's eye gaze as a cue in solving programming problems BIBAKFull-Text 9-15
  Randy Stein; Susan E. Brennan
Expertise in computer programming can often be difficult to transfer verbally. Moreover, technical training and communication occur more and more between people who are located at a distance. We tested the hypothesis that seeing one person's visual focus of attention (represented as an eyegaze cursor) while debugging software (displayed as text on a screen) can be helpful to another person doing the same task. In an experiment, a group of professional programmers searched for bugs in small Java programs while wearing an unobtrusive head-mounted eye tracker. Later, a second set of programmers searched for bugs in the same programs. For half of the bugs, the second set of programmers first viewed a recording of an eyegaze cursor from one of the first programmers displayed over the (indistinct) screen of code, and for the other half they did not. The second set of programmers found the bugs more quickly after viewing the eye gaze of the first programmers, suggesting that another person's eye gaze, produced instrumentally (as opposed to intentionally, like pointing with a mouse), can be a useful cue in problem solving. This finding supports the potential of eye gaze as a valuable cue for collaborative interaction in a visuo-spatial task conducted at a distance.
Keywords: debugging, eye tracking, gaze-based & attentional interfaces, mediated communication, programming, visual co-presence
EyePrint: support of document browsing with eye gaze trace BIBAKFull-Text 16-23
  Takehiko Ohno
Current digital documents provide few traces to help user browsing. This makes document browsing difficult, and we sometimes feel it is hard to keep track of all of the information. To overcome this problem, this paper proposes a method of creating traces on digital documents. The method, called EyePrint, generates a trace from the user's eye gaze in order to support the browsing of digital document. Traces are presented as highlighted areas on a document, which become visual cues for accessing previously visited documents. Traces also become document attributes that can be used to access and search the document. A prototype system that works with a gaze tracking system is developed. The result of a user study confirms the usefulness of the traces in digital document browsing.
Keywords: document browsing, eyePrint, gaze-based interaction, information retrieval, readwear, reusability problem

Multimodial conversational agents

A framework for evaluating multimodal integration by humans and a role for embodied conversational agents BIBAKFull-Text 24-31
  Dominic W. Massaro
One of the implicit assumptions of multi-modal interfaces is that human-computer interaction is significantly facilitated by providing multiple input and output modalities. Surprisingly, however, there is very little theoretical and empirical research testing this assumption in terms of the presentation of multimodal displays to the user. The goal of this paper is provide both a theoretical and empirical framework for addressing this important issue. Two contrasting models of human information processing are formulated and contrasted in experimental tests. According to integration models, multiple sensory influences are continuously combined during categorization, leading to perceptual experience and action. The Fuzzy Logical Model of Perception (FLMP) assumes that processing occurs in three successive but overlapping stages: evaluation, integration, and decision (Massaro, 1998). According to nonintegration models, any perceptual experience and action results from only a single sensory influence. These models are tested in expanded factorial designs in which two input modalities are varied independently of one another in a factorial design and each modality is also presented alone. Results from a variety of experiments on speech, emotion, and gesture support the predictions of the FLMP. Baldi, an embodied conversational agent, is described and implications for applications of multimodal interfaces are discussed.
Keywords: emotion, gesture, multisensory integration, speech
From conversational tooltips to grounded discourse: head poseTracking in interactive dialog systems BIBAKFull-Text 32-37
  Louis-Philippe Morency; Trevor Darrell
Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. While the machine interpretation of these cues has previously been limited to output modalities, recent advances in face-pose tracking allow for systems which are robust and accurate enough to sense natural grounding gestures. We present the design of a module that detects these cues and show examples of its integration in three different conversational agents with varying degrees of discourse model complexity. Using a scripted discourse model and off-the-shelf animation and speech-recognition components, we demonstrate the use of this module in a novel "conversational tooltip" task, where additional information is spontaneously provided by an animated character when users attend to various physical objects or characters in the environment. We further describe the integration of our module in two systems where animated and robotic characters interact with users based on rich discourse and semantic models.
Keywords: conversational tooltips, grounding, head gesture recognition, head pose tracking, human-computer interaction, interactive dialog system
Evaluation of spoken multimodal conversation BIBAKFull-Text 38-45
  Niels Ole Bernsen; Laila Dybkjær
Spoken multimodal dialogue systems in which users address face-only or embodied interface agents have been gaining ground in research for some time. Although most systems are still strictly task-oriented, the field is now moving towards domain-oriented systems and real conversational systems which are no longer defined in terms of the task(s) they support. This paper describes the first running prototype of such a system which enables spoken and gesture interaction with life-like fairytale author Hans Christian Andersen about his fairytales, life, study, etc., focusing on multimodal conversation. We then present recent user test evaluation results on multimodal conversation.
Keywords: evaluation, natural interaction, spoken conversation
Multimodal transformed social interaction BIBAKFull-Text 46-52
  Matthew Turk; Jeremy Bailenson; Andrew Beall; Jim Blascovich; Rosanna Guadagno
Understanding human-human interaction is fundamental to the long-term pursuit of powerful and natural multimodal interfaces. Nonverbal communication, including body posture, gesture, facial expression, and eye gaze, is an important aspect of human-human interaction. We introduce a paradigm for studying multimodal and nonverbal communication in collaborative virtual environments (CVEs) called Transformed Social Interaction (TSI), in which a user's visual representation is rendered in a way that strategically filters selected communication behaviors in order to change the nature of a social interaction. To achieve this, TSI must employ technology to detect, recognize, and manipulate behaviors of interest, such as facial expressions, gestures, and eye gaze. In [13] we presented a TSI experiment called non-zero-sum gaze (NZSG) to determine the effect of manipulated eye gaze on persuasion in a small group setting. Eye gaze was manipulated so that each participant in a three-person CVE received eye gaze from a presenter that was normal, less than normal, or greater than normal. We review this experiment and discuss the implications of TSI for multimodal interfaces.
Keywords: computer-mediated communication, multimodal processing, transformed social interaction


Multimodal interaction in an augmented reality scenario BIBAKFull-Text 53-60
  Gunther Heidemann; Ingo Bax; Holger Bekel
We describe an augmented reality system designed for online acquisition of visual knowledge and retrieval of memorized objects. The system relies on a head mounted camera and display, which allow the user to view the environment together with overlaid augmentations by the system. In this setup, communication by hand gestures and speech is mandatory as common input devices like mouse and keyboard are not available. Using gesture and speech, basically three types of tasks must be handled: (i) Communication with the system about the environment, in particular, directing attention towards objects and commanding the memorization of sample views; (ii) control of system operation, e.g. switching between display modes; and (iii) re-adaptation of the interface itself in case communication becomes unreliable due to changes in external factors, such as illumination conditions. We present an architecture to manage these tasks and describe and evaluate several of its key elements, including modules for pointing gesture recognition, menu control based on gesture and speech, and control strategies to cope with situations when vision becomes unreliable and has to be re-adapted by speech.
Keywords: augmented reality, human-machine-interaction, image retrieval, interfaces, memory, mobile systems, object recognition
The ThreadMill architecture for stream-oriented human communication analysis applications BIBAKFull-Text 61-68
  Paulo Barthelmess; Clarence A. Ellis
This work introduces a new component software architecture -- ThreadMill -- whose main purpose is to facilitate the development of applications in domains where high volumes of streamed data need to be efficiently analyzed. It focuses particularly on applications that target the analysis of human communication e.g. in speech and gesture recognition. Applications in this domain usually employ costly signal processing techniques, but offer in many cases ample opportunities for concurrent execution in many different phases. ThreadMill's abstractions facilitate the development of applications that take advantage of this potential concurrency by hiding the complexity of parallel and distributed programming. As a result, ThreadMill applications can be made to run unchanged on a wide variety of execution environments, ranging from a single-processor machine to a cluster of multi-processor nodes. The architecture is illustrated by an implementation of a tracker for hands and face of American Sign Language signers that uses a parallel and concurrent version of the Joint Likelihood Filter method.
Keywords: human-communication analysis applications, software evolution
TouchLight: an imaging touch screen and display for gesture-based interaction BIBAKFull-Text 69-76
  Andrew D. Wilson
A novel touch screen technology is presented. TouchLight uses simple image processing techniques to combine the output of two video cameras placed behind a semi-transparent plane in front of the user. The resulting image shows objects that are on the plane. This technique is well suited for application with a commercially available projection screen material (DNP HoloScreen) which permits projection onto a transparent sheet of acrylic plastic in normal indoor lighting conditions. The resulting touch screen display system transforms an otherwise normal sheet of acrylic plastic into a high bandwidth input/output surface suitable for gesture-based interaction. Image processing techniques are detailed, and several novel capabilities of the system are outlined.
Keywords: computer human interaction, computer vision, displays, gesture recognition, videoconferencing

Multimodal applications

Walking-pad: a step-in-place locomotion interface for virtual environments BIBAKFull-Text 77-81
  Laroussi Bouguila; Florian Evequoz; Michele Courant; Beat Hirsbrunner
This paper presents a new locomotion interface that provides users with the ability to engage in a life-like walking experience using stepping in place. Stepping actions are performed on top of a flat platform that has an embedded grid of switch sensors that detect footfalls pressure. Based on data received from sensors, the system can compute different variables that represent user's walking behavior such as walking direction, walking speed, standstill, jump, and walking. The overall platform status is scanned at a rate of 100Hz with which we can deliver real-time visual feedback reaction to user actions. The proposed system is portable and easy to integrate into major virtual environment with large projection feature such as CAVE and DOME systems. The overall weight of the Walking-Pad is less than 5 Kg and can be connected to any computer via USB port, which make it even controllable via a portable computer.
Keywords: locomotion, sensors, step-in-place, virtual environment, walking-pad
Multimodal detection of human interaction events in a nursing home environment BIBAKFull-Text 82-89
  Datong Chen; Robert Malkin; Jie Yang
In this paper, we propose a multimodal system for detecting human activity and interaction patterns in a nursing home. Activities of groups of people are firstly treated as interaction patterns between any pair of partners and are then further broken into individual activities and behavior events using a multi-level context hierarchy graph. The graph is implemented using a dynamic Bayesian network to statistically model the multi-level concepts. We have developed a coarse-to-fine prototype system to illustrate the proposed concept. Experimental results have demonstrated the feasibility of the proposed approaches. The objective of this research is to automatically create concise and comprehensive reports of activities and behaviors of patients to support physicians and caregivers in a nursing facility.
Keywords: group activity, human interaction, medical care, multimodal, stochastic modeling
Elvis: situated speech and gesture understanding for a robotic chandelier BIBAKFull-Text 90-96
  Joshua Juster; Deb Roy
We describe a home lighting robot that uses directional spotlights to create complex lighting scenes. The robot senses its visual environment using a panoramic camera and attempts to maintain its target goal state by adjusting the positions and intensities of its lights. Users can communicate desired changes in the lighting environment through speech and gesture (e.g., "Make it brighter over there"). Information obtained from these two modalities are combined to form a goal, a desired change in the lighting of the scene. This goal is then incorporated into the system's target goal state. When the target goal state and the world are out of alignment, the system formulates a sensorimotor plan that acts on the world to return the system to homeostasis.
Keywords: gesture, grounded, input methods, lighting, multimodal, natural interaction, situated, speech

Multimodal communication

Towards integrated microplanning of language and iconic gesture for multimodal output BIBAKFull-Text 97-104
  Stefan Kopp; Paul Tepper; Justine Cassell
When talking about spatial domains, humans frequently accompany their explanations with iconic gestures to depict what they are referring to. For example, when giving directions, it is common to see people making gestures that indicate the shape of buildings, or outline a route to be taken by the listener, and these gestures are essential to the understanding of the directions. Based on results from an ongoing study on language and gesture in direction-giving, we propose a framework to analyze such gestural images into semantic units (image description features), and to link these units to morphological features (hand shape, trajectory, etc.). This feature-based framework allows us to generate novel iconic gestures for embodied conversational agents, without drawing on a lexicon of canned gestures. We present an integrated microplanner that derives the form of both coordinated natural language and iconic gesture directly from given communicative goals, and serves as input to the speech and gesture realization engine in our NUMACK project.
Keywords: embodied conversational agents, generation, gesture, language, multimodal output
Exploiting prosodic structuring of coverbal gesticulation BIBAKFull-Text 105-112
  Sanshzar Kettebekov
Although gesture recognition has been studied extensively, communicative, affective, and biometrical "utility" of natural gesticulation remains relatively unexplored. One of the main reasons for that is the modeling complexity of spontaneous gestures. While lexical information in speech provides additional cues for disambiguating gestures, it does not cover rich paralinguistic domain. This paper offers initial findings from a large corpus of natural monologues about prosodic structuring between frequent beat-like strokes and concurrent speech. Using a set of audio-visual features in an HMM-based formulation, we are able to improve the discrimination between visually similar gestures. Those types of articulatory strokes represent different communicative functions. The analysis is based on the temporal alignment of detected vocal perturbations and the concurrent hand movement. As a supplementary result, we show that recognized articulatory strokes may be used for quantifying gesturing behavior.
Keywords: gesture, multimodal, prosody, speech
Visual and linguistic information in gesture classification BIBAKFull-Text 113-120
  Jacob Eisenstein; Randall Davis
Classification of natural hand gestures is usually approached by applying pattern recognition to the movements of the hand. However, the gesture categories most frequently cited in the psychology literature are fundamentally multimodal; the definitions make reference to the surrounding linguistic context. We address the question of whether gestures are naturally multimodal, or whether they can be classified from hand-movement data alone. First, we describe an empirical study showing that the removal of auditory information significantly impairs the ability of human raters to classify gestures. Then we present an automatic gesture classification system based solely on an n-gram model of linguistic context; the system is intended to supplement a visual classifier, but achieves 66% accuracy on a three-class classification problem on its own. This represents higher accuracy than human raters achieve when presented with the same information.
Keywords: gesture recognition, gesture taxonomies, multimodal disambiguation, validity
Multimodal model integration for sentence unit detection BIBAKFull-Text 121-128
  Mary P. Harper; Elizabeth Shriberg
In this paper, we adopt a direct modeling approach to utilize conversational gesture cues in detecting sentence boundaries, called SUs, in video taped conversations. We treat the detection of SUs as a classification task such that for each inter-word boundary, the classifier decides whether there is an SU boundary or not. In addition to gesture cues, we also utilize prosody and lexical knowledge sources. In this first investigation, we find that gesture features complement the prosodic and lexical knowledge sources for this task. By using all of the knowledge sources, the model is able to achieve the lowest overall SU detection error rate.
Keywords: dialog, gesture, language models, multimodal fusion, prosody, sentence boundary detection

Multimodal interaction

When do we interact multimodally?: cognitive load and multimodal communication patterns BIBAKFull-Text 129-136
  Sharon Oviatt; Rachel Coulston; Rebecca Lunsford
Mobile usage patterns often entail high and fluctuating levels of difficulty as well as dual tasking. One major theme explored in this research is whether a flexible multimodal interface supports users in managing cognitive load. Findings from this study reveal that multimodal interface users spontaneously respond to dynamic changes in their own cognitive load by shifting to multimodal communication as load increases with task difficulty and communicative complexity. Given a flexible multimodal interface, users' ratio of multimodal (versus unimodal) interaction increased substantially from 18.6% when referring to established dialogue context to 77.1% when required to establish a new context, a +315% relative increase. Likewise, the ratio of users' multimodal interaction increased significantly as the tasks became more difficult, from 59.2% during low difficulty tasks, to 65.5% at moderate difficulty, 68.2% at high and 75.0% at very high difficulty, an overall relative increase of +27%. Analysis of users' task-critical errors and response latencies across task difficulty levels increased systematically and significantly as well, corroborating the manipulation of cognitive processing load. The adaptations seen in this study reflect users' efforts to self-manage limitations on working memory when task complexity increases. This is accomplished by distributing communicative information across multiple modalities, which is compatible with a cognitive load theory of multimodal interaction. The long-term goal of this research is the development of an empirical foundation for proactively guiding flexible and adaptive multimodal system design.
Keywords: cognitive load, dialogue context, human performance, individual differences, multimodal integration, multimodal interaction, speech and pen input, system adaptation, task difficulty, unimodal interaction
Bimodal HCI-related affect recognition BIBAKFull-Text 137-143
  Zhihong Zeng; Jilin Tu; Ming Liu; Tong Zhang; Nicholas Rizzolo; Zhenqiu Zhang; Thomas S. Huang; Dan Roth; Stephen Levinson
Perhaps the most fundamental application of affective computing will be Human-Computer Interaction (HCI) in which the computer should have the ability to detect and track the user's affective states, and make corresponding feedback. The human multi-sensor affect system defines the expectation of multimodal affect analyzer. In this paper, we present our efforts toward audio-visual HCI-related affect recognition. With HCI applications in mind, we take into account some special affective states which indicate users' cognitive/motivational states. Facing the fact that a facial expression is influenced by both an affective state and speech content, we apply a smoothing method to extract the information of the affective state from facial features. In our fusion stage, a voting method is applied to combine audio and visual modalities so that the final affect recognition accuracy is greatly improved. We test our bimodal affect recognition approach on 38 subjects with 11 HCI-related affect states. The extensive experimental results show that the average person-dependent affect recognition accuracy is almost 90% for our bimodal fusion.
Keywords: affect recognition, affective computing, emotion recognition, multimodal human-computer interaction
Identifying the addressee in human-human-robot interactions based on head pose and speech BIBAKFull-Text 144-151
  Michael Katzenmaier; Rainer Stiefelhagen; Tanja Schultz
In this work we investigate the power of acoustic and visual cues, and their combination, to identify the addressee in a human-human-robot interaction. Based on eighteen audio-visual recordings of two human beings and a (simulated) robot we discriminate the interaction of the two humans from the interaction of one human with the robot. The paper compares the result of three approaches. The first approach uses purely acoustic cues to find the addressees. Low level, feature based cues as well as higher-level cues are examined. In the second approach we test whether the human's head pose is a suitable cue. Our results show that visually estimated head pose is a more reliable cue for the identification of the addressee in the human-human-robot interaction. In the third approach we combine the acoustic and visual cues which results in significant improvements.
Keywords: attentive interfaces, focus of attention, head pose estimation, human-robot interaction, multimodal interfaces, speech recognition
Articulatory features for robust visual speech recognition BIBAKFull-Text 152-158
  Kate Saenko; Trevor Darrell; James R. Glass
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Keywords: articulatory features, audio-visual speech recognition, multimodal interfaces, speechreading, visual feature extraction

Poster session 1

M/ORIS: a medical/operating room interaction system BIBAKFull-Text 159-166
  Sébastien Grange; Terrence Fong; Charles Baur
We propose an architecture for a real-time multimodal system, which provides non-contact, adaptive user interfacing for Computer-Assisted Surgery (CAS). The system, called M/ORIS (for Medical/Operating Room Interaction System) combines gesture interpretation as an explicit interaction modality with continuous, real-time monitoring of the surgical activity in order to automatically address the surgeon's needs. Such a system will help reduce a surgeon's workload and operation time. This paper focuses on the proposed activity monitoring aspect of M/ORIS. We analyze the issues of Human-Computer Interaction in an OR based on real-world case studies. We then describe how we intend to address these issues by combining a surgical procedure description with parameters gathered from vision-based surgeon tracking and other OR sensors (e.g. tool trackers). We called this approach Scenario-based Activity Monitoring (SAM). We finally present preliminary results, including a non-contact mouse interface for surgical navigation systems.
Keywords: CAS, HCI, medical user interfaces, multimodal interaction
Modality fusion for graphic design applications BIBAKFull-Text 167-174
  André D. Milota
Users must enter a complex mix of spatial and abstract information when operating a graphic design application. Speech / language provides a fluid and natural method for specifying abstract information while a spatial input device is often most intuitive for the entry of spatial information. Thus, the combined speech / gesture interface is ideally suited to this application domain. While some research has been conducted on multimodal graphic design applications, advanced research on modality fusion has typically focused on map related applications. This paper considers the particular demands of graphic design applications and what impact these demands will have on the general strategies employed when combining the speech and gesture channels. We also describe initial work on our own multimodal graphic design application (DPD) which uses these strategies.
Keywords: graphic design, modality fusion, multimodal interface, pen interface, speech input
Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures BIBAKFull-Text 175-182
  Hartwig Holzapfel; Kai Nickel; Rainer Stiefelhagen
This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rules. In the presented user study, people could interact with a robot in a kitchen scenario, using speech and gesture input. In the study, we could observe that our fusion approach is very tolerant against falsely detected pointing gestures. This is because we use speech as the main modality and pointing gestures mainly for disambiguation of objects. In the paper we also report about the temporal correlation of speech and gesture events as observed in the user study.
Keywords: gesture, multimodal architectures, multimodal fusion and multisensory integration, natural language, speech, vision
AROMA: ambient awareness through olfaction in a messaging application BIBAKFull-Text 183-190
  Adam Bodnar; Richard Corbett; Dmitry Nekrasovski
This work explores the properties of different output modalities as notification mechanisms in the context of messaging. In particular, the olfactory (smell) modality is introduced as a potential alternative to visual and auditory modalities for providing messaging notifications. An experiment was performed to compare these modalities as secondary display mechanisms used to deliver notifications to users working on a cognitively engaging primary task. It was verified that the disruptiveness and effectiveness of notifications varied with the notification modality. The olfactory modality was shown to be less effective in delivering notifications than the other modalities, but produced a less disruptive effect on user engagement in the primary task. Our results serve as a starting point for future research into the use of olfactory notification in messaging systems.
Keywords: HCI, ambient awareness, multi-modal interfaces, notification systems, olfactory display, user study
The virtual haptic back for palpatory training BIBAKFull-Text 191-197
  Robert L., II Williams; Mayank Srivastava; John N. Howell; Robert R., Jr. Conatser; David C. Eland; Janet M. Burns; Anthony G. Chila
This paper discusses the Ohio University Virtual Haptic Back (VHB) project, including objectives, implementation, and initial evaluations. Haptics is the science of human tactile sensation and a haptic interface provides force and touch feedback to the user from virtual reality. Our multimodal VHB simulation combines high-fidelity computer graphics with haptic feedback and aural feedback to augment training in palpatory diagnosis in osteopathic medicine, plus related training applications in physical therapy, massage therapy, chiropractic therapy, and other tactile fields. We use the PHANToM haptic interface to provide position interactions by the trainee, with accompanying force feedback to simulate the back of a live human subject in real-time. Our simulation is intended to add a measurable, repeatable component of science to the art of palpatory diagnosis. Based on our experiences in the lab to date, we believe that haptics-augmented computer models have great potential for improving training in the future, for various tactile applications. Our main project goals are to: 1. Provide a novel tool for palpatory diagnosis training; and 2. Improve the state-of-the-art in haptics and graphics applied to virtual anatomy.
Keywords: PHANToM, haptics, palpatory diagnosis, training, virtual haptic back
A vision-based sign language recognition system using tied-mixture density HMM BIBAKFull-Text 198-204
  Liang-Guo Zhang; Yiqiang Chen; Gaolin Fang; Xilin Chen; Wen Gao
In this paper, a vision-based medium vocabulary Chinese sign language recognition (SLR) system is presented. The proposed recognition system consists of two modules. In the first module, techniques of robust hands detection, background subtraction and pupils detection are efficiently combined to precisely extract the feature information with the aid of simple colored gloves in the unconstrained environment. Meanwhile, an effective and efficient hierarchical feature description scheme with different scale features to characterize sign language is proposed, where principal component analysis (PCA) is employed to characterize the finger features more elaborately. In the second part, a Tied-Mixture Density Hidden Markov Models (TMDHMM) framework for SLR is proposed, which can speed up the recognition without the significant loss of recognition accuracy compared with the continuous hidden Markov models (CHMM). Experimental results based on 439 frequently used Chinese sign language (CSL) words show that the proposed methods can work well for the medium vocabulary SLR in the environment without special constraints and the recognition accuracy is up to 92.5%.
Keywords: computer vision, hidden Markov models, human-computer interaction, sign language recognition
Analysis of emotion recognition using facial expressions, speech and multimodal information BIBAKFull-Text 205-211
  Carlos Busso; Zhigang Deng; Serdar Yildirim; Murtaza Bulut; Chul Min Lee; Abe Kazemzadeh; Sungbok Lee; Ulrich Neumann; Shrikanth Narayanan
The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.
Keywords: PCA, SVC, affective states, decision level fusion, emotion recognition, feature level fusion, human-computer interaction (HCI), speech, vision
Support for input adaptability in the ICON toolkit BIBAKFull-Text 212-219
  Pierre Dragicevic; Jean-Daniel Fekete
In this paper, we introduce input adaptability as the ability of an application to exploit alternative sets of input devices effectively and offer users a way of adapting input interaction to suit their needs. We explain why input adaptability must be seriously considered today and show how it is poorly supported by current systems, applications and tools. We then describe ICon (Input Configurator), an input toolkit that allows interactive applications to achieve a high level of input adaptability. We present the software architecture behind ICon then the toolkit itself, and give several examples of non-standard interaction techniques that are easy to build and modify using ICon's graphical editor while being hard or impossible to support using regular GUI toolkits.
Keywords: adaptability, input devices, interaction techniques, toolkits, visual programming
User walkthrough of multimodal access to multidimensional databases BIBAKFull-Text 220-226
  M. P. van Esch-Bussemakers; A. H. M. Cremers
This paper describes a user walkthrough that was conducted with an experimental multimodal dialogue system to access a multidimensional music database using a simulated mobile device (including a technically challenging four-PHANToM-setup). The main objectives of the user walkthrough were to assess user preferences for certain modalities (speech, graphical and haptic-tactile) to access and present certain types of information, and for certain search strategies when searching and browsing a multidimensional database. In addition, the project aimed at providing concrete recommendations for the experimental setup, multimodal user interface design and evaluation. The results show that recommendations can be formulated both on the use of modalities and search strategies, and on the experimental setup as a whole, including the user interface. In short, it is found that haptically enhanced buttons are preferred for navigating or selecting and speech is preferred for searching the database for an album or artist. A 'direct' search strategy indicating an album, artist or genre is favorable. It can be concluded that participants were able to look beyond the experimental setup and see the potential of the envisioned mobile device and its modalities. Therefore it was possible to formulate recommendations for future multimodal dialogue systems for multidimensional database access.
Keywords: guidelines, haptic-tactile, multidimensional, multimodal, speech, usability, user walkthrough, visualization
Multimodal interaction under exerted conditions in a natural field setting BIBAKFull-Text 227-234
  Sanjeev Kumar; Philip R. Cohen; Rachel Coulston
This paper evaluates the performance of a multimodal interface under exerted conditions in a natural field setting. The subjects in the present study engaged in a strenuous activity while multimodally performing map-based tasks using handheld computing devices. This activity made the users breathe heavily and become fatigued during the course of the study. We found that the performance of both speech and gesture recognizers degraded as a function of exertion, while the overall multimodal success rate was stable. This stabilization is accounted for by the mutual disambiguation of modalities, which increases significantly with exertion. The system performed better for subjects with a greater level of physical fitness, as measured by their running speed, with more stable multimodal performance and a later degradation of speech and gesture recognition as compared with subjects who were less fit. The findings presented in this paper have a significant impact on design decisions for multimodal interfaces targeted towards highly mobile and exerted users in field environments.
Keywords: evaluation, exertion, field, mobile, multimodal interaction
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments BIBAKFull-Text 235-242
  Timothy J. Hazen; Kate Saenko; Chia-Hao La; James R. Glass
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.
Keywords: audio-visual corpora, audio-visual speech recognition

Poster session 2

A model-based approach for real-time embedded multimodal systems in military aircrafts BIBAKFull-Text 243-250
  Rémi Bastide; David Navarre; Philippe Palanque; Amélie Schyn; Pierre Dragicevic
This paper presents the use of a model-based approach for the formal description of real-time embedded multimodal systems. This modeling technique has been used in the field of military fighter aircrafts. The paper presents the formal description techniques, its application on the case study of a multimodal command and control interface for the Rafale aircraft as well as its relationship with architectural model for interactive systems.
Keywords: embedded systems, formal description techniques, model-based approaches
ICARE software components for rapidly developing multimodal interfaces BIBAKFull-Text 251-258
  Jullien Bouchet; Laurence Nigay; Thierry Ganille
Although several real multimodal systems have been built, their development still remains a difficult task. In this paper we address this problem of development of multimodal interfaces by describing a component-based approach, called ICARE, for rapidly developing multimodal interfaces. ICARE stands for Interaction-CARE (Complementarity Assignment Redundancy Equivalence). Our component-based approach relies on two types of software components. Firstly ICARE elementary components include Device components and Interaction Language components that enable us to develop pure modalities. The second type of components, called Composition components, define combined usages of modalities. Reusing and assembling ICARE components enable rapid development of multimodal interfaces. We have developed several multimodal systems using ICARE and we illustrate the discussion using one of them: the FACET simulator of the Rafale French military plane cockpit.
Keywords: multimodal interactive systems, software components
MacVisSTA: a system for multimodal analysis BIBAKFull-Text 259-264
  R. Travis Rose; Francis Quek; Yang Shi
The study of embodied communication requires access to multiple data sources such as multistream video and audio, various derived and metadata such as gesture, head, posture, facial expression and gaze information. The common element that runs through these data is the co-temporality of the multiple modes of behavior. In this paper, we present the multimedia Visualization for Situated Temporal Analysis (MacVisSTA) system for the analysis of multimodal human communication through video, audio, speech transcriptions, and gesture and head orientation data. The system uses a multiple linked representation strategy in which different representations are linked by the current time focus. In this framework, the multiple display components associated with the disparate data types are kept in synchrony, each component serving as both a controller of the system as well as a display. Hence the user is able to analyze and manipulate the data from different analytical viewpoints (e.g. through the time-synchronized speech transcription or through motion segments of interest). MacVisSTA supports analysis of the synchronized data at varying timescales. It provides an annotation interface that permits users to code the data into 'music-score' objects, and to make and organize multimedia observations about the data. Hence MacVisSTA integrates flexible visualization with annotation within a single framework. An XML database manager has been created for storage and search of annotation data. We compare the system with other existing annotation tools with respect to functionality and interface design. The software runs on Macintosh OS X computer systems.
Keywords: embodied communication, flexible visualization and annotation, gesture, multimodal interaction, multiple linked representation
Context based multimodal fusion BIBAKFull-Text 265-272
  Norbert Pfleger
We present a generic approach to multimodal fusion which we call context based multimodal integration. Key to this approach is that every multimodal input event is interpreted and enriched with respect to its local turn context. This local turn context comprises all previously recognized input events and the dialogue state that both belong to the same user turn. We show that a production rule system is an elegant way to handle this context based multimodal integration and we describe a first implementation of the so-called PATE system. Finally, we present results from a first evaluation of this approach as part of a human-factors experiment with the COMIC system.
Keywords: fusion, multimodal dialogue systems, multimodal integration, speech and pen input
Emotional Chinese talking head system BIBAKFull-Text 273-280
  Jianhua Tao; Tieniu Tan
Natural Human-Computer Interface requires integration of realistic audio and visual information for perception and display. In this paper, a lifelike talking head system is proposed. The system converts text to speech with synchronized animation of mouth movements and emotion expression. The talking head is based on a generic 3D human head model. The personalized model is incorporated into the system. With texture mapping, the personalized model offers a more natural and realistic look than the generic model. To express emotion, both emotional speech synthesis and emotional facial animation are integrated and Chinese viseme models are also created in the paper. Finally, the emotional talking head system is created to generate the natural and vivid audio-visual output.
Keywords: emotion, facial animation, speech synthesis, talking head
Experiences on haptic interfaces for visually impaired young children BIBAKFull-Text 281-288
  Saija Patomäki; Roope Raisamo; Jouni Salo; Virpi Pasto; Arto Hippula
Visually impaired children do not have equal opportunities to learn and play compared to sighted children. Computers have a great potential to correct this problem. In this paper we present a series of studies where multimodal applications were designed for a group of eleven visually impaired children aged from 3.5 to 7.5 years. We also present our testing procedure specially adapted for visually impaired young children. During the two-year project it became clear that with careful designing of the tasks and proper use of haptic and auditory features usable computing environments can be created for visually impaired children.
Keywords: Phantom, blind children, haptic environment, haptic feedback, learning, visually impaired children
Visual touchpad: a two-handed gestural input device BIBAKFull-Text 289-296
  Shahzad Malik; Joe Laszlo
This paper presents the Visual Touchpad, a low-cost vision-based input device that allows for fluid two-handed interactions with desktop PCs, laptops, public kiosks, or large wall displays. Two downward-pointing cameras are attached above a planar surface, and a stereo hand tracking system provides the 3D positions of a user's fingertips on and above the plane. Thus the planar surface can be used as a multi-point touch-sensitive device, but with the added ability to also detect hand gestures hovering above the surface. Additionally, the hand tracker not only provides positional information for the fingertips but also finger orientations. A variety of one and two-handed multi-finger gestural interaction techniques are then presented that exploit the affordances of the hand tracker. Further, by segmenting the hand regions from the video images and then augmenting them transparently into a graphical interface, our system provides a compelling direct manipulation experience without the need for more expensive tabletop displays or touch-screens, and with significantly less self-occlusion.
Keywords: augmented reality, computer vision, direct manipulation, fluid interaction, gestures, hand tracking, perceptual user interface, two hand, virtual keyboard, virtual mouse, visual touchpad
An evaluation of virtual human technology in informational kiosks BIBAKFull-Text 297-302
  Curry Guinn; Rob Hubal
In this paper, we look at the results of using spoken language interactive virtual characters in information kiosks. Users interact with synthetic spokespeople using spoken natural language dialogue. The virtual characters respond with spoken language, body and facial gesture, and graphical images on the screen. We present findings from studies of three different information kiosk applications. As we developed successive kiosks, we applied lessons learned from previous kiosks to improve system performance. For each setting, we briefly describe the application, the participants, and the results, with specific focus on how we increased user participation and improved informational throughput. We tie the results together in a lessons learned section.
Keywords: evaluation, gesture, natural language, spoken dialogue system, virtual humans, virtual reality
Software infrastructure for multi-modal virtual environments BIBAKFull-Text 303-308
  Brian Goldiez; Glenn Martin; Jason Daly; Donald Washburn; Todd Lazarus
Virtual environment systems, especially those supporting multi-modal interactions require a robust and flexible software infrastructure that supports a wide range of devices, interaction techniques, and target applications. In addition to interactivity needs, a key factor of robustness of the software is the minimization of latency and more importantly, reduction of jitter (the variability of latency). This paper presents a flexible software infrastructure that has demonstrated robustness in initial prototyping. The infrastructure, based on the VESS Libraries from the University of Central Florida, simplifies the task of creating multi-modal virtual environments. Our extensions to VESS include numerous features to support new input and output devices for new sensory modalities and interaction techniques, as well as some control over latency and jitter.
Keywords: augmented environments, haptics, latency, multi-modal interfaces, olfaction, software infrastructure, virtual environments
GroupMedia: distributed multi-modal interfaces BIBAKFull-Text 309-316
  Anmol Madan; Ron Caneel; Alex Sandy Pentland
In this paper, we describe the GroupMedia system, which uses wireless wearable computers to measure audio features, head-movement, and galvanic skin response (GSR) for dyads and groups of interacting people. These group sensor measurements are then used to build a real-time group interest index. The group interest index can be used to control group displays, annotate the group discussion for later retrieval, and even to modulate and guide the group discussion itself. We explore three different situations where this system has been introduced, and report experimental results.
Keywords: galvanic skin response, head nodding, human behavior, influence model, interest, prosody, speech features

Demo session 1

Agent and library augmented shared knowledge areas (ALASKA) BIBAKFull-Text 317-318
  Eric R. Hamilton
This paper reports on an NSF-funded effort now underway to integrate three learning technologies that have emerged and matured over the past decade; each has presented compelling and oftentimes moving opportunities to alter educational practice and to render learning more effective. The project seeks a novel way to blend these technologies and to create and test a new model for human-machine partnership in learning settings. The innovation we are prototyping in this project creates an applet-rich shared space whereby a pedagogical agent at each learner's station functions as an instructional assistant to the teacher or professor and tutor to the student. The platform is intended to open a series of new -- and instructionally potent -- interactive pathways.
Keywords: animated agents, applets, collaborative workspace, heterogeneous network, multi-tier system, pedagogical agents
MULTIFACE: multimodal content adaptations for heterogeneous devices BIBAKFull-Text 319-320
  Songsak Channarukul; Susan W. McRoy; Syed S. Ali
We are interested in applying and extending existing frameworks for combining output modalities for adaptations of multimodal content on heterogeneous devices based on user and device models. In this paper, we present Multiface, a multimodal dialog system that allows users to interact using different devices such as desktop computers, PDAs, and mobile phones. The presented content and its modality will be customized to individual users and the device they are using.
Keywords: device-centered adaptation, dialog system, multimodal output, user-centered adaptation
Command and control resource performance predictor (C2RP2) BIBKFull-Text 321-322
  Joseph M. Dalton; Ali Ahmad; Kay Stanney
Keywords: applet, command and control, predictor
A multi-modal architecture for cellular phones BIBKFull-Text 323-324
  Luca Nardelli; Marco Orlandi; Daniele Falavigna
Keywords: VoiceXML, automatic speech recognition, mobile devices, multimodality
'SlidingMap': introducing and evaluating a new modality for map interaction BIBAKFull-Text 325-326
  Matthias Merdes; Jochen Häußler; Matthias Jöst
In this paper, we describe the concept of a new modality for interaction with digital maps. We propose using inclination as a means for panning maps on a mobile computing device, namely a tablet PC. The result is a map which is both physically transportable as well as manipulable with very simple and natural hand movements. We describe a setup for comparing this new modality with the better known modalities of pen-based and joystick-based interaction. Apart from demonstrating the new modality we plan to perform a short evaluation.
Keywords: inclination modality, map interaction, mobile systems
Multimodal interaction for distributed collaboration BIBAKFull-Text 327-328
  Levent Bolelli; Guoray Cai; Hongmei Wang; Bita Mortazavi; Ingmar Rauschert; Sven Fuhrmann; Rajeev Sharma; Alan MacEachren
We demonstrate a same-time different-place collaboration system for managing crisis situations using geospatial information. Our system enables distributed spatial decision-making by providing a multimodal interface to team members. Decision makers in front of large screen displays and/or desktop computers, and emergency responders in the field with tablet PCs can engage in collaborative activities for situation assessment and emergency response.
Keywords: GIS, geocollaboration, interactive maps, multimodal interfaces

Demo session 2

A multimodal learning interface for sketch, speak and point creation of a schedule chart BIBAKFull-Text 329-330
  Ed Kaiser; David Demirdjian; Alexander Gruenstein; Xiaoguang Li; John Niekrasz; Matt Wesson; Sanjeev Kumar
We present a video demonstration of an agent-based test bed application for ongoing research into multi-user, multimodal, computer-assisted meetings. The system tracks a two person scheduling meeting: one person standing at a touch sensitive whiteboard creating a Gantt chart, while another person looks on in view of a calibrated stereo camera. The stereo camera performs real-time, untethered, vision-based tracking of the onlooker's head, torso and limb movements, which in turn are routed to a 3D-gesture recognition agent. Using speech, 3D deictic gesture and 2D object de-referencing the system is able to track the onlooker's suggestion to move a specific milestone. The system also has a speech recognition agent capable of recognizing out-of-vocabulary (OOV) words as phonetic sequences. Thus when a user at the whiteboard speaks an OOV label name for a chart constituent while also writing it, the OOV speech is combined with letter sequences hypothesized by the handwriting recognizer to yield an orthography, pronunciation and semantics for the new label. These are then learned dynamically by the system and become immediately available for future recognition.
Keywords: multimodal interaction, vision-based body-tracking, vocabulary learning
Real-time audio-visual tracking for meeting analysis BIBAKFull-Text 331-332
  David Demirdjian; Kevin Wilson; Michael Siracusa; Trevor Darrell
We demonstrate an audio-visual tracking system for meeting analysis. A stereo camera and a microphone array are used to track multiple people and their speech activity in real-time. Our system can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting.
Keywords: microphone array, speaker localization, stereo, tracking
Collaboration in parallel worlds BIBAKFull-Text 333-334
  Ashutosh Morde; Jun Hou; S. Kicha Ganapathy; Carlos Correa; Allan Krebs; Lawrence Rabiner
We present a novel paradigm for human to human asymmetric collaboration. There is a need for people at geographically separate locations to seamlessly collaborate in real time as if they are physically co-located. In our system one user (novice) works in the real world and the other user (expert) works in a parallel virtual world. They are assisted in this task by an Intelligent Agent (IA) with considerable knowledge about the environment. Current tele-collaboration systems deal primarily with collaboration purely in the real or virtual worlds. The use of a combination of virtual and real worlds allows us to leverage the advantages from both the worlds.
Keywords: augmented reality, collaboration, distributed systems, intelligent agents, registration, virtual reality
Segmentation and classification of meetings using multiple information streams BIBAKFull-Text 335-336
  Paul E. Rybski; Satanjeev Banerjee; Fernando de la Torre; Carlos Vallespi; Alexander I. Rudnicky; Manuela Veloso
We present a meeting recorder infrastructure used to record and annotate events that occur in meetings. Multiple data streams are recorded and analyzed in order to infer a higher-level state of the group's activities. We describe the hardware and software systems used to capture people's activities as well as the methods used to characterize them.
Keywords: meeting understanding, multi-modal interfaces
A maximum entropy based approach for multimodal integration BIBAKFull-Text 337-338
  Péter Pál Boda
Integration of various user input channels for a multimodal interface is not just an engineering problem. To fully understand users in the context of an application and the current session, solutions are sought that process information from different intentional, i.e. user-originated, as well as from passively available sources in a uniform manner. As a first step towards this goal, the work demonstrated here investigates how intentional user input (e.g. speech, gesture) can be seamlessly combined to provide a single semantic interpretation of the user input. For this classical Multimodal Integration problem the Maximum Entropy approach is demonstrated with 76.52% integration accuracy for the 1st and 86.77% accuracy for the top 3-best candidates. The paper also exhibits the process that generates multimodal data for training the statistical integrator, using transcribed speech from MIT's Voyager application. The quality of the generated data is assessed by comparing to real inputs to the multimodal version of Voyager.
Keywords: machine learning, maximum entropy, multimodal database, multimodal integration, virtual modality
Multimodal interface platform for geographical information systems (GeoMIP) in crisis management BIBAKFull-Text 339-340
  Pyush Agrawal; Ingmar Rauschert; Keerati Inochanon; Levent Bolelli; Sven Fuhrmann; Isaac Brewer; Guoray Cai; Alan MacEachren; Rajeev Sharma
A novel interface system for accessing geospatial data (GeoMIP) has been developed that realizes a user-centered multimodal speech/gesture interface for addressing some of the critical needs in crisis management. In this system we primarily developed vision sensing algorithms, speech integration, multimodality fusion, and rule-based mapping of multimodal user input to GIS database queries. A demo system of this interface has been developed for the Port Authority NJ/NY and is explained here.
Keywords: GIS, collaboration, human-centered design, interactive maps, multimodal human-computer-interface, speech/gesture recognition

Doctoral spotlight session

Adaptations of multimodal content in dialog systems targeting heterogeneous devices BIBAKFull-Text 341
  Songsak Channarukul
Dialog systems that adapt to different user needs and preferences appropriately have been shown to achieve higher levels of user satisfaction [4]. However, it is also important that dialog systems be able to adapt to the user's computing environment, because people are able to access computer systems using different kinds of devices such as desktop computers, personal digital assistants, and cellular telephones. Each of these devices has a distinct set of physical capabilities, as well as a distinct set of functions for which it is typically used.
   Existing research on adaptation in both hypermedia and dialog systems has focused on how to customize content based on user models [2, 4] and interaction history. Some researchers have also investigated device-centered adaptations that range from low-level adaptations such as conversion of multimedia objects [6] (e.g., video to images, audio to text, image size reduction) to higher-level adaptations based on multimedia document models [1] and frameworks for combining output modalities [3, 5]. However, to my knowledge, no work has been done on integrating and coordinating both types of adaptation interdependently.
   The primary problem I would like to address in this thesis is how multimodal dialog systems can adapt their content and style of interaction, taking the user, the device, and the dependency between them into account. Two main aspects of adaptability that my thesis considers are: (1) adaptability in content presentation and communication and (2) adaptability in computational strategies used to achieve system's and user's goals.
   Beside general user modeling questions such as how to acquire information about the user and construct a user model, this thesis also considers other issues that deal with device modeling such as (1) how can the system employ user and device models to adapt the content and determine the right combination of modalities effectively? (2) how can the system determine the right combination of multimodal contents that best suits the device? (3) how can one model the characteristics and constraints of devices? and (4) is it possible to generalize device models based on modalities rather than on their typical categories or physical appearance.
Keywords: device-centered adaptation, dialog system, multimodal output, user-centered adaptation
Utilizing gestures to better understand dynamic structure of human communication BIBAKFull-Text 342
  Lei Chen
Motivation: Many researchers have highlighted the importance of gesture in natural human communication. McNeill [4] puts forward the hypothesis that gesture and speech stem from the same mental process and so tend to be both temporally and semantically related. However in contrast to speech, which surfaces as a linear progression of segments, sounds, and words, gestures appear to be nonlinear, holistic, and imagistic. Gesture adds an important dimension to language understanding due to this property of sharing a common origin with speech while using a very different mechanism for transferring information. Ignoring this information when constructing a model of human communication would limit its potential effectiveness.
   Goal and Method: This thesis concerns the development of methods to effectively incorporate gestural information from a human communication into a computer model to more accurately interpret the content and structure of that communication. Levelt [5] suggests that structure in human communication stems from the dynamic conscious process of language production, during which a conversant organizes the concepts to be expressed, plans the discourse, and selects appropriate words, prosody, and gestures while also correcting errors that occur in this process. Clues related to this conscious processing emerge in both the final speech stream and gestures. This thesis will attempt to utilize these clues to determine the structural elements of human-to-human dialogs, including sentence boundaries, topic boundaries, and disfluency structure. For this purpose, the data driven approach is used. This work requires three important components: corpus generation, feature extraction, and model construction.
   Previous Work: Some work related to each of these components has already been conducted. A data collection and processing protocol for constructing multimodal corpora has been created; details on the video and audio processing can be found in the Data and Annotation section of [3]. To improve the speed of producing a corpus while maintaining its quality, we have surveyed factors impacting the accuracy of forced alignments of transcriptions to audio files [2]. These alignments provide a crucial temporal synchronization between video events and spoken words (and their components) for this research effort. We have also conducted measurement studies in an attempt to understand how to model multimodal conversations. For example, we have investigated the types of gesture patterns that occur during speech repairs [1]. Recently, we constructed a preliminary model combining speech and gesture features for detecting sentence boundaries in videotaped dialogs. This model combines language and prosody models together with a simple gestural model to more effectively detect sentence boundaries [3].
   Future Work: To date, our multimodal corpora involve human monologues and dialogues (see http://vislab.cs.wright.edu/kdi). We are participating in the collection and preparation of a corpus of multi-party meetings (see http://vislab.cs.wright.edu/Projects/Meeting-Analysis). To facilitate the multi-channel audio processing, we are constructing a tool to support accurate audio transcription and alignment. The data from this meeting corpus will enable the development of more sophisticated gesture models allowing us to expand the set of gesture features (e.g., spatial properties of the tracked gestures). Additionally, we will investigate more advanced machine learning methods in an attempt to improve the performance of our models. We also plan to expand our models to phenomena such as topic segmentation.
Keywords: dialog, gesture, language models, multimodal fusion, prosody, sentence boundary detection
Multimodal programming for dyslexic students BIBAFull-Text 343
  Dale-Marie Wilson
As the Web's role in society increases, so to does the need for its universality. Access to the Web by all, including people with disabilities has become a requirement of Web sites as can be seen by the passing of the Americans with Disabilities Act in 1990. This universality has spilled over into other disciplines, e.g. screen readers for Web browsing; however Computer Science has not yet made significant efforts to do the same. The main focus of this research is to provide this universal access in the development of virtual learning environments, more specifically in computer programming. To facilitate this access, research into the features of dyslexia is required: what it is, how it affects a person's thought process and what changes are necessary to facilitate these effects. Also, a complete understanding of the thought process in the creation of a complete computer program is necessary.
   Dyslexia has been diagnosed as affecting the left side of the brain. The left side of the brain processes information in a linear, sequential manner. It is also responsible for processing symbols, which include letters, words and mathematical notations. Thus dyslexics have problems with the code generation, analysis and implementation steps in the creation of a computer program. Potential solutions to this problem include a multimodal programming environment.
   This multimodal environment will be interactive, providing multimodal assistance to the user as they generate, analyze and implement code. This assistance will include the ability to add functions and loops via voice and receiving a spoken description of a code segment that has been selected by the cursor.
Gestural cues for speech understanding BIBKFull-Text 344
  Jacob Eisenstein
Keywords: multimodal natural language processing
Using language structure for adaptive multimodal language acquisition BIBAKFull-Text 345
  Rajesh Chandrasekaran
In human spoken communication, language structure plays a vital role in providing a framework for humans to understand each other. Using language rules, words are combined into meaningful sentences to represent knowledge. Speech enabled systems based on pre-programmed Rule Grammar suffer from constraints on vocabulary and sentence structures. To address this problem, in this paper, we discuss a language acquisition system that is capable of learning new words and their corresponding semantic meaning by initiating an adaptive dialog with the user. Thus, the vocabulary of the system can be increased in real time by the user. The language acquisition system is provided knowledge about language structure and is capable of accepting multimodal user inputs that includes speech, touch, pen-tablet, mouse, and keyboard. We discuss the efficiency of learning new concepts and the ease with which users can teach the system new concepts.
   The multimodal language acquisition system is capable of acquiring, in real time, new words that pertain to objects, actions or attributes and their corresponding meanings. The first step in this process is to detect unknown words in the spoken utterance. Any new word that is detected is classified into one of the above mentioned categories. The second step is to learn from the user the meaning of the word and add it to the semantic database.
   An unknown word is flagged whenever an utterance is not consistent with the pre-programmed Rule Grammar. Because the system can acquire words pertaining to objects, actions or attributes, we are interested in words that are nouns, verbs or adjectives. We use a transformation based part-of-speech tagger that is capable of annotating English words with their part-of-speech to identify words in the utterance that are nouns, verbs and adjectives. These words are searched in the semantic database and unknown words are identified. The system then initiates an adaptive dialog with the user, requesting the user to provide the meaning of the unknown word. When the user has provided the relevant meaning using any of the input modalities, the system checks whether the meaning given corresponds to the category of the word, i.e. if the unknown word is a noun then the user can associate only an object with it or if the unknown word is a verb then only an action can be associated with the word. Thus, the system uses the knowledge of the occurrence of the word in the sentence to determine what kind of meaning can be associated with the word. The language structure thus gives the system a basic knowledge of the unknown word.
Keywords: adaptive dialogue systems, computer language learning, language acquisition, language structure
Private speech during multimodal human-computer interaction BIBKFull-Text 346
  Rebecca Lunsford
Keywords: cognitive load, human performance, individual differences, multimodal interaction, self-regulatory language, senior users, speaker variability, system adaptation, task difficulty
Projection augmented models: the effect of haptic feedback on subjective and objective human factors BIBKFull-Text 347
  Emily Bennett
Keywords: haptic feedback, projection augmented models
Multimodal interface design for multimodal meeting content retrieval BIBAKFull-Text 348
  Agnes Lisowska
This thesis will investigate which modalities, and in which combinations, are best suited for use in a multimodal interface that allows users to retrieve the content of recorded and processed multimodal meetings. The dual role of multimodality in the system (present in both the interface and the stored data) poses additional challenges. We will extend and adapt established approaches to HCI and multimodality [2, 3] to this new domain, maintaining a strongly user-driven approach to design.
Keywords: Wizard of Oz, multimodal interface, multimodal meetings
Determining efficient multimodal information-interaction spaces for C2 systems BIBAKFull-Text 349
  Leah M. Reeves
Military operations and friendly fire mishaps over the last decade have demonstrated that Command, Control, Communications, Computers, Intelligence, Surveillance, and Reconnaissance (C4ISR) systems may often lack the ability to efficiently and effectively support operations in complex, time critical environments. With the vast increase in the amount and type of information available, the challenge to today's military system designers is to create interfaces that allow warfighters to proficiently process the optimal amount of mission essential data [1]. To meet this challenge, multimodal system technology is showing great promise because, as the technology that supports C4ISR systems advances, the possibility of leveraging all of the human sensory systems becomes possible. The implication is that by facilitating the efficient use of a C4ISR operator's multiple information processing resources, substantial gains in the information management capacity of the warfighter-computer integral may be realized [2]. Despite its great promise, however, the potential of multimodal technology as a tool for streamlining interaction within military C4ISR environments may not be fully realized until the following guiding principles are identified:
  • how to combine visualization and multisensory display techniques for given
       users, tasks, and problem domains
  • how task attributes should be represented (e.g., via which modality, via
       multiple modalities);
  • which multimodal interaction technique(s) is most appropriate. Due to the current lack of empirical evidence and principle-driven guidelines, designers often encounter difficulties when choosing the most appropriate modal interaction techniques for given users, applications, or specific military command and control (C2) tasks within C4ISR systems. The implication is that inefficient multimodal C2 system design may hinder our military's ability to fully support operations in complex, time critical environments and thus impede warfighters' ability to achieve accurate situational awareness (SA) in a timely manner [1]. Consequently, warfighters are often becoming overwhelmed when provided with more information than they can accurately process. The development of multimodal design guidelines from both a user and task domain perspective is thus critical to the achievement of successful Human Systems Integration (HSI) within military environments such as C2 systems.
       This study provides preliminary empirical support in identifying user attributes, such as spatial ability (p < 0.02) and learning style (p < 0.03), which may aid in developing principle-driven guidelines for how and when to effectively present task-specific modal information to improve C2 warfighters' performance. A preliminary framework for modeling user interaction in multimodal C2 environments is also in development and is based on existing theories and models of working memory, as well as from new insights gained from the latest in imaging of electromagnetic (e.g., EEG, ERP, MEG) and hemodynamic (e.g., fMRI, PET) changes in the brain while user's perform predefined tasks. This research represents an innovative way to both predict and accommodate a user's information processing resources while interacting with multimodal systems. The current results and planned follow-on studies are facilitating the development of principle-driven multimodal design guidelines regarding how and when to adapt modes of interaction to meet the cognitive capabilities of users. Although the initial application of such results are focused on determining how and when modalities should be presented, either in isolation or combination, to effectively present task-specific information to C4ISR warfighters, this research shows great potential for its applicability to the multimodal design community in general.
    Keywords: HCI, command and control, guidelines, multimodal design, multisensory
  • Using spatial warning signals to capture a driver's visual attention BIBAKFull-Text 350
      Cristy Ho
    This study was designed to assess the potential benefits of using spatial auditory or vibrotactile warning signals in the domain of driving performance, using a simulated driving task. Across six experiments, participants had to monitor a rapidly presented stream of distractor letters for occasional target digits (simulating an attention-demanding visual task, such as driving). Whenever participants heard an auditory cue (E1-E4) or felt a vibration (E5-E6), they had to check the front and the rearview mirror for the rapid approach of a car from in front or behind and respond accordingly (either by accelerating or braking). The efficacy of various auditory and vibrotactile warning signals in directing a participant's visual attention to the correct environmental position was compared (see Table 1). The results demonstrate the potential utility of semantically-meaningful or spatial auditory, and/or vibrotactile warning signals in interface design for directing a driver's, or other interface-operator's, visual attention to time-critical events or information.
    Keywords: auditory, crossmodal, driving, interface design, spatial attention, verbal, vibrotactile, visual, warning signals
    Multimodal interfaces and applications for visually impaired children BIBAKFull-Text 351
      Saija Patomäki
    Applications specially designated for visually handicapped children are rare. Additionally, this group of users is often not able to obtain the needed applications and machinery to their homes due to the expenses. However, the impairment these children have should not preclude them from the benefits and possibilities computers have to offer. In a modern society services and applications that open up along with the computers can be considered as a necessity to its citizens. This is the core issue of our research interest; to test various haptic devices and design usable applications to give this special user group the possibility to become acquainted with the computers so that they are encouraged to use and benefit from the technology also later in their lives.
       Similar research to ours where the haptic sensation is present has been carried out by Sjöström [3]. He has developed and tested haptic games that are used with the Phantom device [1]. Some of his applications are aimed for visually impaired children.
       During the project "Computer-based learning environment for visually impaired people" we designed, implemented and tested three different applications. Our target group was from three- to seven-year-old visually impaired children. Applications were tested in three phases with the chosen subjects. During the experiments a special testing procedure was developed [2]. The applications were based on haptic and auditory feedback but the simple graphical interface was available for those who were only partially blind. The chosen haptic device was the Phantom [1] that is a six-degrees-of-freedom input device. The Phantom is used with the stylus that resembles a pen. A pen is attached to a robotic arm that generates force feedback to stimulate touch.
       The first application consisted on simple materials and path shapes. In the user tests the virtual materials were compared with real ones and the various path shapes were meant to track along with the stylus. The second application was more a game-like environment. There were four haptic rooms where children had to do different tasks. The last tested application was a modification of the previous one. Its user interface consisted of six rooms and the tasks in them were simplified based on the results gained in the previous user tests.
       As the Phantom device is expensive and also difficult to use for some of the children the haptic device was decided to be replaced with simple machinery. In our current project "Multimodal Interfaces for Visually Impaired Children" the applications will be used with haptic devices such as tactile mouse or force feedback joystick. Some applications are designed and implemented from the start and some applications are adapted from the games that are originally meant for sighted children. The desirable research outcome is practical; to produce workable user interfaces and applications whose functionality and cost are reasonable enough to be acquired to the homes of the blind children.
    Keywords: Phantom, blind children, haptic environment, haptic feedback, learning, visually impaired children
    Multilayer architecture in sign language recognition system BIBAKFull-Text 352-353
      Feng Jiang; Hongxun Yao; Guilin Yao
    Up to now analytical or statistical methods have been used in sign language recognition with large vocabulary. Analytical methods such as Dynamic Time Wrapping (DTW) or Euclidian distance have been used for isolated word recognition, but the performance is not satisfactory enough because it is easily interfered by noise. Statistical methods, especially hidden Markov Models are commonly used, for both continuous sign language and isolated words and with the expansion of vocabulary the processing time becomes increasingly unacceptable. Therefore, a multilayer architecture of sign language recognition for large vocabulary is proposed in this paper for the purpose of speeding up the recognition process. In this method the gesture sequence to be recognized is first located at a set of words that are easy to be confused (confusion set) through a global cursory search and then the gesture is recognized through a latter local search and the generation of confusion set is realized by DTW/ISODATA algorithm. Experiment results indicate that it is an effective algorithm for Chinese sign language recognition.
    Keywords: DTW/ISODATA, sign language recognition
    Computer vision techniques and applications in human-computer interaction BIBAKFull-Text 354
      Erno Mäkinen
    There has been much research on computer vision in last three decades. Computer vision methods have been developed for different situations. One example is a detection of human face. For computers face detection is hard. Faces look different from different viewing directions. Facial expressions affect to the look of the face. Each individual person has a unique face. The lightning conditions can vary and so on.
       However, face detection is currently possible in limited conditions. In addition, there are some methods that can be used for gender recognition [3], face recognition [5] and facial expression recognition [2].
       Nonetheless, there has been very little research on how to combine these methods. There has also been quite little research on how to apply these methods in human-computer interaction (HCI).
       Finding sets of techniques that complement each other in a useful way is one research challenge. There are some applications that take advantage of one or two computer vision techniques. For example, Christian and Avery [1] have developed an information kiosk that uses computer vision to detect potential users from a distance. A similar kiosk has been developed by us in the University of Tampere [4]. There are also some games that use simple computer vision techniques for the interaction. However, there are very few applications that use several computer vision techniques together such as face detection, facial expression recognition and gender recognition. Overall, there has been very little effort in combining different techniques.
       In my research I develop computer vision methods and combine them, so that the combined method can detect face, recognize gender and facial expressions. After successfully combining the methods, it is easier to develop HCI applications that take advantage of computer vision. Applications that can be used by small group of people are my specific interest. These applications allow me to build adaptive user interfaces and analyze the use of computer vision techniques in improving human-computer interaction.
    Keywords: computer vision applications, face detection, facial expression recognition, gender recognition
    Multimodal response generation in GIS BIBAFull-Text 355
      Levent Bolelli
    Advances in computer hardware and software technologies have enabled sophisticated information visualization techniques as well as new interaction opportunities to be introduced in the development of GIS (Geographical Information Systems) applications. Especially, research efforts in computer vision and natural language processing have enabled users to interact with computer applications using natural speech and gestures, which has proven to be effective for interacting with dynamic maps [1, 6]. Pen-based mobile devices and gesture recognition systems enable system designers to define application-specific gestures for carrying out particular tasks. Using force-feedback mouse for interacting with GIS has been proposed for visually-impaired people [4]. These are exciting new opportunities and hold the promise of advancing interaction with computers to a complete new level. The ultimate aim, however, should be directed on facilitating human-computer communication; that is, equal emphasis should be given to both understanding and generation of multimodal behavior. My proposed research will provide a conceptual framework and a computational model for generating multimodal responses to communicate spatial information along with dynamically generated maps. The model will eventually lead to development of a computational agent that has reasoning capabilities for distributing the semantic and pragmatic content of the intended response message among speech, deictic gestures and visual information. In other words, the system will be able to select the most natural and effective mode(s) of communicating back to the user.
       Any research in computer science that investigates direct interaction of computers with humans should place human factors in center stage. Therefore, this work will follow a multi-disciplinary approach and integrate ideas from previous research in Psychology, Cognitive Science, Linguistics, Cartography, Geographical Information Science (GIScience) and Computer Science that will enable us to identify and address human, cartographic and computational issues involved in response planning and assist users with their spatial decision making by facilitating their visual thinking process as well as reducing their cognitive load. The methodology will be integrated into the design of DAVE_G [7] prototype: a, 6e of Computer Science and USAtyd Engineeringerface to Support Emergency Management. meaning. natural, multimodal, mixed-initiative dialogue interface to GIS. The system is currently capable of recognizing, interpreting and fusing users' natural occurring speech and gesture requests, and generating natural speech output. The communication between the system and user is modeled following the collaborative discourse theory [2] and maintains a Recipe Graph [5] structure -- based on SharedPlan theory[3] -- to represent the intentional structure of the discourse between the user and system. One major concern in generating speech responses for dynamic maps is that spatial information cannot be effectively communicated using speech. Altering perceptual attributes (e.g. color, size, pattern) of the visual data to direct user's attention to a particular location on the map is not usually effective, since each attribute bears an inherent semantic meaning and those perceptual attributes should be modified only when the system's judgement states that those attributes are not crucial to the user's understanding of the situation at that stage of the task. Gesticulation, on the other hand, is powerful for conveying location and form of spatially oriented information [6] without manipulating the map and the benefit of facilitating speech production. My research aims at designing feasible, extensible and effective multimodal response generation (content planning and modality allocation) model. A plan-based reasoning algorithm and methodology integrated with the Recipe Graph structure has the potential to achieve those goals.
    Adaptive multimodal recognition of voluntary and involuntary gestures of people with motor disabilities BIBKFull-Text 356
      Ingmar Rauschert
    Keywords: adaptive systems, gesture recognition, motor-disability, multimodal human-computer-interface