HCI Bibliography : Search Results skip to search form | skip to results |
Database updated: 2016-05-10 Searches since 2006-12-01: 32,227,731
director@hcibib.org
Hosted by ACM SIGCHI
The HCI Bibliogaphy was moved to a new server 2015-05-12 and again 2016-01-05, substantially degrading the environment for making updates.
There are no plans to add to the database.
Please send questions or comments to director@hcibib.org.
Query: C.ICMI.15.* Limit: papers Results: 106 Sorted by: Date  Comments?
Clear Limits Help Dates
Limit:   
<<First <Previous Permalink Next> Last>> Records: 1 to 25 of 106 Jump to: 2015 |
Sharing Representations for Long Tail Computer Vision Problems Keynote Address 1 / Bengio, Samy Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.1
ACM Digital Library Link
Summary: The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. This has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image captioning, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.

Interaction Studies with Social Robots Keynote Address 2 / Dautenhahn, Kerstin Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.3
ACM Digital Library Link
Summary: Over the past 10 years we have seen worldwide an immense growth of research and development into companion robots. Those are robots that fulfil particular tasks, but do so in a socially acceptable manner. The companionship aspect reflects the repeated and long-term nature of such interactions, and the potential of people to form relationships with such robots, e.g. as friendly assistants. A number of companion and assistant robots have been entering the market, two of the latest examples are Aldebaran's Pepper robot, or Jibo (Cynthia Breazeal). Companion robots are more and more targeting particular application areas, e.g. as home assistants or therapeutic tools. Research into companion robots needs to address many fundamental research problems concerning perception, cognition, action and learning, but regardless how sophisticated our robotic systems may be, the potential users need to be taken into account from the early stages of development. The talk will emphasize the need for a highly user-centred approach towards design, development and evaluation of companion robots. An important challenge is to evaluate robots in realistic and long-term scenarios, in order to capture as closely as possible those key aspects that will play a role when using such robots in the real world. In order to illustrate these points, my talk will give examples of interaction studies that my research team has been involved in. This includes studies into how people perceive robots' non-verbal cues, creating and evaluating realistic scenarios for home companion robots using narrative framing, and verbal and tactile interaction of children with the therapeutic and social robot Kaspar. The talk will highlight the issues we encountered when we proceeded from laboratory-based experiments and prototypes to real-world applications.

Connections: 2015 ICMI Sustained Accomplishment Award Lecture Keynote Address 3 (Sustained Accomplishment Award Talk) / Horvitz, Eric Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.5
ACM Digital Library Link
Summary: Our community has long pursued principles and methods for enabling fluid and effortless collaborations between people and computing systems. Forging deep connections between people and machines has come into focus over the last 25 years as a grand challenge at the intersection of artificial intelligence, human-computer interaction, and cognitive psychology. I will review experiences and directions with leveraging advances in perception, learning, and reasoning in pursuit of our shared dreams.

Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits Oral Session 1: Machine Learning in Multimodal Systems / Chatterjee, Moitreya / Park, Sunghyun / Morency, Louis-Philippe / Scherer, Stefan Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.7-14
ACM Digital Library Link
Summary: Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.

Personality Trait Classification via Co-Occurrent Multiparty Multimodal Event Discovery Oral Session 1: Machine Learning in Multimodal Systems / Okada, Shogo / Aran, Oya / Gatica-Perez, Daniel Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.15-22
ACM Digital Library Link
Summary: This paper proposes a novel feature extraction framework from multi-party multimodal conversation for inference of personality traits and emergent leadership. The proposed framework represents multi modal features as the combination of each participant's nonverbal activity and group activity. This feature representation enables to compare the nonverbal patterns extracted from the participants of different groups in a metric space. It captures how the target member outputs nonverbal behavior observed in a group (e.g. the member speaks while all members move their body), and can be available for any kind of multiparty conversation task. Frequent co-occurrent events are discovered using graph clustering from multimodal sequences. The proposed framework is applied for the ELEA corpus which is an audio visual dataset collected from group meetings. We evaluate the framework for binary classification task of 10 personality traits. Experimental results show that the model trained with co-occurrence features obtained higher accuracy than previously related work in 8 out of 10 traits. In addition, the co-occurrence features improve the accuracy from 2% up to 17%.

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring Oral Session 1: Machine Learning in Multimodal Systems / Ramanarayanan, Vikram / Leong, Chee Wee / Chen, Lei / Feng, Gary / Suendermann-Oeft, David Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.23-30
ACM Digital Library Link
Summary: We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams -- the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.

Gender Representation in Cinematic Content: A Multimodal Approach Oral Session 1: Machine Learning in Multimodal Systems / Guha, Tanaya / Huang, Che-Wei / Kumar, Naveen / Zhu, Yan / Narayanan, Shrikanth S. Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.31-34
ACM Digital Library Link
Summary: The goal of this paper is to enable an objective understanding of gender portrayals in popular films and media through multimodal content analysis. An automated system for analyzing gender representation in terms of screen presence and speaking time is developed. First, we perform independent processing of the video and the audio content to estimate gender distribution of screen presence at shot level, and of speech at utterance level. A measure of the movie's excitement or intensity is computed using audiovisual features for every scene. This measure is used as a weighting function to combine the gender-based screen/speaking time information at shot/utterance level to compute gender representation for the entire movie. Detailed results and analyses are presented on seventeen full length Hollywood movies.

Effects of Good Speaking Techniques on Audience Engagement Oral Session 2: Audio-Visual, Multimodal Inference / Curtis, Keith / Jones, Gareth J. F. / Campbell, Nick Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.35-42
ACM Digital Library Link
Summary: Understanding audience engagement levels for presentations has the potential to enable richer and more focused interaction with audio-visual recordings. We describe an investigation into automated analysis of multimodal recordings of scientific talks where the use of modalities most typically associated with engagement such as eye-gaze is not feasible. We first study visual and acoustic features to identify those most commonly associated with good speaking techniques. To understand audience interpretation of good speaking techniques, we engaged human annotators to rate the qualities of the speaker for a series of 30-second video segments taken from a corpus of 9 hours of presentations from an academic conference. Our annotators also watched corresponding video recordings of the audience to presentations to estimate the level of audience engagement for each talk. We then explored the effectiveness of multimodal features extracted from the presentation video against Likert-scale ratings of each speaker as assigned by the annotators. and on manually labelled audience engagement levels. These features were used to build a classifier to rate the qualities of a new speaker. This was able classify a rating for a presenter over an 8-class range with an accuracy of 52%. By combining these classes to a 4-class range accuracy increases to 73%. We analyse linear correlations with individual speaker-based modalities and actual audience engagement levels to understand the corresponding effect on audience engagement. A further classifier was then built to predict the level of audience engagement to a presentation by analysing the speaker's use of acoustic and visual cues. Using these speaker based modalities pre-fused with speaker ratings only, we are able to predict actual audience engagement levels with an accuracy of 68%. By combining with basic visual features from the audience as whole, we are able to improve this to an accuracy of 70%.

Multimodal Public Speaking Performance Assessment Oral Session 2: Audio-Visual, Multimodal Inference / Wörtwein, Torsten / Chollet, Mathieu / Schauerte, Boris / Morency, Louis-Philippe / Stiefelhagen, Rainer / Scherer, Stefan Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.43-50
ACM Digital Library Link
Summary: The ability to speak proficiently in public is essential for many professions and in everyday life. Public speaking skills are difficult to master and require extensive training. Recent developments in technology enable new approaches for public speaking training that allow users to practice in engaging and interactive environments. Here, we focus on the automatic assessment of nonverbal behavior and multimodal modeling of public speaking behavior. We automatically identify audiovisual nonverbal behaviors that are correlated to expert judges' opinions of key performance aspects. These automatic assessments enable a virtual audience to provide feedback that is essential for training during a public speaking performance. We utilize multimodal ensemble tree learners to automatically approximate expert judges' evaluations to provide post-hoc performance assessments to the speakers. Our automatic performance evaluation is highly correlated with the experts' opinions with r = 0.745 for the overall performance assessments. We compare multimodal approaches with single modalities and find that the multimodal ensembles consistently outperform single modalities.

I Would Hire You in a Minute: Thin Slices of Nonverbal Behavior in Job Interviews Oral Session 2: Audio-Visual, Multimodal Inference / Nguyen, Laurent Son / Gatica-Perez, Daniel Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.51-58
ACM Digital Library Link
Summary: In everyday life, judgments people make about others are based on brief excerpts of interactions, known as thin slices. Inferences stemming from such minimal information can be quite accurate, and nonverbal behavior plays an important role in the impression formation. Because protagonists are strangers, employment interviews are a case where both nonverbal behavior and thin slices can be predictive of outcomes. In this work, we analyze the predictive validity of thin slices of real job interviews, where slices are defined by the sequence of questions in a structured interview format. We approach this problem from an audio-visual, dyadic, and nonverbal perspective, where sensing, cue extraction, and inference are automated. Our study shows that although nonverbal behavioral cues extracted from thin slices were not as predictive as when extracted from the full interaction, they were still predictive of hirability impressions with R² values up to 0.34, which was comparable to the predictive validity of human observers on thin slices. Applicant audio cues were found to yield the most accurate results.

Deception Detection using Real-life Trial Data Oral Session 2: Audio-Visual, Multimodal Inference / Pérez-Rosas, Verónica / Abouelenien, Mohamed / Mihalcea, Rada / Burzo, Mihai Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.59-66
ACM Digital Library Link
Summary: Hearings of witnesses and defendants play a crucial role when reaching court trial decisions. Given the high-stake nature of trial outcomes, implementing accurate and effective computational methods to evaluate the honesty of court testimonies can offer valuable support during the decision making process. In this paper, we address the identification of deception in real-life trial data. We introduce a novel dataset consisting of videos collected from public court trials. We explore the use of verbal and non-verbal modalities to build a multimodal deception detection system that aims to discriminate between truthful and deceptive statements provided by defendants and witnesses. We achieve classification accuracies in the range of 60-75% when using a model that extracts and fuses features from the linguistic and gesture modalities. In addition, we present a human deception detection study where we evaluate the human capability of detecting deception in trial hearings. The results show that our system outperforms the human capability of identifying deceit.

Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects Oral Session 3: Language, Speech and Dialog / Skantze, Gabriel / Johansson, Martin / Beskow, Jonas Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.67-74
ACM Digital Library Link
Summary: In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors at a time could play a collaborative card sorting game together with the robot head Furhat, where the three players discuss the solution together. The cards are shown on a touch table between the players, thus constituting a target for joint attention. We describe how the system was implemented in order to manage turn-taking and attention to users and objects in the shared physical space. We also discuss how multi-modal redundancy (from speech, card movements and head pose) is exploited to maintain meaningful discussions, given that the system has to process conversational speech from both children and adults in a noisy environment. Finally, we present an analysis of 373 interactions, where we investigate the robustness of the system, to what extent the system's attention can shape the users' turn-taking behaviour, and how the system can produce multi-modal turn-taking signals (filled pauses, facial gestures, breath and gaze) to deal with processing delays in the system.

Visual Saliency and Crowdsourcing-based Priors for an In-car Situated Dialog System Oral Session 3: Language, Speech and Dialog / Misu, Teruhisa Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.75-82
ACM Digital Library Link
Summary: This paper addresses issues in situated language understanding in a moving car. We propose a reference resolution method to identify user queries about specific target objects in their surroundings. We investigate methods of predicting which target object is likely to be queried given a visual scene and what kind of linguistic cues users naturally provide to describe a given target object in a situated environment. We propose methods to incorporate the visual saliency of the visual scene as a prior. Crowdsourced statistics of how people describe an object are also used as a prior. We have collected situated utterances from drivers using our research system, which was embedded in a real vehicle. We demonstrate that the proposed algorithms improve target identification rate by 15.1%.

Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding Oral Session 3: Language, Speech and Dialog / Chen, Yun-Nung / Sun, Ming / Rudnicky, Alexander I. / Gershman, Anatole Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.83-86
ACM Digital Library Link
Summary: Spoken language interfaces are appearing in various smart devices (e.g. smart-phones, smart-TV, in-car navigating systems) and serve as intelligent assistants (IAs). However, most of them do not consider individual users' behavioral profiles and contexts when modeling user intents. Such behavioral patterns are user-specific and provide useful cues to improve spoken language understanding (SLU). This paper focuses on leveraging the app behavior history to improve spoken dialog systems performance. We developed a matrix factorization approach that models speech and app usage patterns to predict user intents (e.g. launching a specific app). We collected multi-turn interactions in a WoZ scenario; users were asked to reproduce the multi-app tasks that they had performed earlier on their smart-phones. By modeling latent semantics behind lexical and behavioral patterns, the proposed multi-model system achieves about 52% of turn accuracy for intent prediction on ASR transcripts.

Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video Oral Session 3: Language, Speech and Dialog / Chakravarty, Punarjay / Mirzaei, Sayeh / Tuytelaars, Tinne / Van hamme, Hugo Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.87-90
ACM Digital Library Link
Summary: Active speakers have traditionally been identified in video by detecting their moving lips. This paper demonstrates the same using spatio-temporal features that aim to capture other cues: movement of the head, upper body and hands of active speakers. Speaker directional information, obtained using sound source localization from a microphone array is used to supervise the training of these video features.

Predicting Participation Styles using Co-occurrence Patterns of Nonverbal Behaviors in Collaborative Learning Oral Session 4: Communication Dynamics / Nakano, Yukiko I. / Nihonyanagi, Sakiko / Takase, Yutaka / Hayashi, Yuki / Okada, Shogo Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.91-98
ACM Digital Library Link
Summary: With the goal of assessing participant attitudes and group activities in collaborative learning, this study presents models of participation styles based on co-occurrence patterns of nonverbal behaviors between conversational participants. First, we collected conversations among groups of three people in a collaborative learning situation, wherein each participant had a digital pen and wore a glasses-type eye tracker. We then divided the collected multimodal data into 0.1-second intervals. The discretized data were applied to an unsupervised method to find co-occurrence behavioral patterns. As a result, we discovered 122 multimodal behavioral motifs from more than 3,000 possible combinations of behaviors by three participants. Using the multimodal behavioral motifs as predictor variables, we created regression models for assessing participation styles. The multiple correlation coefficients ranged from 0.74 to 0.84, indicating a good fit between the models and the data. A correlation analysis also enabled us to identify a smaller set of behavioral motifs (fewer than 30) that are statistically significant as predictors of participation styles. These results show that automatically discovered combinations of multiple kinds of nonverbal information with high co-occurrence frequencies observed between multiple participants as well as for a single participant are useful in characterizing the participant's attitudes towards the conversation.

Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings Oral Session 4: Communication Dynamics / Ishii, Ryo / Kumano, Shiro / Otsuka, Kazuhiro Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.99-106
ACM Digital Library Link
Summary: Techniques that use nonverbal behaviors to predict turn-taking situations, such as who will be the next speaker and the next utterance timing in multi-party meetings are receiving a lot of attention recently. It has long been known that gaze is a physical behavior that plays an important role in transferring the speaking turn between humans. Recently, a line of research has focused on the relationship between turn-taking and respiration, a biological signal that conveys information about the intention or preliminary action to start to speak. It has been demonstrated that respiration and gaze behavior separately have the potential to allow predicting the next speaker and the next utterance timing in multi-party meetings. As a multimodal fusion to create models for predicting the next speaker in multi-party meetings, we integrated respiration and gaze behavior, which were extracted from different modalities and are completely different in quality, and implemented a model uses information about them to predict the next speaker at the end of an utterance. The model has a two-step processing. The first is to predict whether turn-keeping or turn-taking happens; the second is to predict the next speaker in turn-taking. We constructed prediction models with either respiration or gaze behavior and with both respiration and gaze behaviors as features and compared their performance. The results suggest that the model with both respiration and gaze behaviors performs better than the one using only respiration or gaze behavior. It is revealed that multimodal fusion using respiration and gaze behavior is effective for predicting the next speaker in multi-party meetings. It was found that gaze behavior is more useful for predicting turn-keeping/turn-taking than respiration and that respiration is more useful for predicting the next speaker in turn-taking.

Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions Oral Session 4: Communication Dynamics / Oertel, Catharine / Mora, Kenneth A. Funes / Gustafson, Joakim / Odobez, Jean-Marc Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.107-114
ACM Digital Library Link
Summary: Estimating a silent participant's degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio-visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64%.

Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors Oral Session 4: Communication Dynamics / Sadoughi, Najmeh / Busso, Carlos Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.115-122
ACM Digital Library Link
Summary: Creating believable behaviors for conversational agents (CAs) is a challenging task, given the complex relationship between speech and various nonverbal behaviors. The two main approaches are rule-based systems, which tend to produce behaviors with limited variations compared to natural interactions, and data-driven systems, which tend to ignore the underlying semantic meaning of the message (e.g., gestures without meaning). We envision a hybrid system, acting as the behavior realization layer in rule-based systems, while exploiting the rich variation in natural interactions. Constrained on a given target gesture (e.g., head nod) and speech signal, the system will generate novel realizations learned from the data, capturing the timely relationship between speech and gestures. An important task in this research is identifying multiple examples of the target gestures in the corpus. This paper proposes a data mining framework for detecting gestures of interest in a motion capture database. First, we train One-class support vector machines (SVMs) to detect candidate segments conveying the target gesture. Second, we use dynamic time alignment kernel (DTAK) to compare the similarity between the examples (i.e., target gesture) and the given segments. We evaluate the approach for five prototypical hand and head gestures showing reasonable performance. These retrieved gestures are then used to train a speech-driven framework based on dynamic Bayesian networks (DBNs) to synthesize these target behaviors.

Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input Oral Session 5: Interaction Techniques / Klamka, Konstantin / Siegel, Andreas / Vogt, Stefan / Göbel, Fabian / Stellmach, Sophie / Dachselt, Raimund Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.123-130
ACM Digital Library Link
Summary: For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.

Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions Oral Session 5: Interaction Techniques / Chatterjee, Ishan / Xiao, Robert / Harrison, Chris Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.131-138
ACM Digital Library Link
Summary: Humans rely on eye gaze and hand manipulations extensively in their everyday activities. Most often, users gaze at an object to perceive it and then use their hands to manipulate it. We propose applying a multimodal, gaze plus free-space gesture approach to enable rapid, precise and expressive touch-free interactions. We show the input methods are highly complementary, mitigating issues of imprecision and limited expressivity in gaze-alone systems, and issues of targeting speed in gesture-alone systems. We extend an existing interaction taxonomy that naturally divides the gaze+gesture interaction space, which we then populate with a series of example interaction techniques to illustrate the character and utility of each method. We contextualize these interaction techniques in three example scenarios. In our user study, we pit our approach against five contemporary approaches; results show that gaze+gesture can outperform systems using gaze or gesture alone, and in general, approach the performance of "gold standard" input systems, such as the mouse and trackpad.

Digital Flavor: Towards Digitally Simulating Virtual Flavors Oral Session 5: Interaction Techniques / Ranasinghe, Nimesha / Suthokumar, Gajan / Lee, Kuan-Yi / Do, Ellen Yi-Luen Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.139-146
ACM Digital Library Link
Summary: Flavor is often a pleasurable sensory perception we experience daily while eating and drinking. However, the sensation of flavor is rarely considered in the age of digital communication mainly due to the unavailability of flavors as a digitally controllable media. This paper introduces a digital instrument (Digital Flavor Synthesizing device), which actuates taste (electrical and thermal stimulation) and smell sensations (controlled scent emitting) together to simulate different flavors digitally. A preliminary user experiment is conducted to study the effectiveness of this method with predefined five different flavor stimuli. Experimental results show that the users were effectively able to identify different flavors such as minty, spicy, and lemony. Moreover, we outline several challenges ahead along with future possibilities of this technology. In summary, our work demonstrates a novel controllable instrument for flavor simulation, which will be valuable in multimodal interactive systems for rendering virtual flavors digitally.

Different Strokes and Different Folks: Economical Dynamic Surface Sensing and Affect-Related Touch Recognition Oral Session 5: Interaction Techniques / Cang, Xi Laura / Bucci, Paul / Strang, Andrew / Allen, Jeff / MacLean, Karon / Liu, H. Y. Sean Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.147-154
ACM Digital Library Link
Summary: Social touch is an essential non-verbal channel whose great interactive potential can be realized by the ability to recognize gestures performed on inviting surfaces. To assess impact on recognition performance of sensor motion, substrate and coverings, we collected gesture data from a low-cost multitouch fabric pressure-location sensor while varying these factors. For six gestures most relevant in a haptic social robot context plus a no-touch control, we conducted two studies, with the sensor (1) stationary, varying substrate and cover (n=10); and (2) attached to a robot under a fur covering, flexing or stationary (n=16). For a stationary sensor, a random forest model achieved 90.0% recognition accuracy (chance 14.2%) when trained on all data, but as high as 94.6% (mean 89.1%) when trained on the same individual. A curved, flexing surface achieved 79.4% overall but averaged 85.7% when trained and tested on the same individual. These results suggest that under realistic conditions, recognition with this type of flexible sensor is sufficient for many applications of interactive social touch. We further found evidence that users exhibit an idiosyncratic 'touch signature', with potential to identify the toucher. Both findings enable varied contexts of affective or functional touch communication, from physically interactive robots to any touch-sensitive object.

MPHA: A Personal Hearing Doctor Based on Mobile Devices Oral Session 6: Mobile and Wearable / Wu, Yuhao / Jia, Jia / Leung, WaiKim / Liu, Yejun / Cai, Lianhong Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.155-162
ACM Digital Library Link
Summary: As more and more people inquire to know their hearing level condition, audiometry is becoming increasingly important. However, traditional audiometric method requires the involvement of audiometers, which are very expensive and time consuming. In this paper, we present mobile personal hearing assessment (MPHA), a novel interactive mode for testing hearing level based on mobile devices. MPHA, 1) provides a general method to calibrate sound intensity for mobile devices to guarantee the reliability and validity of the audiometry system; 2) designs an audiometric correction algorithm for the real noisy audiometric environment. The experimental results show that MPHA is reliable and valid compared with conventional audiometric assessment.

Towards Attentive, Bi-directional MOOC Learning on Mobile Devices Oral Session 6: Mobile and Wearable / Xiao, Xiang / Wang, Jingtao Proceedings of the 2015 International Conference on Multimodal Interaction 2015-11-09 p.163-170
ACM Digital Library Link
Summary: AttentiveLearner is a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner converts the built-in camera of mobile devices into both a tangible video control channel and an implicit heart rate sensing channel by analyzing the learner's fingertip transparency changes in real time. In this paper, we report disciplined research efforts in making AttentiveLearner truly practical in real-world use. Through two 18-participant user studies and follow-up analyses, we found that 1) the tangible video control interface is intuitive to use and efficient to operate; 2) heart rate signals implicitly captured by AttentiveLearner can be used to infer both the learner's interests and perceived confusion levels towards the corresponding learning topics; 3) AttentiveLearner can achieve significantly higher accuracy by predicting extreme personal learning events and aggregated learning events.
<<First <Previous Permalink Next> Last>> Records: 1 to 25 of 106 Jump to: 2015 |