Sharing Representations for Long Tail Computer Vision Problems
Keynote Address 1
/
Bengio, Samy
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.1
© Copyright 2015 ACM
Summary: The long tail phenomena appears when a small number of objects/words/classes
are very frequent and thus easy to model, while many many more are rare and
thus hard to model. This has always been a problem in machine learning. We
start by explaining why representation sharing in general, and embedding
approaches in particular, can help to represent tail objects. Several embedding
approaches are presented, in increasing levels of complexity, to show how to
tackle the long tail problem, from rare classes to unseen classes in image
classification (the so-called zero-shot setting). Finally, we present our
latest results on image captioning, which can be seen as an ultimate rare class
problem since each image is attributed to a novel, yet structured, class in the
form of a meaningful descriptive sentence.
Interaction Studies with Social Robots
Keynote Address 2
/
Dautenhahn, Kerstin
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.3
© Copyright 2015 ACM
Summary: Over the past 10 years we have seen worldwide an immense growth of research
and development into companion robots. Those are robots that fulfil particular
tasks, but do so in a socially acceptable manner. The companionship aspect
reflects the repeated and long-term nature of such interactions, and the
potential of people to form relationships with such robots, e.g. as friendly
assistants. A number of companion and assistant robots have been entering the
market, two of the latest examples are Aldebaran's Pepper robot, or Jibo
(Cynthia Breazeal). Companion robots are more and more targeting particular
application areas, e.g. as home assistants or therapeutic tools. Research into
companion robots needs to address many fundamental research problems concerning
perception, cognition, action and learning, but regardless how sophisticated
our robotic systems may be, the potential users need to be taken into account
from the early stages of development. The talk will emphasize the need for a
highly user-centred approach towards design, development and evaluation of
companion robots. An important challenge is to evaluate robots in realistic and
long-term scenarios, in order to capture as closely as possible those key
aspects that will play a role when using such robots in the real world. In
order to illustrate these points, my talk will give examples of interaction
studies that my research team has been involved in. This includes studies into
how people perceive robots' non-verbal cues, creating and evaluating realistic
scenarios for home companion robots using narrative framing, and verbal and
tactile interaction of children with the therapeutic and social robot Kaspar.
The talk will highlight the issues we encountered when we proceeded from
laboratory-based experiments and prototypes to real-world applications.
Connections: 2015 ICMI Sustained Accomplishment Award Lecture
Keynote Address 3 (Sustained Accomplishment Award Talk)
/
Horvitz, Eric
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.5
© Copyright 2015 ACM
Summary: Our community has long pursued principles and methods for enabling fluid and
effortless collaborations between people and computing systems. Forging deep
connections between people and machines has come into focus over the last 25
years as a grand challenge at the intersection of artificial intelligence,
human-computer interaction, and cognitive psychology. I will review experiences
and directions with leveraging advances in perception, learning, and reasoning
in pursuit of our shared dreams.
Combining Two Perspectives on Classifying Multimodal Data for Recognizing
Speaker Traits
Oral Session 1: Machine Learning in Multimodal Systems
/
Chatterjee, Moitreya
/
Park, Sunghyun
/
Morency, Louis-Philippe
/
Scherer, Stefan
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.7-14
© Copyright 2015 ACM
Summary: Human communication involves conveying messages both through verbal and
non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless,
the task of learning these patterns for a computer by combining cues from
multiple modalities is challenging because it requires effective representation
of the signals and also taking into consideration the complex interactions
between them. From the machine learning perspective this presents a two-fold
challenge: a) Modeling the intermodal variations and dependencies; b)
Representing the data using an apt number of features, such that the necessary
patterns are captured but at the same time allaying concerns such as
over-fitting. In this work we attempt to address these aspects of multimodal
recognition, in the context of recognizing two essential speaker traits, namely
passion and credibility of online movie reviewers. We propose a novel ensemble
classification approach that combines two different perspectives on classifying
multimodal data. Each of these perspectives attempts to independently address
the two-fold challenge. In the first, we combine the features from multiple
modalities but assume inter-modality conditional independence. In the other
one, we explicitly capture the correlation between the modalities but in a
space of few dimensions and explore a novel clustering based kernel similarity
approach for recognition. Additionally, this work investigates a recent
technique for encoding text data that captures semantic similarity of verbal
content and preserves word-ordering. The experimental results on a recent
public dataset shows significant improvement of our approach over multiple
baselines. Finally, we also analyze the most discriminative elements of a
speaker's non-verbal behavior that contribute to his/her perceived
credibility/passionateness.
Personality Trait Classification via Co-Occurrent Multiparty Multimodal
Event Discovery
Oral Session 1: Machine Learning in Multimodal Systems
/
Okada, Shogo
/
Aran, Oya
/
Gatica-Perez, Daniel
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.15-22
© Copyright 2015 ACM
Summary: This paper proposes a novel feature extraction framework from multi-party
multimodal conversation for inference of personality traits and emergent
leadership. The proposed framework represents multi modal features as the
combination of each participant's nonverbal activity and group activity. This
feature representation enables to compare the nonverbal patterns extracted from
the participants of different groups in a metric space. It captures how the
target member outputs nonverbal behavior observed in a group (e.g. the member
speaks while all members move their body), and can be available for any kind of
multiparty conversation task. Frequent co-occurrent events are discovered using
graph clustering from multimodal sequences. The proposed framework is applied
for the ELEA corpus which is an audio visual dataset collected from group
meetings. We evaluate the framework for binary classification task of 10
personality traits. Experimental results show that the model trained with
co-occurrence features obtained higher accuracy than previously related work in
8 out of 10 traits. In addition, the co-occurrence features improve the
accuracy from 2% up to 17%.
Evaluating Speech, Face, Emotion and Body Movement Time-series Features for
Automated Multimodal Presentation Scoring
Oral Session 1: Machine Learning in Multimodal Systems
/
Ramanarayanan, Vikram
/
Leong, Chee Wee
/
Chen, Lei
/
Feng, Gary
/
Suendermann-Oeft, David
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.23-30
© Copyright 2015 ACM
Summary: We analyze how fusing features obtained from different multimodal data
streams such as speech, face, body movement and emotion tracks can be applied
to the scoring of multimodal presentations. We compute both time-aggregated and
time-series based features from these data streams -- the former being
statistical functionals and other cumulative features computed over the entire
time series, while the latter, dubbed histograms of cooccurrences, capture how
different prototypical body posture or facial configurations co-occur within
different time-lags of each other over the evolution of the multimodal,
multivariate time series. We examine the relative utility of these features,
along with curated speech stream features in predicting human-rated scores of
multiple aspects of presentation proficiency. We find that different modalities
are useful in predicting different aspects, even outperforming a naive human
inter-rater agreement baseline for a subset of the aspects analyzed.
Gender Representation in Cinematic Content: A Multimodal Approach
Oral Session 1: Machine Learning in Multimodal Systems
/
Guha, Tanaya
/
Huang, Che-Wei
/
Kumar, Naveen
/
Zhu, Yan
/
Narayanan, Shrikanth S.
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.31-34
© Copyright 2015 ACM
Summary: The goal of this paper is to enable an objective understanding of gender
portrayals in popular films and media through multimodal content analysis. An
automated system for analyzing gender representation in terms of screen
presence and speaking time is developed. First, we perform independent
processing of the video and the audio content to estimate gender distribution
of screen presence at shot level, and of speech at utterance level. A measure
of the movie's excitement or intensity is computed using audiovisual features
for every scene. This measure is used as a weighting function to combine the
gender-based screen/speaking time information at shot/utterance level to
compute gender representation for the entire movie. Detailed results and
analyses are presented on seventeen full length Hollywood movies.
Effects of Good Speaking Techniques on Audience Engagement
Oral Session 2: Audio-Visual, Multimodal Inference
/
Curtis, Keith
/
Jones, Gareth J. F.
/
Campbell, Nick
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.35-42
© Copyright 2015 ACM
Summary: Understanding audience engagement levels for presentations has the potential
to enable richer and more focused interaction with audio-visual recordings. We
describe an investigation into automated analysis of multimodal recordings of
scientific talks where the use of modalities most typically associated with
engagement such as eye-gaze is not feasible. We first study visual and acoustic
features to identify those most commonly associated with good speaking
techniques. To understand audience interpretation of good speaking techniques,
we engaged human annotators to rate the qualities of the speaker for a series
of 30-second video segments taken from a corpus of 9 hours of presentations
from an academic conference. Our annotators also watched corresponding video
recordings of the audience to presentations to estimate the level of audience
engagement for each talk. We then explored the effectiveness of multimodal
features extracted from the presentation video against Likert-scale ratings of
each speaker as assigned by the annotators. and on manually labelled audience
engagement levels. These features were used to build a classifier to rate the
qualities of a new speaker. This was able classify a rating for a presenter
over an 8-class range with an accuracy of 52%. By combining these classes to a
4-class range accuracy increases to 73%. We analyse linear correlations with
individual speaker-based modalities and actual audience engagement levels to
understand the corresponding effect on audience engagement. A further
classifier was then built to predict the level of audience engagement to a
presentation by analysing the speaker's use of acoustic and visual cues. Using
these speaker based modalities pre-fused with speaker ratings only, we are able
to predict actual audience engagement levels with an accuracy of 68%. By
combining with basic visual features from the audience as whole, we are able to
improve this to an accuracy of 70%.
Multimodal Public Speaking Performance Assessment
Oral Session 2: Audio-Visual, Multimodal Inference
/
Wörtwein, Torsten
/
Chollet, Mathieu
/
Schauerte, Boris
/
Morency, Louis-Philippe
/
Stiefelhagen, Rainer
/
Scherer, Stefan
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.43-50
© Copyright 2015 ACM
Summary: The ability to speak proficiently in public is essential for many
professions and in everyday life. Public speaking skills are difficult to
master and require extensive training. Recent developments in technology enable
new approaches for public speaking training that allow users to practice in
engaging and interactive environments. Here, we focus on the automatic
assessment of nonverbal behavior and multimodal modeling of public speaking
behavior. We automatically identify audiovisual nonverbal behaviors that are
correlated to expert judges' opinions of key performance aspects. These
automatic assessments enable a virtual audience to provide feedback that is
essential for training during a public speaking performance. We utilize
multimodal ensemble tree learners to automatically approximate expert judges'
evaluations to provide post-hoc performance assessments to the speakers. Our
automatic performance evaluation is highly correlated with the experts'
opinions with r = 0.745 for the overall performance assessments. We compare
multimodal approaches with single modalities and find that the multimodal
ensembles consistently outperform single modalities.
I Would Hire You in a Minute: Thin Slices of Nonverbal Behavior in Job
Interviews
Oral Session 2: Audio-Visual, Multimodal Inference
/
Nguyen, Laurent Son
/
Gatica-Perez, Daniel
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.51-58
© Copyright 2015 ACM
Summary: In everyday life, judgments people make about others are based on brief
excerpts of interactions, known as thin slices. Inferences stemming from such
minimal information can be quite accurate, and nonverbal behavior plays an
important role in the impression formation. Because protagonists are strangers,
employment interviews are a case where both nonverbal behavior and thin slices
can be predictive of outcomes. In this work, we analyze the predictive validity
of thin slices of real job interviews, where slices are defined by the sequence
of questions in a structured interview format. We approach this problem from an
audio-visual, dyadic, and nonverbal perspective, where sensing, cue extraction,
and inference are automated. Our study shows that although nonverbal behavioral
cues extracted from thin slices were not as predictive as when extracted from
the full interaction, they were still predictive of hirability impressions with
R² values up to 0.34, which was comparable to the predictive validity of
human observers on thin slices. Applicant audio cues were found to yield the
most accurate results.
Deception Detection using Real-life Trial Data
Oral Session 2: Audio-Visual, Multimodal Inference
/
Pérez-Rosas, Verónica
/
Abouelenien, Mohamed
/
Mihalcea, Rada
/
Burzo, Mihai
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.59-66
© Copyright 2015 ACM
Summary: Hearings of witnesses and defendants play a crucial role when reaching court
trial decisions. Given the high-stake nature of trial outcomes, implementing
accurate and effective computational methods to evaluate the honesty of court
testimonies can offer valuable support during the decision making process. In
this paper, we address the identification of deception in real-life trial data.
We introduce a novel dataset consisting of videos collected from public court
trials. We explore the use of verbal and non-verbal modalities to build a
multimodal deception detection system that aims to discriminate between
truthful and deceptive statements provided by defendants and witnesses. We
achieve classification accuracies in the range of 60-75% when using a model
that extracts and fuses features from the linguistic and gesture modalities. In
addition, we present a human deception detection study where we evaluate the
human capability of detecting deception in trial hearings. The results show
that our system outperforms the human capability of identifying deceit.
Exploring Turn-taking Cues in Multi-party Human-robot Discussions about
Objects
Oral Session 3: Language, Speech and Dialog
/
Skantze, Gabriel
/
Johansson, Martin
/
Beskow, Jonas
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.67-74
© Copyright 2015 ACM
Summary: In this paper, we present a dialog system that was exhibited at the Swedish
National Museum of Science and Technology. Two visitors at a time could play a
collaborative card sorting game together with the robot head Furhat, where the
three players discuss the solution together. The cards are shown on a touch
table between the players, thus constituting a target for joint attention. We
describe how the system was implemented in order to manage turn-taking and
attention to users and objects in the shared physical space. We also discuss
how multi-modal redundancy (from speech, card movements and head pose) is
exploited to maintain meaningful discussions, given that the system has to
process conversational speech from both children and adults in a noisy
environment. Finally, we present an analysis of 373 interactions, where we
investigate the robustness of the system, to what extent the system's attention
can shape the users' turn-taking behaviour, and how the system can produce
multi-modal turn-taking signals (filled pauses, facial gestures, breath and
gaze) to deal with processing delays in the system.
Visual Saliency and Crowdsourcing-based Priors for an In-car Situated Dialog
System
Oral Session 3: Language, Speech and Dialog
/
Misu, Teruhisa
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.75-82
© Copyright 2015 ACM
Summary: This paper addresses issues in situated language understanding in a moving
car. We propose a reference resolution method to identify user queries about
specific target objects in their surroundings. We investigate methods of
predicting which target object is likely to be queried given a visual scene and
what kind of linguistic cues users naturally provide to describe a given target
object in a situated environment. We propose methods to incorporate the visual
saliency of the visual scene as a prior. Crowdsourced statistics of how people
describe an object are also used as a prior. We have collected situated
utterances from drivers using our research system, which was embedded in a real
vehicle. We demonstrate that the proposed algorithms improve target
identification rate by 15.1%.
Leveraging Behavioral Patterns of Mobile Applications for Personalized
Spoken Language Understanding
Oral Session 3: Language, Speech and Dialog
/
Chen, Yun-Nung
/
Sun, Ming
/
Rudnicky, Alexander I.
/
Gershman, Anatole
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.83-86
© Copyright 2015 ACM
Summary: Spoken language interfaces are appearing in various smart devices (e.g.
smart-phones, smart-TV, in-car navigating systems) and serve as intelligent
assistants (IAs). However, most of them do not consider individual users'
behavioral profiles and contexts when modeling user intents. Such behavioral
patterns are user-specific and provide useful cues to improve spoken language
understanding (SLU). This paper focuses on leveraging the app behavior history
to improve spoken dialog systems performance. We developed a matrix
factorization approach that models speech and app usage patterns to predict
user intents (e.g. launching a specific app). We collected multi-turn
interactions in a WoZ scenario; users were asked to reproduce the multi-app
tasks that they had performed earlier on their smart-phones. By modeling latent
semantics behind lexical and behavioral patterns, the proposed multi-model
system achieves about 52% of turn accuracy for intent prediction on ASR
transcripts.
Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video
Oral Session 3: Language, Speech and Dialog
/
Chakravarty, Punarjay
/
Mirzaei, Sayeh
/
Tuytelaars, Tinne
/
Van hamme, Hugo
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.87-90
© Copyright 2015 ACM
Summary: Active speakers have traditionally been identified in video by detecting
their moving lips. This paper demonstrates the same using spatio-temporal
features that aim to capture other cues: movement of the head, upper body and
hands of active speakers. Speaker directional information, obtained using sound
source localization from a microphone array is used to supervise the training
of these video features.
Predicting Participation Styles using Co-occurrence Patterns of Nonverbal
Behaviors in Collaborative Learning
Oral Session 4: Communication Dynamics
/
Nakano, Yukiko I.
/
Nihonyanagi, Sakiko
/
Takase, Yutaka
/
Hayashi, Yuki
/
Okada, Shogo
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.91-98
© Copyright 2015 ACM
Summary: With the goal of assessing participant attitudes and group activities in
collaborative learning, this study presents models of participation styles
based on co-occurrence patterns of nonverbal behaviors between conversational
participants. First, we collected conversations among groups of three people in
a collaborative learning situation, wherein each participant had a digital pen
and wore a glasses-type eye tracker. We then divided the collected multimodal
data into 0.1-second intervals. The discretized data were applied to an
unsupervised method to find co-occurrence behavioral patterns. As a result, we
discovered 122 multimodal behavioral motifs from more than 3,000 possible
combinations of behaviors by three participants. Using the multimodal
behavioral motifs as predictor variables, we created regression models for
assessing participation styles. The multiple correlation coefficients ranged
from 0.74 to 0.84, indicating a good fit between the models and the data. A
correlation analysis also enabled us to identify a smaller set of behavioral
motifs (fewer than 30) that are statistically significant as predictors of
participation styles. These results show that automatically discovered
combinations of multiple kinds of nonverbal information with high co-occurrence
frequencies observed between multiple participants as well as for a single
participant are useful in characterizing the participant's attitudes towards
the conversation.
Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in
Multi-Party Meetings
Oral Session 4: Communication Dynamics
/
Ishii, Ryo
/
Kumano, Shiro
/
Otsuka, Kazuhiro
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.99-106
© Copyright 2015 ACM
Summary: Techniques that use nonverbal behaviors to predict turn-taking situations,
such as who will be the next speaker and the next utterance timing in
multi-party meetings are receiving a lot of attention recently. It has long
been known that gaze is a physical behavior that plays an important role in
transferring the speaking turn between humans. Recently, a line of research has
focused on the relationship between turn-taking and respiration, a biological
signal that conveys information about the intention or preliminary action to
start to speak. It has been demonstrated that respiration and gaze behavior
separately have the potential to allow predicting the next speaker and the next
utterance timing in multi-party meetings. As a multimodal fusion to create
models for predicting the next speaker in multi-party meetings, we integrated
respiration and gaze behavior, which were extracted from different modalities
and are completely different in quality, and implemented a model uses
information about them to predict the next speaker at the end of an utterance.
The model has a two-step processing. The first is to predict whether
turn-keeping or turn-taking happens; the second is to predict the next speaker
in turn-taking. We constructed prediction models with either respiration or
gaze behavior and with both respiration and gaze behaviors as features and
compared their performance. The results suggest that the model with both
respiration and gaze behaviors performs better than the one using only
respiration or gaze behavior. It is revealed that multimodal fusion using
respiration and gaze behavior is effective for predicting the next speaker in
multi-party meetings. It was found that gaze behavior is more useful for
predicting turn-keeping/turn-taking than respiration and that respiration is
more useful for predicting the next speaker in turn-taking.
Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the
Classification of Listener Categories in Group Discussions
Oral Session 4: Communication Dynamics
/
Oertel, Catharine
/
Mora, Kenneth A. Funes
/
Gustafson, Joakim
/
Odobez, Jean-Marc
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.107-114
© Copyright 2015 ACM
Summary: Estimating a silent participant's degree of engagement and his role within a
group discussion can be challenging, as there are no speech related cues
available at the given time. Having this information available, however, can
provide important insights into the dynamics of the group as a whole. In this
paper, we study the classification of listeners into several categories
(attentive listener, side participant and bystander). We devised a thin-sliced
perception test where subjects were asked to assess listener roles and
engagement levels in 15-second video-clips taken from a corpus of group
interviews. Results show that humans are usually able to assess silent
participant roles. Using the annotation to identify from a set of multimodal
low-level features, such as past speaking activity, backchannels (both visual
and verbal), as well as gaze patterns, we could identify the features which are
able to distinguish between different listener categories. Moreover, the
results show that many of the audio-visual effects observed on listeners in
dyadic interactions, also hold for multi-party interactions. A preliminary
classifier achieves an accuracy of 64%.
Retrieving Target Gestures Toward Speech Driven Animation with Meaningful
Behaviors
Oral Session 4: Communication Dynamics
/
Sadoughi, Najmeh
/
Busso, Carlos
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.115-122
© Copyright 2015 ACM
Summary: Creating believable behaviors for conversational agents (CAs) is a
challenging task, given the complex relationship between speech and various
nonverbal behaviors. The two main approaches are rule-based systems, which tend
to produce behaviors with limited variations compared to natural interactions,
and data-driven systems, which tend to ignore the underlying semantic meaning
of the message (e.g., gestures without meaning). We envision a hybrid system,
acting as the behavior realization layer in rule-based systems, while
exploiting the rich variation in natural interactions. Constrained on a given
target gesture (e.g., head nod) and speech signal, the system will generate
novel realizations learned from the data, capturing the timely relationship
between speech and gestures. An important task in this research is identifying
multiple examples of the target gestures in the corpus. This paper proposes a
data mining framework for detecting gestures of interest in a motion capture
database. First, we train One-class support vector machines (SVMs) to detect
candidate segments conveying the target gesture. Second, we use dynamic time
alignment kernel (DTAK) to compare the similarity between the examples (i.e.,
target gesture) and the given segments. We evaluate the approach for five
prototypical hand and head gestures showing reasonable performance. These
retrieved gestures are then used to train a speech-driven framework based on
dynamic Bayesian networks (DBNs) to synthesize these target behaviors.
Look & Pedal: Hands-free Navigation in Zoomable Information Spaces
through Gaze-supported Foot Input
Oral Session 5: Interaction Techniques
/
Klamka, Konstantin
/
Siegel, Andreas
/
Vogt, Stefan
/
Göbel, Fabian
/
Stellmach, Sophie
/
Dachselt, Raimund
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.123-130
© Copyright 2015 ACM
Summary: For a desktop computer, we investigate how to enhance conventional mouse and
keyboard interaction by combining the input modalities gaze and foot. This
multimodal approach offers the potential for fluently performing both manual
input (e.g., for precise object selection) and gaze-supported foot input (for
pan and zoom) in zoomable information spaces in quick succession or even in
parallel. For this, we take advantage of fast gaze input to implicitly indicate
where to navigate to and additional explicit foot input for speed control while
leaving the hands free for further manual input. This allows for taking
advantage of gaze input in a subtle and unobtrusive way. We have carefully
elaborated and investigated three variants of foot controls incorporating one-,
two- and multidirectional foot pedals in combination with gaze. These were
evaluated and compared to mouse-only input in a user study using Google Earth
as a geographic information system. The results suggest that gaze-supported
foot input is feasible for convenient, user-friendly navigation and comparable
to mouse input and encourage further investigations of gaze-supported foot
controls.
Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions
Oral Session 5: Interaction Techniques
/
Chatterjee, Ishan
/
Xiao, Robert
/
Harrison, Chris
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.131-138
© Copyright 2015 ACM
Summary: Humans rely on eye gaze and hand manipulations extensively in their everyday
activities. Most often, users gaze at an object to perceive it and then use
their hands to manipulate it. We propose applying a multimodal, gaze plus
free-space gesture approach to enable rapid, precise and expressive touch-free
interactions. We show the input methods are highly complementary, mitigating
issues of imprecision and limited expressivity in gaze-alone systems, and
issues of targeting speed in gesture-alone systems. We extend an existing
interaction taxonomy that naturally divides the gaze+gesture interaction space,
which we then populate with a series of example interaction techniques to
illustrate the character and utility of each method. We contextualize these
interaction techniques in three example scenarios. In our user study, we pit
our approach against five contemporary approaches; results show that
gaze+gesture can outperform systems using gaze or gesture alone, and in
general, approach the performance of "gold standard" input systems, such as the
mouse and trackpad.
Digital Flavor: Towards Digitally Simulating Virtual Flavors
Oral Session 5: Interaction Techniques
/
Ranasinghe, Nimesha
/
Suthokumar, Gajan
/
Lee, Kuan-Yi
/
Do, Ellen Yi-Luen
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.139-146
© Copyright 2015 ACM
Summary: Flavor is often a pleasurable sensory perception we experience daily while
eating and drinking. However, the sensation of flavor is rarely considered in
the age of digital communication mainly due to the unavailability of flavors as
a digitally controllable media. This paper introduces a digital instrument
(Digital Flavor Synthesizing device), which actuates taste (electrical and
thermal stimulation) and smell sensations (controlled scent emitting) together
to simulate different flavors digitally. A preliminary user experiment is
conducted to study the effectiveness of this method with predefined five
different flavor stimuli. Experimental results show that the users were
effectively able to identify different flavors such as minty, spicy, and
lemony. Moreover, we outline several challenges ahead along with future
possibilities of this technology. In summary, our work demonstrates a novel
controllable instrument for flavor simulation, which will be valuable in
multimodal interactive systems for rendering virtual flavors digitally.
Different Strokes and Different Folks: Economical Dynamic Surface Sensing
and Affect-Related Touch Recognition
Oral Session 5: Interaction Techniques
/
Cang, Xi Laura
/
Bucci, Paul
/
Strang, Andrew
/
Allen, Jeff
/
MacLean, Karon
/
Liu, H. Y. Sean
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.147-154
© Copyright 2015 ACM
Summary: Social touch is an essential non-verbal channel whose great interactive
potential can be realized by the ability to recognize gestures performed on
inviting surfaces. To assess impact on recognition performance of sensor
motion, substrate and coverings, we collected gesture data from a low-cost
multitouch fabric pressure-location sensor while varying these factors. For six
gestures most relevant in a haptic social robot context plus a no-touch
control, we conducted two studies, with the sensor (1) stationary, varying
substrate and cover (n=10); and (2) attached to a robot under a fur covering,
flexing or stationary (n=16). For a stationary sensor, a random forest model
achieved 90.0% recognition accuracy (chance 14.2%) when trained on all data,
but as high as 94.6% (mean 89.1%) when trained on the same individual. A
curved, flexing surface achieved 79.4% overall but averaged 85.7% when trained
and tested on the same individual. These results suggest that under realistic
conditions, recognition with this type of flexible sensor is sufficient for
many applications of interactive social touch. We further found evidence that
users exhibit an idiosyncratic 'touch signature', with potential to identify
the toucher. Both findings enable varied contexts of affective or functional
touch communication, from physically interactive robots to any touch-sensitive
object.
MPHA: A Personal Hearing Doctor Based on Mobile Devices
Oral Session 6: Mobile and Wearable
/
Wu, Yuhao
/
Jia, Jia
/
Leung, WaiKim
/
Liu, Yejun
/
Cai, Lianhong
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.155-162
© Copyright 2015 ACM
Summary: As more and more people inquire to know their hearing level condition,
audiometry is becoming increasingly important. However, traditional audiometric
method requires the involvement of audiometers, which are very expensive and
time consuming. In this paper, we present mobile personal hearing assessment
(MPHA), a novel interactive mode for testing hearing level based on mobile
devices. MPHA, 1) provides a general method to calibrate sound intensity for
mobile devices to guarantee the reliability and validity of the audiometry
system; 2) designs an audiometric correction algorithm for the real noisy
audiometric environment. The experimental results show that MPHA is reliable
and valid compared with conventional audiometric assessment.
Towards Attentive, Bi-directional MOOC Learning on Mobile Devices
Oral Session 6: Mobile and Wearable
/
Xiao, Xiang
/
Wang, Jingtao
Proceedings of the 2015 International Conference on Multimodal Interaction
2015-11-09
p.163-170
© Copyright 2015 ACM
Summary: AttentiveLearner is a mobile learning system optimized for consuming lecture
videos in Massive Open Online Courses (MOOCs) and flipped classrooms.
AttentiveLearner converts the built-in camera of mobile devices into both a
tangible video control channel and an implicit heart rate sensing channel by
analyzing the learner's fingertip transparency changes in real time. In this
paper, we report disciplined research efforts in making AttentiveLearner truly
practical in real-world use. Through two 18-participant user studies and
follow-up analyses, we found that 1) the tangible video control interface is
intuitive to use and efficient to operate; 2) heart rate signals implicitly
captured by AttentiveLearner can be used to infer both the learner's interests
and perceived confusion levels towards the corresponding learning topics; 3)
AttentiveLearner can achieve significantly higher accuracy by predicting
extreme personal learning events and aggregated learning events.