HCI Bibliography Home | HCI Conferences | MM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
MM Tables of Contents: 131415

Proceedings of the 2014 ACM International Conference on Multimedia

Fullname:Proceedings of the 22nd ACM International Conference on Multimedia
Editors:Kien A. Hua; Yong Rui; Ralf Steinmetz; Alan Hanjalic; Apostol (Paul) Natsev; Wenwu Zhu
Location:Orlando, Florida
Dates:2014-Nov-03 to 2014-Nov-07
Standard No:ISBN: 978-1-4503-3063-3; ACM DL: Table of Contents; hcibib: MM14
Links:Conference Website
  1. Keynotes
  2. Best Paper Session
  3. Multimedia Art and Entertainment
  4. Action, Activity, and Event Recognition
  5. Music, Speech and Audio
  6. Deep Learning for Multimedia
  7. Multimedia Grand Challenge
  8. Multimedia HCI and QoE
  9. Multimedia Analysis and Mining
  10. Multimedia Systems
  11. Emotional and Social Signals in Multimedia
  12. High Risks High Rewards
  13. Multimedia Applications
  14. Privacy, Health and Well-being
  15. Multimedia Search and Indexing
  16. Social Media and Crowd
  17. Multimedia Recommendations
  18. Doctoral Symposium 1
  19. Doctoral Symposium 2
  20. Open Source Software Competition 1
  21. Open Source Software Competition 2
  22. Art Exhibit
  23. Demos 1: Searching and Finding
  24. Demos 2: Senses and Sensors
  25. Demos 3: Systems
  26. Posters 1
  27. Posters 2
  28. Posters 3
  29. Tutorials
  30. Workshop Summaries


Bing, the fastest growing image search engine BIBAFull-Text 1
  Harry Shum
Since the launch of Bing (www.bing.com) in June 2009, we have seen Bing web search market share in the US more than doubled and Bing image search query share quadrupled. In this talk, I will share our experience building Bing image search as the fastest growing image search engine, and discuss the challenges and opportunities in image search. Specifically, I will talk about how we have significantly improved image search quality, and built differentiated image search user experience using NLP, entity, big data, machine learning and computer vision technologies. By leveraging big data from billions of search queries, billions of images on the web and from the social networks, and billions of user clicks, we have designed massive machine learning systems to continuously improve image search quality. With the focus on natural language and entity understanding, for instance, we have improved Bing's ability to understand the user intent beyond queries and keywords. I will demonstrate with many examples how Bing has delivered a superior image search user experience, quantitatively, qualitatively and aesthetically, by utilizing computer vision techniques.
Affective media and wearables: surprising findings BIBAFull-Text 3-4
  Rosalind Picard
Over a decade ago, I suggested that computers will need the skills of emotional intelligence in order to interact with regular people in ways that they perceive as intelligent. Our lab embarked on this journey of 'affective computing' with a focus on first enabling computers to better understand and communicate human emotion. Our main tools have been wearable sensors (several which we created), video, and audio, coupled with signal processing, machine learning and pattern analysis of multimodal human data. Along the way we encountered several surprises. This talk will highlight some of the challenges we have faced, some accomplishments, and the most surprising and rewarding findings. Our findings reveal the power of the human emotion system not only in intelligence, in social interaction, and in everyday media consumption, but also in autism, epilepsy, and sleep memory formation.
Back and to the future: quality provisioning for multimedia content delivery BIBAFull-Text 5
  Klara Nahrstedt
Quality Provisioning concept has been with us for at least 25 years and it started with the claim to be one of the necessary building blocks for multimedia content distribution and delivery. A lot of research has been done on Quality of Service and Quality of Experience by the ACM Special Interest Group on Multimedia (SIGMM) community and other research communities. So where are we with respect to broader impact and deployment of Quality Provisioning in multimedia networks, systems and applications? Did it become a necessary building block for multimedia content delivery or not? During the talk I will go back and to the future, discussing my journey regarding quality topics ranging from Quality of Service (QoS) in multimedia networks and end systems, to Experiential Quality (QoE) for current and future multimedia applications. I will reflect on successes and failures of Quality Provisioning mechanisms, policies, algorithms, protocols and management frameworks in multimedia networks, systems and applications as they evolved from up to the point of an almost ubiquitous presence of multimedia services to the on-going discussions about network neutrality and multimedia service provisioning. I will also argue that the future for Quality Provisioning is bright with numerous exciting research problems since users expect at this point nothing but high quality multimedia content delivery anytime, anywhere, any content, and on any device.

Best Paper Session

Cross-modal Retrieval with Correspondence Autoencoder BIBAFull-Text 7-16
  Fangxiang Feng; Xiaojie Wang; Ruifan Li
The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. A parameter α is used to balance the representation learning error and the correlation learning error. Based on two different multi-modal autoencoders, Corr-AE is extended to other two correspondence models, here we called Corr-Cross-AE and Corr-Full-AE. The proposed models are evaluated on three publicly available data sets from real scenes. We demonstrate that the three correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multi-modal deep models on cross-modal retrieval tasks.
VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events BIBAFull-Text 17-26
  Amirhossein Habibian; Thomas Mensink; Cees G. M. Snoek
This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only.
Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition BIBAFull-Text 27-36
  Yelin Kim; Emily Mower Provost
Facial movement is modulated both by emotion and speech articulation. Facial emotion recognition systems aim to discriminate between emotions, while reducing the speech-related variability in facial cues. This aim is often achieved using two key features: (1) phoneme segmentation: facial cues are temporally divided into units with a single phoneme and (2) phoneme-specific classification: systems learn patterns associated with groups of visually similar phonemes (visemes), e.g. P, B, and M. In this work, we empirically compare the effects of different temporal segmentation and classification schemes for facial emotion recognition. We propose an unsupervised segmentation method that does not necessitate costly phonetic transcripts. We show that the proposed method bridges the accuracy gap between a traditional sliding window method and phoneme segmentation, achieving a statistically significant performance gain. We also demonstrate that the segments derived from the proposed unsupervised and phoneme segmentation strategies are similar to each other. This paper provides new insight into unsupervised facial motion segmentation and the impact of speech variability on emotion classification.

Multimedia Art and Entertainment

Analysis/synthesis approaches for creatively processing video signals BIBAFull-Text 37-46
  Javier Villegas; Angus Graeme Forbes
This paper explores methods for the creative manipulation of video signals and the generation of animations through a process of analysis and synthesis. Our approach involves four distinct steps, and different creative outputs based on video inputs can be obtained by choosing different alternatives at each of the steps. First, we decide which features to extract from an input video sequence. Next, we choose a matching strategy to associate the features between a pair of video frames. Then, we choose a way to interpolate between corresponding features within these frames. Finally, we decide how to render these elements when resynthesizing the signal. We illustrate our approach with a range of different examples, including video manipulation experiments, animations, and real-time multimedia installations.
Exploring Principles-of-Art Features For Image Emotion Recognition BIBAFull-Text 47-56
  Sicheng Zhao; Yue Gao; Xiaolei Jiang; Hongxun Yao; Tat-Seng Chua; Xiaoshuai Sun
Emotions can be evoked in humans by images. Most previous works on image emotion analysis mainly used the elements-of-art-based low-level visual features. However, these features are vulnerable and not invariant to the different arrangements of elements. In this paper, we investigate the concept of principles-of-art and its influence on image emotions. Principles-of-art-based emotion features (PAEF) are extracted to classify and score image emotions for understanding the relationship between artistic principles and emotions. PAEF are the unified combination of representation features derived from different principles, including balance, emphasis, harmony, variety, gradation, and movement. Experiments on the International Affective Picture System (IAPS), a set of artistic photography and a set of peer rated abstract paintings, demonstrate the superiority of PAEF for affective image classification and regression (with about 5% improvement on classification accuracy and 0.2 decrease in mean squared error), as compared to the state-of-the-art approaches. We then utilize PAEF to analyze the emotions of master paintings, with promising results.
From Writing to Painting: A Kinect-Based Cross-Modal Chinese Painting Generation System BIBAFull-Text 57-66
  Jiajia Li; Grace Ngai; Stephen C. F. Chan; Kien A. Hua; Hong Va Leong; Alvin Chan
As computer and interaction technologies mature, a much broader range of media is now used for input and output, each of which has its own rich repertoire of techniques, instruments, and cultural heritage. The combination of multiple media can produce novel multimedia human-computer interaction approaches which are more efficient and interesting than traditional single media methods. This paper presents CalliPaint, a system for cross-modal art generation that links together Chinese ink brush calligraphy writing and Chinese landscape painting. We investigate the mapping between the two modalities based on concepts of metaphoric congruence, and implement our findings into a prototype system. A multi-step evaluation experiment with real users suggests that CalliPaint provides a realistic and intuitive experience that allows even novice users to create attractive landscape paintings from writing. Comparison with a general-purpose digital painting software suggests that CalliPaint provides users with a more enjoyable experience. Finally, exhibiting CalliPaint in an open-access location for use by casual users without any training shows that the system is easy to learn.
Gibber: Abstractions for Creative Multimedia Programming BIBAFull-Text 67-76
  Charles Roberts; Matthew Wright; JoAnn Kuchera-Morin; Tobias Höllerer
We describe design decisions informing the development of Gibber, an audiovisual programming environment for the browser. Our design comprises a consistent notation across modalities in addition to high-level abstractions affording intuitive declarations of multimodal mappings, unified timing constructs, and rapid, iterative reinvocations of constructors while preserving the state of audio and visual graphs. We discuss the features of our environment and the abstractions that enable them. We close by describing use cases, including live audiovisual performances and computer science education.

Action, Activity, and Event Recognition

Multiple Features But Few Labels?: A Symbiotic Solution Exemplified for Video Analysis BIBAFull-Text 77-86
  Zhigang Ma; Yi Yang; Nicu Sebe; Alexander G. Hauptmann
Video analysis has been attracting increasing research due to the proliferation of internet videos. In this paper, we investigate how to improve the performance on internet quality video analysis. Particularly, we work on the scenario of few labeled training videos being provided, which is less focused in multimedia. To being with, we consider how to more effectively harness the evidences from the low-level features. Researchers have developed several promising features to represent videos to capture the semantic information. However, as videos usually characterize rich semantic contents, the analysis performance by using one single feature is potentially limited. Simply combining multiple features through early fusion or late fusion to incorporate more informative cues is doable but not optimal due to the heterogeneity and different predicting capability of these features. For better exploitation of multiple features, we propose to mine the importance of different features and cast it into the learning of the classification model. Our method is based on multiple graphs from different features and uses the Riemannian metric to evaluate the feature importance. On the other hand, to be able to use limited labeled training videos for a respectable accuracy we formulate our method in a semi-supervised way. The main contribution of this paper is a novel scheme of evaluating the feature importance that is further casted into a unified framework of harnessing multiple weighted features with limited labeled training videos. We perform extensive experiments on video action recognition and multimedia event recognition and the comparison to other state-of-the-art multi-feature learning algorithms has validated the efficacy of our framework.
Latent Tensor Transfer Learning for RGB-D Action Recognition BIBAFull-Text 87-96
  Chengcheng Jia; Yu Kong; Zhengming Ding; Yun Raymond Fu
This paper proposes a method to compensate RGB-D images from the original target RGB images by transferring the depth knowledge of source data. Conventional RGB databases (e.g., UT-Interaction database) do not contain depth information since they are captured by the RGB cameras. Therefore, the methods designed for {RGB} databases cannot take advantage of depth information, which proves useful for simplifying intra-class variations and background subtraction. In this paper, we present a novel transfer learning method that can transfer the knowledge from depth information to the RGB database, and use the additional source information to recognize human actions in RGB videos. Our method takes full advantage of 3D geometric information contained within the learned depth data, thus, can further improve action recognition performance. We treat action data as a fourth-order tensor (row, column, frame and sample), and apply latent low-rank transfer learning to learn shared subspaces of the source and target databases. Moreover, we introduce a novel cross-modality regularizer that plays an important role in finding the correlation between RGB and depth modalities, and then more depth information from the source database can be transferred to that of the target. Our method is extensively evaluated on public by available databases. Results of two action datasets show that our method outperforms existing methods.
3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks BIBAFull-Text 97-106
  Keze Wang; Xiaolong Wang; Liang Lin; Meng Wang; Wangmeng Zuo
Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.
Dynamic Background Learning through Deep Auto-encoder Networks BIBAFull-Text 107-116
  Pei Xu; Mao Ye; Xue Li; Qihe Liu; Yi Yang; Jian Ding
Background learning is a pre-processing of motion detection which is a basis step of video analysis. For the static background, many previous works have already achieved good performance. However, the results on learning dynamic background are still much to be improved. To address this challenge, in this paper, a novel and practical method is proposed based on deep auto-encoder networks. Firstly, dynamic background images are extracted through a deep auto-encoder network (called Background Extraction Network) from video frames containing motion objects. Then, a dynamic background model is learned by another deep auto-encoder network (called Background Learning Network) using the extracted background images as the input. To be more flexible, our background model can be updated on-line to absorb more training samples. Our main contributions are 1) a cascade of two deep auto-encoder networks which can deal with the separation of dynamic background and foregrounds very efficiently; 2) a method of online learning is adopted to accelerate the training of Background Extraction Network. Compared with previous algorithms, our approach obtains the best performance over six benchmark data sets. Especially, the experiments show that our algorithm can handle large variation background very well.

Music, Speech and Audio

Music Emotion Recognition by Multi-label Multi-layer Multi-instance Multi-view Learning BIBAFull-Text 117-126
  Bin Wu; Erheng Zhong; Andrew Horner; Qiang Yang
Music emotion recognition, which aims to automatically recognize the affective content of a piece of music, has become one of the key components of music searching, exploring, and social networking applications. Although researchers have given more and more attention to music emotion recognition studies, the recognition performance has come to a bottleneck in recent years. One major reason is that experts' labels for music emotion are mostly song-level, while music emotion usually varies within a song. Traditional methods have considered each song as a single instance and have built models based on song-level features. However, they ignored the dynamics of music emotion and failed to capture accurate emotion-feature correlations. In this paper, we model music emotion recognition as a novel multi-label multi-layer multi-instance multi-view learning problem: music is formulated as a hierarchical multi-instance structure (e.g., song-segment-sentence) where multiple emotion labels correspond to at least one of the instances with multiple views of each layer. We propose a Hierarchical Music Emotion Recognition model (HMER) -- a novel hierarchical Bayesian model using sentence-level music and lyrics features. It captures music emotion dynamics with a song-segment-sentence hierarchical structure. HMER also considers emotion correlations between both music segments and sentences. Experimental results show that HMER outperforms several state-of-the-art methods in terms of F1; score and mean average precision.
Song Recommendation for Social Singing Community BIBAFull-Text 127-136
  Kuang Mao; Ju Fan; Lidan Shou; Gang Chen; Mohan Kankanhalli
Nowadays, an increasing number of singing enthusiasts upload their cover songs and share their performances in online social singing communities. They can also listen and rate other users' song renderings. An important feature of the social singing communities is to recommend appropriate singing-songs which users are able to perform excellently.
   In this paper, we propose a singing-song recommendation framework to make song recommendation in social singing community. Instead of recommending songs that people like to listen, we recommend suitable songs that people can sing well. We propose to discover the song difficulty orderings from the song performance ratings of each user. We transform the difficulty orderings into a difficulty graph and propose an iterative inference algorithm to make singing-song recommendation on the difficulty graph. The experimental result shows the effectiveness of our proposed framework. To the best of our knowledge, our work is the first study of singing-song recommendation in social singing communities.
"Sheldon speaking, Bonjour!": Leveraging Multilingual Tracks for (Weakly) Supervised Speaker Identification BIBAFull-Text 137-146
  Hervé Bredin; Anindya Roy; Nicolas Pécheux; Alexandre Allauzen
We address the problem of speaker identification in multimedia data, and TV series in particular. While speaker identification is traditionally a supervised machine-learning task, our first contribution is to significantly reduce the need for costly preliminary manual annotations through the use of automatically aligned (and potentially noisy) fan-generated transcripts and subtitles.
   We show that both speech activity detection and speech turn identification modules trained in this weakly supervised manner achieve similar performance as their fully supervised counterparts (i.e. relying on fine manual speech/non-speech/speaker annotation).
   Our second contribution relates to the use of multilingual audio tracks usually available with this kind of content to significantly improve the overall speaker identification performance. Reproducible experiments (including dataset, manual annotations and source code) performed on the first six episodes of The Big Bang Theory TV series show that combining the French audio track (containing dubbed actor voices) with the English one (with the original actor voices) improves the overall English speaker identification performance by 5% absolute and up to 70% relative on the five main characters.
What's Making that Sound? BIBAFull-Text 147-156
  Kai Li; Jun Ye; Kien A. Hua
In this paper, we investigate techniques to localize the sound source in video made using one microphone. The visual object whose motion generates the sound is located and segmented based on the synchronization analysis of object motion and audio energy. We first apply an effective region tracking algorithm to segment the video into a number of spatial-temporal region tracks, each representing the temporal evolution of an appearance-coherent image structure (i.e., object). We then extract the motion features of each object as its average acceleration in each frame. Meanwhile, Short-term Fourier Transform is applied to the audio signal to extract audio energy feature as the audio descriptor. We further impose a nonlinear transformation on both audio and visual descriptors to obtain the audio and visual codes in a common rank correlation space. Finally, the correlation between an object and the audio signal is simply evaluated by computing the Hamming distance between the audio and visual codes generated in previous steps. We evaluate the proposed method both qualitatively and quantitatively using a number of challenging test videos. In particular, the proposed method is compared with a state-of-the-art audiovisual source localization algorithm. The results demonstrate the superior performance of the proposed algorithm in spatial-temporal localization and segmentation of audio sources in the visual domain.correlation space. Finally, the correlation between an object and the audio signal is simply evaluated by computing the Hamming distance between the audio and visual codes generated in previous steps. We evaluate the proposed method both qualitatively and quantitatively using a number of challenging test videos. In particular, the proposed method is compared with a state-of-the-art audiovisual source localization algorithm. The results demonstrate the superior performance of the proposed algorithm in spatial-temporal localization and segmentation of audio sources in the visual domain.

Deep Learning for Multimedia

Deep Learning for Content-Based Image Retrieval: A Comprehensive Study BIBAFull-Text 157-166
  Ji Wan; Dayong Wang; Steven Chu Hong Hoi; Pengcheng Wu; Jianke Zhu; Yongdong Zhang; Jintao Li
Learning effective feature representations and similarity measures are crucial to the retrieval performance of a content-based image retrieval (CBIR) system. Despite extensive research efforts for decades, it remains one of the most challenging open problems that considerably hinders the successes of real-world CBIR systems. The key challenge has been attributed to the well-known "semantic gap" issue that exists between low-level image pixels captured by machines and high-level semantic concepts perceived by human. Among various techniques, machine learning has been actively investigated as a possible direction to bridge the semantic gap in the long term. Inspired by recent successes of deep learning techniques for computer vision and other applications, in this paper, we attempt to address an open problem: if deep learning is a hope for bridging the semantic gap in CBIR and how much improvements in CBIR tasks can be achieved by exploring the state-of-the-art deep learning techniques for learning feature representations and similarity measures. Specifically, we investigate a framework of deep learning with application to CBIR tasks with an extensive set of empirical studies by examining a state-of-the-art deep learning method (Convolutional Neural Networks) for CBIR tasks under varied settings. From our empirical studies, we find some encouraging results and summarize some important insights for future research.
Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification BIBAFull-Text 167-176
  Zuxuan Wu; Yu-Gang Jiang; Jun Wang; Jian Pu; Xiangyang Xue
Videos contain very rich semantics and are intrinsically multimodal. In this paper, we study the challenging task of classifying videos according to their high-level semantics such as human actions or complex events. Although extensive efforts have been paid to study this problem, most existing works combined multiple features using simple fusion strategies and neglected the exploration of inter-class semantic relationships. In this paper, we propose a novel unified framework that jointly learns feature relationships and exploits the class relationships for improved video classification performance. Specifically, these two types of relationships are learned and utilized by rigorously imposing regularizations in a deep neural network (DNN). Such a regularized DNN can be efficiently launched using a GPU implementation with an affordable training cost. Through arming the DNN with better capability of exploring both the inter-feature and the inter-class relationships, the proposed regularized DNN is more suitable for identifying video semantics. With extensive experimental evaluations, we demonstrate that the proposed framework exhibits superior performance over several state-of-the-art approaches. On the well-known Hollywood2 and Columbia Consumer Video benchmarks, we obtain to-date the best reported results: 65.7% and 70.6% respectively in terms of mean average precision.
Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification BIBAFull-Text 177-186
  Tianjun Xiao; Jiaxing Zhang; Kuiyuan Yang; Yuxin Peng; Zheng Zhang
Supervised learning using deep convolutional neural network has shown its promise in large-scale image classification task. As a building block, it is now well positioned to be part of a larger system that tackles real-life multimedia tasks. An unresolved issue is that such model is trained on a static snapshot of data. Instead, this paper positions the training as a continuous learning process as new classes of data arrive. A system with such capability is useful in practical scenarios, as it gradually expands its capacity to predict increasing number of new classes. It is also our attempt to address the more fundamental issue: a good learning system must deal with new knowledge that it is exposed to, much as how human do.
   We developed a training algorithm that grows a network not only incrementally but also hierarchically. Classes are grouped according to similarities, and self-organized into levels. The newly added capacities are divided into component models that predict coarse-grained superclasses and those return final prediction within a superclass. Importantly, all models are cloned from existing ones and can be trained in parallel. These models inherit features from existing ones and thus further speed up the learning. Our experiment points out advantages of this approach, and also yields a few important open questions.
Start from Scratch: Towards Automatically Identifying, Modeling, and Naming Visual Attributes BIBAFull-Text 187-196
  Hanwang Zhang; Yang Yang; Huanbo Luan; Shuicheng Yang; Tat-Seng Chua
Higher-level semantics such as visual attributes are crucial for fundamental multimedia applications. We present a novel attribute discovery approach that can automatically identify, model and name attributes from an arbitrary set of image and text pairs that can be easily gathered on the Web. Different from conventional attribute discovery methods, our approach does not rely on any pre-defined vocabularies and human labeling. Therefore, we are able to build a large visual knowledge base without any human efforts. The discovery is based on a novel deep architecture, named Independent Component Multimodal Autoencoder (ICMAE), that can continually learn shared higher-level representations across the visual and textual modalities. With the help of the resultant representations encoding strong visual and semantic evidences, we propose to (a) identify attributes and their corresponding high-quality training images, (b) iteratively model them with maximum compactness and comprehensiveness, and (c) name the attribute models with human understandable words. To date, the proposed system has discovered 1,898 attributes over 1.3 million pairs of image and text. Extensive experiments on various real-world multimedia datasets demonstrate the quality and effectiveness of the discovered attributes, facilitating multimedia applications such as image annotation and retrieval as compared to the state-of-the-art approaches.

Multimedia Grand Challenge

What are the Fashion Trends in New York? BIBAFull-Text 197-200
  Shintami C. Hidayati; Kai-Lung Hua; Wen-Huang Cheng; Shih-Wei Sun
Fashion is a reflection of the society of a period. Given that New York City is one of the world's fashion capitals, understanding its change in fashion becomes a way to know the society and the times. To keep up with fashion trends, it is important to know what's "in" and what's "out" for a season. Though the fashion trends have been analyzed by fashion designers and fashion analysts for a long time, this issue has been ignored in multimedia science. In this paper, we present a novel algorithm that automatically discovers visual style elements representing fashion trends for a certain season. The visual style elements are discovered based on the stylistic coherent and unique characteristics. The experimental results demonstrate the effectiveness of our proposed method through a large number of catwalk show videos.
Discovering the City by Mining Diverse and Multimodal Data Streams BIBAFull-Text 201-204
  Yin-Hsi Kuo; Yan-Ying Chen; Bor-Chun Chen; Wen-Yu Lee; Chun-Che Wu; Chia-Hung Lin; Yu-Lin Hou; Wen-Feng Cheng; Yi-Chih Tsai; Chung-Yen Hung; Liang-Chi Hsieh; Winston Hsu
This work attempts to tackle the IBM grand challenge -- seeing the daily life of New York City (NYC) in various perspectives by exploring rich and diverse social media content. Most existing works address this problem relying on single media source and covering limited life aspects. Because different social media are usually chosen for specific purposes, multiple social media mining and integration are essential to understand a city comprehensively. In this work, we first discover the similar and unique natures (e.g., attractions, topics) across social media in terms of visual and semantic perceptions. For example, Instagram users share more food and travel photos while Twitter users discuss more about sports and news. Based on these characteristics, we analyze a broad spectrum of life aspects -- trends, events, food, wearing and transportation in NYC by mining a huge amount of diverse and freely available media (e.g., 1.6M Instagram photos, 5.3M Twitter posts). Because transportation logs are hardly available in social media, the NYC Open Data (e.g., 6.5B subway station transactions) is leveraged to visualize temporal traffic patterns. Furthermore, the experiments demonstrate that our approaches can effectively overview urban life with considerable technical improvement, e.g., having 16% relative gains in food recognition accuracy by a hierarchy cross-media learning strategy, reducing the feature dimensions of sentiment analysis by 10 times without sacrificing precision.
New Yorker Melange: Interactive Brew of Personalized Venue Recommendations BIBAFull-Text 205-208
  Jan Zahálka; Stevan Rudinac; Marcel Worring
In this paper we propose New Yorker Melange, an interactive city explorer, which navigates New York venues through the eyes of New Yorkers having a similar taste to the interacting user. To gain insight into New Yorkers' preferences and properties of the venues, a dataset of more than a million venue images and associated annotations has been collected from Foursquare, Picasa, and Flickr. As visual and text features, we use semantic concepts extracted by a convolutional deep net and latent Dirichlet allocation topics. To identify different aspects of the venues and topics of interest to the users, we further cluster images associated with them. New Yorker Melange uses an interactive map interface and learns the interacting user's taste using linear SVM. The SVM model is used to navigate the interacting user's exploration further towards similar users. Experimental evaluation demonstrates that our proposed approach is effective in producing relevant results and that both visual and text modalities contribute to the overall system performance.
ATLAS: Automatic Temporal Segmentation and Annotation of Lecture Videos Based on Modelling Transition Time BIBAFull-Text 209-212
  Rajiv Ratn Shah; Yi Yu; Anwar Dilawar Shaikh; Suhua Tang; Roger Zimmermann
The number of lecture videos available is increasing rapidly, though there is still insufficient accessibility and traceability of lecture video contents. Specifically, it is very desirable to enable people to navigate and access specific slides or topics within lecture videos. To this end, this paper presents the ATLAS system for the VideoLectures.NET challenge (MediaMixer, transLectures) to automatically perform the temporal segmentation and annotation of lecture videos. ATLAS has two main novelties: (i) a SVMhmm model is proposed to learn temporal transition cues and (ii) a fusion scheme is suggested to combine transition cues extracted from heterogeneous information of lecture videos. According to our initial experiments on videos provided by VideoLectures.NET, the proposed algorithm is able to segment and annotate knowledge structures based on fusing temporal transition cues and the evaluation results are very encouraging, which confirms the effectiveness of our ATLAS system.
Predicting Viewer Perceived Emotions in Animated GIFs BIBAFull-Text 213-216
  Brendan Jou; Subhabrata Bhattacharya; Shih-Fu Chang
Animated GIFs are everywhere on the Web. Our work focuses on the computational prediction of emotions perceived by viewers after they are shown animated GIF images. We evaluate our results on a dataset of over 3,800 animated GIFs gathered from MIT's GIFGIF platform, each with scores for 17 discrete emotions aggregated from over 2.5M user annotations -- the first computational evaluation of its kind for content-based prediction on animated GIFs to our knowledge. In addition, we advocate a conceptual paradigm in emotion prediction that shows delineating distinct types of emotion is important and is useful to be concrete about the emotion target. One of our objectives is to systematically compare different types of content features for emotion prediction, including low-level, aesthetics, semantic and face features. We also formulate a multi-task regression problem to evaluate whether viewer perceived emotion prediction can benefit from jointly learning across emotion classes compared to disjoint, independent learning.
Context-Based Photography Learning using Crowdsourced Images and Social Media BIBAFull-Text 217-220
  Yogesh Singh Rawat; Mohan S. Kankanhalli
This paper presents a photography model based on machine learning which utilizes crowd-sourced images along with social media cues. As scene composition and camera parameters play a vital role in aesthetics of a captured image, the proposed system addresses the problem of learning photographic composition and camera parameters. Further, we observe that context is an important factor from a photography perspective, we therefore augment the learning with associated contextual information. We define context features based on factors such as time, geo-location, environmental conditions and type of image, which have an impact on photography. The meta information available with crowd-sourced images is utilized for context identification and social media cues are used for photo quality evaluation. We also propose the idea of computing the photographic composition basis, eigenrules and baserules, to support our composition learning method. The trained photography model can provide assistance to the user in determining image composition and camera parameters.
Virtual Portraitist: Aesthetic Evaluation of Selfies Based on Angle BIBAFull-Text 221-224
  Mei-Chen Yeh; Hsiao-Wei Lin
This work addresses the Huawei Grand Challenge that seeks solutions of quality improvement and functionality extension in computational photography. We propose virtual portraitist-a new method that helps users take good selfies in angle. Dissimilar to current solutions that mostly use a post-processing step to fix a photograph, the proposed method enables a novel function of recommending a good look before the photo is captured. This is achieved by using an automatic approach for estimating the aesthetic quality score of a selfie based on angle. In particular, a set of distinctive patterns discovered from a collection of online profile pictures are combined with head pose and camera orientation to rate the quality of a selfie. Experiments validate the effectiveness of the approach.
Cross Modal Deep Model and Gaussian Process Based Model for MSR-Bing Challenge BIBAFull-Text 225-228
  Jian Wang; Cuicui Kang; Yonghao He; Shiming Xiang; Chunhong Pan
In the MSR-Bing Image Retrieval Challenge, the contestants are required to design a system that can score the query-image pairs based on the relevance between queries and images. To address this problem, we propose a regression based cross modal deep learning model and a Gaussian Process scoring model. The regression based cross modal deep learning model takes the image features and query features as inputs respectively and outputs the relevance scores directly. The Gaussian Process scoring model regards the challenge as a ranking problem and utilizes the click (or pseudo click) information from both the training set and the development set to predict the relevance scores. The proposed models are used in different situations: matched and miss-matched queries. Experiments on the development set show the effectiveness of the proposed models.
Bag-of-Words Based Deep Neural Network for Image Retrieval BIBAFull-Text 229-232
  Yalong Bai; Wei Yu; Tianjun Xiao; Chang Xu; Kuiyuan Yang; Wei-Ying Ma; Tiejun Zhao
This work targets image retrieval task hold by MSR-Bing Grand Challenge. Image retrieval is considered as a challenge task because of the gap between low-level image representation and high-level textual query representation. Recently further developed deep neural network sheds light on narrowing the gap by learning high-level image representation from raw pixels. In this paper, we proposed a bag-of-words based deep neural network for image retrieval task, which learns high-level image representation and maps images into bag-of-words space. The DNN model is trained on the large scale clickthrough data, and the relevance between query and image is measured by the cosine similarity of query's bag-of-words representation and image's bag-of-words representation predicted by DNN, the visual similarity of images is computed by high-level image representation extracted via the DNN model too. Finally, PageRank algorithm is used to further improve the ranking list by considering visual similarity of images for each query. The experimental results achieved state-of-the-art performance and verified the effectiveness of our proposed method.
Click-through-based Subspace Learning for Image Search BIBAFull-Text 233-236
  Yingwei Pan; Ting Yao; Xinmei Tian; Houqiang Li; Chong-Wah Ngo
One of the fundamental problems in image search is to rank image documents according to a given textual query. We address two limitations of the existing image search engines in this paper. First, there is no straightforward way of comparing textual keywords with visual image content. Image search engines therefore highly depend on the surrounding texts, which are often noisy or too few to accurately describe the image content. Second, ranking functions are trained on query-image pairs labeled by human labelers, making the annotation intellectually expensive and thus cannot be scaled-up.
   We demonstrate that the above two fundamental challenges can be mitigated by jointly exploring the subspace learning and the use of click-through data. The former aims to create a latent subspace with the ability in comparing information from the original incomparable views (i.e., textual and visual views), while the latter explores the largely available and freely accessible click-through data (i.e., "crowdsourced" human intelligence) for understanding query. Specifically, we investigate a series of click-through-based subspace learning techniques (CSL) for image search. We conduct experiments on MSR-Bing Grand Challenge and the final evaluation performance achieves DCG@25=0.47225. Moreover, the feature dimension is significantly reduced by several orders of magnitude (e.g., from thousands to tens).

Multimedia HCI and QoE

Perception-Guided Multimodal Feature Fusion for Photo Aesthetics Assessment BIBAFull-Text 237-246
  Luming Zhang; Yue Gao; Chao Zhang; Hanwang Zhang; Qi Tian; Roger Zimmermann
Photo aesthetic quality evaluation is a challenging task in multimedia and computer vision fields. Conventional approaches suffer from the following three drawbacks: 1) the deemphasized role of semantic content that is many times more important than low-level visual features in photo aesthetics; 2) the difficulty to optimally fuse low-level and high-level visual cues in photo aesthetics evaluation; and 3) the absence of a sequential viewing path in the existing models, as humans perceive visually salient regions sequentially when viewing a photo.
   To solve these problems, we propose a new aesthetic descriptor that mimics humans sequentially perceiving visually/semantically salient regions in a photo. In particular, a weakly supervised learning paradigm is developed to project the local aesthetic descriptors (graphlets in this work) into a low-dimensional semantic space. Thereafter, each graphlet can be described by multiple types of visual features, both at low-level and in high-level. Since humans usually perceive only a few salient regions in a photo, a sparsity-constrained graphlet ranking algorithm is proposed that seamlessly integrates both the low-level and the high-level visual cues. Top-ranked graphlets are those visually/semantically prominent graphlets in a photo. They are sequentially linked into a path that simulates the process of humans actively viewing. Finally, we learn a probabilistic aesthetic measure based on such actively viewing paths (AVPs) from the training photos that are marked as aesthetically pleasing by multiple users. Experimental results show that: 1) the AVPs are 87.65% consistent with real human gaze shifting paths, as verified by the eye-tracking data; and 2) our photo aesthetic measure outperforms many of its competitors.
Impact of Ultra High Definition on Visual Attention BIBAFull-Text 247-256
  Hiromi Nemoto; Philippe Hanhart; Pavel Korshunov; Touradj Ebrahimi
Ultra high definition (UHD) TV is rapidly replacing high definition (HD) TV but little is known of its effects on human visual attention. However, a clear understanding of this effect is important, since accurate models, evaluation methodologies, and metrics for visual attention are essential in many areas, including image and video compression, camera and displays manufacturing, artistic content creation, and advertisement. In this paper, we address this problem by creating a dataset of UHD resolution images with corresponding eye-tracking data, and we show that there is a statistically significant difference between viewing strategies when watching UHD and HD contents. Furthermore, by evaluating five representative computational models of visual saliency, we demonstrate the decrease in models' accuracies on UHD contents when compared to HD contents. Therefore, to improve the accuracy of computational models for higher resolutions, we propose a segmentation-based resolution-adaptive weighting scheme. Our approach demonstrates that taking into account information about resolution of the images improves the performance of computational models.
An Objective Quality of Experience (QoE) Assessment Index for Retargeted Images BIBAFull-Text 257-266
  Jiangyang Zhang; C.-C. Jay Kuo
Content-aware image resizing (or image retargeting) is a technique that resizes images for optimum display on devices with different resolutions and aspect ratios. Traditional objective quality of experience (QoE) assessment methods are not applicable to retargeted images because the size of a retargeted image is different from its source. In this work, three determining factors for humans visual QoE on retargeted images are analyzed. They are global structural distortion (G), local region distortion (L) and loss of salient information (S). Different features are selected to quantify their respective distortion degrees. Then, an objective quality assessment index, called GLS, is proposed to predict viewers' QoE by fusing selected features into one single quality score. Several regression models used for feature fusion are discussed and compared. Experimental results demonstrate that the proposed GLS quality index has stronger correlation with human QoE than other existing objective metrics in retargeted image quality assessment.
Acceptability-based QoE Management for User-centric Mobile Video Delivery: A Field Study Evaluation BIBAFull-Text 267-276
  We Song; Dian Tjondronegoro; Ivan Himawan
Effective Quality of Experience (QoE) management for mobile video delivery -- to optimize overall user experience while adapting to heterogeneous use contexts -- is still a big challenge to date. This paper proposes a mobile video delivery system to emphasize the use of acceptability as the main indicator of QoE to manage the end-to-end factors in delivering mobile video services. The first contribution is a novel framework for user-centric mobile video system that is based on acceptability-based QoE (A-QoE) prediction models, which were derived from comprehensive subjective studies. The second contribution is results from a field study that evaluates the user experience of the proposed system during realistic usage circumstances, addressing the impacts of perceived video quality, loading speed, interest in content, viewing locations, network bandwidth, display devices, and different video coding approaches, including region-of-interest (ROI) enhancement and center zooming.

Multimedia Analysis and Mining

Weakly-Supervised Image Parsing via Constructing Semantic Graphs and Hypergraphs BIBAFull-Text 277-286
  Wenxuan Xie; Yuxin Peng; Jianguo Xiao
In this paper, we address the problem of weakly-supervised image parsing, whose aim is to automatically determine the class labels of image regions given image-level labels only. In the literature, existing studies pay main attention to the formulation of the weakly-supervised learning problem, i.e., how to propagate class labels from images to regions given an affinity graph of regions. Notably, however, the affinity graph of regions, which is generally constructed in relatively simpler settings in existing methods, is of crucial importance to the parsing performance due to the fact that the weakly-supervised image parsing problem cannot be handled within a single image, and that the affinity graph facilitates label propagation among multiple images. Therefore, in contrast to existing methods, we focus on how to make the affinity graph more descriptive through embedding more semantics into it. We develop two novel graphs by leveraging the weak supervision information carefully: 1) Semantic graph, which is established upon a conventional graph by utilizing the proposed weakly-supervised criteria; 2) Semantic hypergraph, which explores both intra-image and inter-image high-order semantic relevance. Experimental results on two standard datasets demonstrate that the proposed semantic graphs and hypergraphs not only capture more semantic relevance, but also perform significantly better than conventional graphs in image parsing. More remarkably, due to the complementariness among the proposed semantic graphs and hypergraphs, the combination of them shows even more promising results.
Fused one-vs-all mid-level features for fine-grained visual categorization BIBAFull-Text 287-296
  Xiaopeng Zhang; Hongkai Xiong; Wengang Zhou; Qi Tian
As an emerging research topic, fine-grained visual categorization has been attracting growing attentions in recent years. Due to the large inter-class similarity and intra-class variance, recognizing objects in fine-grained domains is extremely challenging, and sometimes even humans can not recognize them accurately. Traditional bag-of-words model could obtain desirable results for basic-level category classification by weak alignment using spatial pyramid matching model, but may easily fail in fine-grained domains since the discriminative features are not only subtle but also extremely localized. The fine differences often get swamped by those irrelevant features, and it is virtually impossible to distinguish them. To address the problems above, we propose a new framework for fine-grained visual categorization. We strengthen the spatial correspondence among parts by including foreground segmentation and part localization. Based on the part representations of the images, we learn a large set of mid-level features which are more suitable for fine-grained tasks. Comparing with the low level features directly extracted from the images, the learned one-vs-all mid-level features enjoy the following advantages. First, the dimension of the mid-level features is relatively small. In order to obtain high classification accuracy, the dimension of the low level features usually reaches several thousand to tens of thousand, and becomes even larger when introducing spatial pyramid model. However, the dimension of our mid-level features is related to the number of classes, which is far less. Second, each entry of the proposed mid-level features is meaningful, which forms a more compact representation of the image. Third, the mid-level features are more robust than the low level ones, which is helpful for classification. Fourth, the learning process of the mid-level features is independent and can be easily combined with other techniques to boost the performance. We evaluate the proposed approach on the extensive fine-grained dataset CUB 200-2011 and Stanford Dogs, by learning the mid-level features based on the popular Fisher vectors and convolutional neural network, we boost the classification accuracy by a considerable margin and advance the state-of-the-art performance in fine-grained visual categorization.
Scalable Visual Instance Mining with Threads of Features BIBAFull-Text 297-306
  Wei Zhang; Hongzhi Li; Chong-Wah Ngo; Shih-Fu Chang
We address the problem of visual instance mining, which is to extract frequently appearing visual instances automatically from a multimedia collection. We propose a scalable mining method by exploiting Thread of Features (ToF). Specifically, ToF, a compact representation that links consistent features across images, is extracted to reduce noises, discover patterns, and speed up processing. Various instances, especially small ones, can be discovered by exploiting correlated ToFs. Our approach is significantly more effective than other methods in mining small instances. At the same time, it is also more efficient by requiring much fewer hash tables. We compared with several state-of-the-art methods on two fully annotated datasets: MQA and Oxford, showing large performance gain in mining (especially small) visual instances. We also run our method on another Flickr dataset with one million images for scalability test. Two applications, instance search and multimedia summarization, are developed from the novel perspective of instance mining, showing great potential of our method in multimedia analysis.
Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval BIBAFull-Text 307-316
  Yanfei Wang; Fei Wu; Jun Song; Xi Li; Yueting Zhuang
As an important and challenging problem in the multimedia area, multi-modal data understanding aims to explore the intrinsic semantic information across different modalities in a collaborative manner. To address this problem, a possible solution is to effectively and adaptively capture the common cross-modal semantic information by modeling the inherent correlations between the latent topics from different modalities. Motivated by this task, we propose a supervised multi-modal mutual topic reinforce modeling (M³R) approach, which seeks to build a joint cross-modal probabilistic graphical model for discovering the mutually consistent semantic topics via appropriate interactions between model factors (e.g., categories, latent topics and observed multi-modal data). In principle, M³R is capable of simultaneously accomplishing the following two learning tasks: 1) modality-specific (e.g., image-specific or text-specific) latent topic learning; and 2) cross-modal mutual topic consistency learning. By investigating the cross-modal topic-related distribution information, M³R encourages to disentangle the semantically consistent cross-modal topics (containing some common semantic information across different modalities). In other words, the semantically co-occurring cross-modal topics are reinforced by M³R through adaptively passing the mutually reinforced messages to each other in the model-learning process. To further enhance the discriminative power of the learned latent topic representations, M³R incorporates the auxiliary information (i.e., categories or labels) into the process of Bayesian modeling, which boosts the modeling capability of capturing the inter-class discriminative information. Experimental results over two benchmark datasets demonstrate the effectiveness of the proposed M³R in cross-modal retrieval.

Multimedia Systems

Quality-adaptive Prefetching for Interactive Branched Video using HTTP-based Adaptive Streaming BIBAFull-Text 317-326
  Vengatanathan Krishnamoorthi; Niklas Carlsson; Derek Eager; Anirban Mahanti; Nahid Shahmehri
Interactive branched video that allows users to select their own paths through the video, provides creative content designers with great personalization opportunities; however, such video also introduces significant new challenges for the system developer. For example, without careful prefetching and buffer management, the use of multiple alternative playback paths can easily result in playback interruptions. In this paper, we present a full implementation of an interactive branched video player using HTTP-based Adaptive Streaming (HAS) that provides seamless playback even when the users defer their branch path choices to the last possible moment. Our design includes optimized prefetching policies that we derive under a simple optimization framework, effective buffer management of prefetched data, and the use of parallel TCP connections to achieve efficient buffer workahead. Through performance evaluation under a wide range of scenarios, we show that our optimized policies can effectively prefetch data of carefully selected qualities along multiple alternative paths such as to ensure seamless playback, offering users a pleasant viewing experience without playback interruptions.
Self-Organized Inter-Destination Multimedia Synchronization For Adaptive Media Streaming BIBAFull-Text 327-336
  Benjamin Rainer; Christian Timmerer
As social networks have become more pervasive, they have changed how we interact socially. The traditional TV experience has drifted from an event at a fixed location with family or friends to a location-independent and distributed social experience. In addition, more and more Video On-Demand services have adopted pull-based streaming. In order to provide a synchronized and immersive distributed TV experience we introduce self-organized Inter-Destination Multimedia Synchronization (IDMS) for adaptive media streaming. In particular, we adapt the principles of IDMS to MPEG-DASH to synchronize multimedia playback among geographically distributed peers. We introduce session management to MPEG-DASH and propose a Distributed Control Scheme (DCS) to negotiate a reference playback timestamp among the peers participating in an IDMS session. We evaluate our DCS with respect to scalability and the time required to negotiate the reference playback timestamp. Furthermore, we investigate how to compensate for asynchronism using Adaptive Media Playout (AMP) and define a temporal distortion metric for audio and video which allows the impact of playback rate variations to be modeled with respect to QoE. This metric is evaluated based on a subjective quality assessment using crowdsourcing.
Anahita: A System for 3D Video Streaming with Depth Customization BIBAFull-Text 337-346
  Kiana Calagari; Krzysztof Templin; Tarek Elgamal; Khaled Diab; Piotr Didyk; Wojciech Matusik; Mohamed Hefeeda
Producing high-quality stereoscopic 3D content requires significantly more effort than preparing regular video footage. In order to assure good depth perception and visual comfort, 3D videos need to be carefully adjusted to specific viewing conditions before they are shown to viewers. While most stereoscopic 3D content is designed for viewing in movie theaters, where viewing conditions do not vary significantly, adapting the same content for viewing on home TV-sets, desktop displays, laptops, and mobile devices requires additional adjustments. To address this challenge, we propose a new system for 3D video streaming that provides automatic depth adjustments as one of its key features. Our system takes into account both the content and the display type in order to customize 3D videos and maximize their perceived quality. We propose a novel method for depth adjustment that is well-suited for videos of field sports such as soccer, football, and tennis. Our method is computationally efficient and it does not introduce any visual artifacts. We have implemented our 3D streaming system and conducted two user studies, which show: (i) adapting stereoscopic 3D videos for different displays is beneficial, and (ii) our proposed system can achieve up to 35% improvement in the perceived quality of the stereoscopic 3D content.
LiveRender: A Cloud Gaming System Based on Compressed Graphics Streaming BIBAFull-Text 347-356
  Li Lin; Xiaofei Liao; Guang Tan; Hai Jin; Xiaobin Yang; Wei Zhang; Bo Li
In cloud gaming systems, the game program runs at servers in the cloud, while clients access game services by sending input events to the servers and receiving game scenes via video streaming. In this paradigm, servers are responsible for all performance-intensive operations, and thus suffer from poor scalability. An alternative paradigm is called graphics streaming, in which graphics commands and data are offloaded to the clients for local rendering, thereby mitigating the server's burden and allowing more concurrent game sessions. Unfortunately, this approach is bandwidth consuming, due to large amounts of graphic commands and geometry data. In this paper, we present LiveRender, an open source gaming system that remedies the problem by implementing a suite of bandwidth optimization techniques including intra-frame compression, inter-frame compression, and caching, establishing what we call compressed graphics streaming. Experiments results show that the new approach is able to reduce bandwidth consumption by 52-73% compared to raw graphics streaming, with no perceptible difference in video quality and reduced response delay. Compared with the video streaming approach, LiveRender achieves a traffic reduction of 40-90% with even improved video quality and substantially smaller response delay, while enabling higher concurrency at the server.

Emotional and Social Signals in Multimedia

We are not All Equal: Personalizing Models for Facial Expression Analysis with Transductive Parameter Transfer BIBAFull-Text 357-366
  Enver Sangineto; Gloria Zen; Elisa Ricci; Nicu Sebe
Previous works on facial expression analysis have shown that person specific models are advantageous with respect to generic ones for recognizing facial expressions of new users added to the gallery set. This finding is not surprising, due to the often significant inter-individual variability: different persons have different morphological aspects and express their emotions in different ways. However, acquiring person-specific labeled data for learning models is a very time consuming process. In this work we propose a new transfer learning method to compute personalized models without labeled target data Our approach is based on learning multiple person-specific classifiers for a set of source subjects and then directly transfer knowledge about the parameters of these classifiers to the target individual. The transfer process is obtained by learning a regression function which maps the data distribution associated to each source subject to the corresponding classifier's parameters. We tested our approach on two different application domains, Action Units (AUs) detection and spontaneous pain recognition, using publicly available datasets and showing its advantages with respect to the state-of-the-art both in term of accuracy and computational cost.
Object-Based Visual Sentiment Concept Analysis and Application BIBAFull-Text 367-376
  Tao Chen; Felix X. Yu; Jiawei Chen; Yin Cui; Yan-Ying Chen; Shih-Fu Chang
This paper studies the problem of modeling object-based visual concepts such as "crazy car" and "shy dog" with a goal to extract emotion related information from social multimedia content. We focus on detecting such adjective-noun pairs because of their strong co-occurrence relation with image tags about emotions. This problem is very challenging due to the highly subjective nature of the adjectives like "crazy" and "shy" and the ambiguity associated with the annotations. However, associating adjectives with concrete physical nouns makes the combined visual concepts more detectable and tractable. We propose a hierarchical system to handle the concept classification in an object specific manner and decompose the hard problem into object localization and sentiment related concept modeling. In order to resolve the ambiguity of concepts we propose a novel classification approach by modeling the concept similarity, leveraging on online commonsense knowledgebase. The proposed framework also allows us to interpret the classifiers by discovering discriminative features. The comparisons between our method and several baselines show great improvement in classification performance. We further demonstrate the power of the proposed system with a few novel applications such as sentiment-aware music slide shows of personal albums.
An Event Driven Fusion Approach for Enjoyment Recognition in Real-time BIBAFull-Text 377-386
  Florian Lingenfelser; Johannes Wagner; Elisabeth André; Gary McKeown; Will Curran
Social signals and interpretation of carried information is of high importance in Human Computer Interaction. Often used for affect recognition, the cues within these signals are displayed in various modalities. Fusion of multi-modal signals is a natural and interesting way to improve automatic classification of emotions transported in social signals. Throughout most present studies, uni-modal affect recognition as well as multi-modal fusion, decisions are forced for fixed annotation segments across all modalities. In this paper, we investigate the less prevalent approach of event driven fusion, which indirectly accumulates asynchronous events in all modalities for final predictions. We present a fusion approach, handling short-timed events in a vector space, which is of special interest for real-time applications. We compare results of segmentation based uni-modal classification and fusion schemes to the event driven fusion approach. The evaluation is carried out via detection of enjoyment-episodes within the audiovisual Belfast Story-Telling Corpus.
Correlating Speaker Gestures in Political Debates with Audience Engagement Measured via EEG BIBFull-Text 387-396
  John R. Zhang; Jason Sherwin; Jacek Dmochowski; Paul Sajda; John R. Kender

High Risks High Rewards

How 'How' Reflects What's What: Content-based Exploitation of How Users Frame Social Images BIBAFull-Text 397-406
  Michael Riegler; Martha Larson; Mathias Lux; Christoph Kofler
In this paper, we introduce the concept of intentional framing, defined as the sum of the choices that a photographer makes on how to portray the subject matter of an image. We carry out analysis experiments that demonstrate the existence of a correspondence between image similarity that is calculated automatically on the basis of global feature representations, and image similarity that is perceived by humans at the level of intentional frames. Intentional framing has profound implications: The existence of a fundamental image-interpretation principle that explains the importance of global representations in capturing human-perceived image semantics reaches beyond currently dominant assumptions in multimedia research. The ability of fast global-feature approaches to compete with more 'sophisticated' approaches, which are computationally more complex, is demonstrated using a simple search method (SimSea) to classify a large (2M) collection of social images by tag class. In short, intentional framing provides a principled connection between human interpretations of images and lightweight, fast image processing methods. Moving forward, it is critical that the community explicitly exploits such approaches, as the social image collections that we tackle, continue to grow larger.
A Group Testing Framework for Similarity Search in High-dimensional Spaces BIBAFull-Text 407-416
  Miaojing Shi; Teddy Furon; Hervé Jégou
This paper introduces a group testing framework for detecting large similarities between high-dimensional vectors, such as descriptors used in state-of-the-art description of multimedia documents.At the crossroad of multimedia information retrieval and signal processing, we produce a set of group representations that jointly encode several vectors into a single one, in the spirit of group testing approaches. By comparing a query vector to several of these intermediate representations, we screen the large values taken by the similarities between the query and all the vectors, at a fraction of the cost of exhaustive similarity calculation. Unlike concurrent indexing methods that suffer from the curse of dimensionality, our method exploits the properties of high-dimensional spaces. It therefore complements other strategies for approximate nearest neighbor search. Our preliminary experiments demonstrate the potential of group testing for searching large databases of multimedia objects represented by vectors. We obtain a large improvement in terms of the theoretical complexity, at the cost of a small or negligible decrease of accuracy. We hope that this preliminary work will pave the way to subsequent works for multimedia retrieval with limited resources.
Object Segmentation in Images using EEG Signals BIBAFull-Text 417-426
  Eva Mohedano; Graham Healy; Kevin McGuinness; Xavier Giró-i-Nieto; Noel E. O'Connor; Alan F. Smeaton
This paper explores the potential of brain-computer interfaces in segmenting objects from images. Our approach is centered around designing an effective method for displaying the image parts to the users such that they generate measurable brain reactions. When an image region, specifically a block of pixels, is displayed we estimate the probability of the block containing the object of interest using a score based on EEG activity. After several such blocks are displayed, the resulting probability map is binarized and combined with the GrabCut algorithm to segment the image into object and background regions. This study shows that BCI and simple EEG analysis are useful in locating object boundaries in images.
Help Save The Planet: Please Do Adjust Your Picture BIBAFull-Text 427-436
  Oche Ejembi; Saleem N. Bhatti
Allowing digital video users to make choices of picture size and codec would significantly reduce energy usage, electricity costs and the carbon footprint of Internet users. Our empirical investigation shows a difference of up to a factor of 3 in energy usage for video decoding using different codecs at the same picture size and bitrate, on a desktop client system. With video traffic already responsible for the largest and fastest growing proportion of traffic on the Internet, a significant amount of energy, money and carbon output is due to video. We present a simple methodology and metrics that can be used to give an intuitive, quantitative and comparable assessment of the energy usage of video decoding. Providing energy usage information to users would empower them to make sensible choices. We demonstrate how small energy savings for individual client systems could give significant energy savings when considered at a global scale.

Multimedia Applications

Media Experience of Complementary Information and Tweets on a Second Screen BIBAFull-Text 437-446
  Kenta Kusumoto; Teemu Kinnunen; Jari Kätsyri; Heikki Lindroos; Pirkko Oittinen
Television (TV) programs are increasingly augmented with second-screen applications that can provide complementary information and social media interaction related to the program. However, there is a lack of understanding about the optimal design of such applications. In this paper, we study the effects of complementary information and tweets on the media experience indexed by a comprehensive self-report questionnaire. We built a custom second-screen tablet application and conducted an experiment to investigate the effects of complementary information and tweets on the media experience of TV programs. Apart from reducing concentration and immersion, neither the presence of complementary information nor the presence of tweets affected the media experience of the TV programs, whereas the media experience of second-screen content depended on which program it accompanied. Our results demonstrate that the media experience of complementary information and tweets presented on second screens differs between TV programs.
Interactive Line Drawing Recognition and Vectorization with Commodity Camera BIBAFull-Text 447-456
  Pradeep Kumar Jayaraman; Chi-Wing Fu
This paper presents a novel method that interactively recognizes and vectorizes hand-drawn strokes in front of a commodity webcam. Compared to existing methods, which recognize strokes on a completed drawing, our method captures both spatial and temporal information of the strokes, and faithfully vectorizes them with timestamps. By this, we can avoid various stroke recognition ambiguities, enhance the vectorization quality, and recover the stroke drawing order.
   This is a challenging problem, requiring robust tracking of pencil tip, accurate modeling of pen-paper contact, handling pen-paper and hand-paper occlusion, while achieving interactive performance. To address these issues, we develop the following novel techniques. First, we perform robust spatio-temporal tracking of pencil tip by extracting discriminable features, which can be classified with a fast cascade of classifiers. Second, we model the pen-paper contact by analyzing the edge-profile of the acquired trajectory and extracting the portions related to individual strokes. Lastly, we propose a spatio-temporal method to reconstruct meaningful strokes, which are coherent to the stroke drawing continuity and drawing order. By integrating these techniques, our method can support interactive recognition and vectorization of drawn strokes that are faithful to the actual strokes drawn by the user, and facilitate the development of various multimedia applications such as video scribing, cartoon production, and pen input interface.
RAPID: Rating Pictorial Aesthetics using Deep Learning BIBAFull-Text 457-466
  Xin Lu; Zhe Lin; Hailin Jin; Jianchao Yang; James Z. Wang
Effective visual features are essential for computational aesthetic quality rating systems. Existing methods used machine learning and statistical modeling techniques on handcrafted features or generic image descriptors. A recently-published large-scale dataset, the AVA dataset, has further empowered machine learning based approaches. We present the RAPID (RAting PIctorial aesthetics using Deep learning) system, which adopts a novel deep neural network approach to enable automatic feature learning. The central idea is to incorporate heterogeneous inputs generated from the image, which include a global view and a local view, and to unify the feature learning and classifier training using a double-column deep convolutional neural network. In addition, we utilize the style attributes of images to help improve the aesthetic quality categorization accuracy. Experimental results show that our approach significantly outperforms the state of the art on the AVA dataset.
Fashion Parsing with Video Context BIBAFull-Text 467-476
  Si Liu; Xiaodan Liang; Luoqi Liu; Ke Lu; Liang Lin; Shuicheng Yan
In this paper, we explore how to utilize the video context to facilitate fashion parsing. Instead of annotating a large amount of fashion images, we present a general, affordable and scalable solution, which harnesses the rich contexts in easily available fashion videos to boost any existing fashion parser. First, we crawl a large unlabelled fashion video corpus with fashion frames. Then for each fashion video, the cross-frame contexts are utilized for human pose co-estimation, and then video co-parsing to obtain satisfactory fashion parsing results for all frames. More specifically, Sift Flow and super-pixel matching are used to build correspondences across frames, and these correspondences then contextualize the pose estimations and fashion parsing in individual frames. Finally, these parsed video frames are used as the reference corpus for the non-parametric fashion parsing component of the whole solution. Extensive experiments on two benchmark fashion datasets as well as a newly collected challenging Fashion Icon (FI) dataset demonstrate the encouraging performance gain from our general pipeline for fashion parsing.

Privacy, Health and Well-being

Daily Stress Recognition from Mobile Phone Data, Weather Conditions and Individual Traits BIBAFull-Text 477-486
  Andrey Bogomolov; Bruno Lepri; Michela Ferron; Fabio Pianesi; Alex (Sandy) Pentland
Research has proven that stress reduces quality of life and causes many diseases. For this reason, several researchers devised stress detection systems based on physiological parameters. However, these systems require that obtrusive sensors are continuously carried by the user. In our paper, we propose an alternative approach providing evidence that daily stress can be reliably recognized based on behavioral metrics, derived from the user's mobile phone activity and from additional indicators, such as the weather conditions (data pertaining to transitory properties of the environment) and the personality traits (data concerning permanent dispositions of individuals). Our multifactorial statistical model, which is person-independent, obtains the accuracy score of 72.28% for a 2-class daily stress recognition problem. The model is efficient to implement for most of multimedia applications due to highly reduced low-dimensional feature space (32d). Moreover, we identify and discuss the indicators which have strong predictive power.
Validating an iOS-based Rhythmic Auditory Cueing Evaluation (iRACE) for Parkinson's Disease BIBAFull-Text 487-496
  Shenggao Zhu; Robert J. Ellis; Gottfried Schlaug; Yee Sien Ng; Ye Wang
Movement disorders such as Parkinson's disease (PD) will affect a rapidly growing segment of the population as society continues to age. Rhythmic Auditory Cueing (RAC) is a well-supported evidence-based intervention for the treatment of gait impairments in PD. RAC interventions have not been widely adopted, however, due to limitations in access to personnel, technological, and financial resources. To help "scale up" RAC for wider distribution, we have developed an iOS-based Rhythmic Auditory Cueing Evaluation (iRACE) mobile application to deliver RAC and assess motor performance in PD patients. The touchscreen of the mobile device is used to assess motor timing during index finger tapping, and the device's built-in tri-axial accelerometer and gyroscope to assess step time and step length during walking. Novel machine learning-based gait analysis algorithms have been developed for iRACE, including heel strike detection, step length quantification, and left-versus-right foot identification. The concurrent validity of iRACE was assessed using a clinic-standard instrumented walking mat and a pair of force-sensing resistor sensors. Results from 10 PD patients reveal that iRACE has low error rates (<±1.0%) across a set of four clinically relevant outcome measures, indicating a potentially useful clinical tool.
Towards Efficient Privacy-preserving Image Feature Extraction in Cloud Computing BIBAFull-Text 497-506
  Zhan Qin; Jingbo Yan; Kui Ren; Chang Wen Chen; Cong Wang
As the image data produced by individuals and enterprises is rapidly increasing, Scalar Invariant Feature Transform (SIFT), as a local feature detection algorithm, has been heavily employed in various areas, including object recognition, robotic mapping, etc. In this context, there is a growing need to outsource such image computation with high complexity to cloud for its economic computing resources and on-demand ubiquitous access. However, how to protect the private image data while enabling image computation becomes a major concern. To address this fundamental challenge, we study the privacy requirements in outsourcing SIFT computation and propose SecSIFT, a high performance privacy-preserving SIFT feature detection system. In previous private image computation works, one common approach is to encrypt the private image in a public key based homomorphic scheme that enables the original processing algorithms designed for plaintext domain to be performed over ciphertext domain. In contrast to these works, our system is not restricted by the efficiency limitations of homomorphic encryption scheme. The proposed system distributes the computation procedures of SIFT to a set of independent, co-operative cloud servers, and keeps the outsourced computation procedures as simple as possible to avoid utilizing homomorphic encryption scheme. Thus, it enables implementation with practical computation and communication complexity. Extensive experimental results demonstrate that SecSIFT performs comparably to original SIFT on image benchmarks while capable of preserving the privacy in an efficient way.
User-level psychological stress detection from social media using deep neural network BIBAFull-Text 507-516
  Huijie Lin; Jia Jia; Quan Guo; Yuanyuan Xue; Qi Li; Jie Huang; Lianhong Cai; Ling Feng
It is of significant importance to detect and manage stress before it turns into severe problems. However, existing stress detection methods usually rely on psychological scales or physiological devices, making the detection complicated and costly. In this paper, we explore to automatically detect individuals' psychological stress via social media. Employing real online micro-blog data, we first investigate the correlations between users' stress and their tweeting content, social engagement and behavior patterns. Then we define two types of stress-related attributes: 1) low-level content attributes from a single tweet, including text, images and social interactions; 2) user-scope statistical attributes through their weekly micro-blog postings, leveraging information of tweeting time, tweeting types and linguistic styles. To combine content attributes with statistical attributes, we further design a convolutional neural network (CNN) with cross autoencoders to generate user-scope content attributes from low-level content attributes. Finally, we propose a deep neural network (DNN) model to incorporate the two types of user-scope attributes to detect users' psychological stress. We test the trained model on four different datasets from major micro-blog platforms including Sina Weibo, Tencent Weibo and Twitter. Experimental results show that the proposed model is effective and efficient on detecting psychological stress from micro-blog data. We believe our model would be useful in developing stress detection tools for mental health agencies and individuals.

Multimedia Search and Indexing

Optimized Distances for Binary Code Ranking BIBAFull-Text 517-526
  Jianfeng Wang; Heng Tao Shen; Shuicheng Yan; Nenghai Yu; Shipeng Li; Jingdong Wang
Binary encoding on high-dimensional data points has attracted much attention due to its computational and storage efficiency. While numerous efforts have been made to encode data points into binary codes, how to calculate the effective distance on binary codes to approximate the original distance is rarely addressed. In this paper, we propose an effective distance measurement for binary code ranking. In our approach, the binary code is firstly decomposed into multiple sub codes, each of which generates a query-dependent distance lookup table. Then the distance between the query and the binary code is constructed as the aggregation of the distances from all sub codes by looking up their respective tables. The entries of the lookup tables are optimized by minimizing the misalignment between the approximate distance and the original distance. Such a scheme is applied to both the symmetric distance and the asymmetric distance. Extensive experimental results show superior performance of the proposed approach over state-of-the-art methods on three real-world high-dimensional datasets for binary code ranking.
Iterative Multi-View Hashing for Cross Media Indexing BIBAFull-Text 527-536
  Yao Hu; Zhongming Jin; Hongyi Ren; Deng Cai; Xiaofei He
Cross media retrieval engines have gained massive popularity with rapid development of the Internet. Users may perform queries in a corpus consisting of audio, video, and textual information. To make such systems practically possible for large mount of multimedia data, two critical issues must be carefully considered: (a) reduce the storage as much as possible; (b) model the relationship of the heterogeneous media data. Recently academic community have proved that encoding the data into compact binary codes can drastically reduce the storage and computational cost. However, it is still unclear how to integrate multiple information sources properly into the binary code encoding scheme.
   In this paper, we study the cross media indexing problem by learning the discriminative hashing functions to map the multi-view datum into a shared hamming space. Not only meaningful within-view similarity is required to be preserved, we also incorporate the between-view correlations into the encoding scheme, where we map the similar points close together and push apart the dissimilar ones. To this end, we propose a novel hashing algorithm called Iterative Multi-View Hashing (IMVH) by taking these information into account simultaneously. To solve this joint optimization problem efficiently, we further develop an iterative scheme to deal with it by using a more flexible quantization model. In particular, an optimal alignment is learned to maintain the between-view similarity in the encoding scheme. And the binary codes are obtained by directly solving a series of binary label assignment problems without continuous relaxation to avoid the unnecessary quantization loss. In this way, the proposed algorithm not only greatly improves the retrieval accuracy but also performs strong robustness. An extensive set of experiments clearly demonstrates the superior performance of the proposed method against the state-of-the-art techniques on both multimodal and unimodal retrieval tasks.
Rescue Tail Queries: Learning to Image Search Re-rank via Click-wise Multimodal Fusion BIBAFull-Text 537-546
  Xiaopeng Yang; Tao Mei; Yongdong Zhang
Image search engines have achieved good performance for head (popular) queries by leveraging text information and user click data. However, there still remain a large number of tail (rare) queries with relatively unsatisfying search results, which are often overlooked in existing research. Image search for these tail queries therefore provides a grand challenge for research communities. Most existing re-ranking approaches, though effective for head queries, cannot be extended to tail. The assumption of these approaches that the re-ranked list should not go far away from the initial ranked list is not applicable to the tail queries. The challenge, thus, relies on how to leverage the possibly unsatisfying initial ranked results and the very limited click data to solve the search intent gap of tail queries.
   To deal with this challenge, we propose to mine relevant information from the very few click data by leveraging click-wise-based image pairs and query-dependent multimodal fusion. Specifically, we hypothesize that images with more clicks are more relevant to the given query than the ones with no or relatively less clicks and the effects of different visual modalities to re-rank images are query-dependent. We therefore propose a novel query-dependent learning to re-rank approach for tail queries, called "click-wise multimodal fusion." The approach can not only effectively expand training data by learning relevant information from the constructed click-wise-based image pairs, but also fully explore the effects of multiple visual modalities by adaptively predicting the query-dependent fusion weights. The experiments conducted on a real-world dataset with 100 tail queries show that our proposed approach can significantly improve initial search results by 10.88% and 9.12% in terms of NDCG@5 and NDCG@10, respectively, and outperform several existing re-ranking approaches.
Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search BIBAFull-Text 547-556
  Lu Jiang; Deyu Meng; Teruko Mitamura; Alexander G. Hauptmann
Reranking has been a focal technique in multimedia retrieval due to its efficacy in improving initial retrieval results. Current reranking methods, however, mainly rely on the heuristic weighting. In this paper, we propose a novel reranking approach called Self-Paced Reranking (SPaR) for multimodal data. As its name suggests, SPaR utilizes samples from easy to more complex ones in a self-paced fashion. SPaR is special in that it has a concise mathematical objective to optimize and useful properties that can be theoretically verified. It on one hand offers a unified framework providing theoretical justifications for current reranking methods, and on the other hand generates a spectrum of new reranking schemes. This paper also advances the state-of-the-art self-paced learning research which potentially benefits applications in other fields. Experimental results validate the efficacy and the efficiency of the proposed method on both image and video search tasks. Notably, SPaR achieves by far the best result on the challenging TRECVID multimedia event search task.

Social Media and Crowd

Mining Cross-network Association for YouTube Video Promotion BIBAFull-Text 557-566
  Ming Yan; Jitao Sang; Changsheng Xu
We introduce a novel cross-network collaborative problem in this work: given YouTube videos, to find optimal Twitter followees that can maximize the video promotion on Twitter. Since YouTube videos and Twitter followees distribute on heterogeneous spaces, we present a cross-network association-based solution framework. Three stages are addressed: (1) heterogeneous topic modeling, where YouTube videos and Twitter followees are modeled in topic level; (2) cross-network topic association, where the overlapped users are exploited to conduct cross-network topic distribution transfer; and (3) referrer identification, where the query YouTube video and candidate Twitter followees are matched in the same topic space. Different methods in each stage are designed and compared by qualitative as well as quantitative experiments. Based on the proposed framework, we also discuss the potential applications, extensions, and suggest some principles for future heterogeneous social media utilization and cross-network collaborative applications.
One of a Kind: User Profiling by Social Curation BIBAFull-Text 567-576
  Xue Geng; Hanwang Zhang; Zheng Song; Yang Yang; Huanbo Luan; Tat-Seng Chua
Social Curation Service (SCS) is a new type of emerging social media platform, where users can select, organize and keep track of multimedia contents they like. In this paper, we take advantage of this great opportunity and target at the very starting point in social media: user profiling, which supports fundamental applications such as personalized search and recommendation. As compared to other profiling methods in conventional Social Network Services (SNS), our work benefits from the two distinguishable characteristics of SCS: a) organized multimedia user-generated contents, and b) content-centric social network. Based on these two characteristics, we are able to deploy the state-of-the-art multimedia analysis techniques to establish content-based user profiles by extracting user preferences and their social relations. First, we automatically construct a content-based user preference ontology and learn the ontological models to generate comprehensive user profiles. In particular, we propose a new deep learning strategy called multi-task convolutional neural network (mtCNN) to learn profile models and profile-related visual features simultaneously. Second, we propose to model the multi-level social relations offered by SCS to refine the user profiles in a low-rank recovery framework. To the best of our knowledge, our work is the first that explores how social curation can help in content-based social media technologies, taking user profiling as an example. Extensive experiments on 1,293 users and 1.5 million images collected from Pinterest in fashion domain demonstrate that recommendation methods based on the proposed user profiles are considerably more effective than other state-of-the-art recommendation strategies.
3D Interest Maps From Simultaneous Video Recordings BIBAFull-Text 577-586
  Axel Carlier; Lilian Calvet; Duong Trung Dung Nguyen; Wei Tsang Ooi; Pierre Gurdjos; Vincent Charvillat
We consider an emerging situation where multiple cameras are filming the same event simultaneously from a diverse set of angles. The captured videos provide us with the multiple view geometry and an understanding of the 3D structure of the scene. We further extend this understanding by introducing the concept of 3D interest map in this paper. As most users naturally film what they find interesting from their respective viewpoints, the 3D structure can be annotated with the level of interest, naturally crowdsourced from the users. A 3D interest map can be understood as an extension of saliency maps in the 3D space that captures the semantics of the scene. We evaluate the idea of 3D interest maps on two real datasets, taken from the environment or the cameras that are equipped enough to have an estimation of the poses of cameras and a reasonable synchronization between them. We study two aspects of the 3D interest maps in our evaluation. First, by projecting them into 2D, we compare them to state-of-the-art saliency maps. Second, to demonstrate the usefulness of the 3D interest maps, we apply them to a video mashup system that automatically produces an edited video from one of the datasets.
Crowdsourcing a Reverberation Descriptor Map BIBAFull-Text 587-596
  Prem Seetharaman; Bryan Pardo
Audio production is central to every kind of media that involves sound, such as film, television, and music and involves transforming audio into a state ready for consumption by the public. One of the most commonly-used audio production tools is the reverberator. Current interfaces are often complex and hard-to-understand. We seek to simplify these interfaces by letting users communicate their audio production objective with descriptive language (e.g. "Make the drums sound bigger."). To achieve this goal, a system must be able to tell whether the stated goal is appropriate for the selected tool (e.g. making the violin warmer using a panning tool does not make sense). If the goal is appropriate for the tool, it must know what actions lead to the goal. Further, the tool should not impose a vocabulary on users, but rather understand the vocabulary users prefer. In this work, we describe SocialReverb, a project to crowdsource a vocabulary of audio descriptors that can be mapped onto concrete actions using a parametric reverberator. We deployed SocialReverb, on Mechanical Turk, where 513 unique users described 256 instances of reverberation using 2861 unique words. We used this data to build a concept map showing which words are popular descriptors, which ones map consistently to specific reverberation types, and which ones are synonyms. This promises to enable future interfaces that let the user communicate their production needs using natural language.

Multimedia Recommendations

What Videos Are Similar with You?: Learning a Common Attributed Representation for Video Recommendation BIBAFull-Text 597-606
  Peng Cui; Zhiyu Wang; Zhou Su
Although video recommender systems have become the predominant way for people to obtain video information, their performances are far from satisfactory in that (1) the recommended videos are often mismatched with the users' interests and (2) the recommendation results are, in most cases, hardly understandable for users and therefore cannot persuade them to engage. In this paper, we attempt to address the above problems in data representation level, and aim to learn a common attributed representation for users and videos in social media with good interpretability, stability and an appropriate level of granularity. The basic idea is to represent videos with users' social attributes, and represent users with content attributes of videos, such that both users and videos can be represented in a common space concatenated by social attributes and content attributes. The video recommendation problem can then be converted into a similarity matching problem in the common space. However, it is still challenging to balance the roles of social attributes and content attributes, learn such a common representation in sparse user-video interactions and deal with the cold-start problem. In this paper, we propose a REgularized Dual-fActor Regression (REDAR) method based on matrix factorization. In this method, social attributes and content attributes are flexibly combined, and social and content information are effectively exploited to alleviate the sparsity problem. An incremental version of REDAR is designed to solve the cold-start problem. We extensively evaluate the proposed method for video recommendation application in real social network dataset, and the results show that, in most cases, the proposed method can achieve a relative improvement of more than 20% compared to state-of-the-art baseline methods.
ADVISOR: Personalized Video Soundtrack Recommendation by Late Fusion with Heuristic Rankings BIBAFull-Text 607-616
  Rajiv Ratn Shah; Yi Yu; Roger Zimmermann
Capturing videos anytime and anywhere, and then instantly sharing them online, has become a very popular activity. However, many outdoor user-generated videos (UGVs) lack a certain appeal because their soundtracks consist mostly of ambient background noise. Aimed at making UGVs more attractive, we introduce ADVISOR, a personalized video soundtrack recommendation system. We propose a fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history. Specifically, we combine confidence scores, produced by SVMhmm models constructed from geographic, visual, and audio features, to obtain different types of video characteristics. Our contributions are threefold. First, we predict scene moods from a real-world video dataset that was collected from users' daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. A series of extensive experiments confirm that our approach performs well and recommends appealing soundtracks for UGVs to enhance the viewing experience.
Social Embedding Image Distance Learning BIBAFull-Text 617-626
  Shaowei Liu; Peng Cui; Wenwu Zhu; Shiqiang Yang; Qi Tian
Image distance (similarity) is a fundamental and important problem in image processing. However, traditional visual features based image distance metrics usually fail to capture human cognition. This paper presents a novel Social embedding Image Distance Learning (SIDL) approach to embed the similarity of collective social and behavioral information into visual space. The social similarity is estimated according to multiple social factors. Then a metric learning method is especially designed to learn the distance of visual features from the estimated social similarity. In this manner, we can evaluate the cognitive image distance based on the visual content of images. Comprehensive experiments are designed to investigate the effectiveness of SIDL, as well as the performance in the image recommendation and reranking tasks. The experimental results show that the proposed approach makes a marked improvement compared to the state-of-the-art image distance metrics. An interesting observation is given to show that the learned image distance can better reflect human cognition.
Improving Content-based and Hybrid Music Recommendation using Deep Learning BIBAFull-Text 627-636
  Xinxi Wang; Ye Wang
Existing content-based music recommendation systems typically employ a two-stage approach. They first extract traditional audio content features such as Mel-frequency cepstral coefficients and then predict user preferences. However, these traditional features, originally not created for music recommendation, cannot capture all relevant information in the audio and thus put a cap on recommendation performance. Using a novel model based on deep belief network and probabilistic graphical model, we unify the two stages into an automated process that simultaneously learns features from audio content and makes personalized recommendations. Compared with existing deep learning based models, our model outperforms them in both the warm-start and cold-start stages without relying on collaborative filtering (CF). We then present an efficient hybrid method to seamlessly integrate the automatically learnt features and CF. Our hybrid method not only significantly improves the performance of CF but also outperforms the traditional feature mbased hybrid method.

Doctoral Symposium 1

Medical case retrieval BIBAFull-Text 639-642
  Mario Taschwer
The proposed PhD project addresses the problem of finding descriptions of diseases or patients' health records that are relevant for a given description of patient's symptoms, also known as medical case retrieval (MCR). Designing an automatic multimodal MCR system applicable to general medical data sets still presents an open research problem, as indicated by the ImageCLEF 2013 MCR challenge, where the best submitted runs achieved only moderate retrieval performance and used purely textual techniques. This project therefore aims at designing a multimodal MCR model that is capable of achieving a substantially better retrieval performance on the ImageCLEF data set than state-of-the-art techniques. Moreover, the potential of further improvement by leveraging relevance feedback of medical expert users for long-term learning will be investigated.
Mobile Video Broadcasting Services: Combining Video Composition and Network Efficient Transmission BIBAFull-Text 643-646
  Stefan Wilk; Wolfgang Effelsberg
Mobile video broadcasting services offer the opportunity to instantly share video from mobile handhelds to a large audience over the Internet. Additionally, the quality of those video streams is often reduced by the lack of skills of recording users and the technical limitations of the video capturing devices. Our research focuses on large-scale events that attract dozens of users to record video in parallel. In such scenarios, our aim is to design a collaborative mobile video broadcasting service that composes video streams from multiple sources to create one video stream with enhanced perceived quality. Our realization of such a service combines high quality video streaming and at the same time reduces the network load.
Music-information retrieval in environments containing acoustic noise BIBAFull-Text 647-650
  David Grunberg
In the field of Music-Information Retrieval (Music-IR), algorithms are used to analyze musical signals and estimate high-level features such as tempi and beat locations. These features can then be used in tasks to enhance the experience of listening to music. Most conventional Music-IR algorithms are trained and evaluated on audio that is taken directly from professional recordings with little acoustic noise. However, humans often listen to music in noisy environments, such as dance clubs, crowded bars, and outdoor concert venues. Music-IR algorithms that could function accurately even in these environments would therefore be able to reliably process more of the audio that humans hear. In this paper, I propose methods to perform Music-IR tasks on music that has been contaminated by acoustic noise. These methods incorporate algorithms such as Probabilistic Latent Component Analysis (PLCA) and Harmonic-Percussive Source Separation (HPSS) in order to identify important elements of the noisy musical signal. As an example, a noise-robust beat tracker utilizing these techniques is described.
Automated Multi-Track Mixing and Analysis of Instrument Mixtures BIBAFull-Text 651-654
  Jeffrey Scott
Access to hardware and software tools for producing music has become commonplace in the digital landscape. While the means to produce music have become widely available, significant time must be invested to attain professional results. Mixing multi-channel audio requires techniques and training far beyond the knowledge of the average music software user. Achieving balance and clarity in a mixture comprising a multitude of instrument layers requires experience in evaluating and modifying the individual elements and their sum. Creating a mix involves many technical concerns (level balancing, dynamic range control, stereo panning, spectral balance) as well as artistic decisions (modulation effects, distortion effects, side-chaining, etc.). This work proposes methods to model the relationships between a set of multi-channel audio tracks based on short-time spectral-temporal characteristics and long term dynamics. The goal is to create a parameterized space based on high level perceptual cues to drive processing decisions in a multi-track audio setting.

Doctoral Symposium 2

Local Selection of Features for Image Search and Annotation BIBAFull-Text 655-658
  Jichao Sun
In image applications, direct representations of images typically involve hundreds or thousands of features and not all the features are relevant for any given object. Errors introduced into similarity measurements by irrelevant or noisy features are detrimental to the semantic performance of content-based image retrieval. Feature selection techniques can be used to identify indiscriminative features from the entire image database. However, such global approach neglects the possibility that the feature importance may vary across different images or classes of images. We propose several techniques for the local selection of features for image databases. By checking the local neighborhood of each image, our methods determine the feature importance with respect to the image and select different feature sets for individual images. We also design methods based on the proposed local selection schemes for KNN graph construction, image search, and graph-based image annotation. We provide experimental results on two image datasets to demonstrate the effectiveness of our methods.
Segmentation and Indexing of Endoscopic Videos BIBAFull-Text 659-662
  Manfred Jürgen Primus
Over the last few years it has become common to archive video recordings of endoscopic surgeries. These videos are of high value for medics, junior doctors, patients and hospital management but currently they are used rarely or not at all. Each day tens to hundreds of hours of new videos are added to archives without metadata that would support content-based search. In order to fully utilize these videos it is necessary to analyze the content of the recordings. Endoscopic videos are in some aspects fundamentally different to other types of videos. Therefore, pre-existing content-based analysis methods must be tested for their ability to operate with this kind of video and, if required, they must be adopted or new methods must be found. Especially, we address video segmentation and indexing in this work. We present our preliminary work and ideas for future work to add content-based information to endoscopic videos.
Learning recognition of semantically relevant video segments from endoscopy videos contributed and edited in a private social network BIBAFull-Text 663-666
  Desara Xhura
Besides the great benefit of minimizing intrusions made in body, endoscopic surgery has the advantage of producing abundant documentation regarding the procedure as well. Recordings can be used not only to document the surgery but as a mean for learning and improving experts knowledge too. To minimize time and effort that experts invest in preparing informative endoscopic videos, tools that can automatically identify interesting parts in videos are needed. To achieve this, an annotated data set is required. This paper presents an approach for collecting endoscopic videos and related experts knowledge. For this, a social network with integrated video annotation and presentation tools is used. Experts can upload, annotate and share their videos with other physicians. In the background their interactions with the videos are recorded, interpreted and used to derive predictive models or improve existing ones. Once a prediction model is derived, its results will be displayed to physicians as suggestions, which can be integrated into their video annotations. Physicians choice to either keep these suggestions or discard them will serve as a feedback to the learned model and used to refine the derived knowledge.
Multimodal Alignment of Videos BIBAFull-Text 667-670
  Mario Guggenberger
Most multimedia synchronization methods developed in the past are unimodal and consider only the audio data or the video data. Just recently, methods started to emerge that embrace multimodality by utilizing both audio and video processing to improve synchronization results. Although promising, their results are still not sufficient for fully automatic synchronization of recordings from heterogeneous sources. Video processing is also often too expensive to be used on large corpora of recordings, e.g. as they are commonly produced by crowds at social events. In my doctoral thesis, I will try to develop synchronization methods further by (a) examining fundamental problems that are usually ignored by lab-developed methods and therefore compromising real-world applications, (b) creating a publicly available synchronization-method benchmarking dataset, and (c) developing a low-level video feature based synchronization method with a computational complexity not higher than current state of the art audio-based methods.

Open Source Software Competition 1

libLDB: a library for extracting ultrafast and distinctive binary feature description BIBAFull-Text 671-674
  Xin Yang; Chong Huang; Kwang-Ting (Tim) Cheng
This paper gives an overview of libLDB -- a C++ library for extracting an ultrafast and distinctive binary feature LDB (Local Difference Binary) from an image patch. LDB directly computes a binary string using simple intensity and gradient difference tests on pairwise grid cells within the patch. Relying on integral images, the average intensity and gradients of each grid cell can be obtained by only 4~8 add/subtract operations, yielding an ultrafast runtime. A multiple gridding strategy is applied to capture the distinct patterns of the patch at different spatial granularities, leading to a high distinctiveness of LDB. LDB is very suitable for vision apps which require real-time performance, especially for apps running on mobile handheld devices, such as real-time mobile object recognition and tracking, markerless mobile augmented reality, mobile panorama stitching. This software is available under the GNU General Public License (GPL) v3.
Caffe: Convolutional Architecture for Fast Feature Embedding BIBAFull-Text 675-678
  Yangqing Jia; Evan Shelhamer; Jeff Donahue; Sergey Karayev; Jonathan Long; Ross Girshick; Sergio Guadarrama; Trevor Darrell
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments.
   Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Menpo: A Comprehensive Platform for Parametric Image Alignment and Visual Deformable Models BIBAFull-Text 679-682
  Joan Alabort-i-Medina; Epameinondas Antonakos; James Booth; Patrick Snape; Stefanos Zafeiriou
The Menpo Project, hosted at http://www.menpo.io, is a BSD-licensed software platform providing a complete and comprehensive solution for annotating, building, fitting and evaluating deformable visual models from image data. Menpo is a powerful and flexible cross-platform framework written in Python that works on Linux, OS X and Windows. Menpo has been designed to allow for easy adaptation of Lucas-Kanade (LK) parametric image alignment techniques, and goes a step further in providing all the necessary tools for building and fitting state-of-the-art deformable models such as Active Appearance Models (AAMs), Constrained Local Models (CLMs) and regression-based methods (such as the Supervised Descent Method (SDM)). These methods are extensively used for facial point localisation although they can be applied to many other deformable objects. Menpo makes it easy to understand and evaluate these complex algorithms, providing tools for visualisation, analysis, and performance assessment. A key challenge in building deformable models is data annotation; Menpo expedites this process by providing a simple web-based annotation tool hosted at http://www.landmarker.io. The Menpo Project is thoroughly documented and provides extensive examples for all of its features. We believe the project is ideal for researchers, practitioners and students alike.

Open Source Software Competition 2

VideoLat: An Extensible Tool for Multimedia Delay Measurements BIBAFull-Text 683-686
  Jack Jansen
When using a videoconferencing system there will always be a delay from sender to receiver. Such delays affect human communication, and therefore knowing the delay is a major factor in judging the expected quality of experience of the conferencing system. Additionally, for implementors, tuning the system to reduce delay requires an ability to effectively and easily gather delay metrics on a potentially wide range of settings. In order to support this process, we make available a system called videoLat. VideoLat provides an innovative approach to understand glass-to-glass video delays and speaker-to-microphone delays. VideoLat can be used as-is to do audio and video roundtrip delays of black box systems, but by making it available as open source we want to enable people to extend and modify it for different scenarios, such as measuring one-way delays or delay of camera switching.
The Yael Library BIBAFull-Text 687-690
  Matthijs Douze; Hervé Jégou
This paper introduces Yael, a library implementing computationally intensive functions used in large scale image retrieval, such as neighbor search, clustering and inverted files. The library offers interfaces for C, Python and Matlab. Along with a brief tutorial, we analyze and discuss some of our implementation choices, and their impact on efficiency.
Loki+Lire: a framework to create web-based multimedia search engines BIBAFull-Text 691-694
  Giuseppe Becchi; Marco Bertini; Lorenzo Cioni; Alberto Del Bimbo; Andrea Ferracani; Daniele Pezzatini; Mathias Lux
In this paper we present Loki+Lire, a framework for the creation of web-based interfaces for search, annotation and presentation of multimedia data. The framework provides tools to ingest, transcode, present, annotate and index different types of media such as images, videos, audio files and textual documents. The front-end is compliant with the latest HTML5 standards, while the back-end allows system administrators to create processing pipelines that can be adapted for different tasks and purposes.
   The system has been developed in a modular way, aiming at creating loosely coupled components, letting developers to use it as a whole or to select only the parts that are needed to develop their own tools and systems.

Art Exhibit

Audiovisual Resynthesis in an Augmented Reality BIBAFull-Text 695-698
  Parag Kumar Mital
"Resynthesizing Perception" immserses participants within an audiovisual augmented reality using goggles and headphones while they explore their environment. What they hear and see is a computationally generative synthesis of what they would normally hear and see. By demonstrating the associations and juxtapositions the synthesis creates, the aim is to bring to light questions of the nature of representations supporting perception. Two modes of operation are possible. In the first model, while a participant is immersed, salient auditory events from the surrounding environment are stored and continually aggregated to a database. Similarly, for vision, using a model of exogenous attention, proto-objects of the ongoing dynamic visual scene are continually stored using a camera mounted to goggles on the participant's head. The aggregated salient auditory events and proto-objects form a set of representations which are used to resynthesize the microphone and camera inputs in real-time. In the second model, instead of extracting representations from the real world, an existing database of representations already extracted from scenes such as images of paintings and natural auditory scenes are used for synthesizing the real world. This work was previously exhibited at the Victoria and Albert Museum in London. Of the 1373 people to participate in the installation, 21 participants agreed to be filmed and fill out a questionnaire. We report their feedback and show that "Resynthesizing Perception" is an engaging and thought-provoking experience questioning the nature of perceptual representations.
Sound-Light Giblet BIBAFull-Text 699-700
  Charles Roberts
We describe an audiovisual live coding performance created using Gibber, a creative coding environment that runs in the browser. The performance takes advantage of novel affordances for rapidly creating music, shaders, and mappings that tie together audio and visual modalities.
Gone: an interactive experience for two people BIBAFull-Text 701-704
  Michael Riegler; Mathias Lux; Christian Zellot; Lukas Knoch; Horst Schnattler; Sabrina Napetschnig; Julian Kogler; Claus Degendorfer; Norbert Spot; Manuel Zoderer
In this paper we present a multimedia art installation which uses human computer interaction to give the user the possibility to experience how blind people perceive the world and how hard it can be. The installation follows the classical co-operative game pattern, where a group of people -- in this case two -- have to work together to achieve a goal. With these basic characteristics in mind we consider our submission a serious game. With this serious game we show how multimedia can be used for making people more attentive for problems of other people and that multimedia, and in this case serious games, are indeed a useful tool for letting people experience situations they have not been aware of before.
Circles and Sounds BIBAFull-Text 705-708
  Sarah Linebaugh
The exploration of real-time, interactive motion graphics in a live environment is a response to the exponential growth of technology in culture and the human desire to engage in meaningful experiences. Interactive installations draw from the concepts of the New Aesthetic and Relational Aesthetics to create art that depends on cutting-edge technology while emphasizing human interaction and connectedness. The use of graphics that respond in real-time are a hallmark of the medium. Immediate responses from the technology allow the user to engage with the artwork instantaneously. By doing so, they can create meanings that are unique to themselves.
   My interactive art installation, Circles and Sounds, combines art and technology in a way that will allow the user to engage with the technology to create their own real-time graphics and sound experience that is completely unique to them.
Qi Visualizer: An Interactive Pulse Spectrogram Visualization using Mobile Participatory Biometrics BIBAFull-Text 709-712
  Yuan-Yi Fan
Qi is an ancient concept of life energy in various cultures and it represents continuous ineffable dialogues between the human heart with the body. The goal of Qi Visualizer is to translate the ineffable dialogues into a collective data visualization of pulse spectrograms via audience participation using smartphone photoplethysmography. Informed by our earlier studies, we believe spectrogram is the most effective visual language to invoke broader perception of Qi across cultures and disciplines. In our design process, we extend our notion of audience as constituents of a system for collective expression in this novel Qi Visualizer interactive installation. Specifically, we consider audience members as objects and their responses as processes in a system to guide our visualization strategies. Significance of this work is two fold. In terms of data visualization, we streamline data acquisition and visualization processes into an immediate interactive experience in Qi Visualizer installation. In terms of interactive art, the significance may be conceived as a new process shifting tactile-textual translations in a pulse examination to imaging-visual transformations of the pulse spectrograms visualization via audience participation. Similar to voiceprint, a heart rate harmonic signature in the spectrogram is seen as an artistic interpretation of Qi in this installation.
Stoicheia: Architecture, Sound and Tesla's Apotheosis BIBAFull-Text 713-716
  F. Myles Sciotto; Jean-Michel Crettaz
Stoicheia, is an immersive installation focusing on the making of a synthetic ecology and informing the process of architecture using sound. After the reception of enigmatic radio signals in 1899, Tesla began work for many years to perfect the receiving and transmitting equipment that was needed to better pick up and translate his aural discoveries. Stoicheia explores similar notions of Telsa's Aether using specific data feeds and scanning techniques to create a dynamic spatial soundscape drawing upon an experience of immediacy, and ephemeral interplay between the agencies that maintain a continuously self-adaptive sonic and physical environment.
States of Diffusion for n+1 devices BIBAFull-Text 717-719
  Lonce Wyse
States of Diffusion is a participative audio installation artwork that explores patterns of sound transformation that move through space as they evolve spectrally. Six one-minute segments exploring themes of intonation and timing of events across possibly many different audio sources are coordinated via a web server. The spatial diffusion flows through mobile devices held by visitors to the gallery. Visitors also exert a subtle influence over sound through the movement of their device and their movement through the gallery.

Demos 1: Searching and Finding

Fine-grained Visual Faceted Search BIBAFull-Text 721-722
  Julien Champ; Alexis Joly; Pierre Bonnet
This paper presents an interactive information retrieval scheme extending classical faceted search with fine-grained visual facets. Consistent visual words are extracted offline by constructing a matching graph at the region image level and splitting the graph in relevant clusters. Visual and textual facets are then indexed in the same faceted search engine, all of them being dynamically reranked according to their filtering efficiency before being presented to the end-user. To enhance their interpretability, each facet value is illustrated by a representative picture automatically selected from the matching graph in an adaptive way.
Real-time query-by-image video search system BIBAFull-Text 723-724
  Andre Araujo; David Chen; Peter Vajda; Bernd Girod
We demonstrate a novel multimedia system that continuously indexes videos and enables real-time search using images, with a broad range of potential applications. Television shows are recorded and indexed continuously, and iconic images from recent events are discovered automatically. Users can query an uploaded image or an image in the web. When a result is served, the user can play the video clip from the beginning or from the point in time where the retrieved image was found.
Automatic Real-Time Zooming and Panning on Salient Objects from a Panoramic Video BIBAFull-Text 725-726
  Vamsidhar Reddy Gaddam; Ragnar Langseth; Håkon Stensland; Carsten Griwodz; Pål Halvorsen; Øystein Landsverk
The proposed demo shows how our system automatically zooms and pans into tracked objects in panorama videos. At the conference site, we will set up a two-camera version of the system, generating live panorama videos, where the system zooms and pans tracking people using colored hats. Additionally, using a stored soccer game video from a five 2K camera setup at Alfheim stadium in Tromsø from the European league game between Tromsø IL and Tottenham Hotspurs, the system automatically follows the ball.
Event Detection in Broadcasting Video for Halfpipe Sports BIBAFull-Text 727-728
  Hao-Kai Wen; Wei-Che Chang; Chia-Hu Chang; Yin-Tzu Lin; Ja-Ling Wu
In this work, a low-cost and efficient system is proposed to automatically analyze the halfpipe (HP) sports videos. In addition to the court color ratio information, we find the player region by using salient object detection mechanisms to face the challenge of motion blurred scenes in HP videos. Besides, a novel and efficient method for detecting the spin event is proposed on the basis of native motion vectors existing in a compressed video. Experimental results show that the proposed system is effective in recognizing the hard-to-be-detected spin events in HP videos.
Wally: A Scalable Distributed Automated Video Surveillance System with Rich Search Functionalities BIBAFull-Text 729-730
  Jianquan Liu; Shoji Nishimura; Takuya Araki
In this demo, we present Wally, a scalable distributed automated video surveillance system. Wally provides users with rich search functionalities for surveillance, such as instant and historical search based on face or clothing feature by specifying multiple conditions. Unlike most existing systems that focus on video analysis but do not well consider system scalability and search functionality, Wally takes into account both of them, and is developed as a commercial level system for realistic surveillance. To achieve this, we 1) employ relational database, in-memory cache, key-value store, and indexing techniques to ensure the system scalability when it meets Big Data, and 2) develop user-friendly interfaces to enable users to search key frames by rich functionalities. It is estimated by our extensive evaluations that Wally can complete a search task within 50 minutes from the videos captured one day by 1000 real-time cameras, while human monitoring needs 1000x24 man-hours.
Deep Search with Attribute-aware Deep Network BIBAFull-Text 731-732
  Junshi Huang; Wei Xia; Shuicheng Yan
In this demo, we present Deep Search, an attribute-aware fashion-related retrieval system, based on the convolutional neural network by taking clothes as a concrete example. In this system, the visual appearance and semantic attributes of products can be seamlessly integrated into our model in a "dividing and combining" manner. Rather than modelling one attribute per classifier, we design a holistic tree-structure network for exploiting multiple attributes simultaneously, with each branch of network corresponding to one attribute. Therefore, the high-level feature from the conjunction layer of the network can comprehensively preserve both the visual and semantic information of products. The promising retrieval results indicate the great potential of neural feature for attribute-aware retrieval task. The supplementary slides can be found in http://goo.gl/zpobN6.
Virtual Director Adapting Visual Presentation to Conversation Context in Group Videoconferencing: An Interactive Demo BIBAFull-Text 733-734
  Rene Kaiser; Wolfgang Weiss; Manolis Falelakis; Marian F. Ursu
In this technical demonstration, we invite visitors to join a group videoconferencing session with remote participants across Europe, using the Vconect system. This research prototype automatically takes decisions about the presentation of available video streams by processing the audio stream from each participant. It observes patterns in turn-taking behavior and dynamically computes conversation metrics to automatically decide, for each conversation node's screen, how to compose (in layout) and mix (in time) the streams originating from the other nodes. The demo invites to discuss future possibilities in video mediated group communication beyond the currently implemented functionality of the demo system.
SmartVisio: Interactive Sketch Recognition with Natural Correction and Editing BIBAFull-Text 735-736
  Jie Wu; Changhu Wang; Liqing Zhang; Yong Rui
In this work, we introduce the SmartVisio system for interactive hand-drawn shape/diagram recognition. Different from existing work, SmartVisio is a real-time sketch recognition system based on Visio, to recognize hand-drawn flowchart/diagram with flexible interactions. This system enables a user to draw shapes or diagrams on the Visio interface, and then the hand-drawn shapes are automatically converted to formal shapes in real-time. To satisfy the interaction needs from common users, we propose an algorithm to detect a user's correction and editing during drawing, and then recognize in real time. We also propose a novel symbol recognition algorithm to better recognize or differentiate some visually similar shapes. By enabling users' natural correction/editing on various shapes, our system makes flowchart/diagram production much more natural and easier.

Demos 2: Senses and Sensors

Taste+: Digitally Enhancing Taste Sensations of Food and Beverages BIBAFull-Text 737-738
  Nimesha Ranasinghe; Kuan-Yi Lee; Gajan Suthokumar; Ellen Yi-Luen Do
In recent years, digital media technologies expand horizons into non-traditional sensory modalities, for instance, the sense of taste. This demonstration presents 'taste+,' which digitally improves the taste sensations of food and beverages without additional flavoring ingredients. It primarily utilizes weak and controlled electrical pulses on the human tongue to enhance sourness, saltiness, and bitterness of food or beverages. Taste+ consists of two prototype utensils, a bottle and a spoon. Both utensils have embedded electronic control modules to achieve enhanced taste sensations. These modules apply controlled electrical pulses on the tongue through silver electrodes attached to the mouth pieces. Furthermore, different superimposed colors symbolize distinct taste sensations, lemon green represents sour, ocean blue is salty, red for red wine bitter. The initial experimental results suggested that sourness and saltiness are the main sensations that could be evoked while bitterness has comparatively mild responses. Furthermore, we also observed several mixed sensations such as salty-sour together.
Reverbalize: A Crowdsourced Reverberation Controller BIBAFull-Text 739-740
  Prem Seetharaman; Bryan Pardo
One of the most commonly-used audio production tools is the reverberator. Reverberators apply subtle or large echo effects to sound and are typically used in commercial audio recordings. Current reverberator interfaces are often complex and hard-to-understand. In this work, we describe Reverbalize, a novel and easy-to-use interface for a reverberator. Reverbalize uses crowdsourced data to create a 2-dimensional map of adjectives used to describe reverberation (e.g. "underwater'). Adjacent words describe similar reverberation effects. Word size correlates with agreement for the definition of a word. To use Reverbalize, the user simply clicks on the descriptive adjective that best describes the desired effect. The tool modifies the sound accordingly. A text search box also lets the user type in the desired word.
SynthAssist: an audio synthesizer programmed with vocal imitation BIBAFull-Text 741-742
  Mark Cartwright; Bryan Pardo
While programming an audio synthesizer can be difficult, if a user has a general idea of the sound they are trying to program, they may be able to imitate it with their voice. In this technical demonstration, we demonstrate SynthAssist, a system that allows the user to program an audio synthesizer using vocal imitation and interactive feedback. This system treats synthesizer programming as an audio information retrieval task. To account for the limitations of the human voice, it compares vocal imitations to synthesizer sounds by using both absolute and relative temporal shapes of relevant audio features, and it refines the query and feature weights using relevance feedback.
Eat as much as you can: a Kinect-based facial rehabilitation game based on mouth and tongue movements BIBAFull-Text 743-744
  Yong-Xiang Wang; Li-Yun Lo; Min-Chun Hu
In this demo, we present a Kinect-based interactive game which provides patients of facial palsy with a better and more fun way to perform facial physical therapy. By letting the user get scores when he/she bites or licks the virtual foods falling from the sky, the game goads the user into exercising their facial muscles. A robust bite action detector and an efficient tip-of-tongue locating method are developed to realize this game. Based on the head pose estimated by the depth clue, we classify the face orientation and locate the mouth region in each frame. The movements of lips are measured to detect the bite action, and different strategies are designed to more accurately detect the tip-of-tongue for various face orientations. The experimental results show that our method can be successfully applied to different users.
A Multimedia E-Health Framework Towards An Interactive And Non-Invasive Therapy Monitoring Environment BIBAFull-Text 745-746
  Ahmad Qamar; Imad Afyouni; Delwar Hossain; Faizan Ur Rehman; Asad Toonsi; Mohamed Abdur Rahman; Saleh Basalamah
This paper presents a multimedia e-health framework to conduct therapy sessions by collecting live therapeutic data of patients in a non-invasive way. Using our proposed framework, a therapist can model complex gestures by mapping them to a set of primitive actions and generate high-level therapies. Two inexpensive 3D motion tracking sensors, a Kinect and a Leap, are used to collect motion data of a given subject. Data can then be displayed on screen in a live manner or recorded for offline replaying and analysis. The system incorporates an intelligent authoring tool for therapy design, and produces live plots that show the quality of improvement metrics for a given patient.
EventEye: Monitoring Evolving Events from Tweet Streams BIBAFull-Text 747-748
  Hongyun Cai; Zhongxian Tang; Yang Yang; Zi Huang
With the rapid growth in popularity of social websites, social event detection has become one of the hottest research topics. However, continuously monitoring social events has not been well studied. In this demo, we present a novel system called EventEye to effectively monitor evolving events and visualize their evolving paths, which are discovered from tweet streams. In particular, four event operations are defined for our proposed stream clustering algorithm to capture the evolutions over time and a multi-layer indexing structure is designed to support efficient event search from large-scale event databases. In our system, events are visualized in different views, including evolution graph, timeline, map view, etc.
3D Immersive Cardiopulmonary Resuscitation (CPR) Trainer BIBAFull-Text 749-750
  Yuan Tian; Suraj Raghuraman; Yin Yang; Xiaohu Guo; Balakrishnan Prabhakaran
Cardiopulmonary resuscitation (CPR) plays a primary role in first-aid treatment. Instead of the traditional instructor-led training course, we propose a virtual reality system which provides an immersive 3D environment for CPR training with visual and haptic feedback. To simulate a real world CPR experience, our immersive trainer system enables a trainee to perform CPR compressions to a virtual human, inside the virtual world. During the training procedure, the trainee can not only watch his/her 3D image performing CPR, but also feel the force feedback from the chest compressions in real-time. To further enhance the visual fidelity, a haptic-enabled deformable model is applied to show the visual change of chest during compression.
Taking good selfies on your phone BIBAFull-Text 751-752
  Mei-Chen Yeh; Hsiao-Wei Lin
Selfies are in vogue because smartphones and social media websites have helped popularize the phenomenon whereby people can easily take and share self-portrait photographs. However, taking good selfies is not always a trivial task for end users. In this demonstration we present a tool for guiding users on which angles to take selfies from. Dissimilar to existing techniques that enhance the aesthetic quality of selfies through a post-processing procedure, the proposed approach focuses on the creation of a good look before the photo is captured. A few distinctive patterns automatically discovered from a collection of online profile pictures are used to score a selfie in angle. The technique should enable new features such as "virtual portraitist" for current consumer cameras and smartphones, providing an interactive approach for capturing selfies.

Demos 3: Systems

WeMash: An Online System for Web Video Mashup BIBAFull-Text 753-754
  Peng Wang; Yang Yang; Zi Huang; Jiewei Cao; Heng Tao Shen
The explosive growth of video content on the Web has been revolutionizing the way people share, exchange and perceive information. At the same time, however, the massive amount of Web video content may degrade user experience in that it is tedious and time-consuming to view relevant videos uploaded by different users and grasp the gist of their content. In this demonstration, we demonstrate a system that takes Web videos related to a query event as the input and automatically generates a short mashup video that can well describe the essence of the query. Specifically, we first identify shot cliques composed of near-duplicate shots and infer their semantic meanings by resorting to relevant Web images. Then a shot clique selection algorithm is performed to select representative content according to the twofold importance of the shot cliques evaluated, namely content importance and semantic importance, with content diversity considered. Finally, the selected shot cliques are organized into a mashup video based on semantic and temporal cues.
FreeViewer: An Intelligent Director for 3D Tele-Immersion System BIBAFull-Text 755-756
  Zhenhuan Gao; Shannon Chen; Klara Nahrstedt
This paper proposes FreeViewer, a 3D Tele-Immersion view-control system that allows viewers to see arbitrary side of the performer by intelligently choosing the streams of a subset of cameras and changing the point of view in a 3D virtual space. The view changing is actuated by the change of the sensor data from wearable devices (eg. Google Glass, smartphone) on the performer able to monitor the current orientation.
MiSCon: a hot plugging tool for real-time motion-based system control BIBAFull-Text 757-758
  Jun Chen; Chaokun Wang; Lei Yang; Qingfu Wen; Xu Wang
In this demonstration, we proposed a hot plugging tool for the real-time motion-based system control, which is more portable and application-independent than the existing commercial motion-based sensing devices such as Kinect, Wii and PlayStation Move. This tool captures and recognizes people's real-time motions through the built-in camera of PCs, mobile phones or tablets, and automatically executes the system events which have been mapped with people's customized body motion, e.g., the head and the fist. The tool relieves people from the conventional ways to play games and use applications, and enables them to customize their preferred ways to control the systems.
CeleLabel: an interactive system for annotating celebrities in web videos BIBAFull-Text 759-760
  Zhineng Chen; Jinfeng Bai; Chong-Wah Ngo; Bailan Feng; Bo Xu
Manual annotation of celebrities in Web videos is an essential task in many people-related Web services. The task, however, poses a significant challenge even to skillful annotators, mainly due to the large quantity of unfamiliar and greatly varied celebrities, and the lack of a customized system for it. This work develops CeleLabel, an interactive system for manually annotating celebrities in the Web video domain. The peculiarity of CeleLabel is to exploit and display multiple types of information that could assist the annotation, including video content, context surrounding and within a video, celebrity images on the Web, and human factors. Using the system, annotators can interactively switch between two views, i.e., merging similar faces and labeling faces with names, to approach the annotation. User studies show that the CeleLabel leads to a much better labeling efficiency and satisfaction.
FoodCam-256: A Large-scale Real-time Mobile Food RecognitionSystem employing High-Dimensional Features and Compression of Classifier Weights BIBAFull-Text 761-762
  Yoshiyuki Kawano; Keiji Yanai
In the demo, we demonstrate a large-scale food recognition system employing high-dimensional Fisher Vector and liner one-vs-rest classifiers. Since all the processes on image recognition perform on a smartphone, the system does not require an external image recognition server, and runs on an ordinary smartphone in a real-time way.
   The proposed system can recognize 256 kinds of food by using the UEC-Food256 food image dataset we built by ourselves recently as a training dataset. To implement an image recognition system employing high-dimensional features on mobile devices, we propose linear weight compression method to save memory. In the experiments, we proved that the proposed compression methods make a little performance loss, while we can reduce the amount of weight vectors to 1/8. The proposed system has not only food recognition function but also the functions of estimation of food calorie and nutritious and recording a user's eating habits.
   In the experiments with 100 kinds of food categories, we have achieved the 74.4% classification rate for the top 5 category candidates. The prototype system is open to the public as an Android-based smartphone application.
HMD Viewing Spherical Video Streaming System BIBAFull-Text 763-764
  Daisuke Ochi; Yutaka Kunita; Kensaku Fujii; Akira Kojima; Shinnosuke Iwaki; Junichi Hirose
We propose a video streaming system that lets users view their favorite sections through an HMD (head mount display) from a video recorded by a spherical (360-degree) camera. Although spherical video streaming tends to consume a lot of bitrate for its huge image area, our system consumes a reasonable amount of bitrate by assigning higher bitrate only for the user's viewing area, not for the area outside of it. Technical Demos of the system with an Oculus Rift HMD will be performed to demonstrate it enables them to view images at a bitrate of about 2.5 Mbps.
Jiku director 2.0: a mobile video mashup system with zoom and pan using motion maps BIBAFull-Text 765-766
  Duong Trung Dung Nguyen; Axel Carlier; Wei Tsang Ooi; Vincent Charvillat
In this demonstration, we show an automated mobile video mashup system that takes a set of videos filming the same scene as input, and generate an output mashup video consisting of temporally coherent clips selected from these input videos. The key difference of our system over the existing state-of-the-art is that it can generate virtual close-up shots and three camera operations: zooming in, zooming out, and panning, automatically. To achieve this, the system first computes the motion maps of the input videos and then determines a set of rectangles that correspond to highly interesting regions (in terms of motion). The choice of which shot types to use is done heuristically, ensuring diversity and coherency in the content presented in the mashup.
ClockDrift: a mobile application for measuring drift in multimedia devices BIBAFull-Text 767-768
  Mario Guggenberger; Mathias Lux; Laszlo Böszörmenyi
Parallel recordings made at the same event with different devices, e.g. by visitors of a concert, contain semantically the same content but do not run at the same speed when played back in parallel on a computer, which makes their synchronization difficult. This effect, time drift, concerns all current consumer multimedia recording devices and results from their internal clocks not running at the same speed, leading to deviations from their nominal sampling rates. We present a mobile application capable of conducting instant measurements of this time drift, thus helping in determining devices that go well together, or correcting the speed differences in post processing.

Posters 1

Salable Image Search with Reliable Binary Code BIBAFull-Text 769-772
  Guangxin Ren; Junjie Cai; Shipeng Li; Nenghai Yu; Qi Tian
In many existing image retrieval algorithms, Bag-of-Words (BoW) model has been widely adopted for image representation. To achieve accurate indexing and efficient retrieval, local features such as the SIFT descriptor are extracted and quantized to visual words. One of the most popular quantization scheme is scalar quantization, which generates binary signature with an empirical threshold value. However, such binarization strategy inevitably suffers from the quantization loss induced by each quantized bit and impairs the effectiveness of search performance. In this paper, we investigate the reliability of each bit in scalar quantization and propose a novel reliable binary SIFT feature. We move one step ahead to incorporate the reliability in both index word expansion and feature similarity. Our proposed approach not only accelerates the search speed by narrowing search space, but also improves the retrieval accuracy by alleviating the impact of unreliable quantized bits. Experimental results demonstrate that the proposed approach achieves significant improvement in retrieval efficiency and accuracy.
Chic or Social: Visual Popularity Analysis in Online Fashion Networks BIBAFull-Text 773-776
  Kota Yamaguchi; Tamara L. Berg; Luis E. Ortiz
From Flickr to Facebook to Pinterest, pictures are increasingly becoming a core content type in social networks. But, how important is this visual content and how does it influence behavior in the network? In this paper we study the effects of visual, textual, and social factors on popularity in a large real-world network focused on fashion. We make use of state of the art computer vision techniques for clothing representation, as well as network and text information to predict post popularity in both in-network and out-of-network scenarios. Our experiments find significant statistical evidence that social factors dominate the in-network scenario, but that combinations of content and social factors can be helpful for predicting popularity outside of the network. This in depth study of image popularity in social networks suggests that social factors should be carefully considered for research involving social network photos.
n-gram Models for Video Semantic Indexing BIBAFull-Text 777-780
  Nakamasa Inoue; Koichi Shinoda
We propose n-gram modeling of shot sequences for video semantic indexing, in which semantic concepts are extracted from a video shot. Most previous studies for this task have assumed that video shots in a video clip are independent from each other. We model the time-dependency between them assuming that n-consecutive video shots are dependent. Our models improve the robustness against occlusion and camera-angle changes by effectively using information from the previous video shots. In our experiments on the TRECVID 2012 Semantic Indexing Benchmark, we applied the proposed models to a system using Gaussian mixture models and support vector machines. Mean average precision was improved from 30.62% to 32.14%, which is the best performance on the TRECVID 2012 Semantic Indexing to the best of our knowledge.
Line-Based Drawing Style Description for Manga Classification BIBAFull-Text 781-784
  Wei-Ta Chu; Ying-Chieh Chao
Diversity of drawing styles of mangas can be easily perceived by humans, but are hard to be described in text. We design computational features derived from line segments to describe drawing styles, enabling style classification such as discriminating mangas targeting youth boys and youth girls, and discriminating artworks produced by different artists. With statistical analysis, we found that drawing styles can be effectively characterized by the proposed features such that explicit (e.g., density of line segments) or implicit (e.g., included angles between lines) observations can be made to facilitate various manga style classification.
Supervised hashing with error correcting codes BIBAFull-Text 785-788
  Fatih Cakir; Stan Sclaroff
One widely-used solution to expedite similarity search of multimedia data is to construct hash functions to map the data into a Hamming space where linear search is known to be fast and often sublinear solutions perform well. In this paper, we propose a Boosting based formulation for supervised learning of the hash functions that is based on Error Correcting Codes. This approach allows us to apply established theoretical results for Boosting in our analysis of our hashing solution. Specifically, we show that the training accuracy in Boosting can be considered as a lower bound on the (empirical) Mean Average Precision (mAP) score. In experiments with three image retrieval benchmarks, the proposed formulation yields significant improvement in mAP over state-of-the-art supervised hashing methods, while using fewer bits in the hash codes.
Pedestrian Attribute Recognition At Far Distance BIBAFull-Text 789-792
  Yubin Deng; Ping Luo; Chen Change Loy; Xiaoou Tang
The capability of recognizing pedestrian attributes, such as gender and clothing style, at far distance, is of practical interest in far-view surveillance scenarios where face and body close-shots are hardly available. We make two contributions in this paper. First, we release a new pedestrian attribute dataset, which is by far the largest and most diverse of its kind. We show that the large-scale dataset facilitates the learning of robust attribute detectors with good generalization performance. Second, we present the benchmark performance by SVM-based method and propose an alternative approach that exploits context of neighboring pedestrian images for improved attribute inference.
MSVA: Musical Street View Animator: An Effective and Efficient Way to Enjoy the Street Views of Your Journey BIBAFull-Text 793-796
  Yin-Tzu Lin; Po-Nien Chen; Chia-Hu Chang; Ja-Ling Wu
Google Maps with Street View (GSV) provides ways to explore the world but it lacks efficient ways to present a journey. Hyperlapse provides another ways for quick glimpsing the street-views along the route; however, its viewing experience is also not comfortable and could be tedious when the route is long-distance. In this paper, we provide an efficient and enjoyable way to present street view sequences of long journey. Street view journey video accompanied with locally listened music will be produced by the proposed approach. During the move between locations, we use the speed control techniques for animation production to improve the viewing experience. User evaluation results show that the proposed method increases the satisfaction of users in viewing the street view sequences.
QoE-driven Unsupervised Image Categorization for Optimized Web Delivery: Short Paper BIBAFull-Text 797-800
  Parvez Ahammad; Brian Kennedy; Padmapani Ganti; Hariharan Kolam
Due to rapid growth in the richness of web applications and the multitude of wireless devices on which web content can be consumed, optimizing the content delivery of dynamic web applications represents an important technical challenge. One of the keys to solving the wireless web performance puzzle is that of efficient image content delivery. Since the critical factor in image delivery is the end-user experience, we propose a simple quantitative metric for characterizing the Quality of Experience (QoE) for any given image that gets sent through a web delivery service (WDS) pipeline. This quantitative signature, termed VoQS (variation of quality signature), allows any two arbitrary images to be compared in the context of web delivery performance. We then use VoQS in conjunction with an unsupervised learning algorithm to group similarly performing images into coherent groups for optimized web delivery. Using a large database of images compiled from multiple content providers and diverse device-optimized formats, we demonstrate that our approach allows large image databases to be efficiently parsed into coherent groups in a content-dependent and device-targeted manner for optimized image content delivery. Our approach significantly decreases the average bits per image that need to be delivered across large image databases (by 43% in our experiments) while preserving the perceptual quality across the entire image database. We also discuss how such a categorization approach can be leveraged for real-time web delivery of novel image data.
Speech Emotion Recognition Using CNN BIBAFull-Text 801-804
  Zhengwei Huang; Ming Dong; Qirong Mao; Yongzhao Zhan
Deep learning systems, such as Convolutional Neural Networks (CNNs), can infer a hierarchical representation of input data that facilitates categorization. In this paper, we propose to learn affect-salient features for Speech Emotion Recognition (SER) using semi-CNN. The training of semi-CNN has two stages. In the first stage, unlabeled samples are used to learn candidate features by contractive convolutional neural network with reconstruction penalization. The candidate features, in the second step, are used as the input to semi-CNN to learn affect-salient, discriminative features using a novel objective function that encourages the feature saliency, orthogonality and discrimination. Our experiment results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and environment distortion), and outperforms several well-established SER features.
Attractive or Not?: Beauty Prediction with Attractiveness-Aware Encoders and Robust Late Fusion BIBAFull-Text 805-808
  Shuyang Wang; Ming Shao; Yun Fu
Facial attractiveness is an ever-lasting issue in art and social science. It also draws considerable attention from multimedia community recently. In this paper, we develop a framework highlighting attractiveness-aware feature extracted from a pair of auto-encoders to learn human-like assessment of facial beauty. Our work is fully-automatic that does not require any landmark and puts no restrictions on the faces' pose, expressions, and lighting conditions and therefore is applicable on a larger and more diverse dataset. To this end, first, a pair of auto-encoders is built respectively with beauty images and non-beauty images, which can be used to extract attractiveness-aware features by putting test images into both encoders. Second, we further enhance the performance using an efficient robust low-rank fusion framework to integrate the predicted confidence scores which are obtained based on certain kinds of features. We show that our attractiveness-aware model with multiple layers of auto-encoders produces appealing results and performs better than previous appearance-based approaches.
AWtoolbox: Characterizing Audio Information Using Audio Words BIBAFull-Text 809-812
  Chin-Chia Michael Yeh; Ping-Keng Jao; Yi-Hsuan Yang
This paper presents the AWtoolbox, an open-source software designed for extracting the audio word (AW) representation of audio signals. The toolbox comes with a graphical user interface that helps a user design custom AW extraction pipelines and various algorithms for feature encoding, dictionary learning, result rectification, pooling, normalization and others. This paper also reports a benchmark comparing eight AW representations computed by the toolbox against state-of-the-art low-level and mid-level timbre, rhythmic and tonal descriptors of music and sound. The evaluation result shows that sparse coding (SC) based AW representation leads to very competitive performances across the three tested sound and music classification tasks. AWtoolbox is available for download at http://mac.citi.sinica.edu.tw/awtoolbox.
Screencast in the Wild: Performance and Limitations BIBAFull-Text 813-816
  Chih-Fan Hsu; De-Yu Chen; Chun-Ying Huang; Cheng-Hsin Hsu; Kuan-Ta Chen
Displays without associated computing devices are increasingly more popular, and the binding between computing devices and displays is no longer one-to-one but more dynamic and adaptive. Screencast technologies enable such dynamic binding over ad hoc one-hop networks or Wi-Fi access points. In this paper, we design and conduct the first detailed measurement study on the performance of state-of-the-art screencast technologies. By varying the user demands and network conditions, we find that Splashtop and Miracast outperform other screencast technologies under typical setups. Our experiments also show that the screencast technologies either: (i) do not dynamically adjust bitrate or (ii) employ a suboptimal adaptation strategy. The developers of future screencast technologies are suggested to pay more attentions on the bitrate adaptation strategy, e.g., by leveraging cross-layer optimization paradigm.
One-Pass Video Stabilization on Mobile Devices BIBAFull-Text 817-820
  Wei Jiang; Zhenyu Wu; John Wus; Heather Yu
We study real-time one-pass video stabilization on mobile devices such as smartphones and tablets. A localized camera path planning framework is proposed, where through maintaining a forward or backward-looking buffer window, the short-term camera trajectory is locally optimized according to cinematographic rules. Also, we propose a hybrid auto-corrective path optimization algorithm, where the system automatically switches among different motion models according to the actual video adaptively. Using a multithreading mechanism, the proposed method can be implemented on mobile devices to achieve real-time one-pass video stabilization. Evaluation on a large variety of consumer videos demonstrates the effectiveness of our system.
You Talkin' to Me?: Recognizing Complex Human Interactions in Unconstrained Videos BIBAFull-Text 821-824
  Bo Zhang; Yan Yan; Nicola Conci; Nicu Sebe
Nowadays, due to the exponential growth of the user generated videos and the prevailing videos sharing communities such as YouTube and Hulu, recognizing complex human activities in the wild becomes increasingly important in the research community. These videos are hard to study due to the frequent changes of camera viewpoint, multiple people moving in the scene, fast body movements, and varied lengths of video clips. In this paper, we propose a novel framework to analyze human interactions in TV shows. Firstly, we exploit the motion interchange pattern (MIP) to detect camera viewpoint changes in a video, and extract the salient motion points in the bounding box that covers the region of interest (ROI) in each frame. Then, we compute the large displacement optical flow for the salient pixels in the bounding box, and build the histogram of oriented optical flow as the motion feature vector for each frame. Finally, the self-similarity matrix (SSM) is adopted to capture the global temporal correlation of frames in a video. After extracting the SSM descriptors, the video feature vector can be constructed through different encoding approaches. The proposed framework works well in practice for unconstrained videos. We validate our approach on the TV human interaction (TVHI) dataset, and the experimental results demonstrate the efficacy of our strategy.
Instructional Videos for Unsupervised Harvesting and Learning of Action Examples BIBAFull-Text 825-828
  Shoou-I Yu; Lu Jiang; Alexander Hauptmann
Online instructional videos have become a popular way for people to learn new skills encompassing art, cooking and sports. As watching instructional videos is a natural way for humans to learn, analogously, machines can also gain knowledge from these videos. We propose to utilize the large amount of instructional videos available online to harvest examples of various actions in an unsupervised fashion. The key observation is that in instructional videos, the instructor's action is highly correlated with the instructor's narration. By leveraging this correlation, we can exploit the timing of action corresponding terms in the speech transcript to temporally localize actions in the video and harvest action examples. The proposed method is scalable as it requires no human intervention. Experiments show that the examples harvested are of reasonably good quality, and action detectors trained on data collected by our unsupervised method yields comparable performance with detectors trained with manually collected data on the TRECVID Multimedia Event Detection task.
Beautifying Fisheye Images using Orientation and Shape Cues BIBAFull-Text 829-832
  Xiaobo Wang; Xiaochun Cao; Xiaojie Guo; Zhanjie Song
Fisheye images, due to their wide range of vision, become more and more popular in our daily life. However, the fisheye images usually suffer from misalignment that reduces their visual pleasure. In this paper, we develop a computational method for enhancing the aesthetics of such images by exploiting the orientation and shape cues. More specifically, the orientation cue is based on the observation that cameras are often oriented when taking photos, so that their upvectors are parallel to vertical linear structures in the scene. While the shape one refers to that after repositing the fisheye image, the circular shape should be preserved. By employing these two rules as our basic aesthetic guidelines, our method can correct the rotation angle between the camera coordinate and the world coordinate to make the virtual camera oriented, and complete the missing part. Experimental results on a number of challenging indoor and outdoor fisheye images show the effectiveness of our approach, and demonstrate the superior aesthetics of the proposed method compared to the state-of-the-arts.
Temporal Fusion Approach Using Segment Weight for Affect Recognition from Body Movements BIBAFull-Text 833-836
  Jianfeng Xu; Shigeyuki Sakazawa
In the field of affect recognition, most researchers have focused on using multimodal fusion while temporal fusion techniques have yet to be adequately explored. This paper demonstrates that a powerful temporal fusion approach can significantly improve the performance of affect recognition. As a typical approach in state-of-the-art methods, the segment-based affect recognition technique from body movements is used as a baseline, whereby a body movement is parsed into a sequence of motion clips (called segments) and all segments are treated as being the same using a majority voting strategy. Our basic idea is that different types of segments have different influences/weights in recognizing the affective state. To verify this idea, an entropy-based method is proposed to estimate the segment weights in supervised and unsupervised learning. Furthermore, the recognition results from all segments are fused with the segment weights using the sum rule. Our experimental results on a public data set demonstrate that the segment weights can greatly improve the percentage of correctness (accuracy) from the baseline.
sTrack: Secure Tracking in Community Surveillance BIBAFull-Text 837-840
  Chun-Te Chu; Jaeyeon Jung; Zicheng Liu; Ratul Mahajan
We present sTrack, a system that can track objects across multiple cameras without sharing any visual information between two cameras except whether an object was seen by both. To achieve this challenging privacy goal, we leverage recent advances in secure two-party computation and multi-camera tracking. We derive a new distance metric learning technique that is more suited for secure computation. Compared to the existing methods, our technique has lower complexity in secure computation without sacrificing the tracking accuracy. We implement it using a new Boolean circuit for secure tracking. Experiments using real datasets show that the performance overhead of secure tracking is low, adding only a few seconds over non-private tracking.
Searching for Recent Celebrity Images in Microblog Platform BIBAFull-Text 841-844
  Na Zhao; Richang Hong; Meng Wang; Xuegang Hu; Tat-Seng Chua
With the explosive growth and widespread accessibility of image content in social media, many users are eagerly searching for most recent and relevant images on topics of their interests. However, most current microblog platforms merely make use of textual information, specifically, keywords, for image search which cannot achieve satisfactory results since in most cases image content is inconsistent with textual content. In this paper we tackle this problem under the application of searching for celebrity image. The proposed method is based on the idea of refining the initial text-based search results by utilizing multimedia plus social information. Given a text search query, we first obtain an initial text-based result. Next, we extract a seed tweet set whose images contain faces recognized as celebrities and texts contain the expanded keywords. Third, we extend the seed set based on visual and user information. Lastly, we employ a multi-modal graph based learning method to properly rank the obtained tweets by integrating social and visual information. Extensive experiments on data collected from Tencent Weibo demonstrate that our proposed method could approximately achieve 3-fold improvement in results as compared to the text baseline, typically used in microblog search service.
Organizing Video Search Results to Adapted Semantic Hierarchies for Topic-based Browsing BIBAFull-Text 845-848
  Jiajun Wang; Yu-Gang Jiang; Qiang Wang; Kuiyuan Yang; Chong-Wah Ngo
Organizing video search results into semantically structured hierarchies can greatly improve the efficiency of browsing complex query topics. Traditional hierarchical clustering techniques are inadequate since they lack the ability to generate semantically interpretable structures. In this paper, we introduce an approach to organize video search results to an adapted semantic hierarchy. As many hot search topics such as celebrities and famous cities have Wikipedia pages where hierarchical topic structures are available, we start from the Wikipedia hierarchies and adjust the structures according to the characteristics of the returned videos from a search engine. Ordinary clustering based on textual information of the videos is performed to discover the hidden topic structures in the video search results, which are used to adapt the hierarchy extracted from Wikipedia. After that, a simple optimization problem is formulated to assign the videos to each node of the hierarchy considering three important criteria. Experiments conducted on a YouTube video dataset verify the effectiveness of our approach.
Real-time summarization of user-generated videos based on semantic recognition BIBAFull-Text 849-852
  Xi Wang; Yu-Gang Jiang; Zhenhua Chai; Zichen Gu; Xinyu Du; Dong Wang
User-generated contents play an important role in the Internet video-sharing activities. Techniques for summarizing the user-generated videos (UGVs) into short representative clips are useful in many applications. This paper introduces an approach for UGV summarization based on semantic recognition. Different from other types of videos like movies or broadcasting news, where the semantic contents may vary greatly across different shots, most UGVs have only a single long shot with relatively consistent high-level semantics. Therefore, a few semantically representative segments are generally sufficient for a UGV summary, which can be selected based on the distribution of semantic recognition scores. In addition, due to the poor shooting quality of many UGVs, factors such as camera shaking and lighting condition are also considered to achieve more pleasant summaries. Experiments on over 100 UGVs with both subjective and objective evaluations show that our approach clearly outperforms several alternative methods and is highly efficient. Using a regular laptop, it can produce a summary for a 2-minute video in just 10 seconds.
Hacking Chinese Touclick CAPTCHA by Multi-Scale Corner Structure Model with Fast Pattern Matching BIBAFull-Text 853-856
  Yunhang Shen; Rongrong Ji; Donglin Cao; Min Wang
In this paper, we tackle the challenge of hacking CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), which is widely used to identify machine and human in webpage registration and authorization [17]. More specially, we target at touclick Chinese CAPTCHA that is recently popular in mobile application scenario. Hacking such CAPTCHA is much more challenging and left unexploited in the literature. Our main idea is a multi-scale Corner based Structure Model, termed CSM, with a very efficient pattern matching scheme. CSM can accurately capture the intrinsic statistics of touclick Chinese CAPTCHA against background clutters. We demonstrate the efficiency and effectiveness of the proposed approach by extensive experiments on a Chinese touclick CAPTCHA dataset collected from Internet forums. We report encouraging results with an overall success rate of almost 100% and an averaged detection speed of 170 millisecond. Upon our work, we also provide suggestions on improving the current CAPTCHA-based human-machine identification systems.
Orthogonal Gaussian Process for Automatic Age Estimation BIBAFull-Text 857-860
  Kai Zhu; Dihong Gong; Zhifeng Li; Xiouou Tang
Age Estimation from facial images has been receiving increasing interest due to its important applications. Among the existing age estimation algorithms, the personalized approaches have been shown to be the most effective ones. However, most of the person-specific approaches (e.g. MTWGP [1], AGES [2], WAS [3]) rely heavily on the availability of training images across different ages for a single subject, which is very difficult to satisfy in practical applications. In order to overcome this problem, we propose a new approach to age estimation, called Orthogonal Gaussian Process (OGP). Compared to standard Gaussian Process, OGP is much more efficient while maintaining the discriminatory power of the standard Gaussian Process. Based on OGP, we further propose an improvement of OGP called anisotropic OGP (A-OGP) to enhance the age estimation performance. Extensive experiments are conducted to demonstrate the state-of-the-art estimation accuracy of our new algorithm on several public-domain face aging datasets: FG-NET face dataset with 82 different subjects, Morph Album 1 dataset with more than 600 subjects, and Morph Album 2 with about 20,000 different subjects.
Points of Interest Detection from Multiple Sensor-Rich Videos in Geo-Space BIBAFull-Text 861-864
  Ying Zhang; Roger Zimmermann; Luming Zhang; David A. Shamma
Recently, the popularity of user generated videos has highlighted efficient video indexing and browsing as an urgent problem. Points of interest (POI) detection is a technique to address this issue by establishing the implicit relationship among different media resources. The majority of existing studies detect POI by visual similarity, leveraging computer vision techniques. However, these methods suffer from high computational complexity when processing large-scale video sets and are challenging due to the sparse visual correlation among different consumer videos. The advent of geo-referenced videos provides an opportunity to detect POIs in an efficient manner. In this work, we first propose a probability model to formulate the capture intention distribution for video frames. Second, we detect the POIs by considering the contributions from multiple videos. Evaluations demonstrate that our algorithm successfully detects POIs with a reduced error distance from several dozens to a few meters.
Recognizing Human Activities from Smartphone Sensor Signals BIBAFull-Text 865-868
  Arindam Ghosh; Giuseppe Riccardi
In context-aware computing, Human Activity Recognition (HAR) aims to understand the current activity of users from their connected sensors. Smartphones with their various sensors are opening a new frontier in building human-centered applications for understanding users' personal and world contexts. While in-lab and controlled activity recognition systems have yielded very good results, they do not perform well under in-the-wild scenarios. The objective of this paper is to 1) Investigate how audio signal can complement and improve other on-board sensors (accelerometer and gyroscope) for activity recognition; 2) Design and evaluate the fusion of such multiple signal streams to optimize performance and sampling rate. We show that fusion of these signal streams, including audio, achieves high performance even at very low sampling rates; 3) Evaluate the performance of the multi-stream human activity recognition under different real end-user activity conditions.
Twitter-driven YouTube Views: Beyond Individual Influencers BIBAFull-Text 869-872
  Honglin Yu; Lexing Xie; Scott Sanner
This paper proposes a novel method to predict increases in YouTube viewcount driven from the Twitter social network. Specifically, we aim to predict two types of viewcount increases: a sudden increase in viewcount (named as Jump), and the viewcount shortly after the upload of a new video (named as Early). Experiments on hundreds of thousands of videos and millions of tweets show that Twitter-derived features alone can predict whether a video will be in the top 5% for Early popularity with 0.7 Precision@100. Furthermore, our results reveal that while individual influence is indeed important for predicting how Twitter drives YouTube views, it is a diversity of interest from the most active to the least active Twitter users mentioning a video (measured by the variation in their total activity) that is most informative for both Jump and Early prediction. In summary, by going beyond features that quantify individual influence and additionally leveraging collective features of activity variation, we are able to obtain an effective cross-network predictor of Twitter-driven YouTube views.
Decoding Auditory Saliency from FMRI Brain Imaging BIBAFull-Text 873-876
  Shijie Zhao; Xi Jiang; Junwei Han; Xintao Hu; Dajiang Zhu; Jinglei Lv; Tuo Zhang; Lei Guo; Tianming Liu
Given the growing number of available audio streams through a variety of sources and distribution channels, effective and advanced computational audio analysis has received increasing interest in the multimedia field. However, the effectiveness of current audio analysis strategies might be hampered due to the lack of effective representation of high-level semantics perceived by the human and the lack of effective approaches to bridging the gaps between most low-level acoustic features and high-level semantic features. This semantic gap has become the 'bottleneck' problem in audio analysis. In this paper, we propose a computational framework to decode biologically-plausible auditory saliency using high-level features derived from functional magnetic resonance imaging (fMRI) which monitors the human brain's response under the natural stimulus of audio listening. Specifically, we identify meaningful intrinsic brain networks which are involved in audio listening via effective online dictionary learning and sparse representation of whole-brain fMRI signals, reconstruct auditory saliency features using those identified brain network components, and perform group-wise analysis to identify consistent 'brain decoders' of the saliency features across different excerpts and participants. Experimental results demonstrate that the auditory saliency features are effectively decoded via our methods, which potentially provide opportunities for various applications in the multimedia field.
Crowd Counting via Head Detection and Motion Flow Estimation BIBAFull-Text 877-880
  Huiyuan Fu; Huadong Ma; Hongtian Xiao
Crowd counting with heavy occlusions is highly desired for public security. However, few works have been studied towards this goal. Most previous systems only count passing people robustly without heavy occlusions, else they have to estimate the crowds to a certain extent. To solve this difficult problem, this paper proposes an effective algorithm by combining head detection and motion flow estimation together. First, we detect each head in the crowd by using our proposed scene-adaptive scheme on depth data. Then, we estimate the motion flow based on the interest points in each head region on color data. We can ultimately achieve multi-direction crowd counting results with the loop of above steps. Based on this approach, we have built a practical system for reliable crowd counting. Extensive experimental results show that our system is effective.
Semantic feature projection for continuous emotion analysis BIBAFull-Text 881-884
  Prasanth Lade; Troy McDaniel; Sethuraman Panchanathan
Affective computing researchers have recently been focusing on continuous emotion dimensions like arousal and valence. This dual coordinate affect space can explain many of the discrete emotions like sadness, anger, joy, etc. In the area of continuous emotion recognition, Principal Component Analysis (PCA) models are generally used to enhance the performance of various image and audio features by projecting them to a new space where the new features are less correlated. We instead, propose that quantizing and projecting the features to a latent topic space performs better than PCA. Specifically we extract these topic features using Latent Dirichlet Allocation (LDA) models. We show that topic models project the original features to a latent feature space that is more coherent and useful for continuous emotion recognition than PCA. Unlike PCA where no semantics can be attributed to the new features, topic features can have a visual and semantic interpretation which can be used in personalized HCI applications and Assistive technologies. Our hypothesis in this work has been validated using the AVEC 2012 continuous emotion challenge dataset.
Real-time crowd detection based on gradient magnitude entropy model BIBAFull-Text 885-888
  Huiyuan Fu; Huadong Ma
Reliable and real-time crowd detection is one of the most important tasks in intelligent video surveillance system. Previous works focus on counting the number of pedestrians in the crowd directly or use holistic features of crowd scenes for crowd detection. However, the former methods will be invalid in complex crowded scenes, and the latter methods will be confused for feature selection. In this paper, we propose a simple but effective model -- Gradient Magnitude Entropy (GME) model for crowd detection. Our model is based on a key observation -- the value of GME in a region which will increase as the number of pedestrians grows. Thus, we can estimate the degree of crowd when the value of GME is larger than some threshold, without counting the number of pedestrians. Extensive experiments show that our GME model outperforms state-of-the-art techniques on several challenging datasets. Furthermore, our method can process in real time for practical surveillance applications.
Face Recognition from Multiple Images per Subject BIBAFull-Text 889-892
  Yang Mu; Henry Lo; Wei Ding; Dacheng Tao
For face recognition, we show that knowing that each subject corresponds to multiple face images can improve classification performance. For domains such as video surveillance, it is easy to deduce which group of images belong to the same subject; in domains such as family album identification, we lose group membership information but there is still a group of images for each subject. We define these two types of problems as multiple faces per subject. In this paper, we propose a Bipart framework to take advantage of this group information in the testing set as well as in the training set. From these two sources of information, two models are learned independently and combined to form a unified discriminative distance space. Furthermore, this framework is generalized to allow both subspace learning and distance metric learning methods to take advantage of this group information. Bipart is evaluated on the multiple faces per subject problem using several benchmark datasets, including video and static image data, subjects of various ages, various lighting conditions, and many facial expressions. Comparisons against state-of-the-art distance and subspace learning methods demonstrate much better performance when utilizing group information with the Bipart framework.
Admission Control for Wireless Adaptive HTTP Streaming: An Evidence Theory Based Approach BIBAFull-Text 893-896
  Zhisheng Yan; Chang Wen Chen; Bin Liu
In this research, we propose an evidence theory based admission control scheme for wireless cellular adaptive HTTP streaming systems. This novel scheme allows us to effectively address the uncertainty and inaccuracy in QoE management and network estimation, and seamlessly grant or deny the access requests. Specifically, based on recent work of QoE continuum model and QoE continuum driven adaptation algorithm, we utilize Dempster-Shafer evidence theory to assign proper degree of belief to admission, rejection and an uncertainty decision for each user's evidence. We then can strategically combine the weighted evidence of multiple users and make the final decision. The evaluation results show that the proposed scheme can provide satisfactory QoE for both existing and new users while still achieving comparable bandwidth efficiency.
Matrix Completion for Cross-view Pairwise Constraint Propagation BIBAFull-Text 897-900
  Zheng Yang; Yao Hu; Haifeng Liu; Huajun Chen; Zhaohui Wu
As pairwise constraints are usually easier to access than label information, pairwise constraint propagation attracts more and more attention in semi-supervised learning. Most existing pairwise constraint propagation methods are based on canonical graph propagation model, which heavily depends on the edge weights in the graph and cannot preserve local and global consistency simultaneously. In order to address this drawback, we cast cross-view pairwise constraint propagation into a problem of low rank matrix completion and propose a Matrix Completion method for cross-view Pairwise Constraint Propagation (MCPCP). With low rank requirement and graph regularization, our MCPCP can preserve local and global consistency simultaneously. We develop an algorithm based on alternating direction method of multipliers (ADMM) to solve the optimization problem. Finally, the effectiveness of MCPCP is demonstrated in cross-view multimedia retrieval.
Cross-Media Hashing with Neural Networks BIBAFull-Text 901-904
  Yueting Zhuang; Zhou Yu; Wei Wang; Fei Wu; Siliang Tang; Jian Shao
Cross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional hamming space, has attracted intensive attention in recent years. This is motivated by the facts a) the multi-modal data is widespread, e.g., the web images on Flickr are associated with tags, and b) hashing is an effective technique towards large-scale high-dimensional data processing, which is exactly the situation of cross-media retrieval. Inspired by recent advances in deep learning, we propose a cross-media hashing approach based on multi-modal neural networks. By restricting in the learning objective a) the hash codes for relevant cross-media data being similar, and b) the hash codes being discriminative for predicting the class labels, the learned Hamming space is expected to well capture the cross-media semantic relationships and to be semantically discriminative. The experiments on two real-world data sets show that our approach achieves superior cross-media retrieval performance compared with the state-of-the-art methods.
How Your Portrait Impresses People?: Inferring Personality Impressions from Portrait Contents BIBAFull-Text 905-908
  Jie Nie; Peng Cui; Yan Yan; Lei Huang; Zhen Li; Zhiqiang Wei
Whenever looking at a stranger's portrait, besides observable appearance, we always build a personality impression implicitly in our subconscious. It is quite interesting to ask how a portrait impresses people. This paper presents a novel method to infer personality impression from portrait. Firstly, a questionnaire is applied to demonstrate the consistence of people's impression. And then personality-related features are explored through the statistical analysis method. Finally, features are trained using Support Vector Machine. Experimental results demonstrate our method could achieve a precision of 52.14% and a recall of 52.78% on inferring 4 personalities from 2,463 randomly selected portraits of people downloaded from "Google images". Improvements of 44.04% and 37.91% are reported compared to a baseline method. And features contribution analysis deeply unveils the correspondence between portrait contents and personality impressions. Demonstrations with respect to visual patterns in portrait collages of different personalities further prove the effectiveness of the proposed method. Furthermore, we apply our method to analyze portraits of Hillary Clinton and obtain an interesting multifaceted figure of this famous politics, which is another proof of both our concept and method.
Who's Time Is It Anyway?: Investigating the Accuracy of Camera Timestamps BIBAFull-Text 909-912
  Bart Thomee; Jose G. Moreno; David A. Shamma
People take photos all over the world at all times of day; each photo depicting a place and a moment worth capturing. In the context of multimedia analysis and social computing, accurate location and time information about where and when these photos were taken is of importance for understanding event semantics, image content and many other purposes. While location information associated with photos is known to be relatively accurate, time is not. From a sample of 10 million public Flickr photos, we observe that 37% of the photos differ more than an hour between their camera timestamps and GPS timestamps with respect to local time at the locations where the photos were taken. Erroneous time information may adversely influence the correctness of any kind of temporal analysis that relies on camera timestamps, as well as research and real-world applications that require accurate knowledge of when and where photos were captured. In light of our observations we propose a simple yet effective metadata-only technique for improving the accuracy of camera timestamps.
Jointly Discovering Fine-grained and Coarse-grained Sentiments via Topic Modeling BIBAFull-Text 913-916
  Hanqi Wang; Fei Wu; Xi Li; Siliang Tang; Jian Shao; Yueting Zhuang
The ever-increasing user-generated contents in social media and other web services make it highly desirable to discover opinions of users on all kinds of topics. Motivated by the assumption that individual word and paragraph in documents will deliver fine-grained (e.g., "laudatory", "annoyed" or "boring") and coarse-grained (e.g., positive, negative or neutral) sentiments about certain topics respectively, this paper focuses on a deeper thematic level to jointly disentangle fine-grained and coarse-grained opinions towards topics in terms of sentiment analysis, named as LDA with multi-grained sentiments (MgS-LDA). As a result, the proposed MgS-LDA not only discovers the topics in social media, but also identifies opinions about a given topic in terms of fine-grained and coarse-grained sentiment. Results of several experiments show that our proposed MgS-LDA achieves better performance on both sentimental classification and topic modeling than related methods.
A Real-Time Smart Assistant for Video Surveillance Through Handheld Devices BIBAFull-Text 917-920
  Hao Kuang; Benjamin Guthier; Mukesh Saini; Dwarikanath Mahapatra; Abdulmotaleb El Saddik
In a remote surveillance system, a high resolution surveillance camera streams its video to a user's handheld device. Such devices are unable to make use of the high resolution video due to their limited display size and bandwidth. In this paper, we propose a method to assist the mobile operator of the surveillance camera in focusing on sensitive regions of the video. Our system automatically identifies relevant regions. We introduce a pan and zoom strategy to ensure that the operator is able to see fine details in these areas while maintaining contextual knowledge. Regions of interest are identified using foreground detection as well as face and body detection. The efficacy of the proposed method is demonstrated through a user study. Our proposed method was reported to be more useful than two comparable approaches for getting an understanding of the activities in a surveillance scene while maintaining context.

Posters 2

Modeling the Interest-Forgetting Curve for Music Recommendation BIBAFull-Text 921-924
  Jun Chen; Chaokun Wang; Jianmin Wang
Music recommendation plays a key role in our daily lives as well as in the multimedia industry. This paper adapts the memory forgetting curve to model the human interest-forgetting curve for music recommendations based on the observation of recency effects in people's listening to music. Two music recommendation methods are proposed using this model with respect to the sequence-based and the IFC-based transition probabilities, respectively. We also bring forward a learning method to approximate the global optimal or personalized interest-forgetting speed(s). The experimental results show that our methods can significantly improve the accuracy in music recommendations. Meanwhile, the IFC-based method outperforms the sequence-based method when recommendation list is short at each time.
Learning to Assess Image Retargeting BIBAFull-Text 925-928
  Bahetiyaer Bare; Ke Li; Weiyi Wang; Bo Yan
Content-aware image retargeting enables images to fit different devices with various aspect ratios while preserving salient contents. Meanwhile, assessing the quality of image retargeting and unifying both subjective and objective evaluation have become a prominent challenge. In this paper, we propose an image quality assessment based on Radial Basis Function (RBF) neural network. We propose a new feature of image retargeting evaluation, which adapts structural similarity (SSIM) and saliency. By also including other existing features, we build a neural network to assess the quality of the retargeted image. The neural network is trained to combine the above-mentioned features. The accuracy of our proposed assessment is verified by simulations and it possesses huge practical significance.
Efficient Image Retargeting via Adaptive Pixel Fusion BIBAFull-Text 929-932
  Bo Yan; Xiaochu Yang; Ke Li
This paper presents a new image retargeting method, which is able to provide both high efficiency and quality. Firstly, our method calculates the image resizing factors in both horizontal and vertical directions. Then, it resizes the image in these both directions sequentially in order to achieve the target aspect ratio with the help of our proposed mapping functions. Finally, the reconstructed image can be uniformly scaled to the target resolution. Experimental results show that our method is not only cost effective in terms of computational resource, but also provides visually pleasing resized images.
Learning Compact Face Representation: Packing a Face into an int32 BIBAFull-Text 933-936
  Haoqiang Fan; Mu Yang; Zhimin Cao; Yujing Jiang; Qi Yin
This paper addresses the problem of producing very compact representation of a face image for large-scale face search and analysis tasks. In tradition, the compactness of face representation is achieved by a dimension reduction step after representation extraction. However, the dimension reduction usually degrades the discriminative ability of the original representation drastically. In this paper, we present a deep learning framework which optimizes the compactness and discriminative ability jointly. The learnt representation can be as compact as 32 bit (same as the int32) and still produce highly discriminative performance (91.4% on LFW benchmark). Based on the extreme compactness, we show that traditional face analysis tasks (e.g. gender analysis) can be effectively solved by a Look-Up-Table approach given a large-scale face data set.
Person Search in a Scene by Jointly Modeling People Commonness and Person Uniqueness BIBAFull-Text 937-940
  Yuanlu Xu; Bingpeng Ma; Rui Huang; Liang Lin
This paper presents a novel framework for a multimedia search task: searching a person in a scene using human body appearance. Existing works mostly focus on two independent problems related to this task, i.e., people detection and person re-identification. However, a sequential combination of these two components does not solve the person search problem seamlessly for two reasons: 1) the errors in people detection are carried into person re-identification unavoidably; 2) the setting of person re-identification is different from that of person search which is essentially a verification problem. To bridge this gap, we propose a unified framework which jointly models the commonness of people (for detection) and the uniqueness of a person (for identification). We demonstrate superior performance of our approach on public benchmarks compared with the sequential combination of the state-of-the-art detection and identification algorithms.
Object Tracking using Reformative Transductive Learning with Sample Variational Correspondence BIBAFull-Text 941-944
  Tao Zhuo; Peng Zhang; Yanning Zhang; Wei Huang; Hichem Sahli
Tracking-by-learning strategies have effectively solved many challenging problems for visual tracking. When labeled samples are limited, the learning performance can be improved by exploiting unlabeled ones. Thus, a key issue for semi-supervised learning is the label assignment of the unlabeled samples, which is the principal focus of transductive learning. Unfortunately, the optimization scheme employed by the transductive learning is hard to be applied to online tracking because of its large amount of computation for sample labeling. In this paper, a reformative transductive learning was proposed with the variational correspondence between the learning samples, which are utilized to build an effective matching cost function for more efficient label assignment during the learning of representative separators. By using a weighted accumulative average to update the coefficients via a fixed budget of support vectors, the proposed tracking has been demonstrated to outperform most of the state-of-art trackers.
Multimodal Dynamic Networks for Gesture Recognition BIBAFull-Text 945-948
  Di Wu; Ling Shao
Multimodal input is a real-world situation in gesture recognition applications such as sign language recognition. In this paper, we propose a novel bi-modal (audio and skeleton joints) dynamic network for gesture recognition. First, state-of-the-art dynamic Deep Belief Networks are deployed to extract high level audio and skeletal joints representations. Then, instead of traditional late fusion, we adopt another layer of perceptron for cross modality learning taking the input from each individual net's penultimate layer. Finally, to account for temporal dynamics, the learned shared representations are used for estimating the emission probability to infer action sequences. In particular, we demonstrate that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.
Event Detection based on Twitter Enthusiasm Degree for Generating a Sports Highlight Video BIBAFull-Text 949-952
  Keisuke Doman; Taishi Tomita; Ichiro Ide; Daisuke Deguchi; Hiroshi Murase
This paper presents a Twitter-based event detection method based on "Twitter Enthusiasm Degrees (TED)" toward generating a highlight video of a sports game. Existing methods not only depend on both languages and sports types but also often falsely detect non-target events. In contrast, the proposed method detects sports events using TEDs calculated from several kinds of string features independent of languages and sports. We applied the proposed method to actual sports games, and compared the detected events with the events present in broadcasted highlight videos, and confirmed the effectiveness and the language and sports type independencies of the proposed method.
Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback BIBAFull-Text 953-956
  Hong Zhang; Junsong Yuan; Xingyu Gao; Zhenyu Chen
Different types of multimedia data express high-level semantics from different aspects. How to learn comprehensive high-level semantics from different types of data and enable efficient cross-media retrieval becomes an emerging hot issue. There are abundant statistical and semantic correlations among heterogeneous low-level media content, which makes it challenging to query cross-media data effectively. In this paper, we propose a new cross-media retrieval method based on short-term and long-term relevance feedback. Our method mainly focuses on two typical types of media data, i.e. image and audio. First, we build multimodal representation via statistical canonical correlation between image and audio feature matrices, and define cross-media distance metric for similarity measure; then we propose optimization strategy based on relevance feedback, which fuses short-term learning results and long-term accumulated knowledge into the objective function. Experiments on image-audio dataset have demonstrated the superiority of our method over several existing algorithms.
Transfer in Photography Composition BIBAFull-Text 957-960
  Hui-Tang Chang; Yu-Chiang Frank Wang; Ming-Syan Chen
In this paper, we propose novel photography recomposition method, which aims at transferring the photography composition of a reference image to an input image automatically. Without any user interaction, our approach first identifies the salient foreground objects or image regions of interest, and the recomposition is performed by solving a graph-matching based optimization task. With additional post-processing step to preserve the locality and boundary information of the recomposed visual components, we can solve the task of photography recomposition without the uses of any prior knowledge on photography or predetermined image aesthetics rules. Experiments on a variety of images, including transferring the photography composition from real photos, sketches or even paintings, would confirm the effectiveness of our proposed method.
Eyes Whisper Depression: A CCA based Multimodal Approach BIBAFull-Text 961-964
  Heysem Kaya; Albert Ali Salah
This paper presents our work on ACM MM Audio Visual Emotion Corpus 2013 (AVEC 2013) depression recognition sub-challenge using the baseline features in accordance with the challenge protocol. We use Canonical Correlation Analysis for audio-visual fusion as well as covariate extraction for the target task. The video baseline provides histograms of local phase quantization features extracted from 4x4=16 regions of the detected face. We summarize the video features over segments of length 20 seconds using mode and range functionals. We observe that features of range functional that measure the variance tendency provides statistically significantly higher canonical correlation than mode functional features that measure the mean tendency. Moreover, when audio-visual features are used with varying number of covariates per region, the regions that were consistently found the best are the ones corresponding to two eyes and the right part of the mouth.
Just Browsing?: Understanding User Journeys in Online TV BIBAFull-Text 965-968
  Yehia Elkhatib; Rebecca Killick; Mu Mu; Nicholas Race
Understanding the dynamics of user interactions and the behaviour of users as they browse for content is vital for advancements in content discovery, service personalisation, and recommendation engines which ultimately improve quality of user experience. In this paper, we analyse how more than 1,100 users browse an online TV service over a period of six months. Through the use of model-based clustering, we identify distinctive groups of users with discernible browsing patterns that vary during the course of the day.
Inductive Transfer Deep Hashing for Image Retrieval BIBAFull-Text 969-972
  Xinyu Ou; Lingyu Yan; Hefei Ling; Cong Liu; Maolin Liu
With the explosive increase of online images, fast similarity search is increasingly critical for large scale image retrieval. Several hashing methods have been proposed to accelerate image retrieval, a promising way is semantic hashing which designs compact binary codes for a large number of images so that semantically similar images are mapped to similar codes. Supervised methods can handle such semantic similarity but they are prone to overfitting when the labeled data is few or noisy. In this paper, we concentrate on this issue and propose a novel Inductive Transfer Deep Hashing (ITDH) approach for semantic hashing based image retrieval. A transfer deep learning algorithm has been employed to learn the robust image representation, and the neighborhood-structure preserved method has been used to mapped the image into discriminative hash codes in hamming space. The combination of the two techniques ensures that we obtain a good feature representation and a fast query speed without depending on large amounts of labeled data. Experimental results demonstrate that the proposed approach is superior to some state-of-the-art methods.
What Can We Learn about Motion Videos from Still Images? BIBAFull-Text 973-976
  Jianguang Zhang; Yahong Han; Jinhui Tang; Qinghua Hu; Jianmin Jiang
Human action recognition from motion videos plays an important role in multimedia analysis. Different from the temporal cues of action series in motion videos, the motion tendency can also be revealed from the still images or key frames. Thus, if the action knowledge in related still images can be well adapted to the target motion videos, we would have a great chance to improve the performance of video action recognition. In this paper, we propose a framework of Still-to-Motion Adaptation (SMA) for human action recognition. Common visual features are extracted both from the related images and target videos' key frames, by which the gap between still images and videos are bridged. Meanwhile, to utilize the unlabeled training videos in target domain, we incorporate a semi-supervised process into our framework. By minimizing the difference of action prediction from still features and motion features, we formulate the still-to-motion adaptation into a joint optimization process. Experiments successfully demonstrate the effectiveness of the proposed framework and show the better performance of action recognition compared with the state-of-the-art methods. We also analyze the impact on the recognition results of target videos by knowledge adaptation from still images.
Modeling Attributes from Category-Attribute Proportions BIBAFull-Text 977-980
  Felix X. Yu; Liangliang Cao; Michele Merler; Noel Codella; Tao Chen; John R. Smith; Shih-Fu Chang
Attribute-based representation has been widely used in visual recognition and retrieval due to its interpretability and cross-category generalization properties. However, classic attribute learning requires manually labeling attributes on the images, which is very expensive, and not scalable. In this paper, we propose to model attributes from category-attribute proportions. The proposed framework can model attributes without attribute labels on the images. Specifically, given a multi-class image datasets with N categories, we model an attribute, based on an N-dimensional category-attribute proportion vector, where each element of the vector characterizes the proportion of images in the corresponding category having the attribute. The attribute learning can be formulated as a learning from label proportion (LLP) problem. Our method is based on a newly proposed machine learning algorithm called ∝SVM. Finding the category-attribute proportions is much easier than manually labeling images, but it is still not a trivial task. We further propose to estimate the proportions from multiple modalities such as human commonsense knowledge, NLP tools, and other domain knowledge. The value of the proposed approach is demonstrated by various applications including modeling animal attributes, visual sentiment attributes, and scene attributes.
Exploiting Correlation Consensus: Towards Subspace Clustering for Multi-modal Data BIBAFull-Text 981-984
  Yang Wang; Xuemin Lin; Lin Wu; Wenjie Zhang; Qing Zhang
Often, a data object described by many features can be decomposed as multi-modalities, which always provide complementary information to each other. In this paper, we study subspace clustering for multi-modal data by effectively exploiting data correlation consensus across modalities, while keeping individual modalities well encapsulated. Our technique can yield a more ideal data similarity matrix, which encodes strong data correlations for the cross-modal data objects in the same subspace.
   To these ends, we propose a novel angular based regularizer coupled with our objective function, which is aided by trace lasso and minimized to yield sparse representation vectors encoding data correlations in multiple modalities. As a result, the sparse code vectors of the same cross-modal data have small angular difference so as to achieve the data correlation consensus simultaneously. This can generate a compatible data similarity matrix for multi-modal data. The final subspace clustering result is obtained by applying spectral clustering on such data similarity matrix. The effectiveness of our approach is validated by experiments conducted on real-world image datasets.
Learning Multimodal Neural Network with Ranking Examples BIBAFull-Text 985-988
  Xinyan Lu; Fei Wu; Xi Li; Yin Zhang; Weiming Lu; Donghui Wang; Yueting Zhuang
To support cross-modal information retrieval, cross-modal learning to rank approaches utilize ranking examples (e.g., an example may be a text query and its corresponding ranked images) to learn appropriate ranking (similarity) function. However, the fact that each modality is represented with intrinsically different low-level features hinders these approaches from better reducing the heterogeneity-gap between the modalities and thus giving satisfactory retrieval results. In this paper, we consider learning with neural networks, from the perspective of optimizing the listwise ranking loss of the cross-modal ranking examples. The proposed model, named Cross-Modal Ranking Neural Network (CMRNN), benefits from the advance of both neural networks on learning high-level semantics and learning to rank techniques on learning ranking function, such that the learned cross-modal ranking function is implicitly embedded in the learned high-level representation for data objects with different modalities (e.g., text and imagery) to perform cross-modal retrieval directly. We compare CMRNN to existing state-of-the-art cross-modal ranking methods on two datasets and show that it achieves a better performance.
Supervised Discriminative Hashing for Compact Binary Codes BIBAFull-Text 989-992
  Viet Anh Nguyen; Jiwen Lu; Minh N. Do
Binary hashing has been increasingly popular for efficient similarity search in large-scale vision problems. This paper presents a novel Supervised Discriminative Hashing (SDH) method by jointly modeling the global and local manifold structures. Specifically, a family of discriminative hash functions is designed to map data points of the original high-dimensional space into nearby compact binary codes while preserving the geometrical similarity and discriminant properties in both global and local neighborhoods. Furthermore, the quantization loss between the original data and the binary codes together with the even binary code distribution are also taken into account in the optimization to generate more efficient and compact binary codes. Experimental results have demonstrated the proposed method outperforms the state-of-the-art.
Personalized Visual Vocabulary Adaption for Social Image Retrieval BIBAFull-Text 993-996
  Zhenxing Niu; Shiliang Zhang; Xinbo Gao; Qi Tian
With the popularity of mobile devices and social networks, users can easily build their personalized image sets. Thus, personalized image analysis, indexing, and retrieval have become important topics in social media analysis. Because of users' diverse preferences, their personalized image sets are usually related to specific topics and show large feature distribution bias from general Internet images. Therefore, the visual vocabulary trained on general Internet images may could not fit across users' personalized image sets very well. To improve the image retrieval performance on personalized image sets, we propose the personalized visual vocabulary adaption which removes non-discriminative visual words and replaces them with more exact and discriminative ones, i.e., adapt a general vocabulary toward a specific user's image set. The proposed algorithm updates the visual vocabulary during off-line feature quantization, and operates on a limited number of visual words, hence shows satisfying efficiency. Extensive experiments of image search on public datasets demonstrate the efficiency and superior performance of our approach.
Co-Saliency Detection via Base Reconstruction BIBAFull-Text 997-1000
  Xiaochun Cao; Yupeng Cheng; Zhiqiang Tao; Huazhu Fu
Co-saliency aims at detecting common saliency in a series of images, which is useful for a variety of multimedia applications. In this paper, we address the co-saliency detection to a reconstruction problem: the foreground could be well reconstructed by using the reconstruction bases, which are extracted from each image and have the similar appearances in the feature space. We firstly obtain a candidate set by measuring the saliency prior of each image. Relevance information among the multiple images is utilized to remove the inaccuracy reconstruction bases. Finally, with the updated reconstruction bases, we rebuild the images and provide the reconstruction error regarded as a negative correlational value in co-saliency measurement. The satisfactory quantitative and qualitative experimental results on two benchmark datasets demonstrate the efficiency and effectiveness of our method.
Automatic Facial Image Annotation and Retrieval by Integrating Voice Label and Visual Appearance BIBAFull-Text 1001-1004
  Hong-Wun Jheng; Bor-Chun Chen; Yan-Ying Chen; Winston Hsu
Annotation is important for managing and retrieving a large amount of photos, but it is generally labor-intensive and time-consuming. However, speaking while taking photos is straightforward and effortless, and using voice for annotation is faster than typing words. To best reduce the manual cost of annotating photos, we propose a novel framework which utilizes the scarce spoken annotations recorded while capturing as voice labels and automatically label every facial image in the photo collection. To accomplish this goal, we employ a probabilistic graphical model which integrates voice labels and visual appearances for inference. Combined with group prior estimation and gender attribute association, we can achieve an outstanding performance on the proposed synthesized group photo collections.
Query-Adaptive Hash Code Ranking for Fast Nearest Neighbor Search BIBAFull-Text 1005-1008
  Tianxu Ji; Xianglong Liu; Cheng Deng; Lei Huang; Bo Lang
Recently hash-based nearest neighbor search has become attractive in many applications due to its compressed storage and fast query speed. However, the quantization in the hashing process usually degenerates its discriminative power when using Hamming distance ranking. To enable fine-grained ranking, hash bit weighting has been proved as a promising solution. Though achieving satisfying performance improvement, state-of-the-art weighting methods usually heavily rely on the projection's distribution assumption, and thus can hardly be directly applied to more general types of hashing algorithms. In this paper, we propose a new ranking method named QRank with query-adaptive bitwise weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. QRank is a general weighting method for all kinds of hashing algorithms without any strict assumptions. Experimental results on two well-known benchmarks MNIST and NUS-WIDE show that the proposed method can achieve up to 17.11% performance gains over state-of-the-art methods.
Systematic Assessment of the Video Recording Position for User-generated Event Videos BIBAFull-Text 1009-1012
  Stefan Wilk; Wolfgang Effelsberg
The increasing capabilities of mobile handhelds to record high definition videos enable us to share moments with friends and the public. In large-scale events such as concerts, the manifold of potential views recorded by the audience allows to watch an event from different perspectives. In this work, we elaborate on the effects of different recording positions on quality and show that especially with retail smart phone camera sensors the quality differs as the position changes. This is a first systematic approach to describe why user-generated video sequences differ in terms of quality depending on changing recording positions. To achieve a sound understanding, we conduct a large-scale systematic subjective study using crowdsourcing, inspecting the effect of distance and orientation of a recorder in relation to the event taking place. Our results indicate an effect of the recording position on the perceived quality of a video.
BAP: Bimodal Attribute Prediction for Zero-Shot Image Categorization BIBAFull-Text 1013-1016
  Hanhui Li; Donghui Li; Xiaonan Luo
Recent advances in attribute-based methods provide the zero-shot learning problem with practical solutions. In attribute-based methods, visual attributes are introduced to fill the gap between low-level image features and high-level semantic information. This paper proposes a novel bimodal attribute prediction model called BAP, which can better predict visual attributes in images. BAP fuses advantages of the conventional direct attribute prediction (DAP) and indirect attribute prediction (IAP) on the level of attribute prediction. It contains an attribute-classifier pooling process that generates a large amount of base classifiers and a combination strategy that integrates these classifiers. We explore and propose four BAP models with different combination strategies in this paper, and experimentally show that our BAP outperforms the conventional models both in offline and online zero-shot image categorization.
Building a Self-Learning Eye Gaze Model from User Interaction Data BIBAFull-Text 1017-1020
  Michael Xuelin Huang; Tiffany C. K. Kwok; Grace Ngai; Hong Va Leong; Stephen C. F. Chan
Most eye gaze estimation systems rely on explicit calibration, which is inconvenient to the user, limits the amount of possible training data and consequently the performance. Since there is likely a strong correlation between gaze and interaction cues, such as cursor and caret locations, a supervised learning algorithm can learn the complex mapping between gaze features and the gaze point by training on incremental data collected implicitly from normal computer interactions. We develop a set of robust geometric gaze features and a corresponding data validation mechanism that identifies good training data from noisy interaction-informed data collected in real-use scenarios. Based on a study of gaze movement patterns, we apply behavior-informed validation to extract gaze features that correspond with the interaction cue, and data-driven validation provides another level of crosschecking using previous good data. Experimental evaluation shows that the proposed method achieves an average error of 4.06°, and demonstrates the effectiveness of the proposed gaze estimation method and corresponding validation mechanism.
Toward an Estimation of User Tagging Credibility for Social Image Retrieval BIBAFull-Text 1021-1024
  Alexandru Lucian Ginsca; Adrian Popescu; Bogdan Ionescu; Anil Armagan; Ioannis Kanellos
Existing image retrieval systems exploit textual or/and visual information to return results. The retrieval process is mostly focused on data themselves and disregards the data sources. In Web 2.0 platforms, the quality of annotations provided by different users can vary strongly. To account for this variability, we complement existing methods by introducing user tagging credibility in the retrieval process. Tagging credibility is automatically estimated by leveraging a large set of visual concept classifiers learned with Overfeat, a convolutional neural network (CNN) based feature. A good image retrieval system should return results that are both relevant and diversified and tackle both challenges. We diversify results by using a k-Means algorithm with user based cluster ranking. We increase relevance by favoring images uploaded from users with good credibility estimates. Evaluation is performed on DIV400, a publicly available social image retrieval dataset. Experiments show that our method provides interesting performances compared to existing approaches.
Affective Image Retrieval via Multi-Graph Learning BIBAFull-Text 1025-1028
  Sicheng Zhao; Hongxun Yao; You Yang; Yanhao Zhang
Images can convey rich emotions to viewers. Recent research on image emotion analysis mainly focused on affective image classification, trying to find features that can classify emotions better. We concentrate on affective image retrieval and investigate the performance of different features on different kinds of images in a multi-graph learning framework. Firstly, we extract commonly used features of different levels for each image. Generic features and features derived from elements-of-art are extracted as low-level features. Attributes and interpretable principles-of-art based features are viewed as mid-level features, while semantic concepts described by adjective noun pairs and facial expressions are extracted as high-level features. Secondly, we construct single graph for each kind of feature to test the retrieval performance. Finally, we combine the multiple graphs together in a regularization framework to learn the optimized weights of each graph to efficiently explore the complementation of different features. Extensive experiments are conducted on five datasets and the results demonstrate the effectiveness of the proposed method.
Recognizing Thousands of Legal Entities through Instance-based Visual Classification BIBAFull-Text 1029-1032
  Valentin Leveau; Alexis Joly; Olivier Buisson; Pierre Letessier; Patrick Valduriez
This paper considers the problem of recognizing legal entities in visual contents in a similar way to named-entity recognizers for text documents. Whereas previous works were restricted to the recognition of a few tens of logotypes, we generalize the problem to the recognition of thousands of legal persons, each being modeled by a rich corporate identity automatically built from web images. We introduce a new geometrically-consistent instance-based classification method that is shown to outperform state-of-the-art techniques on several challenging datasets while being much more scalable. Further experiments performed on an automatic web crawl of 5,824 legal entities demonstrates the scalability of the approach.
Automatic fine-grained hyperlinking of videos within a closed collection using scene segmentation BIBAFull-Text 1033-1036
  Evlampios Apostolidis; Vasileios Mezaris; Mathilde Sahuguet; Benoit Huet; Barbora Cervenková; Daniel Stein; Stefan Eickeler; José Luis Redondo Garcia; Raphaël Troncy; Lukás Pikora
This paper introduces a framework for establishing links between related media fragments within a collection of videos. A set of analysis techniques is applied for extracting information from different types of data. Visual-based shot and scene segmentation is performed for defining media fragments at different granularity levels, while visual cues are detected from keyframes of the video via concept detection and optical character recognition (OCR). Keyword extraction is applied on textual data such as the output of OCR, subtitles and metadata. This set of results is used for the automatic identification and linking of related media fragments. The proposed framework exhibited competitive performance in the Video Hyperlinking sub-task of MediaEval 2013, indicating that video scene segmentation can provide more meaningful segments, compared to other decomposition methods, for hyperlinking purposes.
Automatic Maya hieroglyph retrieval using shape and context information BIBAFull-Text 1037-1040
  Rui Hu; Carlos Pallan Gayol; Guido Krempel; Jean-Marc Odobez; Daniel Gatica-Perez
We propose an automatic Maya hieroglyph retrieval method integrating shape and glyph context information. Two recent local shape descriptors, Gradient Field Histogram of Orientation Gradient (GF-HOG) and Histogram of Orientation Shape Context (HOOSC), are evaluated. To encode the context information, we propose to convert each Maya glyph block into a first-order Markov chain and apply the co-occurrence of neighbouring glyphs. The retrieval results obtained based on visual matching are therefore re-ranked. Experimental results show that our method can significantly improve the glyph retrieval accuracy even with a basic co-occurrence model. Furthermore, two unique glyph datasets are contributed which can be used as novel shape benchmarks in future research.
A Dataset and Taxonomy for Urban Sound Research BIBAFull-Text 1041-1044
  Justin Salamon; Christopher Jacoby; Juan Pablo Bello
Automatic urban sound classification is a growing area of research with applications in multimedia retrieval and urban informatics. In this paper we identify two main barriers to research in this area -- the lack of a common taxonomy and the scarceness of large, real-world, annotated data. To address these issues we present a taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes. The challenges presented by the new dataset are studied through a series of experiments using a baseline classification system.
Instructive Video Retrieval Based on Hybrid Ranking and Attribute Learning: A Case Study on Surgical Skill Training BIBAFull-Text 1045-1048
  Lin Chen; Peng Zhang; Baoxin Li
Video-based systems have been increasingly used in various training tasks in applications like sports, dancing, and surgery. One key task to add automation to such systems is to automatically select reference videos for a given training video of a trainee. In this paper, we formulate a new problem of instructive video retrieval and propose a solution using both attribute learning and learning to rank. The method first evaluates a user's skill attributes by relative attribute learning. Then, the most critical skill attribute in need of improvement is selected and reported to the user. Finally, a hybrid ranking learning to rank method is employed to retrieve instructive videos from a dataset, which serve as reference for the user. Two main technical problems are solved in this method. First, we combine both skill and visual feature to characterize skill superiority and context similarity. Second, we propose a hybrid ranking approach that works with both pair-wise and point-wise labels of the data. The benefit of the proposed method over other heuristic methods is demonstrated by both objective and subjective experiments, using surgical training videos as a case study.
A Real-Time Smart Display Detection System BIBAFull-Text 1049-1052
  Shu Shi; John Barrus
A smart display detection system is proposed that allows users to connect with displays using mobile cameras. The smart displays dynamically update a server with information about the current screen content and the system matches captured images from mobile devices with the screen information. A synchronized timestamped matching strategy is employed to achieve high performance in detecting screens playing motion intensive video and an aggressive feature selection method is used to minimize bandwidth requirements.
Pose Maker: A Pose Recommendation System for Person in the Landscape Photographing BIBAFull-Text 1053-1056
  Shuang Ma; Yangyu Fan; Chang Wen Chen
To pose like a fashion model and to take professional grade photo are always two challenging tasks, especially for novice users. Pose Maker, an innovative system for pose and photo composition recommendation and synthesis, is developed in this work. Given a user-provided clothing color and gender, this system shall not only offer some suitable poses, but also assist users to take high visual quality photos by generating the visual effect of person in the landscape pictures. To recommend poses, we first propose a view specific professional learning model to help users select several compatible poses for a given image. Based on selection results, a hierarchical pose image synthesis module is designed to synthesize the selected poses along with the scene to be captured in the most suitable position and size taking into consideration several important factors in picture composition, including aesthetic photography principles, color harmony and focal length parameters. Extensive experimental evaluations and analysis on test images of various conditions demonstrate the effectiveness of the proposed system.
Representing Musical Patterns via the Rhythmic Style Histogram Feature BIBAFull-Text 1057-1060
  Matthew Prockup; Jeffrey Scott; Youngmoo E. Kim
When listening to music, humans often focus on melodic and rhythmic elements to identify specific songs or genres. While these representations may be quite simple, they still capture and differentiate higher level aspects of music such as expressive intent and musical style. In this work we seek to extract and represent rhythmic patterns from a polyphonic corpus of audio encompassing a number of styles. A compact feature is designed that probabilistically models rhythmic activations within musical beat divisions through histograms of Inter-Onset-Intervals (IOI). Onset detection functions are calculated from multiple frequency bands of a perceptually motivated filter bank. This allows for patterns of lower pitched and higher pitched onsets to be described separately. Through a set of supervised and unsupervised experiments, we show that this feature is well suited for a variety of tasks in which quantifying rhythmic style is necessary.
Mobile Photo Album Management with Multiscale Timeline BIBAFull-Text 1061-1064
  Kolbeinn Karlsson; Wei Jiang; Dong-Qing Zhang
Traditional photo browsing systems developed for PCs are inefficient for browsing and searching of large photo albums on mobile devices due to the small screen size and limited mobile processing power. We propose a new concept in this paper, the multiscale timeline, where photos are grouped into clusters and displayed sequentially on a scaled timeline with user controllable time scales, enabling multiscale overview of the photo album for efficient browsing and searching. To address the slow speed of re-clustering when new photos are added, a new incremental spectral clustering algorithm is further developed, which is an order of magnitude faster than the traditional spectral clustering algorithm and its conventional incremental version. Our implementation of the system on mobile devices shows a better user experience and browsing efficiency based on the experiments over large real-world photo collections.
Interactive Audio Web Development Workflow BIBAFull-Text 1065-1068
  Lonce Wyse
New low-level sound synthesis capabilities have recently become available in Web browsers. However, there is a considerable gap between the enabling technology for interactive audio and its wide-spread adoption in Web media content. We identify several areas where technologies are necessary to support the various stages of development and deployment, describe systems we have developed to address those needs, and show how they work together within a specific Web content development scenario.
What Strikes the Strings of Your Heart?: Multi-Label Dimensionality Reduction for Music Emotion Analysis BIBAFull-Text 1069-1072
  Yang Liu; Yan Liu; Yu Zhao; Kien A. Hua
Music can convey and evoke powerful emotions. This amazing ability has fascinated the general public and also attracted the researchers from different fields to discover the relationship between music and emotion. Psychologists have indicated that some specific characters of rhythm, harmony, melody, and also their combinations can evoke certain kinds of emotions. Their hypotheses are based on real life experience and proved by psychological paradigms on human beings. Aiming at the same target, this paper intends to design a systematic and quantitative framework, and answer three widely interested questions: 1) what are the intrinsic features embedded in music signal that essentially evoke human emotions; 2) to what extent these features influence human emotions; and 3) whether the findings from computational models are consistent with the existing research results from psychological experiments. We formulate the problem as a multi-label dimensionality reduction problem and provide the optimal solution. The proposed multi-emotion similarity preserving embedding technique not only shows better performance in two standard music emotion datasets but also demonstrates some interesting observations for further research in this interdisciplinary topic.
Fusing Music and Video Modalities Using Multi-timescale Shared Representations BIBAFull-Text 1073-1076
  Bing Xu; Xiaogang Wang; Xiaoou Tang
We propose a deep learning architecture to solve the problem of multimodal fusion of multi-timescale temporal data, using music and video parts extracted from Music Videos (MVs) in particular. We capture the correlations between music and video at multiple levels by learning shared feature representations with Deep Belief Networks (DBN). The shared representations combine information from multiple modalities for decision making tasks, and are used to evaluate matching degrees between modalities and to retrieve matched modalities using single or multiple modalities as input. Moreover, we propose a novel deep architecture to handle temporal data at multiple timescales. When processing long sequences with varying length, we propose to extract hierarchical shared representations by concatenating deep representations at different levels, and to perform decision fusion with a feed forward neural network, which takes input from predictions of local and global classifiers trained with shared representations at each level. The effectiveness of our method is demonstrated through MV classification and retrieval.

Posters 3

Representing And Recognizing Motion Trajectories: A Tube And Droplet Approach BIBAFull-Text 1077-1080
  Yang Zhou; Weiyao Lin; Hang Su; Jianxin Wu; Jinjun Wang; Yu Zhou
This paper addresses the problem of representing and recognizing motion trajectories. We first propose to derive scene-related equipotential lines for points in a motion trajectory and concatenate them to construct a 3D tube for representing the trajectory. Based on this 3D tube, a droplet-based method is further proposed which derives a "water droplet" from the 3D tube and recognizes trajectory activities accordingly. Our proposed 3D tube can effectively embed both motion and scene-related information of a motion trajectory while the proposed droplet-based method can suitably catch the characteristics of the 3D tube for activity recognition. Experimental results demonstrate the effectiveness of our approach.
Multi-modal Language Models for Lecture Video Retrieval BIBAFull-Text 1081-1084
  Huizhong Chen; Matthew Cooper; Dhiraj Joshi; Bernd Girod
We propose Multi-modal Language Models (MLMs), which adapt latent variable techniques for document analysis to exploring co-occurrence relationships in multi-modal data. In this paper, we focus on the application of MLMs to indexing text from slides and speech in lecture videos, and subsequently employ a multi-modal probabilistic ranking function for lecture video retrieval. The MLM achieves highly competitive results against well established retrieval methods such as the Vector Space Model and Probabilistic Latent Semantic Analysis. When noise is present in the data, retrieval performance with MLMs is shown to improve with the quality of the spoken text extracted from the video.
Food Detection and Recognition Using Convolutional Neural Network BIBAFull-Text 1085-1088
  Hokuto Kagaya; Kiyoharu Aizawa; Makoto Ogawa
In this paper, we apply a convolutional neural network (CNN) to the tasks of detecting and recognizing food images. Because of the wide diversity of types of food, image recognition of food items is generally very difficult. However, deep learning has been shown recently to be a very powerful image recognition technique, and CNN is a state-of-the-art approach to deep learning. We applied CNN to the tasks of food detection and recognition through parameter optimization. We constructed a dataset of the most frequent food items in a publicly available food-logging system, and used it to evaluate recognition performance. CNN showed significantly higher accuracy than did traditional support-vector-machine-based methods with handcrafted features. In addition, we found that the convolution kernels show that color dominates the feature extraction process. For food image detection, CNN also showed significantly higher accuracy than a conventional method did.
Locality Preserving Discriminative Hashing BIBAFull-Text 1089-1092
  Kang Zhao; Hongtao Lu; Yangcheng He; Shaokun Feng
Hashing for large scale similarity search has become more and more popular because of its improvement in computational speed and storage reduction. Semi-supervised Hashing (SSH) has been proven effective since it integrates both labeled and unlabeled data to leverage semantic similarity while keeping robust to overfitting. However, it ignores the global label information and the local structure of the feature space. In this paper, we concentrate on these two issues and propose a novel semi-supervised hashing method called Locality Preserving Discriminative Hashing which combines two classical dimensionality reduction approaches, Linear Discriminant Analysis (LDA) and Locality Preserving Projection (LPP). The proposed method presents a rigorous formulation in which the supervised term tries to maintain the global information of the labeled data while the unsupervised term provides effective regularization to model local relationships of the unlabeled data. We apply an efficient sequential procedure to learn the hashing functions. Experimental comparisons with other state-of-the-art methods on three large scale datasets demonstrate the effectiveness and efficiency of our method.
Augmented Image Retrieval using Multi-order Object Layout with Attributes BIBAFull-Text 1093-1096
  Xiaochun Cao; Xingxing Wei; Xiaojie Guo; Yahong Han; Jinhui Tang
In image retrieval, users' search intention is usually specified by textual queries, exemplar images, concept maps, and even sketches, which can only express the search intention partially. These query strategies lack the abilities to indicate the Regions Of Interests (ROIs) and represent the spatial or semantic correlations among the ROIs, which results in the so-called semantic gap between users' search intention and images' low-level visual content. In this paper, we propose a novel image search method, which allows the users to indicate any number of Regions Of Interest (ROIs) within the query as well as utilize various semantic concepts and spatial relations to search images. Specifically, we firstly propose a structured descriptor to jointly represent the categories, attributes, and spatial relations among objects. Then, based on the defined descriptor, our method ranks the images in the database according to the matching scores w.r.t. the category, attribute, and spatial relations. We conduct the experiments on the aPascal and aYahoo datasets, and experimental results show the advantage of the proposed method compared to the state of the arts.
The Stack-of-Rings Interface for Large-Scale Image Browsing on Mobile Touch Devices BIBAFull-Text 1097-1100
  Klaus Schoeffmann
We propose a new interface for browsing large image collections on touch-enabled devices, which allows for faster visual search than a common grid interface. The proposed interface consists of 3D rings arranged in a vertical stack. Each ring can be rotated independently with a horizontal wipe gesture and the whole stack can be moved up and down with vertical wipe gestures. We present results from a user study with 16 participants, where we compared the performance of the proposed interface to the one provided by the default image browsing interface on tablets. The evaluation results show that the proposed interface performs significantly faster than the default interface on the tablet and achieves a significantly better rating by the users.
Automatic Personality and Interaction Style Recognition from Facebook Profile Pictures BIBAFull-Text 1101-1104
  Fabio Celli; Elia Bruni; Bruno Lepri
In this paper, we address the issue of personality and interaction style recognition from profile pictures in Facebook. We recruited volunteers among Facebook users and collected a dataset of profile pictures, labeled with gold standard self-assessed personality and interaction style labels. Then, we exploited a bag-of-visual-words technique to extract features from pictures. Finally, different machine learning approaches were used to test the effectiveness of these features in predicting personality and interaction style traits. Our good results show that this task is very promising, because profile pictures convey a lot of information about a user and are directly connected to impression formation and identity management.
Automatic Image Cropping using Visual Composition, Boundary Simplicity and Content Preservation Models BIBAFull-Text 1105-1108
  Chen Fang; Zhe Lin; Radomir Mech; Xiaohui Shen
Cropping is one of the most common tasks in image editing for improving the aesthetic quality of a photograph. In this paper, we propose a new, aesthetic photo cropping system which combines three models: visual composition, boundary simplicity, and content preservation. The visual composition model measures the quality of composition for a given crop. Instead of manually defining rules or score functions for composition, we learn the model from a large set of well-composed images via discriminative classifier training. The boundary simplicity model measures the clearness of the crop boundary to avoid object cutting-through. The content preservation model computes the amount of salient information kept in the crop to avoid excluding important content. By assigning a hard lower bound constraint on the content preservation and linearly combining the scores from the visual composition and boundary simplicity models, the resulting system achieves significant improvement over recent cropping methods in both quantitative and qualitative evaluation.
Mask Assisted Object Coding with Deep Learning for Object Retrieval in Surveillance Videos BIBAFull-Text 1109-1112
  Kezhen Teng; Jinqiao Wang; Min Xu; Hanqing Lu
Retrieving visual object from a large-scale video dataset is one of multimedia research focuses but a challenging task due to imprecise object extraction and partial occlusion. This paper presents a novel approach to efficiently encode and retrieve visual objects, which addresses some practical complications in surveillance videos. Specifically, we take advantage of the mask information to assist object representation, and develop an encoding method by utilizing highly nonlinear mapping with a deep neural network. Furthermore, we add some occluded noise into the learning process to enhance the robustness of dealing with background noise and partial occlusions. A real-life surveillance video data containing over 10 million objects are built to evaluate the proposed approach. Experimental results show our approach significantly outperforms state-of-the-art solutions for object retrieval in large-scale video dataset.
Estimate Gaze Density by Incorporating Emotion BIBAFull-Text 1113-1116
  Huiying Liu; Min Xu; Xiangjian He; Jinqiao Wang
Gaze density estimation has attracted many research efforts in the past years. The factors considered in the existing methods include low level feature saliency, spatial position, and objects. Emotion, as an important factor driving attention, has not been taken into account. In this paper, we are the first to estimate gaze density through incorporating emotion. To estimate the emotion intensity of each position in an image, we consider three aspects, generic emotional content, facial expression intensity, and emotional objects. Generic emotional content is estimated by using Multiple instance learning, which is employed to train an emotion detector from weakly labeled images. Facial expression intensity is estimated by using a ranking method. Emotional objects are detected, by taking blood/injury and worm/snake as examples. Finally, emotion intensity, low level feature saliency, and spatial position, are fused, through a linear support vector machine, to estimate gaze density. The performance is tested on public eye tracking dataset. Experimental results indicate that incorporating emotion does improve the performance of gaze density estimation.
Trajectory Based Jump Pattern Recognition in Broadcast Volleyball Videos BIBAFull-Text 1117-1120
  Chun-Chieh Hsu; Hua-Tsung Chen; Chien-Li Chou; Chien-Peng Ho; Suh-Yin Lee
Jump actions are typically accompanied by spiking and imply significant events in volleyball matches. In this paper, we propose an effective system capable of jump pattern recognition in player moving trajectories from long broadcast volleyball videos. First, the entire video is segmented into clips of rallies by shot segmentation and whistle detection. Then, camera calibration is adopted to find the correspondence between coordinates in the video frames and real-world coordinates. With the homographic transformation matrix computed, real-world player moving trajectories can be derived by a sequence of tracked player locations in video frames. Jump patterns are recognized from the player moving trajectory by using a sliding window scheme with physics-based validation and context constraint. Finally, the jump locations can be estimated and jump tracks can be separated from the planar moving tracks. The experiments conducted on broadcast volleyball videos show promising results.
Fast Instance Search Based on Approximate Bichromatic Reverse Nearest Neighbor Search BIBAFull-Text 1121-1124
  Masakazu Iwamura; Nobuaki Matozaki; Koichi Kise
In the TRECVID Instance Search (INS) task, it is known that use of BM25, which is an improvement of the TFIDF, greatly improves retrieval performance. Its calculation, however, requires tremendous amount of computational cost and this fact makes its use intractable. In this paper, we present its efficient computational method. Since the BM25 is obtained by solving the bichromatic reverse nearest neighbor (BRNN) search problem, we propose an approximate method for the problem based on the state-of-the-art approximate nearest neighbor search method, bucket distance hashing (BDH). An experiment using the TRECVID INS 2012 dataset showed that the proposed method reduced computational cost to less than 1/3500 of the brute-force search with keeping the accuracy.
A Robust Panel Extraction Method for Manga BIBAFull-Text 1125-1128
  Xufang Pang; Ying Cao; Rynson W. H. Lau; Antoni B. Chan
Automatically extracting frames/panels from digital comic pages is crucial for techniques that facilitate comic reading on mobile devices with limited display areas. However, automatic panel extraction for manga, i.e., Japanese comics, can be especially challenging, largely because of its complex panel layout design mixed with various visual symbols throughout the page. In this paper, we propose a robust method for automatically extracting panels from digital manga pages. Our method first extracts the panel block by closing open panels and identifying a page background mask. It then performs a recursive binary splitting to partition the panel block into a set of sub-blocks, where an optimal splitting line at each recursive level is determined adaptively.
RIFF: Retina-inspired Invariant Fast Feature Descriptor BIBAFull-Text 1129-1132
  Song Wu; Michael S. Lew
We present the Retina-inspired Invariant Fast Feature, RIFF, which was designed for invariance to scale, rotation, and affine image deformations. The feature descriptor is based on pair-wise comparisons over a sampling pattern loosely based on the human retina and introduces a method for improving accuracy by maximizing the discriminatory power of the point set. A performance evaluation with regard to bag of words based image retrieval on several well-known international datasets demonstrates that the RIFF descriptor has competitive performance to the state-of-the-art descriptors (e.g. SIFT, SURF, BRISK, and FREAK).
Stevens' Power Law in 3D Tele-immersion: Towards Subjective Modeling of Multimodal Cyber Interaction BIBAFull-Text 1133-1136
  Sabrina Schulte; Shannon Chen; Klara Nahrstedt
In this paper we verify the insufficiency of Stevens' power law to describe the relationship between QoS and QoE factors. User studies that target different types of application scenarios of 3D Tele-immersion (3DTI) are conducted and the results show no significant power trend in the relationship between packet loss and perceptual quality metrics. We further verify that activity characteristics, activity objectives, and users' roles in the 3DTI session also have profound effects on the service quality aside to the QoS level. Thus, simple one-factor psychophysical laws are inadequate of serving as a QoS-QoE mapping model.
Multi-view Multi-task Feature Extraction for Web Image Classification BIBAFull-Text 1137-1140
  Zhiqiang Zuo; Yong Luo; Dacheng Tao; Chao Xu
The features used in many multimedia analysis-based applications are frequently of very high dimension. Feature extraction offers several advantages in highly dimensional cases, and many recent studies have used multi-task feature extraction approaches, which often outperform single-task feature extraction approaches. However, most of these methods are limited in that they only consider data represented by a single type of feature, even though features usually represent images from multiple views. We therefore propose a novel multi-view multi-task feature extraction (MVMTFE) framework for handling multi-view features for image classification. In particular, MVMTFE simultaneously learns the feature extraction matrix for each view and the view combination coefficients. In this way, MVMTFE not only handles correlated and noisy features, but also utilizes the complementarity of different views to further help reduce feature redundancy in each view. An alternating algorithm is developed for problem optimization and each sub-problem can be efficiently solved. Experiments on an real-world web image dataset demonstrate the effectiveness and superiority of the proposed method.
Image Re-ranking with an Alternating Optimization BIBAFull-Text 1141-1144
  Shanmin Pang; Jianru Xue; Zhanning Gao; Qi Tian
In this work, we propose an efficient image re-ranking method, without additional memory cost compared with the baseline method[8], to re-rank all retrieved images. The motivation of the proposed method is that, there are usually many visual words in the query image that only give votes to irrelevant images. With this observation, we propose to only use visual words which can help to find relevant images to re-rank the retrieved images. To achieve the goal, we first find some similar images to the query by maximizing a quadratic function when given an initial ranking of the retrieved images. Then we select query visual words with an alternating optimization strategy: (1) at each iteration, select words based on the similar images that we have found and (2) in turn, update the similar images with the selected words. These two steps are repeated until convergence. Experimental results on standard benchmark datasets show that the proposed method outperforms spatial based re-ranking methods.
Automatic Removal of Visual Stop-Words BIBAFull-Text 1145-1148
  Edgar Roman-Rangel; Stephane Marchand-Maillet
This paper presents a new methodology for the automatic estimation of the optimal amount of visual words that can be removed from a visual dictionary, such that no harm is induced in the discriminative potential of the resulting bag-of-visual-words representations. The proposed approach relies on a special definition of the entropy of each visual word when considered as a random variable, and a new definition of the overlap of class models computed with a normalized Bhattacharyya coefficient. We combined our proposed methodology with a recent approach that labels visual words as stop-words showing that this combination is beneficial to reduce the dimensionality of bag representations, while obtaining good results in terms of classification accuracy and retrieval performance.
Automatic Image Synthesis from Keywords Using Scene Context BIBAFull-Text 1149-1152
  Sho Inaba; Asako Kanezaki; Tatsuya Harada
Text is one of the simplest way to express one's idea, and an image is one of the most impactive way to do so. Therefore, if a system can synthesize an image from text without direct user manipulation, novel image synthesis applications will be opened to users without artistic skills. In such a system, which objects to synthesize will be declared in texts. However, information about positional relations and scale of objects is not much provided and must be estimated using common sense. As described in this paper, we develop a system that can automatically synthesize objects to an image, given the background image and class name of the target synthesizing object. With the inputs as the background image and keywords, images for synthesizing objects are searched automatically. Although some previously developed systems that can synthesize an image from sketches and paintings, this is the first system that can estimate the position, scale, and appearance of objects and automatically synthesize them to images without direct user input. We propose a scene context, which indicates the position, scale, and appearance of synthesizing objects. The contribution of this paper is twofold: (1) the scene context extraction method for automatic image synthesis and (2) application of automatic image synthesis using the scene context.
Face-Based Automatic Personality Perception BIBAFull-Text 1153-1156
  Noura Al Moubayed; Yolanda Vazquez-Alvarez; Alex McKay; Alessandro Vinciarelli
Automatic Personality Perception is the task of automatically predicting the personality traits people attribute to others. This work presents experiments where such a task is performed by mapping facial appearance into the Big-Five personality traits, namely Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. The experiments are performed over the pictures of the FERET corpus, originally collected for biometrics purposes, for a total of 829 individuals. The results show that it is possible to automatically predict whether a person is perceived to be above or below median with an accuracy close to 70 percent (depending on the trait).
Facial Attribute Space Compression by Latent Human Topic Discovery BIBAFull-Text 1157-1160
  Chia-Hung Lin; Yan-Ying Chen; Bor-Chun Chen; Yu-Lin Hou; Winston Hsu
Facial attribute is important information for a variety of machine vision tasks including recognition, classification, and retrieval. There arises a strong need for detecting various facial attributes such as gender, age and more which consume more computation and storage resources. Therefore, we propose a compression framework to find fewer significant Latent Human Topics (LHT) to approximate more facial attributes. LHT is a combination of attribute correlation by transferring facial attribute space to compressional space with Singular Value Decomposition (SVD). Using the proposed scheme, we can easily detect the facial attributes from a face image via fast reconstructing the compressed labels automatically detected by a few LHT classifiers. Experimental results show that our system can achieve similar performance with substantially fewer dimensions compared to the original number of facial attributes, and it even shows slight improvements because LHT carry informative attribute correlations learned from data.
Emotional Analysis of Music: A Comparison of Methods BIBAFull-Text 1161-1164
  Mohammad Soleymani; Anna Aljanaki; Yi-Hsuan Yang; Michael N. Caro; Florian Eyben; Konstantin Markov; Björn W. Schuller; Remco Veltkamp; Felix Weninger; Frans Wiering
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, the songs were annotated both in terms of the emotions that were expressed by the whole excerpt and dynamically with 1 Hz temporal resolution. Each song received 10 annotations on Amazon Mechanical Turk and the annotations were averaged to form a ground truth. Four different systems from three teams and the organizers were employed to tackle this problem in an open challenge. We compare their performances and discuss the best practices. While the effect of a larger feature set was not very apparent in the static emotion estimation, the combination of a comprehensive feature set and a recurrent neural network that models temporal dependencies has largely outperformed the other proposed methods for dynamic music emotion estimation.
Clothing Retrieval Based on Local Similarity with Multiple Images BIBAFull-Text 1165-1168
  Masaru Mizuochi; Asako Kanezaki; Tatsuya Harada
Recently, the online shopping market has been expanded, which has advanced studies of clothing retrieval via image search. For this study, we develop a novel clothing retrieval system considering local similarity, where users can retrieve their desired clothes which are globally similar to an image and partially similar to another image. We propose a method of coding global features by merging local descriptors extracted from multiple images. Furthermore, we design a system that re-evaluates output of similar image search by the similarity of local regions. We demonstrated that our method increased the probability of users finding their desired clothes from 39.7%-55.1%, compared to a standard similar image search system with global features of a single image. Statistical significance is proven using t-tests.
Convolutional Network Features for Scene Recognition BIBAFull-Text 1169-1172
  Markus Koskela; Jorma Laaksonen
Convolutional neural networks have recently been used to obtain record-breaking results in many vision benchmarks. In addition, the intermediate layer activations of a trained network when exposed to new data sources have been shown to perform very well as generic image features, even when there are substantial differences between the original training data of the network and the new domain. In this paper, we focus on scene recognition and show that convolutional networks trained on mostly object recognition data can successfully be used for feature extraction in this task as well. We train a total of four networks with different training data and architectures, and show that the proposed method combining multiple scales and multiple features obtains state-of-the-art performance on four standard scene datasets.
Perceived Audio Quality for Streaming Stereo Music BIBAFull-Text 1173-1176
  Andrew Hines; Eoin Gillen; Damien Kelly; Jan Skoglund; Anil Kokaram; Naomi Harte
Users of audio-visual streaming services expect an ever increasing quality of experience. Channel bandwidth remains a bottleneck commonly addressed with lossy compression schemes for both the video and audio streams. Anecdotal evidence suggests a strongly perceived link between bit rate and quality. This paper presents three audio quality listening experiments using the ITU MUSHRA methodology to assess a number of audio codecs typically used by streaming services. They were assessed for a range of bit rates using three presentation modes: consumer and studio quality headphones and loudspeakers. Our results indicate that with consumer quality headphones, listeners were not differentiating between codecs with bit rates greater than 48 kb/s (p>=0.228). For studio quality headphones and loudspeakers aac-lc at 128 kb/s and higher was differentiated over other codecs (p<=0.001). The results provide insights into quality of experience that will guide future development of objective audio quality metrics.
Discriminating Native from Non-Native Speech Using Fusion of Visual Cues BIBAFull-Text 1177-1180
  Christos Georgakis; Stavros Petridis; Maja Pantic
The task of classifying accent, as belonging to a native language speaker or a foreign language speaker, has been so far addressed by means of the audio modality only. However, features extracted from the visual modality have been successfully used to extend or substitute audio-only approaches developed for speech or language recognition. This paper presents a fully automated approach to discriminating native from non-native speech in English, based exclusively on visual appearance features from speech. Long Short-Term Memory Neural Networks (LSTMs) are employed to model accent-related speech dynamics and yield accent-class predictions. Subject-independent experiments are conducted on speech episodes captured by mobile phones from the challenging MOBIO Database. We establish a text-dependent scenario, using only those recordings in which all subjects read the same paragraph. Our results show that decision level fusion of networks trained with complementary appearance descriptors consistently leads to performance improvement over single-feature systems, with the highest gain in accuracy reaching 7.3%. The best feature combinations achieve classification accuracy of 75%, rendering the proposed method a useful accent classification tool in cases of missing or noisy audio stream.
A Framework of Mobile Visual Search Based on the Weighted Matching of Dominant Descriptor BIBAFull-Text 1181-1184
  Guoyu Lan; Heng Qi; Keqiu Li; Kai Lin; Wenyu Qu; Zhiyang Li
As a kind of interesting mobile application, Mobile Visual Search (MVS) has attracted extensive research efforts from both academy and industry. Most of the MVS systems adopt the client-server framework, in which transmission latency caused by the limited bandwidth in wireless network is a big problem. To address this problem, the state-of-the-art work focuses on designing low bit-rate descriptors for MVS. However, few work focuses on reducing the number of descriptors. To further reduce the latency, we propose a novel framework of MVS based on the weighted matching of dominant descriptor. Firstly, we present an affinity propagation based algorithm for dominant descriptor selection. Secondly, we propose a weighted feature matching method to consider the differences of dominant descriptors in feature matching. By the proposed framework, we not only reduce the network latency in MVS, but also avoid transmitting useless descriptors to improve the retrieval accuracy of MVS. The experimental results on Stanford MVS data set show that when using CHoG descriptors, the proposed framework outperforms the existing framework by reducing more than 40% of the amount of data transmission and increasing 5% of the average retrieval accuracy.
Efficient Cross-Domain Image Retrieval by Multi-Level Matching and Spatial Verification for Structural Similarity BIBAFull-Text 1185-1188
  Che-Chun Lee; Yin-Hsi Kuo; Winston H. Hsu; Shin'ichi Satoh; Sebastian Agethen
Content-based image retrieval (CBIR) technique is important for browsing the rapidly growing Web images. However, traditional CBIR methods usually fail when the query and database images are in different domains. Instead of focusing on a specific domain, we propose a method to solve the general cross-domain image retrieval problem. This method focuses on the structure of image content but not the detail at pixel level, so it is particularly useful for matching the images across different visual domains e.g., paintings, hand-drawn sketches or photographs. To provide an efficient and effective solution, we analyze the bag-of-words (BoW) matching procedure to find out what causing it to fail in the cross-domain setting. We observe that it is necessary to apply different matching constraints for cross-domain image retrieval; therefore, we propose a multi-level matching process which dynamically selects the most suitable matching constraint for feature matching and adopt a fast spatial verification to describe the structure similarity. Finally, our method is much faster than the state-of-the-art solution and achieves better performance in the experiments.
Learning Like a Toddler: Watching Television Series to Learn Vocabulary from Images and Audio BIBAFull-Text 1189-1192
  Emre Yilmaz; Konstantinos Rematas; Tinne Tuytelaars; Hugo Van hamme
This paper presents the initial findings of our efforts to build an unsupervised multimodal vocabulary learning scheme in a realistic scenario. For this purpose, a new multimodal dataset, called Musti3D, has been created. The Musti3D database contains episodes from an animation series for toddlers. Annotated with audiovisual information, this database is used for the investigation of a non-negative matrix factorization (NMF)-based audiovisual learning technique. The performance of the technique, i.e. correctly matching the audio and visual representations of the objects, has been evaluated by gradually reducing the level of supervision starting from the ground truth transcriptions. Moreover, we have performed experiments using different visual representations and time spans for combining the audiovisual information. The preliminary results show the feasibility of the proposed audiovisual learning framework.
Cuteness Recognition and Localization in the Photos of Animals BIBAFull-Text 1193-1196
  Yu Bao; Jing Yang; Liangliang Cao; Haojie Li; Jinhui Tang
Among the flourishing amount of photos in the social media websites, "cute" images of animals are particularly attractive to the Internet users. This paper considers building an automatic model which can distinguish cute images from non-cute ones. To make the recognition results more interpretable, a lot of efforts are made to find which part of the animal appears attractive to the human users. To validate the success of our proposed method, we collect three new datasets of different animals, i.e., cats, dogs, and rabbits with both cute and non-cute images. Our model obtains promising performance in distinguishing cute images from non-cute ones. Moreover, it outperforms the classical models with not only better recognition accuracy, but also more intuitive localization of the cuteness in the images. The contribution of this paper is three-fold: (1) We collect new datasets for cuteness recognition, (2) We extend the powerful Fisher Vector representation to localize cute part in the animal recognition, and (3) Extensive experimental results show that our proposed method can recognize cute animals of cats, dogs, and rabbits.
A 3D Fingertips Detecting and Tracking Algorithm based on the Sliding Window BIBAFull-Text 1197-1200
  Jun Wu; Wenjing Qiao; Cailiang Kuang; Zhenbao Liu; Shuhui Bu; Junwei Han
In this paper, we propose a 3D finger detecting and tracking algorithm (3dFDT) based on a sliding window for RGB-D sequences. Microsoft Kinect is utilized as a 3D depth camera, which provides RGB images and depth data simultaneously. However the depth data have large noises when the background is white. We are focusing on tracking the up or down movements of the fingertips in order to apply in some applications such as playing the piano on the table or "in the air". First, the skin statistical ellipse model is applied to detect the hand skin region, and the pseudo-contours are removed to get the interested hand region. Then, the region center is considered as the center of the palm. Next, the fingertips are located precisely using the convex defect detection based on the refined hand contours. Furthermore, the depth information is aligned to the fingertips in the RGB channels. Finally, the sliding window strategy is employed to stabilize the depth information due to large noises from the depth space especially when dealing with the white surface as backgrounds. The experimental results show that our proposed method is effective, and it can be applied in the real-time applications for non-contact interactions.
Temporal Dropout of Changes Approach to Convolutional Learning of Spatio-Temporal Features BIBAFull-Text 1201-1204
  Dubravko Culibrk; Nicu Sebe
The paper addresses the problem of learning features that can account for temporal dynamics present in videos. Although deep convolutional learning methods revolutionized several areas of multimedia and computer vision, there have been relatively few proposals dealing with ways in which these methods can be enabled to make use of motion information, critical to the extraction of useful information from video. We propose a temporal dropout of changes approach for this, which allows us to consider temporal information over a series of frames without increasing the number of training parameters of the network.
   To illustrate the potential of the proposed methodology, we focus on the problem of dynamic texture classification. Dynamic textures represent an important form of dynamics present in video data, so far not considered within the framework of deep learning.
   Initial results presented in the paper show that the proposed approach, based on a well-known deep convolutional neural network, can achieve state-of-the-art performance on two well-known and challenging dynamic texture classification data sets (DynTex++ and UCLA dynamic texture).
Real-Time HDR Panorama Video BIBAFull-Text 1205-1208
  Lorenz Kellerer; Vamsidhar Reddy Gaddam; Ragnar Langseth; Haakon Stensland; Carsten Griwodz; Dag Johansen; Paal Halvorsen
The interest for wide field of view panorama video is increasing. In this respect, we have an application that uses an array of cameras that overlook a soccer stadium. The input of these cameras are stitched together to provide a panoramic view of the stadium. One of the challenges we face is that large parts of the field are obscured by shadows on sunny days. Such circumstances cause unsatisfying video quality. We have therefore implemented and evaluated multiple algorithms related to high dynamic range (HDR) video. The evaluation shows that a combination of several approaches gives the most useful results in our scenario.
Dynamic Resource Management in Cloud-based Distributed Virtual Environments BIBAFull-Text 1209-1212
  Yunhua Deng; Siqi Shen; Zhe Huang; Alexandru Iosup; Rynson Lau
As an elastic hosting platform, cloud computing has been attracting many attentions for transferring compute-intensive applications from static self-hosting to flexible cloud-based hosting. Distributed virtual environments (DVEs) which typically involve massive users interacting at the same time and feature significant workload dynamics either in spatial due to in-game user mobility or in temporal due to the fluctuating user population, potentially are suitable applications with cloud-based hosting because of the need of resource elasticity. We explore the dynamic resource management for cloud-based DVEs by taking into account their multi-level workload dynamics which differ them from other applications. Simulation results demonstrates the advantages of our developed methods over existing ones.
Interactive 3D Animation Creation and Viewing System based on Motion Graph and Pose Estimation Method BIBAFull-Text 1213-1216
  Masayuki Furukawa; Yasuhiro Akagi; Yukiko Kawai; Hiroshi Kawasaki
This paper proposes an interactive 3D animation system specifically aiming efficient control of human motion. However there are various commercial products for creating movies and game contents, those are still difficult to deal with for non-professional users. To ease the creation process and encourage to utilize 3D animation for the general users, e.g., in the field of such as education, medicine and so on, we propose a system using Kinect. The data of skeleton models of human motion estimated by Kinect is processed to generate Motion Graph and finally restructure the data automatically for 3D character models. We also propose an efficient 3D animation viewing system based on touch interface for tablet device, which enables intuitive control of multiple motions of the human activity. To evaluate the effectiveness of the method, we implemented a prototype system and created several 3D animations.
Color Transfer based on Spatial Structure for Telepresence BIBAFull-Text 1217-1220
  Kentaro Yamada; Hiroshi Sankoh; Sei Naito
In this paper, we propose a novel color transfer method based on spatial structure. This work considers an immersive telepresence system, in which distant users can feel as if they are present at a place other than their true location. In preceding studies, the region of an attendee in a remote room is extracted and synthesized on the display of a local room with a preset background image that is similar to the local room. However, the difference between the structure of preset background image and remote room image often degrades the quality of image in the synthesized video. For example, when a part of the human region is occluded by an object that does not exist in the preset background image, the deficient human region is shown despite the inexistence of the occluding object. To solve this problem, instead of using a preset background image, we propose a method that applies a color transfer technique to images of the remote room. Our proposed method can offer users an experience that allows them to feel as if they are in the same room as the other participant of telepresence by changing the colors of the remote room to match the colors of the local room. Furthermore, we have improved the similarity between the rooms based on spatial structure that was overlooked by the conventional color transfer methods. The experimental results show that the proposed method can provide the same-room experience for users in the telepresence system through color similarity of the remote and local room.
Human Computer Interface for Quadriplegic People Based on Face Position/gesture Detection BIBAFull-Text 1221-1224
  Zhen-Peng Bian; Junhui Hou; Lap-Pui Chau; Nadia Magnenat-Thalmann
This paper proposes a human computer interface using a single depth camera for quadriplegic people. The nose position is employed to control the cursor along with the commands provided by mouth's status. The detection of nose position and mouth's status is based on randomized decision tree algorithm. The experimental results show that the proposed interface is comfortable, easy to use, robust, and outperforms the existing assistive technology.
A Multi-Touch DJ Interface with Remote Audience Feedback BIBAFull-Text 1225-1228
  Lasse Farnung Laursen; Masataka Goto; Takeo Igarashi
Current DJ interfaces lack direct support for typical digital communication common in social media. We present a novel DJ interface for live internet broadcast performances with remote audience feedback integration. Our multi-touch interface is designed for a table top display, featuring a time-line based visualization. Two studies are presented involving seven DJs, culminating in four live broadcasts gathering and analyzing data to better understand both the DJ and audience perspective. This study is one of the first to look closer at DJs and remote audiences. We present useful insight for future interaction design between DJs and remote audiences, and interface integrated audience feedback.


Immersive 3D Communication BIBAFull-Text 1229-1230
  Wanmin Wu; Cha Zhang
The last few decades have witnessed tremendous advances in telecommunication, with the invention of technologies such as radio, telephone, voice-over-IP, and video conferencing. While all these communication tools are useful and valuable, the ultimate goal of telecommunication is to enable fully immersive remote interaction in a way that simulates or even surpasses the face-to-face experience. Immersive 3D communication technologies are developed aiming at that goal. The objective of this tutorial is to present an overview of the recent advances in immersive 3D communication. Topics include the basics of human 3D perception, new systems and algorithms in real-time 3D scene capture and reconstruction, 3D data compression and dissemination, 3D displays, etc. We intend to provide insights into the latest immersive 3D communication technologies, and highlight some open research challenges for the future.
Over-the-Top Content Delivery: State of the Art and Challenges Ahead BIBAFull-Text 1231-1232
  Christian Timmerer; Ali C. Begen
In this tutorial we present state of the art and challenges ahead in over-the-top content delivery. It particular, the goal of this tutorial is to provide an overview of adaptive media delivery, specifically in the context of HTTP adaptive streaming (HAS) including the recently ratified MPEG-DASH standard. The main focus of the tutorial will be on the common problems in HAS deployments such as client design, QoE optimization, multi-screen and hybrid delivery scenarios, and synchronization issues. For each problem, we will examine proposed solutions along with their pros and cons. In the last part of the tutorial, we will look into the open issues and review the work-in-progress and future research directions.
Emerging Topics on Personalized and Localized Multimedia Information Systems BIBAFull-Text 1233-1234
  Yi Yu; Kiyoharu Aizawa; Toshihiko Yamasaki; Roger Zimmermann
We are experiencing an era with a rapid increase of data relevant to different aspects of users' daily life. On the one hand, such data contains personal information of each individual user. On the other hand, it also reflects user behaviors related to the society as data of more users is aggregated. These data could not only be very beneficial for studying various lifestyle patterns, but also be used to generate more descriptive and explanatory analysis across the landscape of diverse multimedia data. Using personal mobile devices and web services to systematically explore interesting aspects of people world has attracted much attention recently. This is a full-day tutorial that addresses emerging topics on personalized and localized multimedia technologies and applications and emphasizes knowledge sensing and discovery in multimedia landscape. This tutorial aims to deliver an overall introduction to multimedia landscapes with multimedia processing, contextual data acquisition, people activity logs, data analytics, geographic-aware multimedia sharing and delivery, and serves as an important lecture on fundamental and advanced research areas of personalized and localized multimedia information systems.
Learning Knowledge Bases for Text and Multimedia BIBAFull-Text 1235-1236
  Lexing Xie; Haixun Wang
Knowledge acquisition, representation, and reasoning has been one of the long-standing challenges in artificial intelligence and related application areas. Only in the past few years, massive amounts of structured and semi-structured data that directly or indirectly encode human knowledge became widely available, turning the knowledge representation problems into a computational grand challenge with feasible solutions in sight. The research and development on knowledge bases is becoming a lively fusion area among web information extraction, machine learning, databases and information retrieval, with knowledge over images and multimedia emerging as another new frontier of representation and acquisition. This tutorial aims to present a gentle overview of knowledge bases on text and multimedia, including representation, acquisition, and inference. The content of this tutorial are intended for surveying the field, as well as for educating practitioners and aspiring researchers.
Social multimedia computing BIBAFull-Text 1237-1238
  Peng Cui; Lexing Xie; Jitao Sang; Changsheng Xu
This article summarizes the corresponding full-day tutorial at ACM Multimedia 2014. This tutorial reviews recent progresses in social multimedia computing from two perspectives: social-sensed multimedia computing (3 hours) and user-centric social multimedia computing (3 hours).
Video hyperlinking BIBAFull-Text 1239-1240
  Vasileios Mezaris; Benoit Huet
This is the abstract for the "Video Hyperlinking" tutorial, presented as part of the 2014 ACM Multimedia Conference. Video hyperlinking is the introduction of links that originate from pieces of video material and point to other relevant content, be it video or any other form of digital content. The tutorial presents the state of the art in video hyperlinking approaches and in relevant enabling technologies, such as video analysis and multimedia indexing and retrieval. Several alternative strategies, based on text, visual and/or audio information are introduced, evaluated and discussed, providing the audience with details on what works and what doesn't on real broadcast material.
An Introduction to Arts and Digital Culture Inside Multimedia BIBAFull-Text 1241-1242
  David A. Shamma; Daragh Byrne
The Arts and Digital Culture program has offered a high quality forum for the presentation of interactive and arts-based multimedia applications at the annual ACM Multimedia conference for over a decade. This tutorial will explore the evolution of this program as a guide to new authors considering future participation in this program. By surveying both past technical and past exhibited contributions, this tutorial will offer guidance to artists, researchers and practitioners on success at this multifaceted, interdisciplinary forum at ACM Multimedia.

Workshop Summaries

AVEC 2014: the 4th international audio/visual emotion challenge and workshop BIBAFull-Text 1243-1244
  Michel Valstar; Björn W. Schuller; Jarek Krajewski; Roddy Cowie; Maja Pantic
The fourth Audio-Visual Emotion Challenge and workshop AVEC 2014 was held in conjunction ACM Multimedia'14. Like the 2013 edition of AVEC, the workshop/challenge addresses the interpretation of social signals represented in both audio and video in terms of high-level continuous dimensions from a large number of clinically depressed patients and controls, with a sub-challenge in self-reported severity of depression estimation. In this summary, we mainly describe participation and its conditions.
The Workshop on Computational Personality Recognition 2014 BIBAFull-Text 1245-1246
  Fabio Celli; Bruno Lepri; Joan-Isaac Biel; Daniel Gatica-Perez; Giuseppe Riccardi; Fabio Pianesi
The Workshop on Computational Personality Recognition aims to define the state-of-the-art in the field and to provide tools for future standard evaluations in personality recognition tasks. In the WCPR14 we released two different datasets: one of YouTube Vlogs and one of Mobile Phone interactions. We structured the workshop in two tracks: an open shared task, where participants can do any kind of experiment, and a competition. We also distinguished two tasks: A) personality recognition from multimedia data, and B) personality recognition from text only. In this paper we discuss the results of the workshop.
CrowdMM14 -- 2014 International ACM Workshop on Crowdsourcing for Multimedia BIBAFull-Text 1247-1248
  Judith A. Redi; Mathias Lux
The power of crowds, leveraging a large number of human contributors and the capabilities of human computation, has enormous potential to address key challenges in the area of multimedia research. This power is, however, of difficult exploitation: challenges arise from the fact that a community of users or workers is a complex and dynamic system highly sensitive to changes in the form and the parameterization of their activities. Since 2012, the International ACM Workshop on Crowdsourcing for Multimedia CrowdMM has been the venue for collecting new insights on the effective deployment of crowdsourcing towards boosting Multimedia research. In its third edition, CrowdMM14 especially focuses on contributions that propose solutions for the key challenges that face widespread adoption of crowdsourcing paradigms in the multimedia research community. These include: identification of optimal crowd members (e.g., user expertise, worker reliability), providing effective explanations (i.e., good task design), controlling noise and quality in the results, designing incentive structures that do not breed cheating, adversarial environments, gathering necessary background information about crowd members without violating privacy, controlling descriptions of task.
EMASC14: 1st International Workshop on Emerging Multimedia Applications and Services for Smart Cities BIBAFull-Text 1249-1250
  M. Anwar Hossain; Abdulmotaleb El Saddik
Smart city is the vision of future city -- with increasingly instrumented, inter-connected and intelligent urban systems -- to improve the quality of life in many aspects including public safety, healthcare, transportation, or energy. With the ever-increasing presence of multimodal sensors in the smart city infrastructure, multimedia plays an indispensable role. The proliferation of multimedia, sensors, pervasive devices, and infrastructures for realizing smart city has brought many challenges that are the core focus of EMASC workshop.
GeoMM 2014: the third ACM multimedia workshop on geotagging and its applications in multimedia BIBAFull-Text 1251-1252
  Liangliang Cao; Gerald Friedland; Lexing Xie
It is our great pleasure to welcome you to the Third ACM Workshop on Geotagging and Its Applications in Multimedia -- GeoMM'14. This year's event continues the workshops in 2012 and 2013, with the goal of building a forum for the presentation and synthesis of vision and insight from leading experts and practitioners on the developing directions of geotagging research related to multimedia. Following the success in previous years, the GeoMM workshop serves as a venue for the premier research in geotagging and multimedia, and continues to attract submissions from a diverse set of researchers, who address newly arising problems within this emerging field. Five regular papers are presented in this workshop, covering a number of novel applications and new methodologies. An invited paper is also presented to introduce the related MediaEval 2014 Placing task, which consists of 5 million geotagged photos and 25,000 geotagged videos. We believe this workshop will benefit more and more research works in the broad research field.
HuEvent'14: 2014 workshop on human-centered event understanding from multimedia BIBAFull-Text 1253-1254
  Ansgar Scherp; Vasileios Mezaris; Bogdan Ionescu; Francesco De Natale
This workshop focuses on the human-centered aspects of understanding events from multimedia content. This includes the notion of objects and their relation to events. The workshop brings together researchers from the different areas in multimedia and beyond that are interested in understanding the concept of events.
ImmersiveMe'14: 2nd ACM international workshop on immersive media experiences BIBAFull-Text 1255-1256
  Teresa Chambel; Paula Viana; V. Michael Bove; Sharon Strover; Graham Thomas
The 2nd ACM International Workshop on Immersive Media Experiences (ImmersiveMe'14) at ACM Multimedia aims at bringing together researchers, students, media producers, service providers and industry players in the emergent area of immersive media experiences, through the exploration of different scenarios, applications, and neighboring fields. This second edition, after a successful first edition at ACM Multimedia 2013, provides a platform for presenting on-going work, to consolidate and tie different research communities working on this engaging area, as well as to point directions for the future.
WISMM'14 -- First ACM International Workshop on Internet-Scale Multimedia Management BIBAFull-Text 1257-1258
  Roger Zimmermann; Yi Yu
Advanced technologies in consumer electronic products have enabled individual users to record, transmit and receive images and videos with mobile devices. Every day people create and consume massive amounts of multimedia information and data by engaging with various mobile Internet services. With a wide variety of multimedia information and data around us being aggregated over time, the Internet is getting increasingly information-centric. We are experiencing an age of increasing demands on how we host various people's online engagements and how to augment people's lives in the physical world with more personalized smart services. This workshop addresses a focused but broad research theme with an emphasis on how to manage and derive value from multimedia data in the social Internet landscape to facilitate the connections between users' physical world and their online activities.
3rd International Workshop on Socially-Aware Multimedia (SAM'14) BIBAFull-Text 1259-1260
  Pablo Cesar; David A. Shamma; Matthew Cooper; Aisling Kelliher
Multimedia social communication is becoming commonplace. Television is becoming smart and social; media sharing applications are transforming the way we converse and recall events and videoconferencing is a common application on our computers, phones, tablets and even televisions. The confluence of computer-mediated interaction, social networking, and multimedia content are radically reshaping social communications, bringing new challenges and opportunities. This workshop, in its third edition, provides an opportunity to explore socially-aware multimedia, in which the social dimension of mediated interactions between people are considered to be as important as the characteristics of the media content. Even though this social dimension is implicitly addressed in some current solutions, further research is needed to better understand what makes multimedia socially-aware.
Summary Abstract for the 3rd ACM International Workshop on Multimedia Analysis for Ecological Data BIBAFull-Text 1261-1262
  Concetto Spampinato; Vasileios Mezaris; Marco Cristani
The 3rd ACM International Workshop on Multimedia Analysis for Ecological Data (MAED'14) is held as part of ACM Multimedia 2014.
   MAED'14, following the previous two workshops of the MAED series (MAED'12 and MAED'13) held as part of, respectively, ACM Multimedia 2012 and ACM Multimedia 2013, is concerned with the processing, interpretation, and visualisation of ecology-related multimedia content with the aim to support biologists in their investigations for analysing and monitoring natural environments.
PIVP 2014: First International Workshop on Perception Inspired Video Processing BIBAFull-Text 1263-1264
  Hari Kalva; Homer Chen; Gerardo Fernandez Escribano; Velibor Adzic
This is a MM'14 Workshop Summary Abstract for PIVP'14 -- 1st International Workshop on Perception Inspired Video Processing. Workshop provided a venue for researchers involved with perceptual video processing and coding to present their current work and to discuss future directions. The workshop program consisted of a keynote talk, oral presentations of full papers and interactive poster session for the short papers. All participants took part in the discussion panel at the end of workshop, exchanging ideas and suggestions for future work.
Serious Games 2014: International Workshop on Serious Games BIBAFull-Text 1265-1266
  Wolfgang Effelsberg; Stefan Göbel
The ACM First International Workshop on Serious Games was held on November 7, 2014. It was co-located with ACM's Inter-national Conference on Multimedia in Orlando. The purpose of the workshop was to bring together researchers and practitioners working on the development and use of serious games. The rationale behind co-locating this workshop with ACM Multimedia was that games are indeed multimedia systems, involving 3D graphics, integrated audio processing, human factors for the input devices, and often also the real-time transmission of events over computer networks.