HCI Bibliography Home | HCI Conferences | MM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
MM Tables of Contents: 131415

Proceedings of the 2013 ACM International Conference on Multimedia

Fullname:Proceedings of the 21st ACM International Conference on Multimedia
Editors:Alejandro (Alex) Jaimes; Nicu Sebe; Nozha Boujemaa; Daniel Gatica-Perez; David A. Shamma; Marcel Worring; Roger Zimmermann
Location:Barcelona, Spain
Dates:2013-Oct-21 to 2013-Oct-25
Standard No:ISBN: 978-1-4503-2404-5; ACM DL: Table of Contents; hcibib: MM13
Links:Conference Website
  1. Keynote address
  2. Best paper session
  3. Experience
  4. Music & play
  5. Similarity search
  6. Art, performance, and sports
  7. Brave new topics: social and cognitive aspects
  8. Action and event recognition
  9. Streaming and synchronization
  10. Keynote address
  11. Multimedia grand challenge
  12. Demos
  13. Posters
  14. Security and forensics
  15. Open source software
  16. Multimodal analysis
  17. Social dynamics
  18. Annotation
  19. Scene understanding
  20. Doctoral symposium
  21. Art session overview
  22. Workshops overview
  23. Technical

Keynote address

Multimedia framed BIBFull-Text 1-2
  Elizabeth Churchill

Best paper session

"Wow! you are so beautiful today!" BIBAFull-Text 3-12
  Luoqi Liu; Hui Xu; Junliang Xing; Si Liu; Xi Zhou; Shuicheng Yan
Beauty e-Experts, a fully automatic system for hairstyle and facial makeup recommendation and synthesis, is developed in this work. Given a user-provided frontal face image with short/bound hair and no/light makeup, the Beauty e-Experts system can not only recommend the most suitable hairdo and makeup, but also show the synthetic effects. To obtain enough knowledge for beauty modeling, we build the Beauty e-Experts Database, which contains 1,505 attractive female photos with a variety of beauty attributes and beauty-related attributes annotated. Based on this Beauty e-Experts Dataset, two problems are considered for the Beauty e-Experts system: what to recommend and how to wear, which describe a similar process of selecting hairstyle and cosmetics in our daily life. For the what-to-recommend problem, we propose a multiple tree-structured super-graphs model to explore the complex relationships among the high-level beauty attributes, mid-level beauty-related attributes and low-level image features, and then based on this model, the most compatible beauty attributes for a given facial image can be efficiently inferred. For the how-to-wear problem, an effective and efficient facial image synthesis module is designed to seamlessly synthesize the recommended hairstyle and makeup into the user facial image. Extensive experimental evaluations and analysis on testing images of various conditions well demonstrate the effectiveness of the proposed system.
GIANT: geo-informative attributes for location recognition and exploration BIBAFull-Text 13-22
  Quan Fang; Jitao Sang; Changsheng Xu
This paper considers the problem of automatically discovering geo-informative attributes for location recognition and exploration. The attribute is expected to be both discriminative and representative, which corresponds to a distinctive visual pattern and associates with semantic interpretation. For solution, we analyze the attribute at region level. Each segmented region in the training set is assigned a binary latent variable indicating its discriminative capability. A latent learning framework is proposed for discriminative region detection and geo-informative attribute discovery. Moreover, we use user-generated content to obtain the semantic interpretation for the discovered visual attribute. The proposed approach are evaluated on one challenging dataset including GoogleStreetView and Flickr photos. Experimental results show that: (1) geo-informative attribute are discriminative and useful for location recognition; (2) the discovered semantic interpretation is meaningful and can be exploited for further explorations.
Online human gesture recognition from motion data streams BIBAFull-Text 23-32
  Xin Zhao; Xue Li; Chaoyi Pang; Xiaofeng Zhu; Quan Z. Sheng
Online human gesture recognition has a wide range of applications in computer vision, especially in human-computer interaction applications. Recent introduction of cost-effective depth cameras brings on a new trend of research on body-movement gesture recognition. However, there are two major challenges: i) how to continuously recognize gestures from unsegmented streams, and ii) how to differentiate different styles of a same gesture from other types of gestures. In this paper, we solve these two problems with a new effective and efficient feature extraction method that uses a dynamic matching approach to construct a feature vector for each frame and improves sensitivity to the features of different gestures and decreases sensitivity to the features of gestures within the same class. Our comprehensive experiments on MSRC-12 Kinect Gesture and MSR-Action3D datasets have demonstrated a superior performance than the stat-of-the-art approaches.
Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval BIBAFull-Text 33-42
  Hanwang Zhang; Zheng-Jun Zha; Yang Yang; Shuicheng Yan; Yue Gao; Tat-Seng Chua
This paper presents a novel Attribute-augmented Semantic Hierarchy (A2 SH) and demonstrates its effectiveness in bridging both the semantic and intention gaps in Content-based Image Retrieval (CBIR). A2 SH organizes the semantic concepts into multiple semantic levels and augments each concept with a set of related attributes, which describe the multiple facets of the concept and act as the intermediate bridge connecting the concept and low-level visual content. A hierarchical semantic similarity function is learnt to characterize the semantic similarities among images for retrieval. To better capture user search intent, a hybrid feedback mechanism is developed, which collects hybrid feedbacks on attributes and images. These feedbacks are then used to refine the search results based on A2 SH. We develop a content-based image retrieval system based on the proposed A2 SH. We conduct extensive experiments on a large-scale data set of over one million Web images. Experimental results show that the proposed A2 SH can characterize the semantic affinities among images accurately and can shape user search intent precisely and quickly, leading to more accurate search results as compared to state-of-the-art CBIR solutions.


Robust evaluation for quality of experience in crowdsourcing BIBAFull-Text 43-52
  Qianqian Xu; Jiechao Xiong; Qingming Huang; Yuan Yao
Strategies exploiting crowdsourcing are increasingly being applied in the area of Quality of Experience (QoE) for multimedia. They enable researchers to conduct experiments with a more diverse set of participants and at a lower economic cost than conventional laboratory studies. However, a major challenge for crowdsourcing tests is the detection and control of outliers, which may arise due to different test conditions, human errors or abnormal variations in context. For this purpose, it is desired to develop a robust evaluation methodology to deal with crowdsourceable data, which are possibly incomplete, imbalanced, and distributed on a graph. In this paper, we propose a robust rating scheme based on robust regression and Hodge Decomposition on graphs, to assess QoE using crowdsourcing. The scheme shows that the removal of outliers in crowdsourcing experiments would be helpful for purifying data and could provide us with more reliable results. The effectiveness of the proposed scheme is further confirmed by experimental studies on both simulated examples and real-world data.
Size does matter: how image size affects aesthetic perception? BIBAFull-Text 53-62
  Wei-Ta Chu; Yu-Kuang Chen; Kuan-Ta Chen
There is no doubt that an image's content determines how people assess the image aesthetically. Previous works have shown that image contrast, saliency features, and the composition of objects may jointly determine whether or not an image is perceived as aesthetically pleasing. In addition to an image's content, the way the image is presented may affect how much viewers appreciate it. For example, it may be assumed that a picture will always look better when it is displayed in a larger size. Is this "the-bigger-the-better" rule always valid? If not, in what situations is it invalid?
   In this paper, we investigate how an image's resolution (pixels) and physical dimensions (inches) affect viewers' appreciation of it. Based on a large-scale aesthetic assessments of 100 images displayed in a variety of resolutions and physical dimensions, we show that an image's size significantly affects its aesthetic rating in a complicated way. Normally a picture looks better when it is bigger, but it may look worse depending on its content. We develop a set of regression models to predict a picture's absolute and relative aesthetic levels at a given display size based on its content and compositional features. In addition, we analyze the essential features that lead to the size-dependent property of image aesthetics. It is hoped that this work will motivate further research by showing that both content and presentation should be considered when evaluating an image's aesthetic appeals.
Non-reference audio quality assessment for online live music recordings BIBAFull-Text 63-72
  Zhonghua Li; Ju-Chiang Wang; Jingli Cai; Zhiyan Duan; Hsin-Min Wang; Ye Wang
Immensely popular video sharing websites such as YouTube have become the most important sources of music information for Internet users and the most prominent platform for sharing live music. The audio quality of this huge amount of live music recordings, however, varies significantly due to factors such as environmental noise, location, and recording device. However, most video search engines do not take audio quality into consideration when retrieving and ranking results. Given the fact that most users prefer live music videos with better audio quality, we propose the first automatic, non-reference audio quality assessment framework for live music video search online. We first construct two annotated datasets of live music recordings. The first dataset contains 500 human-annotated pieces, and the second contains 2,400 synthetic pieces systematically generated by adding noise effects to clean recordings. Then, we formulate the assessment task as a ranking problem and try to solve it using a learning-based scheme. To validate the effectiveness of our framework, we perform both objective and subjective evaluations. Results show that our framework significantly improves the ranking performance of live music recording retrieval and can prove useful for various real-world music applications.
Enabling low bitrate mobile visual recognition: a performance versus bandwidth evaluation BIBAFull-Text 73-82
  Yu-Chuan Su; Tzu-Hsuan Chiu; Yan-Ying Chen; Chun-Yen Yeh; Winston H. Hsu
The rapid development of technologies in both hardware and software have made content-based multimedia services feasible on mobile devices such as smartphones and tablets; and the strong needs for mobile visual search and recognition have been emerging. While many real applications of visual recognition require a large scale recognition systems, the same technologies that support server-based scalable visual recognition may not be feasible on mobile devices due to the resource constraints. Although the client-server framework ensures the scalability, the real-time response subjects to the limitation on network bandwidth. Therefore, the main challenge for mobile visual recognition system should be the recognition bitrate, which is the amount of data transmission under the same recognition performance. For this work, we exploit and compare various strategies such as compact features, feature compression, feature signatures by hashing, image scaling, etc., to enable low bitrate mobile visual recognition. We argue that thumbnail image is a competitive candidate for low bitrate visual recognition because it carries multiple features at once and multi-feature fusion is important as the size of semantic space increases. Our evaluations on two subsets of ImageNet, both contain more than 10,000 images with 19 and 137 categories, verify the efficacy of thumbnail images. We further suggest a new strategy that combines single (local) feature signature and the thumbnail image, which achieves significant bitrate reduction from (average) 102,570 to 4,661 bytes with merely (overall) 10% performance degradation.

Music & play

Competitive affective gaming: winning with a smile BIBAFull-Text 83-92
  André Mourão; João Magalhães
Human-computer interaction (HCI) is expanding towards natural modalities of human expression. Gestures, body movements and other affective interaction techniques can change the way computers interact with humans. In this paper, we propose to extend existing interaction paradigms by including facial expression as a controller in videogames. NovaEmötions is a multiplayer game where players score by acting an emotion through a facial expression. We designed an algorithm to offer an engaging interaction experience using the facial expression. Despite the novelty of the interaction method, our game scoring algorithm kept players engaged and competitive. A user study done with 46 users showed the success and potential for the usage of affective-based interaction in videogames, i.e., the facial expression as the sole controller in videogames. Moreover, we released a novel facial expression dataset with over 41,000 images. These face images were captured in a novel and realistic setting: users playing games where a player's facial expression has an impact on the game score.
Tracking-based interaction for object creation in mobile augmented reality BIBAFull-Text 93-102
  Wolfgang Hürst; Joris Dekker
In this work, we evaluate the feasibility of tracking-based interaction using a mobile phone's or tablet's camera in order to create and edit 3D objects in augmented reality applications. We present a feasibility study investigating if and how gestures made with your finger can be used to create such objects. A revised interface design is evaluated in a user study with 24 subjects that reveals a high usability and entertainment value, but also identifies issues such as ergonomic discomfort and imprecise input for complex tasks. Hence, our results suggest huge potential for this type of interaction in the entertainment, edutainment, and leisure domain, but limited usefulness for serious applications.
Physical modelling and supervised training of a virtual string quartet BIBAFull-Text 103-112
  Graham Percival; Nicholas Bailey; George Tzanetakis
This work improves the realism of synthesis and performance of string quartet music by generating audio through physical modelling of the violins, viola, and cello. To perform music with the physical models, virtual musicians interpret the musical score and generate actions which control the physical models. The resulting audio and haptic signals are examined with support vector machines, which adjust the bowing parameters in order to establish and maintain a desirable timbre. This intelligent feedback control is trained with human input, but after the initial training is completed, the virtual musicians perform autonomously. The system can synthesize and control different instruments of the same type (e.g., multiple distinct violins) and has been tested on two distinct string quartets (total of 8 violins, 2 violas, 2 cellos). In addition to audio, the system creates a video animation of the instruments performing the sheet music.
Using quadratic programming to estimate feature relevance in structural analyses of music BIBAFull-Text 113-122
  Jordan B. L. Smith; Elaine Chew
To identify repeated patterns and contrasting sections in music, it is common to use self-similarity matrices (SSMs) to visualize and estimate structure. We introduce a novel application for SSMs derived from audio recordings: using them to learn about the potential reasoning behind a listener's annotation. We use SSMs generated by musically-motivated audio features at various timescales to represent contributions to a structural annotation. Since a listener's attention can shift among musical features (e.g., rhythm, timbre, and harmony) throughout a piece, we further break down the SSMs into section-wise components and use quadratic programming (QP) to minimize the distance between a linear sum of these components and the annotated description. We posit that the optimal section-wise weights on the feature components may indicate the features to which a listener attended when annotating a piece, and thus may help us to understand why two listeners disagreed about a piece's structure. We discuss some examples that substantiate the claim that feature relevance varies throughout a piece, using our method to investigate differences between listeners' interpretations, and lastly propose some variations on our method.

Similarity search

Topology preserving hashing for similarity search BIBAFull-Text 123-132
  Lei Zhang; Yongdong Zhang; Jinhui Tang; Xiaoguang Gu; Jintao Li; Qi Tian
Binary hashing has been widely used for efficient similarity search. Learning efficient codes has become a research focus and it is still a challenge. In many cases, the real-world data often lies on a low-dimensional manifold, which should be taken into account to capture meaningful neighbors with hashing. The importance of a manifold is its topology, which represents the neighborhood relationships between its subregions and the relative proximities between the neighbors of each subregion, e.g. the relative ranking of neighbors of each subregion. Most existing hashing methods try to preserve the neighborhood relationships by mapping similar points to close codes, while ignoring the neighborhood rankings. Moreover, most hashing methods lack in providing a good ranking for query results since they use Hamming distance as the similarity metric, and in practice, there are often a lot of results sharing the same distance to a query. In this paper, we propose a novel hashing method to solve these two issues jointly. The proposed method is referred to as Topology Preserving Hashing (TPH). TPH is distinct from prior works by preserving the neighborhood rankings of data points in Hamming space. The learning stage of TPH is formulated as a generalized eigendecomposition problem with closed form solutions. Experimental comparisons with other state-of-the-art methods on three noted image benchmarks demonstrate the efficacy of the proposed method.
Order preserving hashing for approximate nearest neighbor search BIBAFull-Text 133-142
  Jianfeng Wang; Jingdong Wang; Nenghai Yu; Shipeng Li
In this paper, we propose a novel method to learn similarity-preserving hash functions for approximate nearest neighbor (NN) search. The key idea is to learn hash functions by maximizing the alignment between the similarity orders computed from the original space and the ones in the hamming space. The problem of mapping the NN points into different hash codes is taken as a classification problem in which the points are categorized into several groups according to the hamming distances to the query. The hash functions are optimized from the classifiers pooled over the training points. Experimental results demonstrate the superiority of our approach over existing state-of-the-art hashing techniques.
Linear cross-modal hashing for efficient multimedia search BIBAFull-Text 143-152
  Xiaofeng Zhu; Zi Huang; Heng Tao Shen; Xin Zhao
Most existing cross-modal hashing methods suffer from the scalability issue in the training phase. In this paper, we propose a novel cross-modal hashing approach with a linear time complexity to the training data size, to enable scalable indexing for multimedia search across multiple modals. Taking both the intra-similarity in each modal and the inter-similarity across different modals into consideration, the proposed approach aims at effectively learning hash functions from large-scale training datasets. More specifically, for each modal, we first partition the training data into $k$ clusters and then represent each training data point with its distances to $k$ centroids of the clusters. Interestingly, such a k-dimensional data representation can reduce the time complexity of the training phase from traditional O(n²) or higher to O(n), where n is the training data size, leading to practical learning on large-scale datasets. We further prove that this new representation preserves the intra-similarity in each modal. To preserve the inter-similarity among data points across different modals, we transform the derived data representations into a common binary subspace in which binary codes from all the modals are "consistent" and comparable. The transformation simultaneously outputs the hash functions for all modals, which are used to convert unseen data into binary codes. Given a query of one modal, it is first mapped into the binary codes using the modal's hash functions, followed by matching the database binary codes of any other modals. Experimental results on two benchmark datasets confirm the scalability and the effectiveness of the proposed approach in comparison with the state of the art.
Online multimodal deep similarity learning with application to image retrieval BIBAFull-Text 153-162
  Pengcheng Wu; Steven C. H. Hoi; Hao Xia; Peilin Zhao; Dayong Wang; Chunyan Miao
Recent years have witnessed extensive studies on distance metric learning (DML) for improving similarity search in multimedia information retrieval tasks. Despite their successes, most existing DML methods suffer from two critical limitations: (i) they typically attempt to learn a linear distance function on the input feature space, in which the assumption of linearity limits their capacity of measuring the similarity on complex patterns in real-world applications; (ii) they are often designed for learning distance metrics on uni-modal data, which may not effectively handle the similarity measures for multimedia objects with multimodal representations. To address these limitations, in this paper, we propose a novel framework of online multimodal deep similarity learning (OMDSL), which aims to optimally integrate multiple deep neural networks pretrained with stacked denoising autoencoder. In particular, the proposed framework explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We conduct an extensive set of experiments to evaluate the performance of the proposed algorithms for multimodal image retrieval tasks, in which the encouraging results validate the effectiveness of the proposed technique.

Art, performance, and sports

One-man-band: a touch screen interface for producing live multi-camera sports broadcasts BIBAFull-Text 163-172
  Eric Foote; Peter Carr; Patrick Lucey; Yaser Sheikh; Iain Matthews
Generating live broadcasts of sporting events requires a coordinated crew of camera operators, directors, and technical personnel to control and switch between multiple cameras to tell the evolving story of a game. In this paper, we present an unimodal interface concept that allows one person to cover live sporting action by controlling multiple cameras and determining which view to broadcast. The interface exploits the structure of sports broadcasts which typically switch between a zoomed out game-camera view (which records the strategic team-level play), and a zoomed in iso-camera view (which captures the animated adversarial relations between opposing players). The operator simultaneously controls multiple pan-tilt-zoom cameras by pointing at a location on the touch screen, and selects which camera to broadcast using one or two points of contact. The image from the selected camera is superimposed on top of a wide-angle view captured from a context-camera which provides the operator with periphery information (which is useful for ensuring good framing while controlling the camera). We show that by unifying directorial and camera operation functions, we can achieve comparable broadcast quality to a multi-person crew, while reducing cost, logistical, and communication complexities.
Tele echo tube: beyond cultural and imaginable boundaries BIBAFull-Text 173-182
  Hiroki Kobayashi; Michitaka Hirose; Akio Fujiwara; Kazuhiko Nakamura; Kaoru Sezaki; Kaoru Saito
Currently, human-computer interaction (HCI) is primarily focused on human-centric interactions; however, people experience many nonhuman-centric interactions during the course of a day. Interactions with nature, such as experiencing the sounds of birds or trickling water, can imprint the beauty of nature in our memories. In this context, this paper presents an interface of such nonhuman interactions to observe people's reaction to the interactions through an imaginable interaction with a mythological creature. Tele Echo Tube (TET) is a speaking tube interface that acoustically interacts with a deep mountain echo through the slightly vibrating lampshade-like interface. TET allows users to interact with the mountain echo in real time through an augmented echo-sounding experience with the vibration over a satellite data network. This novel interactive system can create an imaginable presence of the mythological creature in the undeveloped natural locations beyond our cultural and imaginable boundaries. The results indicate that users take the reflection of the sound as a cue that triggers the nonlinguistic believability in the form of the mythological metaphor of the mountain echo. This echo-like experience of believable interaction in an augmented reality between a human and nature gave the users an imaginable presence of the mountain echo with a high degree of excitement. This paper describes the development and integration of nonhuman-centric design protocols, requirements, methods, and context evaluation.
eHeritage of shadow puppetry: creation and manipulation BIBAFull-Text 183-192
  Min Lin; Zhenzhen Hu; Si Liu; Meng Wang; Richang Hong; Shuicheng Yan
To preserve the precious traditional heritage Chinese shadow puppetry, we propose the puppetry eHeritage, including a creator module and a manipulator module. The creator module accepts a frontal view face image and a profile face image of the user as input, and automatically generates the corresponding puppet, which looks like the original person and meanwhile has some typical characteristics of traditional Chinese shadow puppetry. In order to create the puppet, we first extract the central profile curve and warp the reference puppet eye and eyebrow to the shape of the frontal view eye and eyebrow. Then we transfer the puppet texture to the real face area. The manipulator module can accept the script provided by the user as input and automatically generate the motion sequences. Technically, we first learn atomic motions from a set of shadow puppetry videos. A scripting system converts the user's input to atomic motions, and finally synthesizes the animation based on the atomic motion instances. For better visual effects, we propose the sparsity optimization over simplexes formulation to automatically assemble weighted instances of different atomic actions into a smooth shadow puppetry animation sequence. We evaluate the performance of the creator module and the manipulator module sequentially. Extensive experimental results on the creation of puppetry characters and puppetry plays well demonstrate the effectiveness of the proposed system.
Hybrid robotic/virtual pan-tilt-zom cameras for autonomous event recording BIBAFull-Text 193-202
  Peter Carr; Michael Mistry; Iain Matthews
We present a method to generate aesthetic video from a robotic camera by incorporating a virtual camera operating on a delay, and a hybrid controller which uses feedback from both the robotic and virtual cameras. Our strategy employs a robotic camera to follow a coarse region-of-interest identified by a realtime computer vision system, and then resamples the captured images to synthesize the video that would have been recorded along a smooth, aesthetic camera trajectory. The smooth motion trajectory is obtained by operating the virtual camera on a short delay so that perfect knowledge of immediate future events is known. Previous autonomous camera installations have employed either robotic cameras or stationary wide-angle cameras with subregion cropping. Robotic cameras track the subject using realtime sensor data, and regulate a smoothness-latency trade-off through control gains. Fixed cameras post-process the data and suffer significant reductions in image resolution when the subject moves freely over a large area.
   Our approach provides a solution for broadcasting events from locations where camera operators cannot easily access. We can also offer broadcasters additional actuated camera angles without the overhead of additional human operators. Experiments on our prototype system for college basketball illustrate how our approach better mimics human operators compared to traditional robotic control approaches, while avoiding the loss in resolution that occurs from fixed camera system.

Brave new topics: social and cognitive aspects

Social life networks: a multimedia problem? BIBAFull-Text 203-212
  Amarnath Gupta; Ramesh Jain
Connecting people to the resources they need is a fundamental task for any society. We present the idea of a technology that can be used by the middle tier of a society so that it uses people's mobile devices and social networks to connect the needy with providers. We conceive of a world observatory called the Social Life Network (SLN) that connects together people and things and monitors for people's needs as their life situations evolve. To enable such a system we need SLN to register and recognize situations by combining people's activities and data streaming from personal devices and environment sensors, and based on the situations make the connections when possible. But is this a multimedia problem? We show that many pattern recognition, machine learning, sensor fusion and information retrieval techniques used in multimedia-related research are deeply connected to the SLN problem. We sketch the functional architecture of such a system and show the place for these techniques.
Unveiling the multimedia unconscious: implicit cognitive processes and multimedia content analysis BIBAFull-Text 213-222
  Marco Cristani; Alessandro Vinciarelli; Cristina Segalin; Alessandro Perina
One of the main findings of cognitive sciences is that automatic processes of which we are unaware shape, to a significant extent, our perception of the environment. The phenomenon applies not only to the real world, but also to multimedia data we consume every day. Whenever we look at pictures, watch a video or listen to audio recordings, our conscious attention efforts focus on the observable content, but our cognition spontaneously perceives intentions, beliefs, values, attitudes and other constructs that, while being outside of our conscious awareness, still shape our reactions and behavior. So far, multimedia technologies have neglected such a phenomenon to a large extent. This paper argues that taking into account cognitive effects is possible and it can also improve multimedia approaches. As a supporting proof-of-concept, the paper shows not only that there are visual patterns correlated with the personality traits of 300 Flickr users to a statistically significant extent, but also that the personality traits (both self-assessed and attributed by others) of those users can be inferred from the images these latter post as "favourite".
Large-scale visual sentiment ontology and detectors using adjective noun pairs BIBAFull-Text 223-232
  Damian Borth; Rongrong Ji; Tao Chen; Thomas Breuel; Shih-Fu Chang
We address the challenge of sentiment analysis from visual content. In contrast to existing methods which infer sentiment or emotion directly from visual low-level features, we propose a novel approach based on understanding of the visual concepts that are strongly related to sentiments. Our key contribution is two-fold: first, we present a method built upon psychological theories and web mining to automatically construct a large-scale Visual Sentiment Ontology (VSO) consisting of more than 3,000 Adjective Noun Pairs (ANP). Second, we propose SentiBank, a novel visual concept detector library that can be used to detect the presence of 1,200 ANPs in an image. The VSO and SentiBank are distinct from existing work and will open a gate towards various applications enabled by automatic sentiment analysis. Experiments on detecting sentiment of image tweets demonstrate significant improvement in detection accuracy when comparing the proposed SentiBank based predictors with the text-based approaches. The effort also leads to a large publicly available resource consisting of a visual sentiment ontology, a large detector library, and the training/testing benchmark for visual sentiment analysis.
Indexing billions of images for sketch-based retrieval BIBAFull-Text 233-242
  Xinghai Sun; Changhu Wang; Chao Xu; Lei Zhang
Because of the popularity of touch-screen devices, it has become a highly desirable feature to retrieve images from a huge repository by matching with a hand-drawn sketch. Although searching images via keywords or an example image has been successfully launched in some commercial search engines of billions of images, it is still very challenging for both academia and industry to develop a sketch-based image retrieval system on a billion-level database. In this work, we systematically study this problem and try to build a system to support query-by-sketch for two billion images. The raw edge pixel and Chamfer matching are selected as the basic representation and matching in this system, owning to the superior performance compared with other methods in extensive experiments. To get a more compact feature and a faster matching, a vector-like Chamfer feature pair is introduced, based on which the complex matching is reformulated as the crossover dot-product of feature pairs. Based on this new formulation, a compact shape code is developed to represent each image/sketch by projecting the Chamfer features to a linear subspace followed by a non-linear source coding. Finally, the multi-probe Kmedoids-LSH is leveraged to index database images, and the compact shape codes are further used for fast reranking. Extensive experiments show the effectiveness of the proposed features and algorithms in building such a sketch-based image search system.
Clickage: towards bridging semantic and intent gaps via mining click logs of search engines BIBAFull-Text 243-252
  Xian-Sheng Hua; Linjun Yang; Jingdong Wang; Jing Wang; Ming Ye; Kuansan Wang; Yong Rui; Jin Li
The semantic gap between low-level visual features and high-level semantics has been investigated for decades but still remains a big challenge in multimedia. When "search" became one of the most frequently used applications, "intent gap", the gap between query expressions and users' search intents, emerged. Researchers have been focusing on three approaches to bridge the semantic and intent gaps: 1) developing more representative features, 2) exploiting better learning approaches or statistical models to represent the semantics, and 3) collecting more training data with better quality. However, it remains a challenge to close the gaps. In this paper, we argue that the massive amount of click data from commercial search engines provides a data set that is unique in the bridging of the semantic and intent gap. Search engines generate millions of click data (a.k.a. image-query pairs), which provide almost "unlimited" yet strong connections between semantics and images, as well as connections between users' intents and queries. To study the intrinsic properties of click data and to investigate how to effectively leverage this huge amount of data to bridge semantic and intent gap is a promising direction to advance multimedia research. In the past, the primary obstacle is that there is no such dataset available to the public research community. This changes as Microsoft has released a new large-scale real-world image click data to public. This paper presents preliminary studies on the power of large-scale click data with a variety of experiments, such as building large-scale concept detectors, tag processing, search, definitive tag detection, intent analysis, etc., with the goal to inspire deeper researches based on this dataset.
Latent feature learning in social media network BIBAFull-Text 253-262
  Zhaoquan Yuan; Jitao Sang; Yan Liu; Changsheng Xu
The current trend in social media analysis and application is to use the pre-defined features and devoted to the later model development modules to meet the end tasks. In this work, we claim that representation is critical to the end tasks and contributes much to the model development module. We provide evidence that specially learned feature well addresses the diverse, heterogeneous and collective characteristics of social media data. Therefore, we propose to transfer the focus from the model development to latent feature learning, and present a general feature learning framework based on the popular deep architecture. In particular, following the proposed framework, we design a novel relational generative deep learning model to test the idea on link analysis tasks in the social media networks. We show that the derived latent features well embed both the media content and their observed links, leading to improvement in social media tasks of user recommendation and social image annotation.

Action and event recognition

Learning latent spatio-temporal compositional model for human action recognition BIBAFull-Text 263-272
  Xiaodan Liang; Liang Lin; Liangliang Cao
Action recognition is an important problem in multimedia understanding. This paper addresses this problem by building an expressive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: a number of discrete temporal anchor frames, each of which is further decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions e.g. triple jumping, swinging and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action parts within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variability; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for aggregating scores over temporal anchor frames. Moreover, the contextual interactions are defined between leaf-nodes in both spatial and temporal domains. For model training, we develop a novel weakly supervised learning algorithm which iteratively determines the structural configuration (e.g. the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. By fully exploiting spatio-temporal compositions and interactions, our approach handles well large intra-class action variance (e.g. different views, individual appearances, spatio-temporal structures). The experimental results on the challenging databases demonstrate superior performance of our approach over other methods.
Exploring discriminative pose sub-patterns for effective action classification BIBAFull-Text 273-282
  Xu Zhao; Yuncai Liu; Yun Fu
Articulated configuration of human body parts is an essential representation of human motion, therefore is well suited for classifying human actions. In this work, we propose a novel approach to exploring the discriminative pose sub-patterns for effective action classification. These pose sub-patterns are extracted from a predefined set of 3D poses represented by hierarchical motion angles. The basic idea is motivated by the two observations: (1) There exist representative sub-patterns in each action class, from which the action class can be easily differentiated. (2) These sub-patterns frequently appear in the action class. By constructing a connection between frequent sub-patterns and the discriminative measure, we develop the SSPI, namely, the Support Sub-Pattern Induced learning algorithm for simultaneous feature selection and feature learning. Based on the algorithm, discriminative pose sub-patterns can be identified and used as a series of "magnetic centers" on the surface of normalized super-sphere for feature transform. The "attractive forces" from the sub-patterns determine the direction and step-length of the transform. This transformation makes a feature more discriminative while maintaining dimensionality invariance. Comprehensive experimental studies conducted on a large scale motion capture dataset demonstrate the effectiveness of the proposed approach for action classification and the superior performance over the state-of-the-art techniques.
Human activities recognition using depth images BIBAFull-Text 283-292
  Raj Gupta; Alex Yong-Sang Chia; Deepu Rajan
We present a new method to classify human activities by leveraging on the cues available from depth images alone. Towards this end, we propose a descriptor which couples depth and spatial information of the segmented body to describe a human pose. Unique poses (i.e. codewords) are then identified by a spatial-based clustering step. Given a video sequence of depth images, we segment humans from the depth images and represent these segmented bodies as a sequence of codewords. We exploit unique poses of an activity and the temporal ordering of these poses to learn subsequences of codewords which are strongly discriminative for the activity. Each discriminative subsequence acts as a classifier and we learn a boosted ensemble of discriminative subsequences to assign a confidence score for the activity label of the test sequence. Unlike existing methods which demand accurate tracking of 3D joint locations or couple depth with color image information as recognition cues, our method requires only the segmentation masks from depth images to recognize an activity. Experimental results on the publicly available Human Activity Dataset (which comprises 12 challenging activities) demonstrate the validity of our method, where we attain a precision/recall of 78.1%/75.4% when the person was not seen before in the training set, and 94.6%/93.1% when the person was seen before.
We are not equally negative: fine-grained labeling for multimedia event detection BIBAFull-Text 293-302
  Zhigang Ma; Yi Yang; Zhongwen Xu; Nicu Sebe; Alexander G. Hauptmann
Multimedia event detection (MED) is an effective technique for video indexing and retrieval. Current classifier training for MED treats the negative videos equally. However, many negative videos may resemble the positive videos in different degrees. Intuitively, we may capture more informative cues from the negative videos if we assign them fine-grained labels, thus benefiting the classifier learning. Aiming for this, we use a statistical method on both the positive and negative examples to get the decisive attributes of a specific event. Based on these decisive attributes, we assign the fine-grained labels to negative examples to treat them differently for more effective exploitation. The resulting fine-grained labels may be not accurate enough to characterize the negative videos. Hence, we propose to jointly optimize the fine-grained labels with the knowledge from the visual features and the attributes representations, which brings mutual reciprocality. Our model obtains two kinds of classifiers, one from the attributes and one from the features, which incorporate the informative cues from the fine-grained labels. The outputs of both classifiers on the testing videos are fused for detection. Extensive experiments on the challenging TRECVID MED 2012 development set have validated the efficacy of our proposed approach.

Streaming and synchronization

Joserlin: joint request and service scheduling for peer-to-peer non-linear media access BIBAFull-Text 303-312
  Zhen Wei Zhao; Wei Tsang Ooi
A peer-to-peer non-linear media streaming system needs to schedule both on-demand and prefetch requests carefully so as to reduce the server load and ensure good user experience. In this work, we propose, Joserlin, a joint request and service scheduling solution that not only alleviates request contentions (requests compete for limited service capacity), but also schedules the prefetch requests by considering their contributions to potential reduction of server load. In particular, we propose a novel request binning algorithm to prevent self-contention among on-demand requests issued from the same peer. A service and rejection policy is devised to resolve contention among on-demand requests issued from different neighbors. More importantly, Joserlin employs a gain function to prioritize prefetch requests at both requesters and responders, and a prefetch request issuing algorithm to fully utilize available upload bandwidth. Evaluation with traces collected from a popular networked virtual environment shows that Joserlin leads to 20%~60% reduction in server load.
FlashStream: a multi-tiered storage architecture for adaptive HTTP streaming BIBAFull-Text 313-322
  Moonkyung Ryu; Umakishore Ramachandran
Video streaming on the Internet is popular and the need to store and stream video content using CDNs is continually on the rise thanks to services such as Hulu and Netflix. Adaptive HTTP streaming using the deployed CDN infrastructure has become the de facto standard for meeting the increasing demand for video streaming on the Internet. The storage architecture that is used for storing and streaming the video content is the focus of this study. Hard-disk as the storage medium has been the norm for enterprise-class storage servers for the longest time. More recently, multi-tiered storage servers (incorporating SSDs) such as Sun's ZFS and Facebook's flashcache offer an alternative to disk-based storage servers for enterprise applications. Both these systems use the SSD as a cache between the DRAM and the hard disk. The thesis of our work is that the current-state-of-the art in multi-tiered storage systems, architected for general-purpose enterprise workloads, do not cater to the unique needs of adaptive HTTP streaming. We present FlashStream, a multi-tiered storage architecture that addresses the unique needs of adaptive HTTP streaming. Like ZFS and flashcache, it also incorporates SSDs as a cache between the DRAM and the hard disk. The key architectural elements of FlashStream include optimal write granularity to overcome the write amplification effect of flash memory SSDs and a QoS-sensitive caching strategy that monitors the activity of the flash memory SSDs to ensure that video streaming performance is not hampered by the caching activity. We have implemented FlashStream and experimentally compared it with ZFS and flashcache for adaptive HTTP streaming workloads. We show that FlashStream outperforms both these systems for the same hardware configuration. Specifically, it is better by a factor of two compared to its nearest competitor, namely ZFS. In addition, we have compared FlashStream with a traditional two-level storage architecture (DRAM + HDDs), and have shown that, for the same investment cost, FlashStream provides 33% better performance and 94% better energy efficiency.
Early event-driven (EED) RTCP feedback for rapid IDMS BIBAFull-Text 323-332
  Mario Montagud; Fernando Boronat; Hans Stokking
Inter-Destination Media Synchronization (IDMS) is essential in the emerging media consumption paradigm, which is radically evolving from passive and isolated services towards dynamic and interactive group shared experiences. This paper concentrates on improving a standardized RTP/RTCP-based solution for IDMS. In particular, novel Early Event-Driven (EED) RTCP feedback reporting mechanisms are designed to overcome latency issues and to enable higher flexibility, dynamism and accuracy when using RTP/RTCP for IDMS. The faster reaction on dynamic situations (e.g., detection of asynchrony or channel change delays) and a finer granularity for synchronizing media-related events, while preserving the RTCP bandwidth bounds, are validated through simulation tests.
Orchestration: tv-like mixing grammars applied to video-communication for social groups BIBAFull-Text 333-342
  Marian F. Ursu; Martin Groen; Manolis Falelakis; Michael Frantzis; Vilmos Zsombori; Rene Kaiser
This paper reports research into video-mediated synchronous communication within social groups. The ultimate aim of the research is to create a more natural medium for interaction, aware of the context in which it operates, able to continuously adapt itself to the communication needs and optimise the way in which it captures and transmits aspects of the communication. This, is hypothesised, can be achieved by equipping each of the various locations involved in the communication with multiple controllable video cameras and microphones, and mixing the resulting content through techniques similar to those used in television?a process referred to as "orchestration". Through orchestration, each location should be able to receive the appropriate perspectives and levels of detail, thus generating experiences in which the spatial separation between participants is minimised. The paper defines the concept of orchestration and presents two major evaluation experiments that provide supporting evidence for the main assumption and motivate further research, in richer interaction contexts, into this concept.

Keynote address

The space between the images BIBAFull-Text 343-344
  Leonidas J. Guibas
Multimedia content has become a ubiquitous presence on all our computing devices, spanning the gamut from live content captured by device sensors such as smartphone cameras to immense databases of images, audio and video stored in the cloud. As we try to maximize the utility and value of all these petabytes of content, we often do so by analyzing each piece of data individually and foregoing a deeper analysis of the relationships between the media. Yet with more and more data, there will be more and more connections and correlations, because the data captured comes from the same or similar objects, or because of particular repetitions, symmetries or other relations and self-relations that the data sources satisfy. This is particularly true for media of a geometric character, such as GPS traces, images, videos, 3D scans, 3D models, etc.
   In this talk we focus on the "space between the images", that is on expressing the relationships between different mutlimedia data items. We aim to make such relationships explicit, tangible, first-class objects that themselves can be analyzed, stored, and queried -- irrespective of the media they originate from. We discuss mathematical and algorithmic issues on how to represent and compute relationships or mappings between media data sets at multiple levels of detail. We also show how to analyze and leverage networks of maps and relationships, small and large, between inter-related data. The network can act as a regularizer, allowing us to benefit from the "wisdom of the collection" in performing operations on individual data sets or in map inference between them.
   We will illustrate these ideas using examples from the realm of 2D images and 3D scans/shapes -- but these notions are more generally applicable to the analysis of videos, graphs, acoustic data, biological data such as microarrays, homeworks in MOOCs, etc. This is an overview of joint work with multiple collaborators, as will be discussed in the talk.

Multimedia grand challenge

Lecture video segmentation by automatically analyzing the synchronized slides BIBAFull-Text 345-348
  Xiaoyin Che; Haojin Yang; Christoph Meinel
In this paper we propose a solution which segments lecture video by analyzing its supplementary synchronized slides. The slides content derives automatically from OCR (Optical Character Recognition) process with an approximate accuracy of 90%. Then we partition the slides into different subtopics by examining their logical relevance. Since the slides are synchronized with the video stream, the subtopics of the slides indicate exactly the segments of the video. Our evaluation reveals that the average length of segments for each lecture is ranged from 5 to 15 minutes, and 45% segments achieved from test datasets are logically reasonable.
Activity-aware adaptive compression: a morphing-based frame synthesis application in 3DTI BIBAFull-Text 349-352
  Shannon Chen; Pengye Xia; Klara Nahrstedt
In view of the different demands on quality of service of different user activities in the 3D Tele-immersive (3DTI) environment, we combine activity recognition and real-time morphing-based compression and present the Activity-Aware Adaptive Compression. We implement this scheme on our 3DTI platform: the TEEVE Endpoint, which is a runtime engine to handle the creation, transmission and rendering of 3DTI data. User study as well as objective evaluation of the scheme show that it can achieve 25% more bandwidth saving compared to conventional 3D data compression as zlib without perceptible degradation in the user experience.
Flickr-tag prediction using multi-modal fusion and meta information BIBAFull-Text 353-356
  Yu-Chuan Su; Tzu-Hsuan Chiu; Guan-Long Wu; Chun-Yen Yeh; Felix Wu; Winston Hsu
We present our evaluation and analysis on Yahoo! Large-scale Flickr-tag Image Classification dataset. Our evaluations show that combining multi-features and different classification models, the MAP of tag prediction can be significantly improve over ordinary linear classification. Further analysis shows that some tags are given not because of the visual content but the meta information of images. Our experiments show that we can make more accurate prediction on certain tags using meta information without any training process, compared with visual content based classifiers. Combine the meta information, multi-features and multi-models fusion, we achieve significantly better performance than simple linear classification. We also evaluate the performance of various mid-level feature, and the results suggest that "Concept Bank" feature may be a promising direction for the task.
Structured exploration of who, what, when, and where in heterogeneous multimedia news sources BIBAFull-Text 357-360
  Brendan Jou; Hongzhi Li; Joseph G. Ellis; Daniel Morozoff-Abegauz; Shih-Fu Chang
We present a fully automatic system from raw data gathering to navigation over heterogeneous news sources, including over 18k hours of broadcast video news, 3.58M online articles, and 430M public Twitter messages. Our system addresses the challenge of extracting "who," "what," "when," and "where" from a truly multimodal perspective, leveraging audiovisual information in broadcast news and those embedded in articles, as well as textual cues in both closed captions and raw document content in articles and social media. Performed over time, we are able to extract and study the trend of topics in the news and detect interesting peaks in news coverage over the life of the topic. We visualize these peaks in trending news topics using automatically extracted keywords and iconic images, and introduce a novel multimodal algorithm for naming speakers in the news. We also present several intuitive navigation interfaces for interacting with these complex topic structures over different news sources.
Towards a comprehensive computational model for aesthetic assessment of videos BIBAFull-Text 361-364
  Subhabrata Bhattacharya; Behnaz Nojavanasghari; Tao Chen; Dong Liu; Shih-Fu Chang; Mubarak Shah
In this paper we propose a novel aesthetic model emphasizing psycho-visual statistics extracted from multiple levels in contrast to earlier approaches that rely only on descriptors suited for image recognition or based on photographic principles. At the lowest level, we determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual classifiers) on a given frame, that invoke specific sentiments such as "colorful clouds", "smiling face" etc. and collect the classifier responses as frame-level statistics. At the topmost level, we extract trajectories from video shots. Using viewer's fixation priors, the trajectories are labeled as foreground, and background/camera on which statistics are computed. Additionally, spatio-temporal local binary patterns are computed that capture texture variations in a given shot. Classifiers are trained on individual feature representations independently. On thorough evaluation of 9 different types of features, we select the best features from each level -- dark channel, affect and camera motion statistics. Next, corresponding classifier scores are integrated in a sophisticated low-rank fusion framework to improve the final prediction scores. Our approach demonstrates strong correlation with human prediction on 1,000 broadcast quality videos released by NHK as an aesthetic evaluation dataset.
Multi-factor segmentation for topic visualization and recommendation: the MUST-VIS system BIBAFull-Text 365-368
  Chidansh Amitkumar Bhatt; Andrei Popescu-Belis; Maryam Habibi; Sandy Ingram; Stefano Masneri; Fergus McInnes; Nikolaos Pappas; Oliver Schreer
This paper presents the MUST-VIS system for the MediaMixer/VideoLectures.NET Temporal Segmentation and Annotation Grand Challenge. The system allows users to visualize a lecture as a series of segments represented by keyword clouds, with relations to other similar lectures and segments. Segmentation is performed using a multi-factor algorithm which takes advantage of the audio (through automatic speech recognition and word-based segmentation) and video (through the detection of actions such as writing on the blackboard). The similarity across segments and lectures is computed using a content-based recommendation algorithm. Overall, the graph-based representation of segment similarity appears to be a promising and cost-effective approach to navigating lecture databases.
Beauty is here: evaluating aesthetics in videos using multimodal features and free training data BIBAFull-Text 369-372
  Yanran Wang; Qi Dai; Rui Feng; Yu-Gang Jiang
The aesthetics of videos can be used as a useful clue to improve user satisfaction in many applications such as search and recommendation. In this paper, we demonstrate a computational approach to automatically evaluate the aesthetics of videos, with particular emphasis on identifying beautiful scenes. Using a standard classification pipeline, we analyze the effectiveness of a comprehensive set of features, ranging from low-level visual features, mid-level semantic attributes, to style descriptors. In addition, since there is limited public training data with manual labels of video aesthetics, we explore freely available resources with a simple assumption that people tend to share more aesthetically appealing works than unappealing ones. Specifically, we use images from DPChallenge and videos from Flickr as positive training data and the Dutch documentary videos as negative data, where the latter contain mostly old materials of low visual quality. Our extensive evaluations show that combining multiple features is helpful, and very promising results can be obtained using the noisy but annotation-free training data. On the NHK Multimedia Challenge dataset, we attain a Spearman's rank correlation coefficient of 0.41.
Metadata enrichment for news video retrieval: a graph-based propagation approach BIBAFull-Text 373-376
  Kong-Wah Wan; Wei-Yun Yau; Sujoy Roy
This paper summarizes our contribution to the Technicolor Rich Multimedia Retrieval from Input Videos Grand Challenge. We hold the view that semantic analysis of a given news video is best performed in the text domain. Starting with a noisy text obtained from applying Automatic Speech Recognition (ASR), a graph-based approach is then used to enrich the text by propagating labels from visually similar videos culled from parallel (YouTube) News sources. From the enriched text, we next extract salient keywords to form a query to a news video search engine, retrieving a larger corpus of related news video. Compared to a baseline method that only uses the ASR text, significant improvement in precision has been obtained, indicating that retrieval has benefited from the ingestion of the external labels. Capitalizing on the enriched metadata, we find that videos are more amenable to the Wikipedia-based Explicit Semantic Analysis (ESA), resulting in better support for subtopic news video retrieval. We apply our methods to an in-house live news search portal, and report on several best practices.
Human action recognition by fast dense trajectories BIBAFull-Text 377-380
  Zongbo Hao; Qianni Zhang; Ebroul Ezquierdo; Nan Sang
In this paper, we propose the fast dense trajectories algorithm for human action recognition. Dense trajectories are robust to fast irregular motions and outperform other state-of-the-art descriptors such as KLT tracker or SIFT descriptors. However, the use of dense trajectories is time consuming. To improve the efficiency, we extract feature trajectories in the ROI rather than in the whole frames, and we use the temporal pyramids to achieve adaptable mechanism for different action speed. We evaluate the method on the dataset of Huawei/3DLife -- 3D human reconstruction and action recognition Grand Challenge in ACM Multimedia 2013. Experimental results show a significant improvement over the dense trajectories descriptor in real-time, and adaptable to different speed.
Scalable training with approximate incremental Laplacian eigenmaps and PCA BIBAFull-Text 381-384
  Eleni Mantziou; Symeon Papadopoulos; Yiannis Kompatsiaris
The paper describes the approach, the experimental settings, and the results obtained by the proposed methodology at the ACM Yahoo! Multimedia Grand Challenge. Its main contribution is the use of fast and efficient features with a highly scalable semi-supervised learning approach, the Approximate Laplacian Eigenmaps (ALEs), and its extension, by computing the test set incrementally for learning concepts in time linear to the number of images (both labelled and unlabelled). A combination of two local visual features combined with the VLAD feature aggregation method and PCA is used to improve the efficiency and time complexity. Our methodology achieves somewhat better accuracy compared to the baseline (linear SVM) in small training sets, but improves the performance as the training data increase. Performing ALE fusion on a training set of 50K/concept resulted in a MiAP score of 0.4223, which was among the highest scores of the proposed approach.
Estimating beauty ratings of videos using supervoxels BIBAFull-Text 385-388
  Gökhan Yildirim; Appu Shaji; Sabine Süsstrunk
The major low-level perceptual components that influence the beauty ratings of video are color, contrast, and motion. To estimate the beauty ratings of the NHK dataset, we propose to extract these features based on supervoxels, which are a group of pixels that share similar color and spatial information through the temporal domain. Recent beauty methods use frame-level processing for visual features and disregard the spatio-temporal aspect of beauty. In this paper, we explicitly model this property by introducing supervoxel-based visual and motion features. In order to create a beauty estimator, we first identify 60 videos (either beautiful or not beautiful) in the NHK dataset. We then train a neural network regressor using the supervoxel-based features and binary beauty ratings. We rate the 1000 videos in the NHK dataset and rank them according to their ratings. When comparing our rankings with the actual rankings of the NHK dataset, we obtain a Spearman correlation coefficient of 0.42.
Action recognition using invariant features under unexampled viewing conditions BIBAFull-Text 389-392
  Litian Sun; Kiyoharu Aizawa
A great challenge in real-world applications of action recognition is the lack of sufficient label information because of variance in the recording viewpoint and differences between individuals. A system that can adapt itself according to these variances is required for practical use. We present a generic method for extracting view-invariant features from skeleton joints. These view-invariant features are further refined using a stacked, compact autoencoder. To model the challenge of real-world applications, two unexampled test settings (NewView and NewPerson) are used to evaluate the proposed method. Experimental results with these test settings demonstrate the effectiveness of our method.
Search-based relevance association with auxiliary contextual cues BIBAFull-Text 393-396
  Chun-Che Wu; Kuan-Yu Chu; Yin-Hsi Kuo; Yan-Ying Chen; Wen-Yu Lee; Winston H. Hsu
In this work, we target at solving the Bing challenge provided by Microsoft. The task is to design an effective and efficient measurement of query terms in describing the images (image-query pairs) crawled from the web. We observe that the provided image-query pairs (e.g., text-based image retrieval results) are usually related to their surrounding text; however, the relationship between image content seems to be ignored. Hence, we attempt to integrate the visual information for better ranking results. In addition, we found that plenty of query terms are related to people (e.g., celebrity) and user might have similar queries (click logs) in the search engine. Therefore, in this work, we propose a relevance association by investigating the effectiveness of different auxiliary contextual cues (i.e., face, click logs, visual similarity). Experimental results show that the proposed method can have 16% relative improvement compared to the original ranking results. Especially, for people-related queries, we can further have 45.7% relative improvement.
Image search by graph-based label propagation with image representation from DNN BIBAFull-Text 397-400
  Yingwei Pan; Ting Yao; Kuiyuan Yang; Houqiang Li; Chong-Wah Ngo; Jingdong Wang; Tao Mei
Our objective is to estimate the relevance of an image to a query for image search purposes. We address two limitations of the existing image search engines in this paper. First, there is no straightforward way of bridging the gap between semantic textual queries as well as users' search intents and image visual content. Image search engines therefore primarily rely on static and textual features. Visual features are mainly used to identify potentially useful recurrent patterns or relevant training examples for complementing search by image reranking. Second, image rankers are trained on query-image pairs labeled by human experts, making the annotation intellectually expensive and time-consuming. Furthermore, the labels may be subjective when the queries are ambiguous, resulting in difficulty in predicting the search intention. We demonstrate that the aforementioned two problems can be mitigated by exploring the use of click-through data, which can be viewed as the footprints of user searching behavior, as an effective means of understanding query. The correspondences between an image and a query are determined by whether the image was searched and clicked by users under the query in a commercial image search engine. We therefore hypothesize that the image click counts in response to a query are as their relevance indications. For each new image, our proposed graph-based label propagation algorithm employs neighborhood graph search to find the nearest neighbors on an image similarity graph built up with visual representations from deep neural networks and further aggregates their clicked queries/click counts to get the labels of the new image. We conduct experiments on MSR-Bing Grand Challenge and the results show consistent performance gain over various baselines. In addition, the proposed approach is very efficient, completing annotation of each query-image pair within just 15 milliseconds on a regular PC.


Real-time salient object detection BIBAFull-Text 401-402
  Chia-Ju Lu; Chih-Fan Hsu; Mei-Chen Yeh
Salient object detection techniques have a variety of multimedia applications of broad interest. However, the detection must be fast to truly aid in these processes. There exist many robust algorithms tackling the salient object detection problem but most of them are computationally demanding. In this demonstration we show a fast salient object detection system implemented in a conventional PC environment. We examine the challenges faced in the design and development of a practical system that can achieve accurate detection in real-time.
Kanji snap: an OCR-based smartphone application for learning Japanese kanji characters BIBAFull-Text 403-404
  Kiia Korpi; Kiyoharu Aizawa
As optical character recognition techniques improve, new opportunities to improve existing systems open up. There are for example OCR reading aids for the visually impaired and applications for translating foreign text by capturing photos with smartphones. But one field that hasn't made use of the advancing technology is the learning environment. This demo concentrates on incorporating optical character recognition into a smartphone application for learning Japanese kanji characters. With the application users can take photos of kanji in their everyday environment and look up detailed information and translations easily. Users can practice those kanji with vocabulary lists and quizzes and track their study progress.
Mobile video browsing with the ThumbBrowser BIBAFull-Text 405-406
  Marco A. Hudelist; Klaus Schoeffmann; Laszlo Boeszoermenyi
In this work we propose an early prototype of a video browser for mobile devices with touchscreens. We concentrate on utilizing the thumbs because of the natural posture used with the devices when watching videos in landscape mode. The controls are only displayed when the user touches the screen and automatically rearrange themselves depending on the position of the thumbs. A combination of a radial menu and an extended seeker control with hierarchical browsing and bookmarking features enables the user to navigate quickly through videos.
Physiognomy master: a novel personality analysis system based on facial features BIBAFull-Text 407-408
  Che-Hao Hsu; Kai-Lung Hua; Wen-Huang Cheng
In this demo, we present the proposed "Physiognomy Master." It is a novel practical personality analysis system based on facial features. We first design five facial features that are essential for face reading. We then construct a database to record the facial features' values from a number of volunteers. In the meantime, the volunteers are also invited to fill out a professional personality test. The relations between the facial features and the personality traits are then learned. Given a test subject or an input frontal face image, the proposed system will produce the associated personality report by fusing the personality scores from the people who have similar facial features in the constructed database. The fusing mechanism is based on the idea that people with similar facial features possess similar personality characteristics. The proposed system is a powerful tool in numerous kinds of social interactions, such as personnel selection, team composition, and marriage matching.
LAVES: an instant mobile video search system based on layered audio-video indexing BIBAFull-Text 409-410
  Wu Liu; Feibin Yang; Yongdong Zhang; Qinghua Huang; Tao Mei
This demonstration presents an innovative instant mobile video search system based on layered audio-video indexing, called "LAVES." Through the system, users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. Unlike most existing mobile video search applications which simply send the original video query to the cloud, the proposed mobile system is one of the first attempts towards instant and progressive video search leveraging the light-weight computing capacity of mobile devices. The system is able to index large-scale video data using the layered audio-video indexing technique on the cloud, as well as extract light-weight joint audio-video signatures in real time and perform bipartite-graph-based progressive search process on the devices. On a 600 hours video dataset, the system can outperform the state-of-the-arts by achieving 90.79% precision when the query video is less than 10 seconds.
Freesound technical demo BIBAFull-Text 411-412
  Frederic Font; Gerard Roma; Xavier Serra
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
Visualizing web mash-ups for in-situ vision-based mobile AR applications BIBAFull-Text 413-414
  Yu You; Ville-Veikko Mattila
Augmented reality applications are gaining popularity due to increased capabilities of modern mobile devices. Creating AR content however is tedious and traditionally done on desktop environments by professionals, with extensive knowledge and/or even programming skills required. In this demo, we demonstrate a complete mobile approach for creating vision-based AR in both indoor and outdoor environment. Using hyperlinks, Web mashups are built to dynamically augment the physical world by normal users without programing skills.
repoVizz: a framework for remote storage, browsing, annotation, and exchange of multi-modal data BIBAFull-Text 415-416
  Oscar Mayor; Quim Llimona; Marco Marchini; Panos Papiotis; Esteban Maestre
In this technical demo we present repoVizz (http://repovizz.upf.edu), an integrated online system capable of structural formatting and remote storage, browsing, exchange, annotation, and visualization of synchronous multi-modal, time-aligned data. Motivated by a growing need for data-driven collaborative research, repoVizz aims to resolve commonly encountered difficulties in sharing or browsing large collections of multi-modal data. At its current state, repoVizz is designed to hold time-aligned streams of heterogeneous data: audio, video, motion capture, physiological signals, extracted descriptors, annotations, et cetera. Most popular formats for audio and video are supported, while Broadcast WAVE or CSV formats are adopted for streams other than audio or video (e.g., motion capture or physiological signals). The data itself is structured via customized XML files, allowing the user to (re-) organize multi-modal data in any hierarchical manner, as the XML structure only holds metadata and pointers to data files. Datasets are stored in an online database, allowing the user to interact with the data remotely through a powerful HTML5 visual interface accessible from any standard web browser; this feature can be considered a key aspect of repoVizz since data can be explored, annotated, or visualized from any location or device. Data exchange and upload/download is made easy and secure via a number of data conversion tools and a user/permission management system.
Small objects query suggestion in a large web-image collection BIBAFull-Text 417-418
  Pierre Letessier; Nicolas Hervé; Julien Champ; Alexis Joly; Buisson Olivier; Amel Hamzaoui
State-of-the-art visual search methods allow retrieving efficiently small rigid objects in very large image datasets (e.g. logos, paintings, etc.). User's perception of the classical query-by-window paradigm is however affected by the fact that many submitted queries actually return nothing or only junk results. We demonstrate in this demo that the perception can be radically different if the objects of interest are rather suggested to the user by pre-computing relevant clusters of instances. Impressive results involving very small objects discovered in a web collection of 110K images are demonstrated through a simple interactive GUI.
Video2Sentence and vice versa BIBAFull-Text 419-420
  Amirhossein Habibian; Cees G. M. Snoek
In this technical demonstration, we showcase a multimedia search engine that retrieves a video from a sentence, or a sentence from a video. The key novelty is our machine translation capability that exploits a cross-media representation for both the visual and textual modality using concept vocabularies. We will demonstrate the translations using arbitrary web videos and sentences related to everyday events. What is more, we will provide an automatically generated explanation, in terms of concept detectors, on why a particular video or sentence has been retrieved as the most likely translation.
A tool for catching back your preferred videos from physical collages BIBAFull-Text 421-422
  Christoph Korinke; Mohamad Rabbath; Dennis Lamken; Susanne Boll
Images and videos are easy to create by inexperienced users with intuitive devices like smartphones. We propose a system which enables users to extract panoramas from videos for the creation of collages which can be printed. Single frames used for panoramas are indexed on the basis of SURF features. As a result users can make a photo with the smartphone from a panorama within the printed collage or zoom into the digital version of a collage to select an area of interest and retrieve the related video using the SURF features index. Moreover users can pick a photo from a location where a video was previously recorded and retrieve a related video as well. Our approach supports the maintenance of a large number of videos.
Pl@ntNet mobile app BIBAFull-Text 423-424
  Hervé Goëau; Pierre Bonnet; Alexis Joly; Vera Bakic; Julien Barbe; Itheri Yahiaoui; Souheil Selmi; Jennifer Carré; Daniel Barthélémy; Nozha Boujemaa; Jean-François Molino; Grégoire Duché; Aurélien Péronnet
Pl@ntNet is an image sharing and retrieval application for the identification of plants, available on iPhone and iPad devices. Contrary to previous content-based identification applications it can work with several parts of the plant including flowers, leaves, fruits and bark. It also allows integrating user's observations in the database thanks to a collaborative workflow involving the members of a social network specialized on plants. Data collected so far makes it one of the largest mobile plant identification tool.
Determining exposure values from HDR histograms for smartphone photography BIBAFull-Text 425-426
  Benjamin Guthier; Kalun Ho; Stephan Kopf; Wolfgang Effelsberg
We present a novel system to assist users in choosing suitable exposure values for photography on smartphones. The user specifies the desired shape of the histogram of the captured image by adjusting the parameters of a function. The system then uses the smartphone's camera and varying exposure values to calculate a high dynamic range (HDR) histogram of the scene. It contains the entire brightness range from the darkest to the brightest area of the scene. For any exposure value, the HDR histogram can be used to synthesize a conventional histogram of a low dynamic range (LDR) image as if captured by this exposure value. This process is much faster than capturing the image itself. By comparing a number of synthesized LDR histograms to the user's desired histogram shape, the system determines an exposure value that best meets the user's preference.
Semantic dispatching of multimedia news with MEWS BIBAFull-Text 427-428
  Julien Law-To; Gregory Grefenstette; Rémi Landais
Online news comes in rich multimedia form: video, audio, photos, in addition to traditional text. Recent advances in semantically-rich text processing, in speech-to-text processing, and in image processing allows us to develop new ways of presenting and enriching news stories. Here we present MEWS, a Multimedia nEWS platform, which enriches news browsing according to media (text, images, and video) and to automatically detected type of news (music, general news, politics).
Cloud based multimedia analytic platform BIBAFull-Text 429-430
  Peng Wu; Rares Vernica; Qian Lin
Multimedia Analytic Platform is a cloud based service to expose state-of-the-art multimedia technologies for mobile and web application development. As a product-quality service platform, it offers comprehensive API documentation, code example, service description and sandbox for trial for each multimedia technology. The utilization of the cloud storage and distributed computing framework allows the service platform to run with robustness and efficiency. The current technologies supported by the platform include face detection, face verification, face demographic estimation, feature extraction, image matching, and image collage. Since its initial public launch in October 2012, it has been adopted by universities and third party companies for course support and application development.
Adaptable and personalized game-based training system for fall prevention BIBAFull-Text 431-432
  Sandro Hardy; Stefan Göbel; Ralf Steinmetz
Digital Games which incorporate movements of the player's body in their gameplay are becoming more and more popular. An increasing number of doctors and physical therapists use such games for training exercises, although these games are not designed to achieve predefined training goals. Various studies show that the training effects of these games are small in comparison with classic exercises. Therapists request more accessible and more flexible games. In this paper we present an adaptive game for fall prevention based on the adaptation and exergame analysis framework StoryTecRT which allows the adaptation of parameters which impact accessibility, acceptance and training load of a game. This paper includes an insight into the framework and the implementation as well as first evaluation results.
AdVisual: a visual-based advertising system BIBAFull-Text 433-434
  Chao Dong; Shifeng Chen; Xiaoou Tang
In this work, we present a visual-based contextual advertising system, AdVisual. It is designed for content service providers to effectively select relevant ads for online videos. First, it will analyze each video and extract high level semantic visual information including specific objects, people, and significant scenes. Then ads highly related to the visual concepts are associated with the corresponding shots. As AdVisual is a user interaction system, it allows users to select favorite ads relevant to the video. By saliency detection, selected ads will be displayed as an overlay window embedded at the non-intrusive part of the shot.
Multi-screen cloud social TV: transforming TV experience into 21st century BIBAFull-Text 435-436
  Yichao Jin; Tian Xie; Yonggang Wen; Haiyong Xie
Nowadays, TV experience has been transformed from the traditional "laid-back" video watching experience to a "lean-forward" social and multi-screen experience. In this demo, we design and develop a multi-screen cloud social TV system in response to this trend. Our system is built upon two enabling technologies, including a cloud based back-end infrastructure and a multi-screen front-end application. We demonstrate two key features of our system based on real user scenarios, including a living-room video watching experience with remote viewers, and the video teleportation as an enhanced multi-screen experience.
"Wow! you are so beautiful today!" BIBAFull-Text 437-438
  Luoqi Liu; Hui Xu; Si Liu; Junliang Xing; Xi Zhou; Shuicheng Yan
In this demo, we present Beauty e-Experts, a fully automatic system for hairstyle and facial makeup recommendation and synthesis. Given a user-provided frontal facial image with short/bound hair and no/light makeup, the Beauty e-Experts system can not only recommend the most suitable hairstyle and makeup, but also show the synthesis effects. Two problems are considered for the Beauty e-Experts system: what to recommend and how to wear, which describe a similar process of selecting and applying hairstyle and cosmetics in our daily life. For the what-to-recommend problem, we propose a multiple tree-structured super-graphs model to explore the complex relationships among the beauty attributes, beauty-related attributes and image features, and then based on this model, the most suitable beauty attributes for a given facial image can be efficiently inferred. For the how-to-wear problem, a facial image synthesis module is designed to seamlessly blend the recommended hairstyle and makeup into the user facial image. Extensive experimental evaluations and analysis on testing images well demonstrate the effectiveness of the proposed system.
OSCOR: an orientation sensor data correction system for mobile generated contents BIBAFull-Text 439-440
  Guanfeng Wang; Beomjoo Seo; Yifang Yin; Roger Zimmermann; Zhijie Shen
In addition to positioning data, other sensor information -- such as orientation data, have become a useful and powerful contextual feature. Such auxiliary information can facilitate higher-level semantic description inferences in many multimedia applications, e.g., video tagging and video summarization. However, sensor data collected from current mobile devices is often not accurate enough for upstream multimedia analysis. An effective orientation data correction system for mobile multimedia content has been an elusive goal so far. Here we present a system, termed Oscor, which aims to improve the accuracy of noisy orientation sensor measurements generated by mobile devices during image and video recording. We provide a user-friendly camera interface to facilitate the gathering of additional information, which enables the correction process on the server-side. Geographic field-of-view (FOV) visualizations based on the original and corrected sensor data help users understand the corrected contextual information and how the erroneous data possibly may affect further processes.
OTMedia: the French TransMedia news observatory BIBAFull-Text 441-442
  Nicolas Hervé; Marie-Luce Viaud; Jérôme Thièvre; Agnès Saulnier; Julien Champ; Pierre Letessier; Olivier Buisson; Alexis Joly
Who said What, Where and How? How are images, video and stories spreading out? Who produces the information? OTMedia addresses these questions by collecting, enriching and analysing continuously more than 1500 streams of French media from TV Radio, Web, AFP, and Twitter. Two studies on media produced by end users with the OTMedia framework are presented.
TEEVE endpoint: towards the ease of 3D tele-immersive application development BIBAFull-Text 443-444
  Pengye Xia; Klara Nahrstedt
We present TEEVE Endpoint, which is a runtime engine to handle the creation, transmission and rendering of 3D Tele-immersive (3DTI) data and provides application programming interfaces (APIs) to developers to easily create 3DTI applications.
eHeritage of shadow puppetry: creation and manipulation BIBAFull-Text 445-446
  Zhenzhen Hu; Min Lin; Si Liu; Meng Wang; Richang Hong; Shuicheng Yan
In this demo, we propose the puppetry eHeritage, including a creator module and a manipulator module, to preserve the precious traditional heritage Chinese shadow puppetry. The creator module accepts a frontal face image and a profile face image of the user as input, and automatically generates the corresponding puppet, which looks like the original person and meanwhile preserves typical characteristics of traditional Chinese shadow puppetry. The manipulator module can accept the script provided by the user as input and automatically generate the motion sequences. For better visual effects, we propose the sparsity optimization over simplexes formulation.
Gesture-based control of physical modeling sound synthesis: a mapping-by-demonstration approach BIBAFull-Text 447-448
  Jules Françoise; Norbert Schnell; Frédéric Bevilacqua
We address the issue of mapping between gesture and sound for gesture-based control of physical modeling sound synthesis. We propose an approach called mapping by demonstration, allowing users to design the mapping by performing gestures while listening to sound examples. The system is based on a multimodal model able to learn the relationships between gestures and sounds.
News rover: exploring topical structures and serendipity in heterogeneous multimedia news BIBAFull-Text 449-450
  Hongzhi Li; Brendan Jou; Jospeh G. Ellis; Daniel Morozoff; Shih-Fu Chang
News stories are rarely understood in isolation. Every story is driven by key entities that give the story its context. Persons, places, times, and several surrounding topics can often succinctly represent a news event, but are only useful if they can be both identified and linked together. We introduce a novel architecture called News Rover for re-bundling broadcast video news, online articles, and Twitter content. The system utilizes these many multimodal sources to link and organize content by topics, events, persons and time. We present two intuitive interfaces for navigating content by topics and their related news events as well as serendipitously learning about a news topic. These two interfaces trade-off between user-controlled and serendipitous exploration of news while retaining the story context. The novelty of our work includes the linking of multi-source, multimodal news content to extracted entities and topical structures for contextual understanding, and visualized in intuitive active and passive interfaces.
A novel framework for collaborative video recommendation, interest discovery and friendship suggestion based on semantic profiling BIBAFull-Text 451-452
  Marco Bertini; Alberto Del Bimbo; Andrea Ferracani; Francesco Gelli; Daniele Maddaluno; Daniele Pezzatini
Two important challenges for social networks are the creation of targeted and personalized content for their users, selecting the most interesting material from the huge amount of user-generated content, and keeping user engagement, e.g. through creation and curation of users' profiles. In this demo we show a system for video commenting, sharing and interest discovery that combines recommendation algorithms, clustering techniques, tools for video tagging and evaluation of semantic resources relatedness. Combining these tools and techniques it becomes possible to provide personalized multimedia services and to improve and propagate interests and inter-personal connections through the network.
euTV: a system for media monitoring and publishing BIBAFull-Text 453-454
  Marco Bertini; Alberto Del Bimbo; George Ioannidis; Emile Bijk; Isabel Trancoso; Hugo Meinedo
In this paper, we describe the euTV system, which provides a flexible approach to collect, manage, annotate and publish collections of images, videos and textual documents. The system is based on a Service Oriented Architecture that allows to combine and orchestrate a large set of web services for automatic and manual annotation, retrieval, browsing, ingestion and authoring of multimedia sources. euTV tools have been used to create several publicly available vertical applications, addressing different use cases. Positive results of user evaluations have shown that the system can be effectively used to create different types of applications.
CAMMA: contextual advertising system for multimodal news aggregations BIBAFull-Text 455-456
  Giuliano Armano; Alessandro Giuliani; Alberto Messina; Maurizio Montagnuolo
This demo paper describes a system for contextual advertising on aggregations of multimodal news items. The prototype is intended to demonstrate how modern content analysis techniques can be profitably used to automate tasks commonly performed by humans such as the planning of the computer-assisted advertising content.
Flarty: recommending art routes using check-ins latent topics BIBAFull-Text 457-458
  Alberto Del Bimbo; Andrea Ferracani; Daniele Pezzatini
In this demo we present Flarty, a mobile location-based social network for the dynamic construction and recommendation of art routes in the city of Florence, Italy, via item based similarity algorithms, places topic extraction and user interest modeling. To achieve this goal Flarty derives knowledge from users check-ins and combines clustering techniques and recommendation algorithms, as well as features such as geo-location, to define groups of similar artworks or POIs (Points Of Interest) and to compute the most efficient routes likely to meet user's interests. Model analysis takes into account ratings, topics extracted from textual features associated with the POIs, and users preferences computed exploiting collaborative filtering techniques on their past behavior.
SentiBank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content BIBAFull-Text 459-460
  Damian Borth; Tao Chen; Rongrong Ji; Shih-Fu Chang
A picture is worth one thousand words, but what words should be used to describe the sentiment and emotions conveyed in the increasingly popular social multimedia? We demonstrate a novel system which combines sound structures from psychology and the folksonomy extracted from social multimedia to develop a large visual sentiment ontology consisting of 1,200 concepts and associated classifiers called SentiBank. Each concept, defined as an Adjective Noun Pair (ANP), is made of an adjective strongly indicating emotions and a noun corresponding to objects or scenes that have a reasonable prospect of automatic detection. We believe such large-scale visual classifiers offer a powerful mid-level semantic representation enabling high-level sentiment analysis of social multimedia. We demonstrate novel applications made possible by SentiBank including live sentiment prediction of social media and visualization of visual content in a rich intuitive semantic space.
Augmented and interactive video playback based on global camera pose BIBAFull-Text 461-462
  Junsheng Fu; Lixin Fan; Yu You; Kimmo Roimela
This paper proposes a video playback system that allows user to expend the field of view to surrounding environments that are not visible in the original video frame, arbitrarily change the viewing angles, and see the superimposed point-of-interest (POIs) data in an augmented reality manner during the video playback. The processing consists of two main steps: in the first step, client uploads a video to the GeoVideo Engine, and then the GeoVideo Engine extracts the geo-metadata and returns them back to the client; in the second step, client requests POIs from server, and then the client renders the video with POIs.
EigenNews: a personalized news video delivery platform BIBAFull-Text 463-464
  Matt C. Yu; Peter Vajda; David M. Chen; Sam S. Tsai; Maryam Daneshi; Andre F. Araujo; Huizhong Chen; Bernd Girod
We demonstrate EigenNews, a personalized television news system. Upon visiting the EigenNews website, a user is shown a variety of news videos which have been automatically selected based on her individual preferences. These videos are extracted from 16 continually recorded television programs using a multimodal segmentation algorithm. Relevant metadata for each video are generated by linking videos to online news articles. Selected news videos can be watched in three different layouts and on various devices.
NovaEmötions: winning with a smile BIBAFull-Text 465-466
  André Mourão; João Magalhães
Human-computer interaction (HCI) is expanding towards natural modalities of human expression. Gestures, body movements and other affective interaction techniques can change the way computers interact with humans. In this demo, we display a fully playable version of NovaEmötions, a competitive game where players score by acting an emotion through a facial expression. The game is designed to offer a competitive playing experience using only facial expressions. Despite the novelty of the interaction method, our game scoring algorithm kept players engaged and competitive.
Tell me what happened here in history BIBAFull-Text 467-468
  Jia Chen; Qin Jin; Weipeng Zhang; Shenghua Bao; Zhong Su; Yong Yu
This demo shows our system that takes a landmark image as input, recognizes the landmark from the image and returns historical events of the landmark with related photos. Different from existing landmark related researches, we focus on the temporal dimension of a landmark. Our system automatically recognizes the landmark, shows historical events chronologically and provides detailed photos for the events. To build these functions, we fuse information from multiple online resources.
Group TV: a cloud based social TV for group social experience BIBAFull-Text 469-470
  Xiaoyan Wang; Lifeng Sun; Shou Wang
Social TV allows people to meet the demand of social experience while watching videos. Existing solutions cannot balance interest among group and provide good viewing quality for all members with different network bandwidth and devices. Realizing that people in group will be more tolerant, trusting and willing to help each other, we developed a cloud based social TV for online group video service. We emphasize group behavior in our application and introduce the concept of tolerance and trust between users to balance their interest. Group Recommendation results we show in our system can obtain overall high satisfaction while provides diverse content. We also design a collaborative video distribution technology to optimize network resource utility for fluency and high quality viewing experience. The entire operation of the system emphasizes on simplicity, collaboration and smooth, providing excellent social TV experience.
GeSoDeck: a geo-social event detection and tracking system BIBAFull-Text 471-472
  Xingyu Gao; Juan Cao; Zhiwei Jin; Xin Li; Jintao Li
This demonstration presents a novel geo-social event detection and tracking system based on geographical pattern mining and content analysis, called "GeSoDeck". A user can capture what events happened by our system. Unlike most existing social event detection applications, GeSoDeck aims to detect events with high accuracy and efficiency, and track them as well. Given a geographical area, the system can not only detect diverse social events in this area using the geographical pattern mining and density-based K-means clustering, but also track the representative tweets of the detected event in real time, mining geographical diffusion trajectory on the map and temporal pattern of retweeting process. On a realistic dataset collected from Sina Weibo, the system can outperform the state-of-the-art methods.
Stereotime: a wireless 2D and 3D switchable video communication system BIBAFull-Text 473-474
  You Yang; Qiong Liu; Yue Gao; Binbin Xiong; Li Yu; Huanbo Luan; Rongrong Ji; Qi Tian
Mobile 3D video communication, especially with 2D and 3D compatible, is a new paradigm for both video communication and 3D video processing. Current techniques face challenges in mobile devices when bundled constraints such as computation resource and compatibility should be considered. In this work, we present a wireless 2D and 3D switchable video communication to handle the previous challenges, and name it as Stereotime. The methods of Zig-Zag fast object segmentation, depth cues detection and merging, and texture-adaptive view generation are used for 3D scene reconstruction. We show the functionalities and compatibilities on 3D mobile devices in WiFi network environment.
MagicBrush: image search by color sketch BIBAFull-Text 475-476
  Xinghai Sun; Changhu Wang; Avneesh Sud; Chao Xu; Lei Zhang
In this paper, we showcase the MagicBrush system, a novel painting-based image search engine. This system enables users to draw a color sketch as a query to find images. Different from existing works on sketch-based image retrieval, most of which focus on matching the shape structure without carefully considering other important visual modalities, MagicBrush takes into account the indispensable value of "color" related to "shape", and explores to make use of both the shape and color expectations that users usually have when they're imaging or searching for an image. To achieve this, we 1) develop a user-friendly interface to allow users to easily "paint out" their colorful visual expectations; 2) design a compact feature "color-edge word" to encode both shape and color information in a organic way; and 3) develop a novel matching and index structure to support a real-time response in 6.4 million images. By taking into account both shape and color information, the MagicBrush system helps users to vividly present what they are imagining, and retrieve images in a more natural way.
Jiku director: a mobile video mashup system BIBAFull-Text 477-478
  Duong-Trung-Dung Nguyen; Mukesh Saini; Vu-Thanh Nguyen; Wei Tsang Ooi
In this technical demonstration, we demonstrate a Web-based application called Jiku Director that automatically creates a mashup video from event videos uploaded by users. The system runs an algorithm that considers view quality (shakiness, tilt, occlusion), video quality (blockiness, contrast, sharpness, illumination, burned pixels), and spatial-temporal diversity (shot angles, shot lengths) to create a mashup video with smooth shot transitions while covering the event from different perspectives.
WeCard: a multimodal solution for making personalized electronic greeting cards BIBAFull-Text 479-480
  Huijie Lin; Jia Jia; Hanyu Liao; Lianhong Cai
In this demo, we build a practical system, WeCard, to generate personalized multimodal electronic greeting cards based on parametric emotional talking avatar synthesis technologies. Given user-input greeting text and facial image, WeCard intelligently and automatically generate the personalized speech with expressive lip-motion synchronized facial animation. Besides the parametric talking avatar synthesis, WeCard incorporates two key technologies: 1) automatical face mesh generation algorithm based on MPEG-4 FAPs (Facial Animation Parameters) extracted by the face alignment algorithm; 2) emotional audio-visual speech synchronization algorithm based on DBN. More specifically, WeCard merges the users? preferred electronic card scene with emotional talking avatar animation, turning the final content into flash or video file that can be easily shared with friends. By this way, WeCard can help you make your multimodal greetings to be more attractive, beautiful, and sincere.


Classifying tag relevance with relevant positive and negative examples BIBAFull-Text 485-488
  Xirong Li; Cees G. M. Snoek
Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation.
Non-rigid target tracking based on 'flow-cut' in pair-wise frames with online hough forests BIBAFull-Text 489-492
  Tao Zhuo; Yanning Zhang; Peng Zhang; Wei Huang; Hichem Sahli
In conventional online learning based tracking studies, fixed-shape appearance modeling is often incorporated for training samples generation, as it is simple and convenient to be applied. However, for more general non-rigid and articulated object, this strategy may regard some background areas as foreground, which is likely to deteriorate the learning process. Recently published works utilize more than one patches to represent non-rigid object with foreground object segmentation, but most of these segmentation for target representation are performed only in single frame manner. Since the motion information between the consecutive frames was not considered by these approaches, when the backgrounds are similar to the target, accurate segmentation is hard to be achieved. In this work, we propose a novel model for non-rigid object segmentation by incorporating consecutive gradients flow between pair-wise frames into a Gibbs energy function. With help from motion information, the irregular target areas can be segmented more accurately during precise boundary convergence. The proposed segmentation model is incorporated into a semi-supervised online tracking framework for training samples generation. We test the proposed tracking on challenging videos involving heavy intrinsic variations and occlusions. As a result, the experiments demonstrate a significant improvement in tracking accuracy and robustness in comparison with other state-of-art tracking works.
Object coding on the semantic graph for scene classification BIBAFull-Text 493-496
  Jingjing Chen; Yahong Han; Xiaochun Cao; Qi Tian
In the scene classification, a scene can be considered as a set of object cliques. Objects inside each clique have semantic correlations with each other, while two objects from different cliques are relatively independent. To utilize these correlations for better recognition performance, we propose a new method -- Object Coding on the Semantic Graph to address the scene classification problem. We first exploit prior knowledge by making statistics on a large number of labeled images and calculating the dependency degree between objects. Then, a graph is built to model the semantic correlations between objects. This semantic graph captures semantics by treating the objects as vertices and the objects affinities as the weights of edges. By encoding this semantic knowledge into the semantic graph, object coding is conducted to automatically select a set of object cliques that have strongly semantic correlations to represent a specific scene. The experimental results show that the Object Coding on semantic graph can improve the classification accuracy.
Beyond bag of words: image representation in sub-semantic space BIBAFull-Text 497-500
  Chunjie Zhang; Shuhui Wang; Chao Liang; Jing Liu; Qingming Huang; Haojie Li; Qi Tian
Due to the semantic gap, the low-level features are not able to semantically represent images well. Besides, traditional semantic related image representation may not be able to cope with large inter class variations and are not very robust to noise. To solve these problems, in this paper, we propose a novel image representation method in the sub-semantic space. First, examplar classifiers are trained by separating each training image from the others and serve as the weak semantic similarity measurement. Then a graph is constructed by combining the visual similarity and weak semantic similarity of these training images. We partition this graph into visually and semantically similar sub-sets. Each sub-set of images are then used to train classifiers in order to separate this sub-set from the others. The learned sub-set classifiers are then used to construct a sub-semantic space based representation of images. This sub-semantic space is not only more semantically meaningful but also more reliable and resistant to noise. Finally, we make categorization of images using this sub-semantic space based representation on several public datasets to demonstrate the effectiveness of the proposed method.
Speaking swiss: languages and venues in foursquare BIBAFull-Text 501-504
  Darshan Santani; Daniel Gatica-Perez
Due to increasing globalization, urban societies are becoming more multicultural. The availability of large-scale digital mobility traces e.g. from tweets or checkins provides an opportunity to explore multiculturalism that until recently could only be addressed using survey-based methods. In this paper we examine a basic facet of multiculturalism through the lens of language use across multiple cities in Switzerland. Using data obtained from Foursquare over 330 days, we present a descriptive analysis of linguistic differences and similarities across five urban agglomerations in a multicultural, western European country.
What are the distance metrics for local features? BIBAFull-Text 505-508
  Zhendong Mao; Yongdong Zhang; Qi Tian
Previous research has found that the distance metric for similarity estimation is determined by the underlying data noise distribution. The well known Euclidean (L2) and Manhattan (L1) metrics are then justified when the additive noise are Gaussian and Exponential, respectively. However, finding a suitable distance metric for local features is still a challenge when the underlying noise distribution is unknown and could be neither Gaussian nor Exponential. To address this issue, we introduce a modeling framework for arbitrary noise distributions and propose a generalized distance metric for local features based on this framework. We prove that the proposed distance is equivalent to the L1 or the L2 distance when the noise is Gaussian or Exponential. Furthermore, we justify the Hamming metric when the noise meets the given conditions. In that case, the proposed distance is a linear mapping of the Hamming distance. The proposed metric has been extensively tested on a benchmark data set with five state-of-the-art local features: SIFT, SURF, BRIEF, ORB and BRISK. Experiments show that our framework better models the real noise distributions and that more robust results can be obtained by using the proposed distance metric.
Salient object detection in videos by optimal spatio-temporal path discovery BIBAFull-Text 509-512
  Ye Luo; Junsong Yuan
Many consumer videos focus on and follow salient objects in a scene. Detecting such salient objects is thus of great interests to video analytics and search. Instead of detecting salient object in individual frames separately, we propose to detect and track salient object simultaneously by finding a spatio-temporal path of the highest saliency density in the video. As salient video objects usually appear in consecutive frames, leveraging the motion coherence of videos can detect salient object more robustly. Without any prior knowledge of the salient objects, our method can automatically detect the salient objects of different shapes and sizes, and is able to handle noisy saliency maps and moving cameras. Experimental results on two public datasets demonstrate the effectiveness of the proposed method on salient video object detection.
Multiview semi-supervised ranking for automatic image annotation BIBAFull-Text 513-516
  Ali Fakeri-Tabrizi; Massih R. Amini; Patrick Gallinari
Most photo sharing sites give their users the opportunity to manually label images. The labels collected that way are usually very incomplete due to the size of the image collections: most images are not labeled according to all the categories they belong to, and, conversely, many class have relatively few representative examples. Automated image systems that can deal with small amounts of labeled examples and unbalanced classes are thus necessary to better organize and annotate images. In this work, we propose a multiview semi-supervised bipartite ranking model which allows to leverage the information contained in unlabeled sets of images in order to improve the prediction performance, using multiple descriptions, or views of images. For each topic class, our approach first learns as many view-specific rankers as available views using the labeled data only. These rankers are then improved iteratively by adding pseudo-labeled pairs of examples on which all view-specific rankers agree over the ranking of examples within these pairs. We report on experiments carried out on the NUS-WIDE dataset, which show that the multiview ranking process improves predictive performances when a small number of labeled examples is available specially for unbalanced classes. We show also that our approach achieves significant improvements over a state-of-the art semi-supervised multiview classification model.
How do we deep-link?: leveraging user-contributed time-links for non-linear video access BIBAFull-Text 517-520
  Raynor Vliegendhart; Babak Loni; Martha Larson; Alan Hanjalic
This paper studies a new way of accessing videos in a non-linear fashion. Existing non-linear access methods allow users to jump into videos at points that depict specific visual concepts or that are likely to elicit affective reactions. We believe that deep-link comments, which occur unprompted on social video sharing platforms, offer a new opportunity beyond existing methods. With deep-link comments, viewers express themselves about a particular moment in a video by including a time-code. Deep-link comments are special because they reflect viewer perceptions of noteworthiness, that include, but extend beyond depicted conceptual content and induced affective reactions. Based on deep-link comments collected from YouTube, we develop a Viewer Expressive Reaction Variety (VERV) taxonomy that captures how viewers deep-link. We validate the taxonomy with a user study on a crowdsourcing platform and discuss how it extends conventional relevance criteria. We carry out experiments which show that deep-link comments can be automatically filtered and sorted into VERV categories.
Compact bag-of-words visual representation for effective linear classification BIBAFull-Text 521-524
  Xiaodan Zhuang; Shuang Wu; Pradeep Natarajan
Bag-of-words approaches have been shown to achieve state-of-the-art performance in large-scale multimedia event detection. However, the commonly used histogram representation of bag-of-words requires large codebook sizes and expensive nonlinear kernel based classifiers for optimal performance. To address these two issues, we present a two-part generative model for compact visual representation, based on the i-vector approach recently proposed for speech and audio modeling. First, we use a Gaussian mixture model (GMM) to model the joint distribution of local descriptors. Second, we use a low-dimensional factor representation that constrains the GMM parameters to a subspace that preserves most of the information. We further extend this method to incorporate overlapping spatial regions, forming a highly compact visual representation that achieves superior performance with fast linear classifiers. We evaluate the method on a large video dataset used in the TRECVID 2011 MED evaluation. With linear classifiers, the proposed representation, with one-tenth of the storage footprint, outperforms soft quantization histograms used in the top performing TRECVID 2011 MED systems.
Large-scale web video shot ranking based on visual features and tag co-occurrence BIBAFull-Text 525-528
  Do Hang Nga; Keiji Yanai
In this paper, we propose a novel ranking method, VisualTextualRank, which extends [1] and [2]. Our method is based on random walk over bipartite graph to integrate visual information of video shots and tag information of Web videos effectively. Note that instead of treating the textual information as an additional feature for shot ranking, we explore the mutual reinforcement between shots and textual information of their corresponding videos to improve shot ranking. We apply our proposed method to the system of extracting automatically relevant video shots of specific actions from Web videos [3]. Based on our experimental results, we demonstrate that our ranking method can improve the performance of video shot retrieval.
Locality preserving verification for image search BIBAFull-Text 529-532
  Shanmin Pang; Jianru Xue; Nanning Zheng; Qi Tian
Establishing correct correspondences between two images has a wide range of applications, such as 2D and 3D registration, structure from motion, and image retrieval. In this paper, we propose a new matching method based on spatial constraints. The proposed method has linear time complexity, and is efficient when applying it to image retrieval. The main assumption behind our method is that, the local geometric structure among a feature point and its neighbors, is not easily affected by both geometric and photometric transformations, and thus should be preserved in their corresponding images. We model this local geometric structure by linear coefficients that reconstruct the point from its neighbors. The method is flexible, as it can not only estimate the number of correct matches between two images efficiently, but also determine the correctness of each match accurately. Furthermore, it is simple and easy to be implemented. When applying the proposed method on re-ranking images in an image search engine, it outperforms the-state-of-the-art techniques.
Undo the codebook bias by linear transformation for visual applications BIBAFull-Text 533-536
  Chunjie Zhang; Yifan Zhang; Shuhui Wang; Junbiao Pang; Chao Liang; Qingming Huang; Qi Tian
The bag of visual words model (BoW) and its variants have demonstrate their effectiveness for visual applications and have been widely used by researchers. The BoW model first extracts local features and generates the corresponding codebook, the elements of a codebook are viewed as visual words. The local features within each image are then encoded to get the final histogram representation. However, the codebook is dataset dependent and has to be generated for each image dataset. This costs a lot of computational time and weakens the generalization power of the BoW model. To solve these problems, in this paper, we propose to undo the dataset bias by codebook linear transformation. To represent every points within the local feature space using Euclidean distance, the number of bases should be no less than the space dimensions. Hence, each codebook can be viewed as a linear transformation of these bases. In this way, we can transform the pre-learned codebooks for a new dataset. However, not all of the visual words are equally important for the new dataset, it would be more effective if we can make some selection using sparsity constraints and choose the most discriminative visual words for transformation. We propose an alternative optimization algorithm to jointly search for the optimal linear transformation matrixes and the encoding parameters. Image classification experimental results on several image datasets show the effectiveness of the proposed method.
GLocal structural feature selection with sparsity for multimedia data understanding BIBAFull-Text 537-540
  Yan Yan; Zhongwen Xu; Gaowen Liu; Zhigang Ma; Nicu Sebe
The selection of discriminative features is an important and effective technique for many multimedia tasks. Using irrelevant features in classification or clustering tasks could deteriorate the performance. Thus, designing efficient feature selection algorithms to remove the irrelevant features is a possible way to improve the classification or clustering performance. With the successful usage of sparse models in image and video classification and understanding, imposing structural sparsity in feature selection has been widely investigated during the past years. Motivated by the merit of sparse models, we propose a novel feature selection method using a sparse model in this paper. Different from the state of the art, our method is built upon l2,p-norm and simultaneously considers both the global and local (GLocal) structures of data distribution. Our method is more flexible in selecting the discriminating features as it is able to control the degree of sparseness. Moreover, considering both global and local structures of data distribution makes our feature selection process more effective. An efficient algorithm is proposed to solve the l2,p-norm sparsity optimization problem in this paper. Experimental results performed on real-world image and video datasets show the effectiveness of our feature selection method compared to several state-of-the-art methods.
Score-informed audio decomposition and applications BIBAFull-Text 541-544
  Jonathan Driedger; Harald Grohganz; Thomas Prätzlich; Sebastian Ewert; Meinard Müller
The separation of different sound sources from polyphonic music recordings constitutes a complex task since one has to account for different musical and acoustical aspects. In the last years, various score-informed procedures have been suggested where musical cues such as pitch, timing, and track information are used to support the source separation process. In this paper, we discuss a framework for decomposing a given music recording into notewise audio events which serve as elementary building blocks. In particular, we introduce an interface that employs the additional score information to provide a natural way for a user to interact with these audio events. By simply selecting arbitrary note groups within the score a user can access, modify, or analyze corresponding events in a given audio recording. In this way, our framework not only opens up new ways for audio editing applications, but also serves as a valuable tool for evaluating and better understanding the results of source separation algorithms.
Background subtraction via coherent trajectory decomposition BIBAFull-Text 545-548
  Zhixiang Ren; Liang-Tien Chia; Deepu Rajan; Shenghua Gao
Background subtraction, the task to detect moving objects in a scene, is an important step in video analysis. In this paper, we propose an efficient background subtraction method based on coherent trajectory decomposition. We assume that the trajectories from background lie in a low-rank subspace, and foreground trajectories are sparse outliers in this background subspace. Meanwhile, the Markov Random Field (MRF) is used to encode the spatial coherency and trajectory consistency. With the low-rank decomposition and the MRF, our method can better handle videos with moving camera and obtain coherent foreground. Experimental results on a video dataset show our method achieves very competitive performance.
Motion matters: a novel framework for compressing surveillance videos BIBAFull-Text 549-552
  Xiaojie Guo; Siyuan Li; Xiaochun Cao
Currently, video surveillance plays a very important role in the fields of public safety and security. For storing the videos that usually contain extremely long sequences, it requires huge space. Video compression techniques can be used to release the storage load to some extent, such as H.264/AVC. However, the existing codecs are not sufficiently effective and efficient for encoding surveillance videos as they do not specifically consider the characteristic of surveillance videos, i.e. the background of surveillance video has intensive redundancy. This paper introduces a novel framework for compressing such videos. We first train a background dictionary based on a small number of observed frames. With the trained background dictionary, we then separate every frame into the background and motion (foreground), and store the compressed motion together with the reconstruction coefficient of the background corresponding to the background dictionary. The decoding is carried out on the encoded frame in an inverse procedure. The experimental results on extensive surveillance videos demonstrate that our proposed method significantly reduces the size of videos while gains much higher PSNR compared to the state of the art codecs.
Spatialized audio multiparty teleconferencing with commodity miniature microphone array BIBAFull-Text 553-556
  Viet Anh Nguyen; Shengkui Zhao; Tien Dung Vu; Douglas L. Jones; Minh N. Do
This paper presents a Spatialized Audio Multiparty Teleconferencing (SAMT) system with a radically new communication experience for group teleconferencing. The system includes our recently developed 3D audio technologies: 3D sound source localization (SSL) and 3D audio capture and reproduction using a low-cost and compact design microphone array. In essence, the SAMT system offers 3D audio capture capability and spatial audio perception with multiple participants at a site, which still falls short in teleconferencing solutions. In addition to being able to identify and automatically track the active speaker, the system allows more compelling visual presentation for effective communication. Requiring only a low-cost microphone array and a consumer depth camera, the proposed system runs reliably and comfortably in real time on a commodity laptop or desktop PC. With such a minimal deployment requirement, we present a variety of user experiences created by SAMT.
Learning articulated body models for people re-identification BIBAFull-Text 557-560
  Davide Baltieri; Roberto Vezzani; Rita Cucchiara
People re-identification is a challenging problem in surveillance and forensics and it aims at associating multiple instances of the same person which have been acquired from different points of view and after a temporal gap. Image-based appearance features are usually adopted but, in addition to their intrinsically low discriminability, they are subject to perspective and view-point issues. We propose to completely change the approach by mapping local descriptors extracted from RGB-D sensors on a 3D body model for creating a view-independent signature. An original bone-wise color descriptor is generated and reduced with PCA to compute the person signature. The virtual bone set used to map appearance features is learned using a recursive splitting approach. Finally, people matching for re-identification is performed using the Relaxed Pairwise Metric Learning, which simultaneously provides feature reduction and weighting. Experiments on a specific dataset created with the Microsoft Kinect sensor and the OpenNi libraries prove the advantages of the proposed technique with respect to state of the art methods based on 2D or non-articulated 3D body models.
Facial landmark localization based on hierarchical pose regression with cascaded random ferns BIBAFull-Text 561-564
  Zhanpeng Zhang; Wei Zhang; Jianzhuang Liu; Xiaoou Tang
The main challenge of facial landmark localization in real-world application is that the large changes of head pose and facial expressions cause substantial image appearance variations. To avoid high dimensional regression in the 3D and 2D facial pose spaces simultaneously, we propose a hierarchical pose regression approach, estimating the head rotation, facial components and landmarks hierarchically. The regression process works in a unified cascaded fern framework. We present generalized gradient boosted ferns (GBFs) for the regression framework, which give better performance than traditional ferns. The framework also achieves real time performance. We verify our method on the latest benchmark datasets. The results show that it outperforms state-of-the-art methods in both accuracy and speed.
Image context discovery from socially curated contents BIBAFull-Text 565-568
  Akisato Kimura; Katsuhiko Ishiguro; Makoto Yamada; Alejandro Marcos Alvarez; Kaori Kataoka; Kazuhiko Murasaki
This paper proposes a novel method of discovering a set of image contents sharing a specific context (attributes or implicit meaning) with the help of image collections obtained from social curation platforms. Socially curated contents are promising to analyze various kinds of multimedia information, since they are manually filtered and organized based on specific individual preferences, interests or perspectives. Our proposed method fully exploits the process of social curation: (1) How image contents are manually grouped together by users, and (2) how image contents are distributed in the platform. Our method reveals the fact that image contents with a specific context are naturally grouped together and every image content includes really various contexts that cannot necessarily be verbalized by texts.% A preliminary experiment with a small collection of a million of images yields a promising result.
Moment feature based forensic detection of resampled digital images BIBAFull-Text 569-572
  Lu Li; Jianru Xue; Zhiqiang Tian; Nanning Zheng
Forensic detection of resampled digital images has become an important technology among many others to establish the integrity of digital visual content. This paper proposes a moment feature based method to detect resampled digital images. Rather than concentrating on the positions of characteristic resampling peaks, we utilize a moment feature to exploit the periodic interpolation characteristics in the frequency domain. Not only the positions of resampling peaks but also the amplitude distribution is taken into consideration. With the extracted moment feature, a trained SVM classifier is used to detect resampled digital images. Extensive experimental results show the validity and efficiency of the proposed method.
Towards precise POI localization with social media BIBAFull-Text 573-576
  Adrian Popescu; Aymen Shabou
Points of interest (POIs) are a core component of geographical databases and of location based services. POI acquisition was performed by domain experts but associated costs and access difficulties in many regions of the world reduce the coverage of manually built geographical databases. With the availability of large geotagged multimedia datasets on the Web, a sustained research effort was dedicated to automatic POI discovery and characterization. However, in spite of its practical importance, POI localization was only marginally addressed. To compute POI coordinates an assumption was made that the more data were available, the more precise the localization will be. Here we shift the focus of the process from data quantity to data quality. Given a set of geotagged Flickr photos associated to a POI, close-up classification is used to trigger a spatial clustering process. To evaluate the newly introduced method against different other localization schemes, we create an accurate ground truth. We show that significant localization error reductions are obtained compared to a coordinate averaging approach and to a X-Means clustering scheme.
Sim-min-hash: an efficient matching technique for linking large image collections BIBAFull-Text 577-580
  Wan-Lei Zhao; Hervé Jégou; Guillaume Gravier
One of the most successful method to link all similar images within a large collection is min-Hash, which is a way to significantly speed-up the comparison of images when the underlying image representation is bag-of-words. However, the quantization step of min-Hash introduces important information loss. In this paper, we propose a generalization of min-Hash, called Sim-min-Hash, to compare sets of real-valued vectors. We demonstrate the effectiveness of our approach when combined with the Hamming embedding similarity. Experiments on large-scale popular benchmarks demonstrate that Sim-min-Hash is more accurate and faster than min-Hash for similar image search. Linking a collection of one million images described by 2 billion local descriptors is done in 7 minutes on a single core machine.
Recognizing the royals: leveraging computerized face recognition for identifying subjects in ancient artworks BIBAFull-Text 581-584
  Ramya Srinivasan; Amit Roy-Chowdhury; Conrad Rudolph; Jeanette Kohl
We present a work that explores the feasibility of automated face recognition technologies for analyzing identities in works of portraiture, and in the process provide additional evidence to settle some long-standing questions in art history. Works of portrait art bear the mark of visual interpretation of the artist. Moreover, the number of samples available to model these effects is often limited. From a set of portraiture of the Renaissance and Baroque periods, where the identities of subjects are known, we derive appropriate features that are based on domain knowledge of artistic renderings, and learn and validate statistical models for the distribution of the match and non-match scores, which we refer to as portrait feature space (PFS). Thereafter, we use this PFS on a number of cases that have been "open questions" to art historians. They are usually in the form of validating two portraits as belonging to the same person. Using statistical hypothesis tests on the PFS, we provide quantitative measures of similarity for each of these questions. It is, to the best of our knowledge, the first study that applies automated face recognition technologies to the analysis of portraits of multiple subjects in various forms -- paintings, death masks, sculptures.
CollARt: a tool for creating 3D photo collages using mobile augmented reality BIBAFull-Text 585-588
  Asier Marzo; Oscar Ardaiz
A collage is an artistic composition made by assembling different parts to create a new whole. This procedure can be applied for assembling tridimensional objects. In this paper we present CollARt, a Mobile Augmented Reality application which permits to create 3D photo collages. Virtual pieces are textured with pictures taken with the camera and can be blended with real objects. A preliminary user study (N=12) revealed that participants were able to create interesting works of art. The evaluation also suggested that the possibility of itinerantly mixing virtual pieces with the real world increases creativity.
Spatio-temporal fisher vector coding for surveillance event detection BIBAFull-Text 589-592
  Qiang Chen; Yang Cai; Lisa Brown; Ankur Datta; Quanfu Fan; Rogerio Feris; Shuicheng Yan; Alex Hauptmann; Sharath Pankanti
We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model (GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.
Efficient image and tag co-ranking: a Bregman divergence optimization method BIBAFull-Text 593-596
  Lin Wu; Yang Wang; John Shepherd
Ranking on image search has attracted considerable attentions. Many graph-based algorithms have been proposed to solve this problem. Despite their remarkable success, these approaches are restricted to their separated image networks. To improve the ranking performance, one effective strategy is to work beyond the separated image graph by leveraging fruitful information from manual semantic labeling (i.e., tags) associated with images, which leads to the technique of co-ranking images and tags, a representative method that aims to explore the reinforcing relationship between image and tag graphs. The idea of co-ranking is implemented by adopting the paradigm of random walks. However, there are two problems hidden in co-ranking remained to be open: the high computational complexity and the problem of out-of-sample. To address the challenges above, in this paper, we cast the co-ranking process into a Bregman divergence optimization framework under which we transform the original random walk into an equivalent optimal kernel matrix learning problem. Enhanced by this new formulation, we derive a novel extension to achieve a better performance for both in-sample and out-of-sample cases. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of our approach.
Real-time privacy-preserving moving object detection in the cloud BIBAFull-Text 597-600
  Kuan-Yu Chu; Yin-Hsi Kuo; Winston H. Hsu
With the advance of cloud computing, growing applications have been migrating to the cloud for its robustness and scalability. However, sending raw data to the cloud-based service providers will generally risk our privacy; especially for cloud-based surveillance system, where privacy is one of the major concerns as continuously recording daily life. Thus, privacy-preserving intelligent analytics are in dire needs. In this preliminary research, we investigate real-time privacy-preserving moving object detection in the encrypted cloud-based surveillance videos. Moving object detection is one of the core techniques and can further enable other applications (e.g., object tracking, action recognition, etc.). One possible approach is using homomorphic encryption which provides corresponding operations between unencrypted and encrypted data. However, homomorphic encryption is impractical in real case because of formidable computations and bulky storage consumption. In this paper, we propose an efficient and secure encryption framework, which entails real-time analytics (e.g., moving object detection) in encrypted video streams. Experiments confirm that the proposed method can achieve similar accuracy as detection on original raw frames.
With one look: robust face recognition using single sample per person BIBAFull-Text 601-604
  De-An Huang; Yu-Chiang Frank Wang
In this paper, we address the problem of robust face recognition using single sample per person. Given only one training image per subject of interest, our proposed method is able to recognize query images with illumination or expression changes, or even the corrupted ones due to occlusion. In order to model the above intra-class variations, we advocate the use of external data (i.e., images of subjects not of interest) for learning an exemplar-based dictionary. This dictionary provides auxiliary yet representative information for handling intra-class variation, while the gallery set containing one training image per class preserves separation between different subjects for recognition purposes. Our experiments on two face datasets confirm the effectiveness and robustness of our approach, which is shown to outperform state-of-the-art sparse representation based methods.
Weakly-supervised multi-class object detection using multi-type 3D features BIBAFull-Text 605-608
  Asako Kanezaki; Yasuo Kuniyoshi; Tatsuya Harada
We propose a weakly-supervised learning method for object detection using color and depth images of a real environment attached with object labels. The proposed method applies Multiple Instance Learning to find proper instances of the objects in training images. This method is novel in the sense that it learns multiple objects simultaneously in a way to balance the scores of each training sample across all object classes. Moreover, we combine 3D features considering different properties, that is, color texture, grayscale texture, and surface curvature, to improve the performance. We show that our method surpasses a conventional method using color and depth images. Furthermore, we evaluate its performance with our new dataset consisting of color and depth images with weak labels of 100 objects and various backgrounds.
Querying for video events by semantic signatures from few examples BIBAFull-Text 609-612
  Masoud Mazloom; Amirhossein Habibian; Cees G. M. Snoek
We aim to query web video for complex events using only a handful of video query examples, where the standard approach learns a ranker from hundreds of examples. We consider a semantic signature representation, consisting of off-the-shelf concept detectors, to capture the variance in semantic appearance of events. Since it is unknown what similarity metric and query fusion to use in such an event retrieval setting, we perform three experiments on unconstrained web videos from the TRECVID event detection task. It reveals that: retrieval with semantic signatures using normalized correlation as similarity metric outperforms a low-level bag-of-words alternative, multiple queries are best combined using late fusion with an average operator, and event retrieval is preferred over event classification when less than eight positive video examples are available.
Towards cover group thumbnailing BIBAFull-Text 613-616
  Peter Grosche; Meinard Müller; Joan Serrà
In this paper we investigate whether we can extract the commonalities shared by a group of cover songs or versions of the same musical piece. As a main contribution, we introduce the concept of cover group thumbnail, which is the most representative, essential subsequence for an entire group of versions. Opposed to previous approaches, we jointly consider all versions of a given song to compute a single cover group template, which then shows a high degree of robustness against version-specific aspects. To compute such a template, we introduce a modification of a recent audio thumbnailing technique. To evaluate the reliability of our conceptual contribution, we consider the task of template-based version identification, where we show comparable accuracies to existing systems.
Multi-feature canonical correlation analysis for face photo-sketch image retrieval BIBAFull-Text 617-620
  Dihong Gong; Zhifeng Li; Jianzhuang Liu; Yu Qiao
Automatic face photo-sketch image retrieval has attracted great attention in recent years due to its important applications in real life. The major difficulty in automatic face photo-sketch image retrieval lies in the fact that there exists great discrepancy between the different image modalities (photo and sketch). In order to reduce such discrepancy and improve the performance of automatic face photo-sketch image retrieval, we propose a new framework called multi-feature canonical correlation analysis (MCCA) to effectively address this problem. The MCCA is an extension and improvement of the canonical correlation analysis (CCA) algorithm using multiple features combined with two different random sampling methods in feature space and sample space. In this framework, we first represent each photo or sketch using a patch-based local feature representation scheme, in which histograms of oriented gradients (HOG) and multi-scale local binary pattern (MLBP) serve as the local descriptors. Canonical correlation analysis (CCA) is then performed on a collection of random subspaces to construct an ensemble of classifiers for photo-sketch image retrieval. Extensive experiments on two public-domain face photo-sketch datasets (CUFS and CUFSF) clearly show that the proposed approach obtains a substantial improvement over the state-of-the-art.
Hand and foot gesture interaction for handheld devices BIBAFull-Text 621-624
  Zhihan Lu; Muhammad Sikandar Lal Khan; Shafiq Ur Réhman
In this paper we present hand and foot based immersive multimodal interaction approach for handheld devices. A smart phone based immersive football game is designed as a proof of concept. Our proposed method combines input modalities (i.e. hand & foot) and provides a coordinated output to both modalities along with audio and video. In this work, human foot gesture is detected and tracked using template matching method and Tracking-Learning-Detection (TLD) framework. We evaluated our system's usability through a user study in which we asked participants to evaluate proposed interaction method. Our preliminary evaluation demonstrates the efficiency and ease of use of proposed multimodal interaction approach.
AirTouch panel: a re-anchorable virtual touch panel BIBAFull-Text 625-628
  Shih-Yao Lin; Chuen-Kai Shie; Shen-Chi Chen; Yi-Ping Hung
To achieve maximum mobility, device-less approaches for home appliance remote control have received increasing attention in recent years. In this paper, we propose a screen-less virtual touch panel, called AirTouch Panel, which can be positioned at any place with various orientations around users. The proposed virtual touch panel provides a potential ability to remotely control the home appliances, such as television, air conditioner, and so on. The proposed system allows users to anchor the panel at the place with comfortable poses. If the users want to change panel's position or orientation, they only need to re-anchor it, and then the panel will be reset. In this paper, our main contribution is to design a re-anchorable virtual panel for digital home remote control. Most importantly, we explore the design of such imaginary interface through two user studies. In our user studies, we analyze task completion time, satisfaction rate, and the number of miss-clicks. We are interested in the feasibility issues, for example, proper click gesture, panel size and button size, etc. Moreover, based on the AirTouch Panel, we also developed an intelligent TV to demonstrate the usability for controlling home appliance.
Creation of individual photo selections: read preferences from the users' eyes BIBAFull-Text 629-632
  Tina Walber; Chantal Neuhaus; Steffen Staab; Ansgar Scherp; Ramesh Jain
The automated selection of satisfying subsets from large collections of photos is a central challenge in multimedia research. Objective criteria like the depiction of persons or the photo quality are met by existing approaches. But it is difficult to know the users' personal interest, which plays an important role in the selection process. The expected spread of devices with eye tracking support in the near future allows us to measure this interest in a new way. In an experiment with 12 participants, we derive the most interesting photos of a collection for every person from gaze information recorded during the free viewing of the photos. We can show that the eye tracking information delivers valuable information about the users' preferences by comparing the results to a manual selection. The selection based on gaze information significantly outperforms baseline approaches and improves the results by up to 17%. For photo sets of personal interest this improvement is even up to 23%.
Strong geometrical consistency in large scale partial-duplicate image search BIBAFull-Text 633-636
  Junqiang Wang; Jinhui Tang; Yu-Gang Jiang
The state-of-the-art partial-duplicate image search systems reply heavily on the match of local features like SIFT. Independently matching local features across two images ignores the overall geometry structure and therefore may incur many false matches. To reduce such matches, several geometry verification methods have been proposed. This paper introduces a new geometry verification method named as Strong Geometry Consistency (SGC), which uses the orientation, scale and location information of the local feature points to accurately and quickly remove the false matches. We also propose a simple scale weighting (SW) strategy, which gives feature points with larger scales greater weights, based on the intuition that a larger-scale feature point tends to be more robust for image search as it occupies a larger area of an image. Extensive experiments performed on three popular datasets show that SGC significantly outperforms state-of-the-art geometry verification methods, and SW can further boost the performance with marginal additional computation.
Segmental multi-way local pooling for video recognition BIBAFull-Text 637-640
  Ilseo Kim; Sangmin Oh; Arash Vahdat; Kevin Cannons; A. G. Amitha Perera; Greg Mori
In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
Efficient video quality assessment based on spacetime texture representation BIBAFull-Text 641-644
  Peng Peng; Kevin Cannons; Ze-Nian Li
Most existing video quality metrics measure temporal distortions based on optical-flow estimation, which typically has limited descriptive power of visual dynamics and low efficiency. This paper presents a unified and efficient framework to measure temporal distortions based on a spacetime texture representation of motion. We first propose an effective motion-tuning scheme to capture temporal distortions along motion trajectories by exploiting the distributive characteristic of the spacetime texture. Then we reuse the motion descriptors to build a self-information based spatiotemporal saliency model to guide the spatial pooling. At last, a comprehensive quality metric is developed by combining the temporal distortion measure with spatial distortion measure. Our method demonstrates high efficiency and excellent correlation with the human perception of video quality.
Fitted spectral hashing BIBAFull-Text 645-648
  Yu Wang; Sheng Tang; Yalin Zhang; JinTao Li; DanYi Chen
Spectral hashing (SpH) is an efficient and simple binary hashing method, which assumes that data are sampled from a multidimensional uniform distribution. However, this assumption is too restrictive in practice. In this paper we propose an improved method, Fitted Spectral Hashing, to relax this distribution assumption. Our work is based on the fact that one-dimensional data of any distribution could be mapped to a uniform distribution without changing the local neighbor relations among data items. We have found that this mapping on each PCA direction has certain regular pattern, and could fit data well by S-Curve function, Sigmoid function. With more parameters Fourier function also fit data well. Thus with Sigmoid function and Fourier function, we propose two binary hashing methods. Experiments show that our methods are efficient and outperform state-of-the-art methods.
Using emotional context from article for contextual music recommendation BIBAFull-Text 649-652
  Chih-Ming Chen; Ming-Feng Tsai; Jen-Yu Liu; Yi-Hsuan Yang
This paper proposes a context-aware approach that recommends music to a user based on the user's emotional state predicted from the article the user writes. We analyze the association between user-generated text and music by using a real-world dataset with user, text, music tripartite information collected from the social blogging website LiveJournal. The audio information represents various perceptual dimensions of music listening, including danceability, loudness, mode, and tempo; the emotional text information consists of bag-of-words and three dimensional affective states within an article: valence, arousal and dominance. To combine these factors for music recommendation, a factorization machine-based approach is taken. Our evaluation shows that the emotional context information mined from user-generated articles does improve the quality of recommendation, comparing to either the collaborative filtering approach or the content-based approach.
Revisiting the VLAD image representation BIBAFull-Text 653-656
  Jonathan Delhumeau; Philippe-Henri Gosselin; Hervé Jégou; Patrick Pérez
Recent works on image retrieval have proposed to index images by compact representations encoding powerful local descriptors, such as the closely related VLAD and Fisher vector. By combining such a representation with a suitable coding technique, it is possible to encode an image in a few dozen bytes while achieving excellent retrieval results. This paper revisits some assumptions proposed in this context regarding the handling of "visual burstiness", and shows that ad-hoc choices are implicitly done which are not desirable. Focusing on VLAD without loss of generality, we propose to modify several steps of the original design. Albeit simple, these modifications significantly improve VLAD and make it compare favorably against the state of the art.
Human behavior sensing for tag relevance assessment BIBAFull-Text 657-660
  Mohammad Soleymani; Sebastian Kaltwang; Maja Pantic
Users react differently to non-relevant and relevant tags associated with content. These spontaneous reactions can be used for labeling large multimedia databases. We present a method to assess tag relevance to images using the non-verbal bodily responses, namely, electroencephalogram (EEG), facial expressions, and eye gaze. We conducted experiments in which 28 images were shown to 28 subjects once with correct and another time with incorrect tags. The goal of our system is to detect the responses to non-relevant tags and consequently filter them out. Therefore, we trained classifiers to detect the tag relevance from bodily responses. We evaluated the performance of our system using a subject independent approach. The precision at top 5% and top 10% detections were calculated and results of different modalities and different classifiers were compared. The results show that eye gaze outperforms the other modalities in tag relevance detection both overall and for top ranked results.
Robust facial expressions recognition using 3D average face and ameliorated AdaBoost BIBAFull-Text 661-664
  Jinhui Chen; Yasuo Ariki; Tetsuya Takiguchi
One of the most crucial techniques associated with Computer Vision is technology that deals with facial recognition, especially, the automatic estimation of facial expressions. However, in real-time facial expression recognition, when a face turns sideways, the expressional feature extraction becomes difficult as the view of camera changes and recognition accuracy degrades significantly. Therefore, quite many conventional methods are proposed, which are based on static images or limited to situations in which the face is viewed from the front. In this paper, a method that uses Look-Up-Table (LUT) AdaBoost combining with the three-dimensional average face is proposed to solve the problem mentioned above. In order to evaluate the proposed method, the experiment compared with the conventional method was executed. These approaches show promising results and very good success rates. This paper covers several methods that can improve results by making the system more robust.
Visual business recognition: a multimodal approach BIBAFull-Text 665-668
  Amir Roshan Zamir; Afshin Dehghan; Mubarak Shah
In this paper we investigate a new problem called visual business recognition. Automatic identification of businesses in images is an interesting task with plenty of potential applications especially for mobile device users. We propose a multimodal approach which incorporates business directories, textual information, and web images in a unified framework. We assume the query image is associated with a coarse location tag and utilize business directories for extracting an over complete list of nearby businesses which may be visible in the image. We use the name of nearby businesses as search keywords in order to automatically collect a set of relevant images from the web and perform image matching between them and the query. Additionally, we employ a text processing method customized for business recognition which is assisted by nearby business names; we fuse the information acquired from image matching and text processing in a probabilistic framework to recognize the businesses. We tested the proposed algorithm on a challenging set of user-uploaded and street view images with promising results for this new application.
3D view synthesis with inter-view consistency BIBAFull-Text 669-672
  David Wolinski; Olivier Le Meur; Josselin Gautier
In this paper, we propose a new pipeline to synthesize virtual views by extrapolation. It allows us to generate virtual views far away from each other, each presenting the exact same level of quality. This inter-view consistency is key to seamlessly navigate between viewpoints. Its computational cost is also lower than that of existing approaches. We compare the proposed approach with state-of-the-art methods and show the effectiveness of this new view synthesis pipeline.
Improving event detection using related videos and relevance degree support vector machines BIBAFull-Text 673-676
  Christos Tzelepis; Nikolaos Gkalelis; Vasileios Mezaris; Ioannis Kompatsiaris
In this paper, a new method that exploits related videos for the problem of event detection is proposed, where related videos are videos that are closely but not fully associated with the event of interest. In particular, the Weighted Margin SVM formulation is modified so that related class observations can be effectively incorporated in the optimization problem. The resulting Relevance Degree SVM is especially useful in problems where only a limited number of training observations is provided, e.g., for the EK10Ex subtask of TRECVID MED, where only ten positive and ten related samples are provided for the training of a complex event detector. Experimental results on the TRECVID MED 2011 dataset verify the effectiveness of the proposed method.
Consistent stereo image editing BIBAFull-Text 677-680
  Tao Yan; Shengfeng He; Rynson W. H. Lau; Yun Xu
Stereo images and videos are very popular in recent years, and techniques for processing this media are attracting a lot of attention. In this paper, we extend the shift-map method for stereo image editing. Our method simultaneously processes the left and right images on pixel level using a global optimization algorithm. It enforces photo consistence between the two images and preserves 3D scene structures. It also addresses the occlusion and disocclusion problem, which may enable many stereo image editing functions, such as depth mapping, object depth adjustment and non-homogeneous image resizing. Our experiments show that the proposed method produces high quality results in various editing functions.
Superpixel segmentation based structural scene recognition BIBAFull-Text 681-684
  Shuhui Bu; Zhenbao Liu; Junwei Han; Jun Wu
This paper presents a novel structural model based scene recognition method. In order to resolve regular grid image division methods which cause low content discriminability for scene recognition in previous methods, we partition an image into a pre-defined set of regions by superpixel segmentation. And then classification is modelled by introducing a structural model which has the capability of organizing unordered features of image patches. In the implementation, CENTRIST which is robust to scene recognition is used as original image feature, and bag-of-words representation is used to capture the local appearances of an image. In addition, we incorporate adjacent superpixel's differences as edge features. Our models are trained using structural SVM. Two state-of-the-art scene datasets are adopted to evaluate the proposed method. The experiment results show that the recognition accuracy is significantly improved by the proposed method.
Evaluation of salient point methods BIBAFull-Text 685-688
  Song Wu; Michael Lew
Processing visual content in images and videos is a challenging task associated with the development of modern computer vision. Because salient point approaches can represent distinctive and affine invariant points in images, many approaches have been proposed over the past decade. Each method has particular advantages and limitations and may be appropriate in different contexts. In this paper we evaluate the performance of a wide set of salient point detectors and descriptors. We begin by comparing diverse salient point algorithms (SIFT, SURF, BRIEF, ORB, FREAK, BRISK, STAR, GFTT and FAST) with regard to repeatability, recall and precision and then move to accuracy and stability in real-time video tracking.
Cross-media topic mining on wikipedia BIBAFull-Text 689-692
  Xikui Wang; Yang Liu; Donghui Wang; Fei Wu
As a collaborative wiki-based encyclopedia, Wikipedia provides a huge amount of articles of various categories. In addition to their text corpus, Wikipedia also contains plenty of images which makes the articles more intuitive for readers to understand. To better organize these visual and textual data, one promising area of research is to jointly model the embedding topics across multi-modal data (i.e, cross-media) from Wikipedia. In this work, we propose to learn the projection matrices that map the data from heterogeneous feature spaces into a unified latent topic space. Different from previous approaches, by imposing the l1 regularizers to the projection matrices, only a small number of relevant visual/textual words are associated with each topic, which makes our model more interpretable and robust. Furthermore, the correlations of Wikipedia data in different modalities are explicitly considered in our model. The effectiveness of the proposed topic extraction algorithm is verified by several experiments conducted on real Wikipedia datasets.
A multigrid approach for bandwidth and display resolution aware streaming of 3D deformations BIBAFull-Text 693-696
  Yuan Tian; Yin Yang; Xiaohu Guo; Balakrishnan Prabhakaran
In this paper, we propose a novel multimedia system adaptively streaming the animation according to display resolution and/or network bandwidth. A Multigrid-like technique is used in this framework to accelerate the converging rate of the optimization of the nonlinear deformation energy. The computation is performed from coarsest mesh at the top level to the finest mesh at the bottom level and then goes back to the top again. Such V-shape calculation provides great flexibility for the networked environment. Clients are able to receive the data streaming corresponding to its display resolution and network bandwidth. A more compact form of deformation data packaging is also used in this system such that a cube element only needs six parameters instead of 24 variables as used in regular mesh representation, which significantly reduces the network overhead for the streaming.
Error recovered hierarchical classification BIBAFull-Text 697-700
  Shiai Zhu; Xiao-Yong Wei; Chong-Wah Ngo
Hierarchical classification (HC) is a popular and efficient way for detecting the semantic concepts from the images. However, the conventional HC, which always selects the branch with the highest classification response to go on, has the risk of propagating serious errors from higher levels of the hierarchy to the lower levels. We argue that the highest-response-first strategy is too arbitrary, because the candidate nodes are considered individually which ignores the semantic relationship among them. In this paper, we propose a novel method for HC, which is able to utilize the semantic relationship among candidate nodes and their children to recover the responses of unreliable classifiers of the candidate nodes, with the hope of providing the branch selection a more globally valid and semantically consistent view. The experimental results show that the proposed method outperforms the conventional HC methods and achieves a satisfactory balance between the accuracy and efficiency.
Time matters!: capturing variation in time in video using fisher kernels BIBAFull-Text 701-704
  Ionut Mironica; Jasper Uijlings; Negar Rostamzadeh; Bogdan Ionescu; Nicu Sebe
In video global features are often used for reasons of computational efficiency, where each global feature captures information of a single video frame. But frames in video change over time, so an important question is: how can we meaningfully aggregate frame-based features in order to preserve the variation in time? In this paper we propose to use the Fisher Kernel to capture variation in time in video. While in this approach the temporal order is lost, it captures both subtle variation in time such as the ones caused by a moving bicycle and drastic variations in time such as the changing of shots in a documentary. Our work should not be confused with a Bag of Local Visual Features approach, where one captures the visual variation of local features in both time and space indiscriminately. Instead, each feature measures a complete frame hence we capture variation in time only.
   We show that our framework is highly general, reporting improvements using frame-based visual features, body-part features, and audio features on three diverse datasets: We obtain state-of-the-art results on the UCF50 human action dataset and improve the state-of-the-art on the MediaEval 2012 video-genre benchmark and on the ADL daily activity recognition dataset.
A multimodal probabilistic model for gesture -- based control of sound synthesis BIBAFull-Text 705-708
  Jules Françoise; Norbert Schnell; Frédéric Bevilacqua
In this paper, we propose a multimodal approach to create the mapping between gesture and sound in interactive music systems. Specifically, we propose to use a multimodal HMM to conjointly model the gesture and sound parameters. Our approach is compatible with a learning method that allows users to define the gesture -- sound relationships interactively. We describe an implementation of this method for the control of physical modeling sound synthesis. Our model is promising to capture expressive gesture variations while guaranteeing a consistent relationship between gesture and sound.
Modeling local descriptors with multivariate Gaussians for object and scene recognition BIBAFull-Text 709-712
  Giuseppe Serra; Costantino Grana; Marco Manfredi; Rita Cucchiara
Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling on a spatial pyramid. Every distribution is converted to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. Experiments on Caltech-101 and ImageCLEF2011 are performed using the Stochastic Gradient Descent solver, which allows to deal with large scale datasets and high dimensional feature spaces.
Anchor concept graph distance for web image re-ranking BIBAFull-Text 713-716
  Shi Qiu; Xiaogang Wang; Xiaoou Tang
Web image re-ranking aims to automatically refine the initial text-based image search results by employing visual information. A strong line of work in image re-ranking relies on building image graphs that requires computing distances between image pairs. In this paper, we present Anchor Concept Graph Distance (ACG Distance), a novel distance measure for image re-ranking. For a given textual query, an Anchor Concept Graph (ACG) is automatically learned from the initial text-based search results. The nodes of the ACG (i.e., anchor concepts) and their correlations well model the semantic structure of the images to be re-ranked. Images are projected to the anchor concepts. The projection vectors undergo a diffusion process over the ACG, and then are used to compute the ACG distance. The ACG distance reduces the semantic gap and better represents distances between images. Experiments on the MSRA-MM and INRIA datasets show that the ACG distance consistently outperforms existing distance measures and significantly improves start-of-the-art methods in image re-ranking.
Violence detection in Hollywood movies by the fusion of visual and mid-level audio cues BIBAFull-Text 717-720
  Esra Acar; Frank Hopfgartner; Sahin Albayrak
Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.
A 3D tele-immersion streaming approach using skeleton-based prediction BIBAFull-Text 721-724
  Suraj Raghuraman; Karthik Venkatraman; Zhanyu Wang; Balakrishnan Prabhakaran; Xiaohu Guo
3D collaborative Tele-Immersive environments allow reconstruction of real world 3D scenes in the virtual world across multiple physical locations. This kind of reconstruction results in a lot of 3D data being transmitted over the internet in real time. The current systems allow for transmission at low frame rates due to the large volume of data and network bandwidth restrictions. In this paper we propose a prediction based approach that generates future frames by animating the live model based on few skeleton points. By doing so the magnitude of data transmitted is reduced to few hundred bytes. The prediction errors are corrected when an entire frame is received. This approach allows minimal amounts (few bytes) of data to be transmitted per frame, thus allowing for high frame rates and still maintain an acceptable visual quality of reconstruction at the receiver side.
Fast image/video collection summarization with local clustering BIBAFull-Text 725-728
  Shuhei Tarashima; Go Irie; Ken Tsutsuguchi; Hiroyuki Arai; Yukinobu Taniguchi
Image/video collection summarization is an emerging paradigm to provide an overview of contents stored in massive databases. Existing algorithms require at least O(N) time to generate a summary, which cannot be applied to online scenarios. Assuming that contents are represented as a sparse graph, we propose a fast image/video collection summarization algorithm using local graph clustering. After a query node is specified, our algorithm first finds a small sub-graph near the query without looking at the whole graph, and then selects fewer number of nodes diverse to each other. Our algorithm thus provides a summary in nearly constant time in the number of contents. Experimental results demonstrate that our algorithm is more than 1500 times faster than a state-of-the-art method, with comparable summarization quality.
Spot the differences: from a photograph burst to the single best picture BIBAFull-Text 729-732
  H. Emrah Tasli; Jan C. van Gemert; Theo Gevers
With the rise of the digital camera, people nowadays typically take several near-identical photos of the same scene to maximize the chances of a good shot. This paper proposes a user-friendly tool for exploring a personal photo gallery for selecting or even creating the best shot of a scene between its multiple alternatives. This functionality is realized through a graphical user interface where the best viewpoint can be selected from a generated panorama of the scene. Once the viewpoint is selected, the user is able to go explore possible alternatives coming from the other images. Using this tool, one can explore a photo gallery efficiently. Moreover, additional compositions from other images are also possible. With such additional compositions, one can go from a burst of photographs to the single best one. Even funny compositions of images, where you can duplicate a person in the same image, are possible with our proposed tool.
Semantic pooling for complex event detection BIBAFull-Text 733-736
  Qian Yu; Jingen Liu; Hui Cheng; Ajay Divakaran; Harpreet Sawhney
Complex event detection is very challenging in open source such as You-Tube videos, which usually comprise very diverse visual contents involving various object, scene and action concepts. Not all of them, however, are relevant to the event. In other words, a video may contain a lot of "junk" information which is harmful for recognition. Hence, we propose a semantic pooling approach to tackle this issue. Unlike the conventional pooling over the entire video or specific spatial regions of a video, we employ a discriminative approach to acquire abstract semantic "regions" for pooling. For this purpose, we first associate low-level visual words with semantic concepts via their co-occurrence relationship. We then pool the low-level features separately according to their semantic information. The proposed semantic pooling strategy also provides a new mechanism for incorporating semantic concepts for low-level feature based event recognition. We evaluate our approach on TRECVID MED [1] dataset and the results show that semantic pooling consistently improves the performance compared with conventional pooling strategies.
SwarmVision: autonomous aesthetic multi-camera interaction BIBAFull-Text 737-740
  George Legrady; Danny Bazo; Marco Pinter
A platform of exploratory networked robotic cameras was created, utilizing an aesthetic approach to experimentation. Initiated by research in autonomous swarm robotic camera behavior, SwarmVision is an installation consisting of multiple Pan-Tilt-Zoom cameras on rails positioned above spectators in an exhibition space, where each camera behaves autonomously based on its own rules of computer vision and control. Each of the cameras is programmed to detect visual information of interest based on a different algorithm, and each negotiates with the other two, influencing what subject matter to study in a collective way. The emergent behaviors of the system illustrate an ongoing process of scene reconstruction and video-based behavior generation.
Segmenting music through the joint estimation of keys, chords and structural boundaries BIBAFull-Text 741-744
  Johan Pauwels; Geoffroy Peeters
In this paper, we introduce a new approach to music structure segmentation that is based on the joint estimation of structural segments, keys and chords in one probabilistic framework. More precisely, the boundaries of a structure segment are determined by detecting key changes and by utilizing the difference in prior probability of chord transitions according to their position in a structural segment. In contrast to many of the recent approaches to structural segmentation, this system does not work with self-similarity matrices, although it has been designed to integrate this kind of approach into the framework at a later stage. However, just the current version of the system, using only the estimated harmony, is already producing encouraging results, especially with respect to the precise localization of the boundaries.
3D teleimmersive activity classification based on application-system metadata BIBAFull-Text 745-748
  Aadhar Jain; Ahsan Arefin; Raoul Rivas; Chien-nan Chen; Klara Nahrstedt
Being able to detect and recognize human activities is essential for 3D collaborative applications for efficient quality of service provisioning and device management. A broad range of research has been devoted to analyze media data to identify human activity, which requires the knowledge of data format, application-specific coding technique and computationally expensive image analysis. In this paper, we propose a human activity detection technique based on application generated metadata and related system metadata. Our approach does not depend on specific data format or coding technique. We evaluate our algorithm with different cyber-physical setups, and show that we can achieve very high accuracy (above 97%) by using a good learning model.
Object co-segmentation via discriminative low rank matrix recovery BIBAFull-Text 749-752
  Yong Li; Jing Liu; Zechao Li; Yang Liu; Hanqing Lu
The goal of this paper is to simultaneously segment the object regions appearing in a set of images of the same object class, known as object co-segmentation. Different from typical methods, simply assuming that the regions common among images are the object regions, we additionally consider the disturbance from consistent backgrounds, and indicate not only common regions but salient ones among images to be the object regions. To this end, we propose a Discriminative Low Rank matrix Recovery (DLRR) algorithm to divide the over-completely segmented regions (i.e., superpixels) of a given image set into object and non-object ones. In DLRR, a low-rank matrix recovery term is adopted to detect salient regions in an image, while a discriminative learning term is used to distinguish the object regions from all the super-pixels. An additional regularized term is imported to jointly measure the disagreement between the predicted saliency and the objectiveness probability corresponding to each super-pixel of the image set. For the unified learning problem by connecting the above three terms, we design an efficient optimization procedure based on block-coordinate descent. Extensive experiments are conducted on two public datasets, i.e., MSRC and iCoseg, and the comparisons with some state-of-the-arts demonstrate the effectiveness of our work.
πLDA: document clustering with selective structural constraints BIBAFull-Text 753-756
  Siliang Tang; Hanqi Wang; Jian Shao; Fei Wu; Ming Chen; Yueting Zhuang
Segments, such as sentence boundaries in texts or annotated regions in images, can be considered as useful structural constraints (i.e., priors) for unsupervised topic modeling. However, some segment units (e.g., words in texts or visual words in images) inside a given segment may be irrelevant to the topic of this segment due to their characteristics. This paper proposes a model called πLDA, which introduces a latent variable π into LDA, a traditional topic model, to capture the characteristic of each segment unit. That is to say, the πLDA model is conducted to determine whether a segment unit is assigned (or selected) to the topic embedded in its corresponding segment. Compared with other approaches that assume all the segment units in one segment to share a common topic, our proposed πLDA has the selective ability to discover the discriminative segment units (e.g., informative words or visual words). Experimental results and interpretations of them are presented for demonstrating the promising performance of our method.
Con-text: text detection using background connectivity for fine-grained object classification BIBAFull-Text 757-760
  Sezer Karaoglu; Jan C. van Gemert; Theo Gevers
This paper focuses on fine-grained classification by detecting photographed text in images. We introduce a text detection method that does not try to detect all possible foreground text regions but instead aims to reconstruct the scene background to eliminate non-text regions. Object cues such as color, contrast, and objectiveness are used in corporation with a random forest classifier to detect background pixels in the scene. Results on two publicly available datasets ICDAR03 and a fine-grained Building subcategories of ImageNet shows the effectiveness of the proposed method.
Relative spatial features for image memorability BIBAFull-Text 761-764
  Jongpil Kim; Sejong Yoon; Vladimir Pavlovic
Recent studies in image memorability showed that the memorability of an image is a measurable quantity and is closely correlated with semantic attributes. However, the intrinsic characteristics of memorability are not yet fully understood. It has been reported that in contrast to a popular belief unusualness or aesthetic beauty of the image may not be positively correlated with the image memorability. This counter-intuitive characteristic of memorability hinders a better understanding of image memorability and its applicability. In this paper, we investigate two new spatial features that are closely correlated with the image memorability yet intuitively explainable. We propose the Weighted Object Area (WOA) that jointly considers the location and size of objects and the Relative Area Rank (RAR) that captures the relative unusualness of the size of objects. We empirically demonstrate their useful correlation with the image memorability. Results show that both WOA and RAR can improve the memorability prediction. In addition, we provide evidence that the RAR can effectively capture object-centric unusualness of size.
Automatic Egyptian hieroglyph recognition by retrieving images as texts BIBAFull-Text 765-768
  Morris Franken; Jan C. van Gemert
In this paper we propose an approach for automatically recognizing ancient Egyptian hieroglyph from photographs. To this end we first manually annotated and segmented a large collection of nearly 4,000 hieroglyphs. In our automatic approach we localize and segment each individual hieroglyph, determine the reading order and subsequently evaluate 5 visual descriptors in 3 different matching schemes to evaluate visual hieroglyph recognition. In addition to visual-only cues, we use a corpus of Egyptian texts to learn language models that help re-rank the visual output.
Query-dependent visual dictionary adaptation for image reranking BIBAFull-Text 769-772
  Jialong Wang; Cheng Deng; Wei Liu; Rongrong Ji; Xiangyu Chen; Xinbo Gao
Although text-based image search engines are popular for ranking images of user's interest, the state-of-the-art ranking performance is still far from satisfactory. One major issue comes from the visual similarity metric used in the ranking operation, which depends solely on visual features. To tackle this issue, one feasible method is to incorporate semantic concepts, also known as image attributes, into image ranking. However, the optimal combination of visual features and image attributes remains unknown. In this paper, we propose a query-dependent image reranking approach by leveraging the higher level attribute detection among the top returned images to adapt the dictionary built over the visual features to a query-specific fashion. We start from offline learning transposition probabilities between visual codewords and attributes, then utilize the probabilities to online adapt the dictionary, and finally produce a query-dependent and semantics-induced metric for image ranking. Extensive evaluations on several benchmark image datasets demonstrate the effectiveness and efficiency of the proposed approach in comparison with state-of-the-arts.
Correlated-spaces regression for learning continuous emotion dimensions BIBAFull-Text 773-776
  Mihalis A. Nicolaou; Stefanos Zafeiriou; Maja Pantic
Adopting continuous dimensional annotations for affective analysis has been gaining rising attention by researchers over the past years. Due to the idiosyncratic nature of this problem, many subproblems have been identified, spanning from the fusion of multiple continuous annotations to exploiting output-correlations amongst emotion dimensions. In this paper, we firstly empirically answer several important questions which have found partial or no answer at all so far in related literature. In more detail, we study the correlation of each emotion dimension (i) with respect to other emotion dimensions, (ii) to basic emotions (e.g., happiness, anger). As a measure for comparison, we use video and audio features. Interestingly enough, we find that (i) each emotion dimension is more correlated with other emotion dimensions rather than with face and audio features, and similarly (ii) that each basic emotion is more correlated with emotion dimensions than with audio and video features. A similar conclusion holds for discrete emotions which are found to be highly correlated to emotion dimensions as compared to audio and/or video features. Motivated by these findings, we present a novel regression algorithm (Correlated-Spaces Regression, CSR), inspired by Canonical Correlation Analysis (CCA) which learns output-correlations and performs supervised dimensionality reduction and multimodal fusion by (i) projecting features extracted from all modalities and labels onto a common space where their inter-correlation is maximised and (ii) learning mappings from the projected feature space onto the projected, uncorrelated label space.
RealSense: directional interaction for proximate mobile sharing using built-in orientation sensors BIBAFull-Text 777-780
  Chien-Pang Lin; Cheng-Yao Wang; Hou-Ren Chen; Wei-Chen Chu; Mike Y. Chen
We present RealSense, a technology that enables users to easily share media files with proximate users by performing directional gestures on mobile devices. RealSense leverages the natural human group behavior of forming a circle and facing the center of the group. By continuously monitoring the directional heading of each device using only built-in orientation sensors, RealSense can compute the relative direction between all the devices. It simplifies media sharing because users do not need to lookup and specify the user IDs and device IDs of the intended recipients. We first evaluated the feasibility and design of RealSense, including the orientation sensor error and the minimal arc degree for selecting recipients. We then compared RealSense with three other common sharing interactions: 1) linear menu, 2) pie menu, and 3) NFC. Our results show that participants preferred RealSense over other sharing interactions, especially for groups of participants who were unacquainted with each other.
Understanding and classifying image tweets BIBAFull-Text 781-784
  Tao Chen; Dongyuan Lu; Min-Yen Kan; Peng Cui
Social media platforms now allow users to share images alongside their textual posts. These image tweets make up a fast-growing percentage of tweets, but have not been studied in depth unlike their text-only counterparts. We study a large corpus of image tweets in order to uncover what people post about and the correlation between the tweet's image and its text. We show that an important functional distinction is between visually-relevant and visually-irrelevant tweets, and that we can successfully build an automated classifier utilizing text, image and social context features to distinguish these two classes, obtaining a macro F1 of 70.5%.
User interest and social influence based emotion prediction for individuals BIBAFull-Text 785-788
  Yun Yang; Peng Cui; Wenwu Zhu; Shiqiang Yang
Emotions are playing significant roles in daily life, making emotion prediction important. To date, most of state-of-the-art methods make emotion prediction for the masses which are invalid for individuals. In this paper, we propose a novel emotion prediction method for individuals based on user interest and social influence. To balance user interest and social influence, we further propose a simple yet efficient weight learning method in which the weights are obtained from users' behaviors. We perform experiments in real social media network, with 4,257 users and 2,152,037 microblogs. The experimental results demonstrate that our method outperforms traditional methods with significant performance gains.
Bimodal log-linear regression for fusion of audio and visual features BIBAFull-Text 789-792
  Ognjen Rudovic; Stavros Petridis; Maja Pantic
One of the most commonly used audiovisual fusion approaches is feature-level fusion where the audio and visual features are concatenated. Although this approach has been successfully used in several applications, it does not take into account interactions between the features, which can be a problem when one and/or both modalities have noisy features. In this paper, we investigate whether feature fusion based on explicit modelling of interactions between audio and visual features can enhance the performance of the classifier that performs feature fusion using simple concatenation of the audio-visual features. To this end, we propose a log-linear model, named Bimodal Log-linear regression, which accounts for interactions between the features of the two modalities. The performance of the target classifiers is measured in the task of laughter-vs-speech discrimination, since both laughter and speech are naturally audiovisual events. Our experiments on the MAHNOB laughter database suggest that feature fusion based on explicit modelling of interactions between the audio-visual features leads to an improvement of 3% over the standard feature concatenation approach, when log-linear model is used as the base classifier. Finally, the most and least influential features can be easily identified by observing their interactions.

Security and forensics

Facilitating fashion camouflage art BIBAFull-Text 793-802
  Ranran Feng; Balakrishnan Prabhakaran
Artists and fashion designers have recently been creating a new form of art -- Camouflage Art -- which can be used to prevent computer vision algorithms from detecting faces. This digital art technique combines makeup and hair styling, or other modifications such as facial painting to help avoid automatic face-detection. In this paper, we first study the camouflage interference and its effectiveness on several current state of art techniques in face detection/recognition; and then present a tool that can facilitate digital art design for such camouflage that can fool these computer vision algorithms. This tool can find the prominent or decisive features from facial images that constitute the face being recognized; and give suggestions for camouflage options (makeup, styling, paints) on particular facial features or facial parts. Testing of this tool shows that it can effectively aid the artists or designers in creating camouflage-thwarting designs. The evaluation on suggested camouflages applied on 40 celebrities across eight different face recognition systems (both non-commercial or commercial) shows that 82.5%~100% of times the subject is unrecognizable using the suggested camouflage.
An efficient image homomorphic encryption scheme with small ciphertext expansion BIBAFull-Text 803-812
  Peijia Zheng; Jiwu Huang
The field of image processing in the encrypted domain has been given increasing attention for the extensive potential applications, for example, providing efficient and secure solutions for privacy-preserving applications in untrusted environment. One obstacle to the widespread use of these techniques is the ciphertext expansion of high orders of magnitude caused by the existing homomorphic encryptions. In this paper, we provide a way to tackle this issue for image processing in the encrypted domain. By using characteristics of image format, we develop an image encryption scheme to limit ciphertext expansion while preserving the homomorphic property. The proposed encryption scheme first encrypts image pixels with an existing probabilistic homomorphic cryptosystem, and then compresses the whole encrypted image in order to save storage space. Our scheme has a much smaller ciphertext expansion factor compared with the element-wise encryption scheme, while preserving the homomorphic property. It is not necessary to require additional interactive protocols when applying secure signal processing tools to the compressed encrypted image. We present a fast algorithm for the encryption and the compression of the proposed image encryption scheme, which speeds up the computation and makes our scheme much more efficient. The analysis on the security, ciphertext expansion ratio, and computational complexity are also conducted. Our experiments demonstrate the validity of the proposed algorithms. The proposed scheme is suitable to be employed as an image encryption method for the applications in secure image processing.
Large-scale multimedia content analysis using scientific workflows BIBAFull-Text 813-822
  Ricky J. Sethi; Yolanda Gil; Hyunjoon Jo; Andrew Philpot
Analyzing web content, particularly multimedia content, for security applications is of great interest. However, it often requires deep expertise in data analytics that is not always accessible to non-experts. Our approach is to use scientific workflows that capture expert-level methods to examine web content. We use workflows to analyze the image and text components of multimedia web posts separately, as well as by a multimodal fusion of both image and text data. In particular, we re-purpose workflow fragments to do the multimedia analysis and create additional components for the fusion of the image and text modalities. In this paper, we present preliminary work which focuses on a Human Trafficking Detection task to help deter human trafficking of minors by thus fusing image and text content from the web. We also examine how workflow fragments save time and effort in multimedia content analysis while bringing together multiple areas of machine learning and computer vision. We further export these workflow fragments using linked data as web objects.

Open source software

Waisda?: video labeling game BIBAFull-Text 823-826
  Michiel Hildebrand; Maarten Brinkerink; Riste Gligorov; Martijn van Steenbergen; Johan Huijkman; Johan Oomen
The Waisda? video labeling game is a crowsourcing tool to collect user-generated metadata for video clips. It follows the paradigm of games-with-a-purpose, where two or more users play against each other by entering tags that describe the content of the video. Players score points by entering the same tags as one of the other players. As a result each video that is played in the game is annotated with tags that are anchored to a time point in the video. Waisda? has been deployed in two projects with videos from Dutch broadcasters. With the open source version of Waisda? crowdsourcing of video annotation becomes available for any online video collection.
GamingAnywhere: an open-source cloud gaming testbed BIBAFull-Text 827-830
  Chun-Ying Huang; De-Yu Chen; Cheng-Hsin Hsu; Kuan-Ta Chen
While cloud gaming opens new business opportunity, it also poses tremendous challenges as the Internet only provides best-effort service and gamers are hard to please. Although researchers have various ideas to improve cloud gaming systems, existing cloud gaming systems are closed and proprietary, and cannot be used to evaluate these ideas. We present GamingAnywhere, the first open-source cloud gaming system, which is extensible, portable, and configurable. GamingAnywhere may be used by: (i) researchers and engineers to implement and test their new ideas, (ii) service providers to develop cloud gaming services, and (iii) gamers to set up private cloud gaming systems. Details on GamingAnywhere are given in this paper. We firmly believe GamingAnywhere will stimulate future studies on cloud gaming and real-time interactive distributed systems.
The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time BIBAFull-Text 831-834
  Johannes Wagner; Florian Lingenfelser; Tobias Baur; Ionut Damian; Felix Kistler; Elisabeth André
Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.
Recent developments in openSMILE, the munich open-source multimedia feature extractor BIBAFull-Text 835-838
  Florian Eyben; Felix Weninger; Florian Gross; Björn Schuller
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.
ImproveMyCity: an open source platform for direct citizen-government communication BIBAFull-Text 839-842
  Ioannis Tsampoulatidis; Dimitrios Ververidis; Panagiotis Tsarchopoulos; Spiros Nikolopoulos; Ioannis Kompatsiaris; Nicos Komninos
ImproveMyCity is an open source platform that enables residents to directly report to their public administration local issues about their neighborhood such as discarded trash bins, faulty street lights, broken tiles on sidewalks, illegal advertising boards, etc. The reported issues are automatically transmitted to the appropriate office in public administration so as to schedule their settlement. Reporting is feasible both through a web- and a smartphone-based front-end that adopt a map-based visualization, which makes reporting a user-friendly and intriguing process. The management and routing of incoming issues is performed through a back-end infrastructure that serves as an integrated management system with easy to use interfaces. Apart from reporting a new issue, both front-ends allow the citizens to add comments or vote on existing issues, which adds a social dimension on the collected content. Finally, the platform makes also provision for informing the citizens about the progress status of the reported issue and in this way facilitate the establishment of a two-way dialogue between the citizen and public administration.
LIRE: open source image retrieval in Java BIBAFull-Text 843-846
  Mathias Lux
Content based image retrieval has been around for some time. There are lots of different test data sets, lots of published methods and techniques, and manifold retrieval challenges, where content based image retrieval is of interest. LIRE is a Java library, that provides a simple way to index and retrieve millions of images based on the images' contents. LIRE is robust and well tested and is not only recommended by the websites of ImageCLEF and MediaEval, but is also employed in industry. This paper gives an overview on LIRE, its use, capabilities and reports on retrieval and runtime performance.
Golden retriever: a Java based open source image retrieval engine BIBAFull-Text 847-850
  Lazaros T. Tsochatzidis; Chryssanthi Iakovidou; Savvas A. Chatzichristofis; Yiannis S. Boutalis
Golden Retriever Image Retrieval Engine (GRire) is an open source light weight Java library developed for Content Based Image Retrieval (CBIR) tasks, employing the Bag of Visual Words (BOVW) model. It provides a complete framework for creating CBIR system including image analysis tools, classifiers, weighting schemes etc., for efficient indexing and retrieval procedures. Its eminent feature is its extensibility, achieved through the open source nature of the library as well as a user-friendly embedded plug-in system. GRire is available on-line along with install and development documentation on http://www.grire.net and on its Google Code page http://code.google.com/p/grire. It is distributed either as a Java library or as a standalone Java application, both GPL licensed.
Stage framework: an HTML5 and CSS3 framework for digital publishing BIBAFull-Text 851-854
  Rami Aamulehto; Mikko Kuhna; Jussi Tarvainen; Pirkko Oittinen
In this paper we present Stage Framework, an HTML5 and CSS3 framework for digital book, magazine and newspaper publishing. The framework offers publishers the means and tools for publishing editorial content in the HTML5 format using a single web application. The approach is cross-platform and is based on open web standards. Stage Framework serves as an alternative for platform-specific native publications using pure HTML5 to deliver book, magazine and newspaper content while retaining the familiar gesture interaction of native applications. Available gesture actions include for example the page swipe and kinetic scrolling. The magazine browsing view relies entirely on CSS3 3D Transforms and Transitions, thus utilizing hardware acceleration in most devices and platforms. The web application also features a magazine stand which, can be used to offer issues of multiple publications. Developed as a part of master's thesis research, the framework has been published under the GPL and MIT licenses and is available to everyone via the framework website (http://stageframework.com) and the GitHub repository (http://github.com/ralatalo/stage).
ESSENTIA: an open-source library for sound and music analysis BIBAFull-Text 855-858
  Dmitry Bogdanov; Nicolas Wack; Emilia Gómez; Sankalp Gulati; Perfecto Herrera; Oscar Mayor; Gerard Roma; Justin Salamon; José Zapata; Xavier Serra
We present Essentia 2.0, an open-source C++ library for audio analysis and audio-based music information retrieval released under the Affero GPL license. It contains an extensive collection of reusable algorithms which implement audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, and a large set of spectral, temporal, tonal and high-level music descriptors. The library is also wrapped in Python and includes a number of predefined executable extractors for the available music descriptors, which facilitates its use for fast prototyping and allows setting up research experiments very rapidly. Furthermore, it includes a Vamp plugin to be used with Sonic Visualiser for visualization purposes. The library is cross-platform and currently supports Linux, Mac OS X, and Windows systems. Essentia is designed with a focus on the robustness of the provided music descriptors and is optimized in terms of the computational cost of the algorithms. The provided functionality, specifically the music descriptors included in-the-box and signal processing algorithms, is easily expandable and allows for both research experiments and development of large-scale industrial applications.
SCReen adjusted panoramic effect: SCRAPE BIBAFull-Text 859-862
  Carl Flynn; David Monaghan; Noel E. O'Connor
A Cave Automatic Virtual Environment (CAVE) is an enclosed virtual reality room that uses multiple projectors to display images across its surfaces. It uses one or more computers to synchronise and combine the images and allows users to control virtual worlds using a host of interaction devices. Traditionally, a CAVE is used by a single user at any one time and by utilising some form of motion sensing, the user's head position can be tracked to allow for first virtual perception. The images are then displayed in stereographic 3D in order to complete the virtual reality effect. Professional CAVE installations are expensive and can cost upwards of several hundred thousand euros. This tends to act as a significant barrier to their propagation, however, as the reduction in cost of high specification computers, projectors and graphics cards continues apace, it has sparked a renewed interest in CAVE environments and given rise to the realistic possibility of setting up low cost, amateur CAVEs. Unfortunately, one of the greatest disadvantages of CAVE systems is the lack of inexpensive, easy to use, specialised software. In this paper we present an open source and easy to use CAVE software toolkit called SCReen Adjusted Panoramic Effect or SCRAPE for short. We believe that SCRAPE is the first major piece in a longer-term vision that aims to bring easy to setup, easy to use, portable CAVE systems to all types of non-expert users.
Orcc: multimedia development made easy BIBAFull-Text 863-866
  Herve Yviquel; Antoine Lorence; Khaled Jerbi; Gildas Cocherel; Alexandre Sanchez; Mickael Raulet
In this paper, we present Orcc, an open-source development environment that aims at enhancing multimedia development by offering all the advantages of dataflow programming: flexibility, portability and scalability. To do so, Orcc embeds two rich eclipse-based editors that provide an easy writing of dataflow applications, a simulator that allows quick validation of the written code, and a multi-target compiler that is able to translate any dataflow program, written in the RVC-CAL language, into an equivalent description in both hardware and software languages. Orcc has already been used to successfully write tens of multimedia applications, such as a video decoder supporting the new High Efficiency Video Coding standard, that clearly demonstrates the ability of the environment to develop complex applications. Moreover, results show scalable performances on multi-core platforms and achieve real-time decoding frame-rate on HD sequences.

Multimodal analysis

Human vs machine: establishing a human baseline for multimodal location estimation BIBAFull-Text 867-876
  Jaeyoung Choi; Howard Lei; Venkatesan Ekambaram; Pascal Kelm; Luke Gottlieb; Thomas Sikora; Kannan Ramchandran; Gerald Friedland
Over the recent years, the problem of video location estimation (i.e., estimating the longitude/latitude coordinates of a video without GPS information) has been approached with diverse methods and ideas in the research community and significant improvements have been made. So far, however, systems have only been compared against each other and no systematic study on human performance has been conducted. Based on a human-subject study with 11,900 experiments, this article presents a human baseline for location estimation for different combinations of modalities (audio, audio/video, audio/video/text). Furthermore, this article compares state-of-the-art location estimation systems with the human baseline. Although the overall performance of humans' multimodal video location estimation is better than current machine learning approaches, the difference is quite small: For 41% of the test set, the machine's accuracy was superior to the humans. We present case studies and discuss why machines did better for some videos and not for others. Our analysis suggests new directions and priorities for future work on the improvement of location inference algorithms.
Cross-media semantic representation via bi-directional learning to rank BIBAFull-Text 877-886
  Fei Wu; Xinyan Lu; Zhongfei Zhang; Shuicheng Yan; Yong Rui; Yueting Zhuang
In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. Existing approaches take either one-to-one paired data or uni-directional ranking examples (i.e., utilizing only text-query-image ranking examples or image-query-text ranking examples) as training examples, which do not make full use of bi-directional ranking examples (bi-directional ranking means that both text-query-image and image-query-text ranking examples are utilized in the training period) to achieve a better performance. In this paper, we consider learning a cross-media representation model from the perspective of optimizing a listwise ranking problem while taking advantage of bi-directional ranking examples. We propose a general cross-media ranking algorithm to optimize the bi-directional listwise ranking loss with a latent space embedding, which we call Bi-directional Cross-Media Semantic Representation Model (Bi-CMSRM). The latent space embedding is discriminatively learned by the structural large margin learning for optimization with certain ranking criteria (mean average precision in this paper) directly. We evaluate Bi-CMSRM on the Wikipedia and NUS-WIDE datasets and show that the utilization of the bi-directional ranking examples achieves a much better performance than only using the uni-directional ranking examples.
Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing BIBAFull-Text 887-896
  Wu Liu; Tao Mei; Yongdong Zhang; Jintao Li; Shipeng Li
Mobile video is quickly becoming a mass consumer phenomenon. More and more people are using their smartphones to search and browse video content while on the move. In this paper, we have developed an innovative instant mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. The system is able to index large-scale video data using a new layered audio-video indexing approach in the cloud, as well as extract light-weight joint audio-video signatures in real time and perform progressive search on mobile devices. Unlike most existing mobile video search applications that simply send the original video query to the cloud, the proposed mobile system is one of the first attempts at instant and progressive video search leveraging the light-weight computing capacity of mobile devices. The system is characterized by four unique properties: 1) a joint audio-video signature to deal with the large aural and visual variances associated with the query video captured by the mobile phone, 2) layered audio-video indexing to holistically exploit the complementary nature of audio and video signals, 3) light-weight fingerprinting to comply with mobile processing capacity, and 4) a progressive query process to significantly reduce computational costs and improve the user experience -- the search process can stop anytime once a confident result is achieved. We have collected 1,400 query videos captured by 25 mobile users from a dataset of 600 hours of video. The experiments show that our system outperforms state-of-the-art methods by achieving 90.79% precision when the query video is less than 10 seconds and 70.07% even when the query video is less than 5 seconds.
Parallel field alignment for cross media retrieval BIBAFull-Text 897-906
  Xiangbo Mao; Binbin Lin; Deng Cai; Xiaofei He; Jian Pei
Cross media retrieval systems have received increasing interest in recent years. Due to the semantic gap between low-level features and high-level semantic concepts of multimedia data, many researchers have explored joint-model techniques in cross media retrieval systems. Previous joint-model approaches usually focus on two traditional ways to design cross media retrieval systems: (a) fusing features from different media data; (b) learning different models for different media data and fusing their outputs. However, the process of fusing features or outputs will lose both low- and high-level abstraction information of media data. Hence, both ways do not really reveal the semantic correlations among the heterogeneous multimedia data. In this paper, we introduce a novel method for the cross media retrieval task, named Parallel Field Alignment Retrieval (PFAR), which integrates a manifold alignment framework from the perspective of vector fields. Instead of fusing original features or outputs, we consider the cross media retrieval as a manifold alignment problem using parallel fields. The proposed manifold alignment algorithm can effectively preserve the metric of data manifolds, model heterogeneous media data and project their relationship into intermediate latent semantic spaces during the process of manifold alignment. After the alignment, the semantic correlations are also determined. In this way, the cross media retrieval task can be resolved by the determined semantic correlations. Comprehensive experimental results have demonstrated the effectiveness of our approach.

Social dynamics

Analysis and forecasting of trending topics in online media streams BIBAFull-Text 907-916
  Tim Althoff; Damian Borth; Jörn Hees; Andreas Dengel
Among the vast information available on the web, social media streams capture what people currently pay attention to and how they feel about certain topics. Awareness of such trending topics plays a crucial role in multimedia systems such as trend aware recommendation and automatic vocabulary selection for video concept detection systems. Correctly utilizing trending topics requires a better understanding of their various characteristics in different social media streams. To this end, we present the first comprehensive study across three major online and social media streams, Twitter, Google, and Wikipedia, covering thousands of trending topics during an observation period of an entire year. Our results indicate that depending on one's requirements one does not necessarily have to turn to Twitter for information about current events and that some media streams strongly emphasize content of specific categories. As our second key contribution, we further present a novel approach for the challenging task of forecasting the life cycle of trending topics in the very moment they emerge. Our fully automated approach is based on a nearest neighbor forecasting technique exploiting our assumption that semantically similar topics exhibit similar behavior.
   We demonstrate on a large-scale dataset of Wikipedia page view statistics that forecasts by the proposed approach are about 9-48k views closer to the actual viewing statistics compared to baseline methods and achieve a mean average percentage error of 45-19% for time periods of up to 14 days.
Why not, WINE?: towards answering why-not questions in social image search BIBAFull-Text 917-926
  Sourav S. Bhowmick; Aixin Sun; Ba Quan Truong
Despite considerable progress in recent years on Tag-based Social Image Retrieval (TagIR), state-of-the-art TagIR systems fail to provide a systematic framework for end users to ask why certain images are not in the result set of a given query and provide an explanation for such missing results. However, as humans, such why-not questions are natural when expected images are missing in the query results returned by a TagIR system. Clearly, it would be very helpful to users if they could pose follow-up why-not questions to seek clarifications on missing images in query results. In this work, we take the first step to systematically answer the why-not questions posed by end-users on TagIR systems. Our answer not only involves the reason why desired images are missing in the results but also suggestion on how the query can be altered so that the user can view these missing images in sufficient number. We present three explanation models, namely result reordering, query relaxation, and query substitution, that enable us to explain a variety of why-not questions. We present an algorithm called WINE (Why-not questIon aNswering Engine) that exploits these models to answer why-not questions efficiently. Experiments on NUS-WIDE dataset demonstrate effectiveness as well as benefits of WINE.
Automatic generation of social media snippets for mobile browsing BIBAFull-Text 927-936
  Wenyuan Yin; Tao Mei; Chang Wen Chen
The ongoing revolution in media consumption from traditional PCs to the pervasiveness of mobile devices is driving the adoption of social media in our daily lives. More and more people are using their mobile devices to enjoy social media content while on the move. However, mobile display constraints create challenges for presenting and authoring the rich media content on screens with limited display size. This paper presents an innovative system to automatically generate magazine-like social media visual summaries, which is called "snippet," for efficient mobile browsing. The system excerpts the most salient and dominant elements, i.e., a major picture element and a set of textual elements, from the original media content, and composes these elements into a text overlaid image by maximizing information perception. In particular, we investigate a set of aesthetic rules and visual perception principles to optimize the layout of the extracted elements by considering display constraints. As a result, browsing the snippet on mobile devices is just like quickly glancing at a magazine. To the best of our knowledge, this paper represents one of the first attempts at automatic social media snippet generation by studying aesthetic rules and visual perception principles. We have conducted experiments and user studies with social posts from news entities. We demonstrated that the generated snippets are effective at representing media content in a visually appealing and compact way, leading to a better user experience when consuming social media content on mobile devices.
Temporal encoded F-formation system for social interaction detection BIBAFull-Text 937-946
  Tian Gan; Yongkang Wong; Daqing Zhang; Mohan S. Kankanhalli
In the context of a social gathering, such as a cocktail party, the memorable moments are generally captured by professional photographers or by the participants. The latter case is often undesirable because many participants would rather enjoy the event instead of being occupied by the photo-taking task. Motivated by this scenario, we propose the use of a set of cameras to automatically take photos. Instead of performing dense analysis on all cameras for photo capturing, we first detect the occurrence and location of social interactions via F-formation detection. In the sociology literature, F-formation is a concept used to define social interactions, where each detection only requires the spatial location and orientation of each participant. This information can be robustly obtained with additional Kinect depth sensors. In this paper, we propose an extended F-formation system for robust detection of interactions and interactants. The extended F-formation system employs a heat-map based feature representation for each individual, namely Interaction Space (IS), to model their location, orientation, and temporal information. Using the temporally encoded IS for each detected interactant, we propose a best-view camera selection framework to detect the corresponding best view camera for each detected social interaction. The extended F-formation system is evaluated with synthetic data on multiple scenarios. To demonstrate the effectiveness of the proposed system, we conducted a user study to compare our best view camera ranking with human's ranking using real-world data.


Towards efficient sparse coding for scalable image annotation BIBAFull-Text 947-956
  Junshi Huang; Hairong Liu; Jialie Shen; Shuicheng Yan
Nowadays, content-based retrieval methods are still the development trend of the traditional retrieval systems. Image labels, as one of the most popular approaches for the semantic representation of images, can fully capture the representative information of images. To achieve the high performance of retrieval systems, the precise annotation for images becomes inevitable. However, as the massive number of images in the Internet, one cannot annotate all the images without a scalable and flexible (i.e., training-free) annotation method. In this paper, we particularly investigate the problem of accelerating sparse coding based scalable image annotation, whose off-the-shelf solvers are generally inefficient on large-scale dataset. By leveraging the prior that most reconstruction coefficients should be zero, we develop a general and efficient framework to derive an accurate solution to the large-scale sparse coding problem through solving a series of much smaller-scale subproblems. In this framework, an active variable set, which expands and shrinks iteratively, is maintained, with each snapshot of the active variable set corresponding to a subproblem. Meanwhile, the convergence of our proposed framework to global optimum is theoretically provable. To further accelerate the proposed framework, a sub-linear time complexity hashing strategy, e.g. Locality-Sensitive Hashing, is seamlessly integrated into our framework. Extensive empirical experiments on NUS-WIDE and IMAGENET datasets demonstrate that the orders-of-magnitude acceleration is achieved by the proposed framework for large-scale image annotation, along with zero/negligible accuracy loss for the cases without/with hashing speed-up, compared to the expensive off-the-shelf solvers.
Learning with limited and noisy tagging BIBAFull-Text 957-966
  Yingming Li; Zhongang Qi; Zhongfei (Mark) Zhang; Ming Yang
With the rapid development of social networks, tagging has become an important means responsible for such rapid development. A robust tagging method must have the capability to meet the two challenging requirements: limited labeled training samples and noisy labeled training samples. In this paper, we investigate this challenging problem of learning with limited and noisy tagging and propose a discriminative model, called SpSVM-MC, that exploits both labeled and unlabeled data through a semi-parametric regularization and takes advantage of the multi-label constraints into the optimization. While SpSVM-MC is a general method for learning with limited and noisy tagging, in the evaluations we focus on the specific application of noisy image tagging with limited labeled training samples on a benchmark dataset. Theoretical analysis and extensive evaluations in comparison with state-of-the-art literature demonstrate that SpSVM-MC outstands with a superior performance.
Picture tags and world knowledge: learning tag relations from visual semantic sources BIBAFull-Text 967-976
  Lexing Xie; Xuming He
This paper studies the use of everyday words to describe images. The common saying has it that 'a picture is worth a thousand words', here we ask which thousand? The proliferation of tagged social multimedia data presents a challenge to understanding collective tag-use at large scale -- one can ask if patterns from photo tags help understand tag-tag relations, and how it can be leveraged to improve visual search and recognition. We propose a new method to jointly analyze three distinct visual knowledge resources: Flickr, ImageNet/WordNet, and ConceptNet. This allows us to quantify the visual relevance of both tags learn their relationships. We propose a novel network estimation algorithm, Inverse Concept Rank, to infer incomplete tag relationships. We then design an algorithm for image annotation that takes into account both image and tag features. We analyze over 5 million photos with over 20,000 visual tags. The statistics from this collection leads to good results for image tagging, relationship estimation, and generalizing to unseen tags. This is a first step in analyzing picture tags and everyday semantic knowledge. Potential other applications include generating natural language descriptions of pictures, as well as validating and supplementing knowledge databases.
Annotation for free: video tagging by mining user search behavior BIBAFull-Text 977-986
  Ting Yao; Tao Mei; Chong-Wah Ngo; Shipeng Li
The problem of tagging is mostly considered from the perspectives of machine learning and data-driven philosophy. A fundamental issue that underlies the success of these approaches is the visual similarity, ranging from the nearest neighbor search to manifold learning, to identify similar instances of an example for tag completion. The need to searching for millions of visual examples in high-dimensional feature space, however, makes the task computationally expensive. Moreover, the results can suffer from robustness problem, when the underlying data, such as online videos, are rich of semantics and the similarity is difficult to be learnt from low-level features. This paper studies the exploration of user searching behavior through click-through data, which is largely available and freely accessible by search engines, for learning video relationship and applying the relationship for economic way of annotating online videos. We demonstrated that, by a simple approach using co-click statistics, promising results were obtained in contrast to feature-based similarity measurement. Furthermore, considering the long tail effect that few videos dominate most clicks, a new method based on polynomial semantic indexing is proposed to learn a latent space for alleviating the sparsity problem of click-through data. The proposed approaches are then applied for three major tasks in tagging: tag assignment, ranking, and enrichment. On a bipartite graph constructed from click-through data with over 15 million queries and 20 million video URL clicks, we showed that annotation can be performed for free with competitive performance and minimum computing resource, representing a new and promising paradigm for video tagging in addition to machine learning and data-driven methodologies.

Scene understanding

Static saliency vs. dynamic saliency: a comparative study BIBAFull-Text 987-996
  Tam V. Nguyen; Mengdi Xu; Guangyu Gao; Mohan Kankanhalli; Qi Tian; Shuicheng Yan
Recently visual saliency has attracted wide attention of researchers in the computer vision and multimedia field. However, most of the visual saliency-related research was conducted on still images for studying static saliency. In this paper, we give a comprehensive comparative study for the first time of dynamic saliency (video shots) and static saliency (key frames of the corresponding video shots), and two key observations are obtained: 1) video saliency is often different from, yet quite related with, image saliency, and 2) camera motions, such as tilting, panning or zooming, affect dynamic saliency significantly. Motivated by these observations, we propose a novel camera motion and image saliency aware model for dynamic saliency prediction. The extensive experiments on two static-vs-dynamic saliency datasets collected by us show that our proposed method outperforms the state-of-the-art methods for dynamic saliency prediction. Finally, we also introduce the application of dynamic saliency prediction for dynamic video captioning, assisting people with hearing impairments to better entertain videos with only off-screen voices, e.g., documentary films, news videos and sports videos.
Building holistic descriptors for scene recognition: a multi-objective genetic programming approach BIBAFull-Text 997-1006
  Li Liu; Ling Shao; Xuelong Li
Real-world scene recognition has been one of the most challenging research topics in computer vision, due to the tremendous intraclass variability and the wide range of scene categories. In this paper, we successfully apply an evolutionary methodology to automatically synthesize domain-adaptive holistic descriptors for the task of scene recognition, instead of using hand-tuned descriptors. We address this as an optimization problem by using multi-objective genetic programming (MOGP). Specifically, a set of primitive operators and filters are first randomly assembled in the MOGP framework as tree-based combinations, which are then evaluated by two objective fitness criteria i.e., the classification error and the tree complexity. Finally, the best-so-far solution selected by MOGP is regarded as the (near-)optimal feature descriptor for scene recognition. We have evaluated our approach on three realistic scene datasets: MIT urban and nature, SUN and UIUC Sport. Experimental results consistently show that our MOGP-generated descriptors achieve significantly higher recognition accuracies compared with state-of-the-art hand-crafted and machine-learned features.
Scale based region growing for scene text detection BIBAFull-Text 1007-1016
  Junhua Mao; Houqiang Li; Wengang Zhou; Shuicheng Yan; Qi Tian
Scene text is widely observed in our daily life and has many important multimedia applications. Unlike document text, scene text usually exhibits large variations in font and language, and suffers from low resolution, occlusions and complex background. In this paper, we present a novel scale-based region growing algorithm for scene text detection. We first distinguish SIFT features in text regions from those in background by exploring the inter- and intra-statistics of SIFT features. Then scene text regions in images are identified by scale-based region growing, which explores the geometric context of SIFT keypoints in local regions. Our algorithm is very effective to detect multilingual text in various fonts, sizes, and with complex background. In addition, it offers insights on efficiently deploying local features in numerous applications, such as visual search. We evaluate our algorithm on three datasets and achieve the state-of-the-art performance.
Visual interestingness in image sequences BIBAFull-Text 1017-1026
  Helmut Grabner; Fabian Nater; Michel Druey; Luc Van Gool
Interestingness is said to be the power of attracting or holding one's attention (because something is unusual or exciting, etc.). We, as humans, have the great capacity to direct our visual attention and judge the interestingness of a scene. Consider for example the image sequence in the figure on the right. The spider in front of the camera or the snow on the lens are examples of events that deviate from the context since they violate the expectations, and therefore are considered interesting. On the other hand, weather changes or a camera shift, do not raise human attention considerably, even though large regions of the image are influenced. In this work we firstly investigate what humans consider as "interesting" in image sequences. Secondly we propose a computer vision algorithm to automatically spot these interesting events. To this end, we integrate multiple cues inspired by cognitive concepts and discuss why and to what extent the automatic discovery of visual interestingness is possible.

Doctoral symposium

Using tagged images of low visual ambiguity to boost the learning efficiency of object detectors BIBAFull-Text 1027-1030
  Elisavet Chatzilari
Motivated by the abundant availability of user-generated multimedia content, a data augmentation approach that enhances an initial manually labelled training set with regions from user tagged images is presented. Initially, object detection classifiers are trained using a small number of manually labelled regions as the training set. Then, a set of positive regions is automatically selected from a large number of loosely tagged images, pre-segmented by an automatic segmentation algorithm, to enhance the initial training set. In order to overcome the noisy nature of user tagged images and the lack of information about the pixel level annotations, the main contribution of this work is the introduction of the visual ambiguity term. Visual ambiguity is caused by the visual similarity of semantically dissimilar concepts with respect to the employed visual representation and analysis system (i.e. segmentation, feature space, classifier) and, in this work, is modelled so that the images where ambiguous concepts co-exist are penalized. Preliminary experimental results show that the employment of visual ambiguity guides the selection process away from the ambiguous images and, as a result, allows for better separation between the targeted true positive and the undesired negative regions.
Projective identity and procedural rhetoric in educational multimedia: towards the enrichment of programming self-concept and growth mindset with fantasy role-play BIBAFull-Text 1031-1034
  Michael James Scott
There is a growing movement in the behavioral sciences towards exploring more situated, pragmatic and ontological accounts of human learning. Positive psychology shows that a reciprocal relationship may exist between self-concept and the development of expertise, while social psychology reveals that mindsets about the nature of personal traits can have profound impacts on practice behavior. Thus, nurturing psychological constructs through the learning environment may empower students, enabling them to learn more effectively. Educational multimedia is known to support learning in a range of contexts, but its role in facilitating such self-enrichment has seldom been explored. Consequently, it is not clear which designs can aid both self enhancement and skill development. This doctoral symposium paper proposes that an interplay between projective identity and procedural rhetoric, delivered in the form of a fantasy role-playing experience, could be one such practice. Early experiments in the area of introductory programming show promise, but raise questions about external validity, educationally relevant effect sizes and how multimedia elements within the tool could be utilized more effectively to enhance these effects.
Recognition of complex events in open-source web-scale videos: a bottom up approach BIBAFull-Text 1035-1038
  Subhabrata Bhattacharya
Recognition of complex events in unconstrained Internet videos is a challenging research problem. In this symposium proposal, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy. We also identify three key stages where we make novel contributions which are necessary to not only improve the overall recognition performance, but also develop richer understanding of these events. At the lowest level, our contributions include (a) compact covariance descriptors of appearance and motion features used in sparse coding framework to recognize realistic actions and gestures, and (b) a Lie-algebra based representation of dominant camera motion present in video shots which can be used as a complementary feature for video analysis. In the next level, we propose an (c) efficient maximum likelihood estimate based representation from low-level features computed from videos which demonstrates state of the art performance in large scale visual concept detection, and finally, we propose to (d) model temporal interactions between concepts detected in video shots through two new discriminative feature spaces derived from Linear dynamical systems which eventually boosts event recognition performance. In all cases, we conduct thorough experiments to demonstrate promising performance gains over some of the prominent approaches.
Motion compensated compressed domain watermarking BIBAFull-Text 1039-1042
  Tanima Dutta
The security has become an important issue in multimedia applications. The embedding of watermark bits in compressed domain is less computationally expensive as full decoding and re-encoding is not required. The motion coherency is an essential property to resist temporal frame averaging based attacks. The design of motion compensated embedding method in compressed domain is a challenging task. As far we know, no such embedding method is explored yet. In this paper, we propose a motion compensated compressed domain embedding method within a short video neighborhood that gives acceptable visual quality, embedding capacity, and robustness. The simulation results show the effectiveness of the proposed method.
Social interaction detection using a multi-sensor approach BIBAFull-Text 1043-1046
  Tian Gan
In the context of a social gathering, such as a cocktail party, the memorable moments are often captured by professional photographers or the participants. The latter case is generally undesirable because many participants would rather enjoy the event instead of being occupied by the tedious photo capturing task. Motivated by this scenario, we propose an automated social event photo-capture framework for which, given the multiple sensor data streams and the information from the Web as input, will output the visually appealing photos of the social event. Our proposal consists of three components: (1) social attribute extraction from both the physical space and the cyberspace; (2) social attribute fusion; and (3) active camera control. Current work is presented and we conclude with expected contributions as well as future direction.
Virtual director technology for social video communication and live event broadcast production BIBAFull-Text 1047-1050
  Rene Kaiser
This thesis investigates several aspects of Virtual Director technology, i.e. software capable of intelligent real-time selection of live media streams. It addresses several research questions in this interdisciplinary field with respect to how a generic Virtual Director framework can be constructed, and how its behavior can be modeled and formalized to realize professional applications with many parallel users within real-time constraints. Prototypes have been built for the applications of group videoconferencing and live event broadcast. The engine executes cinematic principles aiming to enhance the user experience. In group videoconferencing, a Virtual Director aims to support communication goals by selecting from multiple available streams, i.e. automating cuts between shots according to the communication situation. In event broadcast, it enables personalization by framing, animating and cutting virtual camera views as cropping from a high-resolution panorama. While the technical approach and framework has been evaluated in lab experiments, further evaluation involving potential users and cinematic professionals is ongoing.
Gesture -- sound mapping by demonstration in interactive music systems BIBAFull-Text 1051-1054
  Jules Françoise
In this paper we address the issue of mapping between gesture and sound in interactive music systems. Our approach, we call mapping by demonstration, aims at learning the mapping from examples provided by users while interacting with the system. We propose a general framework for modeling gesture -- sound sequences based on a probabilistic, multimodal and hierarchical model. Two orthogonal modeling aspects are detailed and we describe planned research directions to improve and evaluate the proposed models.
Learning representations for affective video understanding BIBAFull-Text 1055-1058
  Esra Acar
Among the ever growing available multimedia data, finding multimedia content which matches the current mood of users is a challenging problem. Choosing discriminative features for the representation of video segments is a key issue in designing video affective content analysis algorithms, where no dominant feature representation has emerged yet. Most existing affective content analysis methods either use low-level audio-visual features or generate hand-crafted higher level representations. In this work, we propose to use deep learning methods, in particular, convolutional neural networks (CNNs), in order to learn mid-level representations from automatically extracted raw features. We exploit only the audio modality in the current framework and employ Mel-Frequency Cepstral Coefficients (MFCC) features in order to build higher level audio representations. We use the learned representations for the affective classification of music video clips. We choose multi-class support vector machines (SVMs) for classifying video clips into affective categories. Preliminary results on a subset of the DEAP dataset show that a significant improvement is obtained when we learn higher level representations instead of using low-level features directly for video affective content analysis. We plan to further extend this work and include visual modality as well. We will generate mid-level visual representations using CNNs and fuse these visual representations with mid-level audio representations both at feature- and decision-level for video affective content analysis.
Context-aware gesture recognition in classical music conducting BIBAFull-Text 1059-1062
  Alvaro Sarasua
Body movement has received increasing attention in music technology research during the last years. Some new musical interfaces make use of gestures to control music in a meaningful and intuitive way. A typical approach is to use the orchestra conducting paradigm, in which the computer that generates the music would be a virtual orchestra conducted by the user. However, although conductors' gestures are complex and their meaning can vary depending on the musical context, this context-dependency is still to explore. We propose a method to study context-dependency of body and facial gestures of conductors in orchestral classical music based on temporal clustering of gestures into actions, followed by an analysis of the evolution of audio features after action occurrences. For this, multi-modal data (audio, video, motion capture) will be recorded in real live concerts and rehearsals situations using unobtrusive techniques.
Bringing the sport stadium atmosphere to remote fans BIBAFull-Text 1063-1066
  Pedro Centieiro
While watching live sports broadcasts we do not feel so emotionally connected with the performers and the in-venue fans as if we were watching it live, where the event takes place. Moreover, both remote and in-venue fans do not share a social connection, resulting in defragmented social experiences. This work intends to establish a new paradigm that explores the use of mobile devices to enhance remote fans? interaction with a live event. This new paradigm will provide them with an emotional and social experience by bringing the stadium atmosphere, its immersion, and emotional levels, to remote supporters. As a result, remote fans will be more engaged in the broadcasted sports, and both, remote and in-venue, fans will all feel part of the same community.
Automatic melodic and structural analysis of music material for enriched concert related experiences BIBAFull-Text 1067-1070
  Juan J. Bosch
This PhD thesis proposal deals with the automatic analysis of musical audio, focusing on the estimation of the predominant melodic lines, which are used as a basis for extracting musical themes, and (along with other features) for structure recognition. The main focus is set on classical western music in large ensemble settings, which poses interesting research challenges to current state-of-the art algorithms. We will study the limitations of current approaches in this genre, and elaborate specific descriptors and methods, combining audio based analysis with further sources of knowledge and modalities. The creation of appropriate datasets will also be a main aspect, in order to properly evaluate the developed approaches. This work will be used to enrich musical concert related experiences, from music consumers to editors.
Design, development and evaluation of an adaptive and standardized RTP/RTCP-based IDMS solution BIBAFull-Text 1071-1074
  Mario Montagud
Inter-Destination Media Synchronization (IDMS) is essential for enabling pleasant shared media experiences. The goal of my PhD thesis is to design, develop and evaluate an advanced RTP/RTCP-based IDMS solution fitting the requirements of the emerging distributed media consumption paradigm. In particular, standard compliant extensions to RTCP are being specified to allow for an accurate, adaptive and dynamic IDMS control when using RTP for streaming media. Moreover, the feasibility and suitability of several architectural schemes for exchanging the IDMS information, algorithms for allowing a dynamic IDMS monitoring and control, as well as adjustment techniques are being investigated. Objective and subjective testing are being conducted to validate the satisfactory performance of our IDMS solution and to provide insights about the users' tolerance on asynchrony levels in different IDMS scenarios.
Visual object analysis using regions and interest points BIBAFull-Text 1075-1078
  Carles Ventura
This dissertation research will explore region-based and interest points based image representations, two of the most-used image models for object detection, image classification, and visual search among other applications. We will analyze the relationship between both representations with the goal of proposing a new hybrid representation that takes advantage of the strengths and overcomes the weaknesses of both approaches. More specifically, we will focus on the gPb-owt-ucm segmentation algorithm and the SIFT local features since they are the most contrasted techniques in their respective fields. Furthermore, using an object retrieval benchmark, this dissertation research will analyze three basic questions: (i) the usefulness of an interest points hierarchy based on a contour strength signal, (ii) the influence of the context on both interest points location and description, and (iii) the analysis of regions as spatial support for bundling interest points.

Art session overview

The ACM multimedia 2013 art exhibition BIBAFull-Text 1079-1082
  Marc Cavazza; Antonio Camurri
The Art Exhibition of ACM Multimedia 2013 has attracted significant work from a variety of digital artists collaborating with research institutions. We have endeavored to select exhibits that achieved an interesting balance between technology and artistic intent. The techniques underpinning these artworks are relevant to several technical tracks of the conference, in particular those dealing with human-centered and interactive media. We briefly review how the various installations revisit current topics in Multimedia research, focusing more specifically on their approach to dynamic content generation, user experience, multimodality, and affective interfaces. Once again, the unique blend of technology and user experience does not limit itself to showcasing recent advances in interactive media, and should be of interest to all conference participants as well as the general public the exhibition space will be open to.

Workshops overview

4th ACM/IEEE ARTEMIS 2013 international workshop on analysis and retrieval of tracked events and motion in imagery streams BIBAFull-Text 1083-1084
  Anastasios Doulamis; Nikolaos Doulamis; Marco Bertini; Jordi Gonzalez; Thomas Moeslund
In this paper, we give a short summary of the papers proposed in ACMARTEMIS 2013 which is held in Barcelona Spain in conjunction with ACM Multimedia. The workshop handles the areas of features analysis both at low and high level for efficient events detection, retrieval of multimedia events and objects and video synchronization issues and also events and behavior recognition from visual data. All papers were classified into three session of a single track workshop. The first session named "Video Features and Scene Analysis" includes articles that handle low level and high level visual analysis appropriate for event detection. The second session entitled "Retrieval of Multimedia Objects/Events" applies schemes for media data retrieval and video synchronization. Finally the third session "Analysis of Visual Events" describes algorithms for detecting actions, behaviors and events in complex visual scenes.
Workshop summary for the 3rd international audio/visual emotion challenge and workshop (AVEC'13) BIBAFull-Text 1085-1086
  Michel Valstar; Björn Schuller; Jarek Krajewski; Roddy Cowie; Maja Pantic
The third Audio-Visual Emotion Challenge and workshop AVEC 2013 will be held in conjunction ACM Multimedia'13. Like the 2012 edition of AVEC, the workshop/challenge addresses the interpretation of social signals represented in both audio and video in terms of the high-level continuous dimensions arousal and valence, but importantly this year the data is that of a large number of clinically depressed patients and controls, with a sub-challenge in self-reported severity of depression estimation. Like both previous AVECs, the aim is to bring together the audio and video analysis communities.
Workshop summary for the 5th international workshop on multimedia for cooking and eating activities (CEA'13) BIBAFull-Text 1087-1088
  Kiyoharu Aizawa; Yoko Yamakata; Takuya Funatomi
This summary introduces the aim of the CEA'13 workshop and the list of papers presented in the workshop.
ACM multimedia 2013 workshop on crowdsourcing for multimedia BIBAFull-Text 1089-1090
  Kuan-Ta Chen; Wei-Ta Chu; Martha Larson
The topic "Crowdsourcing for Multimedia" encompasses the full range of techniques that combine human intelligence and a large number of individual contributors to advance the state of the art in multimedia research. The ACM Multimedia 2013 Workshop on Crowdsourcing for Multimedia (CrowdMM 2013) provided a forum for presenting new crowdsourcing techniques, exchanging innovative crowdsourcing ideas, and discussing crowdsourcing best practices for multimedia. The workshop program consisted of presented papers, a keynote speech and a panel discussion. A special feature of this year's workshop was the "Crowdsourcing for Multimedia Ideas Competition", the results of which were presented at the workshop.
Second ACM multimedia workshop on geotagging and its applications in multimedia (GeoMM 2013) BIBAFull-Text 1091-1092
  Liangliang Cao; Gerald Friedland; Pascal Kelm
The Workshop on Geotagging and Its Applications in Multimedia (GeoMM 2013) focuses on new applications and methods of geotagging and in geo-location support systems. As the location based multimedia becomes more and more popular in the era of Web and mobile applications, the increase in the use of geotagging and improvements in geo-location support systems open up a new dimension for the description, organization and manipulation of multimedia data. This new dimension radically expands the usefulness of multimedia data both for daily users of the Internet and social networking sites as well as for experts in particular application scenarios. The workshop serves as a venue for the premier research in geotagging and multimedia, and continues to attract submissions from a diverse set of researchers, who address newly arising problems within this emerging field.
Fourth international workshop on human behavior understanding (HBU 2013) BIBAFull-Text 1093-1094
  Albert Ali Salah; Hayley Hung; Oya Aran; Hatice Gunes
With advances in pattern recognition and multimedia computing, it became possible to analyze human behavior via multimodal sensors, at different time-scales and at different levels of interaction and interpretation. This ability opens up enormous possibilities for multimedia and multimodal interaction, with a potential of endowing the computers with a capacity to attribute meaning to users' attitudes, preferences, personality, social relationships, etc., as well as to understand what people are doing, the activities they have been engaged in, their routines and lifestyles. This workshop gathers researchers dealing with the problem of modeling human behavior under its multiple facets with particular attention to interactions in arts, creativity, entertainment and edutainment.
Immersive media experiences: ImmersiveMe 2013 workshop at ACM multimedia BIBAFull-Text 1095-1096
  Teresa Chambel; V. Michael Bove; Sharon Strover; Paula Viana; Graham Thomas
Immersive media has the potential for strong impact on users' emotions and their sense of presence and engagement. The main objective of this workshop is to bring together researchers, students, media producers, service providers and industry players in the area of emergent immersive media. The workshop will provide a platform for a deep discussion on ongoing work, recent achievements and experiences. It is expected not only to consolidate experiences but also to identify aspects where strong collaboration among all the interested players is needed and to point towards future working directions.
The third ACM international workshop on interactive multimedia on mobile and portable devices (IMMPD'13) BIBAFull-Text 1097-1098
  Jiebo Luo; Caifeng Shan; Ling Shao; Minoru Etoh
With the mobile and portable devices become ubiquitous for people's daily life, how to design user interfaces of these products that enable natural, intuitive and fun interaction is one of the main challenges the multimedia community is facing. Following previous successful events, the third ACM International workshop on Interactive Multimedia on Mobile and Portable Devices (IMMPD'13) aims to bring together researchers from both academia and industry in domains including computer vision, audio and speech processing, machine learning, pattern recognition, communications, human-computer interaction, and media technology to share and discuss recent advances in interactive multimedia.
2nd international workshop on socially-aware multimedia (SAM'13) BIBAFull-Text 1099-1100
  Pablo Cesar; Matthew Cooper; David A. Shamma; Doug Williams
Multimedia social communication is becoming commonplace. Television is becoming smart and social; media sharing applications are transforming the way we converse and recall events and videoconferencing is a common application on our computers, phones, tablets and even televisions. The confluence of computer-mediated interaction, social networking, and multimedia content are radically reshaping social communications, bringing new challenges and opportunities. This workshop, in its second edition, provides an opportunity to explore socially-aware multimedia, in which the social dimension of mediated interactions between people are considered to be as important as the characteristics of the media content. Even though this social dimension is implicitly addressed in some current solutions, further research is needed to better understand what makes multimedia socially-aware.
Summary abstract for the 2nd ACM international workshop on multimedia analysis for ecological data BIBAFull-Text 1101-1102
  Concetto Spampinato; Vasileios Mezaris; Jacco van Ossenbruggen
The 2nd ACM International Workshop on Multimedia Analysis for Ecological Data (MAED'13) is held as part of ACM Multimedia 2013. MAED'13, following the first workshop of the MAED series (MAED'12) that was held as part of ACM Multimedia 2012, is concerned with the processing, interpretation, and visualization of ecology-related multimedia content with the aim to support biologists in their investigations for analyzing and monitoring natural environments.
ACM MM MIIRH 2013: workshop on multimedia indexing and information retrieval for healthcare BIBAFull-Text 1103-1104
  Jenny Benois-Pineau; Alexia Briassouli; Alexander Hauptmann
Healthcare systems are depending on increasingly sophisticated and ubiquitous technology, while telehealth is rapidly gaining importance with the advent of low-cost and effective technological solutions in medicine. The increase in the worldwide elderly population and the burden this is inflicting upon the workforce, societies and economies are making remote care and independent living at home a necessity. MIIRH is the first workshop on multimedia analysis for remote care of and assisted living solutions which enable people that are incapacitated in some regard to continue living independently at home and remain active members of society. The topics addressed in MIIRH are extremely timely, as multitudes of cost-effective and high quality care solutions are already being developed and used, rendering the examination of new medical, healthcare paradigms an absolute necessity.
Summary abstract for the 1st ACM international workshop on personal data meets distributed multimedia BIBAFull-Text 1105-1106
  Vivek K. Singh; Tat-Seng Chua; Ramesh Jain; Alex Sandy Pentland
Multimedia data are now created at a macro, public scale as well as individual personal scale. While distributed multimedia streams (e.g. images, microblogs, and sensor readings) have recently been combined to understand multiple spatio-temporal phenomena like epidemic spreads, seasonal patterns, and political situations; personal data (via mobile sensors, quantified-self technologies) are now being used to identify user behavior, intent, affect, social connections, health, gaze, and interest level in real time. An effective combination of the two types of data can revolutionize multiple applications ranging from healthcare, to mobility, to product recommendation, to content delivery. Building systems at this intersection can lead to better orchestrated media systems that may also improve users' social, emotional and physical well-being. For example, users trapped in risky hurricane situations can receive personalized evacuation instructions based on their health, mobility parameters, and distance to nearest shelter. This workshop bring together researchers interested in exploring novel techniques that combine multiple streams at different scales (macro and micro) to understand and react to each user's needs.


Semantic technologies for multimedia content: foundations and applications BIBAFull-Text 1107-1108
  Ansgar Scherp
Higher-level semantics for multimedia content is essential to answer questions like "Give me all presentations of German Physicists of the 20th century". The tutorial provides an introduction and overview to such semantics and the developments in multimedia metadata. It introduces current advancements for describing media on the web using Linked Open Data and other more expressive semantic technologies. The application of such technologies will be shown at concrete examples.
Towards next generation multimedia recommendation systems BIBAFull-Text 1109-1110
  Jialie Shen; Xian-Sheng Hua; Emre Sargin
Empowered by advances in information technology, such as social media network, digital library and mobile computing, there emerges an ever-increasing amounts of multimedia data. As the key technology to address the problem of information overload, multimedia recommendation system has been received a lot of attentions from both industry and academia. This course aims to 1) provide a series of detailed review of state-of-the-art in multimedia recommendation; 2) analyze key technical challenges in developing and evaluating next generation multimedia recommendation systems from different perspectives and 3) give some predictions about the road lies ahead of us.
Crowdsourcing for multimedia research BIBAFull-Text 1111-1112
  Mohammad Soleymani; Martha Larson
Crowdsourcing techniques make use of intelligent contributions of large number of human crowdmembers. This tutorial introduces researchers to the applications of crowdsourcing to multimedia analysis with the aim of allowing them to understand the potentials and limitations of crowdsourcing tools and techniques. We emphasize the fact that crowdsourcing represents a further development along a pre-existing continuum of techniques, and discuss the added advantages that new developments offer. We provide a basic overview of human computation, with an emphasis on example cases in which crowdsourcing has been applied to generate data sets, to improve automatic multimedia content analysis, and to elicit user needs or multimedia system requirements. Different techniques and considerations in using human computation methods to acquire high-quality data and annotations are discussed and demonstrated.
Massive-scale multimedia semantic modeling BIBAFull-Text 1113-1114
  John R. Smith; Liangliang Cao
Visual data is exploding! 500 billion consumer photos are taken each year world-wide, 633 million photos taken per year in NYC alone. 120 new video-hours are uploaded on YouTube per minute. The explosion of digital multimedia data is creating a valuable open source for insights. However, the unconstrained nature of 'image/video in the wild' makes it very challenging for automated computer-based analysis. Furthermore, the most interesting content in the multimedia files is often complex in nature reflecting a diversity of human behaviors, scenes, activities and events. To address these challenges, this tutorial will provide a unified overview of the two emerging techniques: Semantic modeling and Massive scale visual recognition, with a goal of both introducing people from different backgrounds to this exciting field and reviewing state of the art research in the new computational era.
Social interactions over geographic-aware multimedia systems BIBAFull-Text 1115-1116
  Roger Zimmermann; Yi Yu
User-centric Internet multimedia scenes challenge us to discover more interesting events and topics based on matching users' needs associated with personal preferences, geographic interests and social norms. Geotagged multimedia contents from online social sites, e.g., Flickr and Twitter, provide large volumes of data about many given locations. Hence, location is one of the most important user-generated contexts and contains rich information about an individual's interests and behavior. Location-based social multimedia streams (e.g., tweets, videos, images) can provide us with socially complementary information to predict users' needs.
   By making use of geo-tagging, a cohesive set of social multimedia streams can be published to facilitate a more accurate analysis of user-centric big data information and further assess user tastes based on location activities. This tutorial delivers not only a better understanding of the basics of location-aware contextual descriptions and its relations to social multimedia scenes, but may also serve to highlight relationships that can be collaboratively applied to multimodal retrieval and recommendation technology.
Multimedia information retrieval: music and audio BIBFull-Text 1117-1118
  Markus Schedl; Emilia Gómez; Masataka Goto
Blending the physical and the virtual in music technology: from interface design to multi-modal signal processing BIBAFull-Text 1119-1120
  George Tzanetakis; Sidney Fels; Michael Lyons
Recent years have seen a significant increase of interest in rich multi-modal user interfaces going beyond conventional mouse/keyboard/screen interaction. The new interface technologies are broadly impacting music technology and culture. New musical interfaces use a variety of sensing (and actuating) modalities to receive and present information to users, and often require techniques from signal processing and machine learning in order to extract and fuse high level information from noisy, high dimensional signals over time. Hence they pose many interesting signal processing challenges while offering fascinating possibilities for new research. At the same time the richness of possibilities for new forms of musical interaction requires a new approach to the design of musical technologies and has implications for performance aesthetics and music pedagogy. This tutorial begins with a general and gentle introduction to the theory and practice of the design of new technologies for musical creation and performance. It continues with an overview of signal processing and machine learning methods which are needed for more advanced work in new musical interface design.
Privacy concerns of sharing multimedia in social networks BIBAFull-Text 1121-1122
  Gerald Friedland
This article summarizes the corresponding 3-hour tutorial at ACM Multimedia 2013.