Preface
Page: iii-v (3)
Author: Xian-Sheng Hua, Marcel Worring and Tat-Seng Chua
DOI: 10.2174/9781608052158113010002
List of Contributors
Page: vi-ix (4)
Author: Xian-Sheng Hua, Marcel Worring and Tat-Seng Chua
DOI: 10.2174/9781608052158113010003
An Image Decomposition Approach to Large-scale Image Retrieval
Page: 3-25 (23)
Author: Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma and Heung-Yeung Shum
DOI: 10.2174/9781608052158113010004
PDF Price: $15
Abstract
Thanks to the fast development of Internet and image capturing devices, the available images online have gone through an exponential growth. Efficient indexing and retrieval methods are crucial in order to leverage the web image dataset. This has important impact to a number of research areas such as image recognition, image retrieval and computer graphics. In this chapter, we review the current popular image representation and corresponding large-scale index technologies. For global representation, we review tree and hash based index structures. For local features, which recently receive lots of attention for their invariance properties to lighting, scale and rotation, we review inverted list indexing and the related “long query problem”. Then we introduce an image decomposition approach to convert the local feature representation from high dimensional sparse feature vectors to (relatively) low dimensional dense feature vectors with residual information. We also discuss a specially designed index structure to facilitate efficient storage and retrieval for this image representation. At the end of the chapter, we present extensive experiment results on a 2.3 million image database to demonstrate the efficacy of the image decomposition approach.
Near-Duplicate Web Video Detection
Page: 26-57 (32)
Author: Xiao Wu, Wan-Lei Zhao, Chong-Wah Ngo and Alexander G. Hauptmann
DOI: 10.2174/9781608052158113010005
PDF Price: $15
Abstract
The explosive expansion of the social web makes overwhelming amounts of web videos available, among which there are a large number of near-duplicate videos. Current web video search results rely exclusively on text keywords or usersupplied tags. A search on the keywords of a typical popular video often returns many duplicate and near-duplicate videos in the top results. Efficient near-duplicate web video detection is essential for effective search, retrieval, browsing and annotation. Due to the large variety of near-duplicate web video types ranging from simple formatting to complex editing, accurate detection generally comes at the cost of time complexity, particularly for web scale video applications. On the other hand, timely response to user queries is one important factor that fuels the popularity of the social web. This chapter will review approaches for near-duplicate web video detection from different technical viewpoints: combining global features and local features, integrating content and contextual information, and visual-word based scalable retrieval.
An Efficient Visual Representation and Indexing Scheme for Large Scale Video Retrieval
Page: 58-79 (22)
Author: Xiao Zhang, Gang Hua and Lei Zhang
DOI: 10.2174/9781608052158113010006
PDF Price: $15
Abstract
This chapter presents technical advancements toward web scale video retrieval. Two methods for efficient visual representation and efficient visual indexing of web videos are discussed. First, a video representation named interest seam image is presented, which considers both spatial and temporal information contained in a video. Therefore it is more discriminative than previous video representations (such as those based on key-frames). Second, an indexing system is presented, which is capable of dealing with web-scale data. The system combines both local and global descriptors, and embeds geometric configuration information of interest points into the index to simultaneously improves retrieval accuracy, speed and memory footprint. The efficacy and efficiency of the proposed methods are demonstrated in large scale experiments on real web-videos.
Large-Scale Online Multi-Labeled Annotation for Multimedia Search and Mining
Page: 80-110 (31)
Author: Guo-Jun Qi and Hong-Jiang Zhang
DOI: 10.2174/9781608052158113010007
PDF Price: $15
Abstract
In this chapter, we briefly review the online learning algorithms applied to enable content-based multimedia annotation, which is scalable to handle large-scale multimedia data as well as the associated semantic concepts. Multimedia search uses annotated semantic concepts to approach efficient content-based indexing. This is a promising direction to enable real content-based multimedia search. However, due to large amounts of multimedia samples and semantic concepts, existing techniques for automatic multimedia annotation are not able to handle large-scale multimedia corpus and concept set, in terms of both annotation accuracy and computation cost. To enable large-scale semantic concept annotation, a practical multimedia annotation method ought to be scalable on both multimedia sample dimension and concept label dimension. In real-world cases, large-scale unlabeled multimedia samples arrive consecutively in batches with an initial prelabeled training set, based on which a preliminary multi-label classifier is built. For each arrived batch, a multi-label active learning engine is applied, which selects a set of unlabeled samples with selected set of labels to get label confirmation from data labelers. And then an online learner updates the original classifier by taking the newly labeled sample-label pairs into consideration. This process repeats until all data are arrived. During the process, new labels, even without any pre-labeled training samples, can be incorporated into the process anytime. In this chapter, we review the large-scale online active annotation for Internet multimedia in the above two basic techniques - active learning and online computing. By combining these two techniques in a unified framework, scalable multimedia annotation can be achieved in an online manner so that both annotation accuracy and efficiency are able to be significantly improved.
(Un)Tagged Social Image Retrieval
Page: 111-129 (19)
Author: Xirong Li, Cees G.M. Snoek and Marcel Worring
DOI: 10.2174/9781608052158113010008
PDF Price: $15
Abstract
Social image retrieval is increasingly important for managing and accessing the rapidly growing social-tagged images. In this chapter, we address social image retrieval from two directions which tackle the subjectiveness and the incompleteness of social tagging, respectively. To make subjective social tagging objective, we introduce a simple and effective neighbor voting algorithm to estimate the relevance of a tag with respect to the visual content it is describing. To build a concept index for untagged or incompletely tagged images, we study a new learning scenario where concept detectors are trained with negative examples created by social tagging, rather than by traditional expert labeling. Empirical studies on realistic subsets of Flickr data demonstrate the potential of the proposed algorithms for searching (un)tagged social images.
Adapting Web-based Video Concept Detectors for Different Target Domains
Page: 130-167 (38)
Author: Damian Borth, Adrian Ulges and Thomas M. Breuel
DOI: 10.2174/9781608052158113010009
PDF Price: $15
Abstract
In this chapter, we address the visual learning of automatic concept detectors from web video as available from services like YouTube. While allowing a much more efficient, flexible, and scalable concept learning compared to expert labels, web-based detectors perform poorly when applied to different domains (such as specific TV channels). We address this domain change problem using a novel approach, which – after an initial training on web content – performs a highly efficient online adaptation on the target domain.
In quantitative experiments on data from YouTube and from the TRECVID campaign, we first validate that domain change appears to be the key problem for web-based concept learning, with a much more significant impact than other phenomena like label noise. Second, the proposed adaptation approach is shown to improve the accuracy of web-based detectors significantly, even over SVMs trained on the target domain. Finally, we extend our approach with active learning such that adaptation can be interleaved with manual annotation for an efficient exploration of novel domains.
Social Tag Processing for Internet Images
Page: 168-203 (36)
Author: Dong Liu, Xian-Sheng Hua and Hong-Jiang Zhang
DOI: 10.2174/9781608052158113010010
PDF Price: $15
Abstract
Online social image websites such as Flickr and Zooomr allow users to manually annotate their uploaded images with freely-chosen tags, which are then used as indexing keywords to facilitate image search and other applications. However, although the tags are provided by the users, they are generally imprecise and incomplete, and many of them are intrinsically unrelated to the visual content. Besides, the tags associated with social images are generally uniform without any importance or relevance information with respect to the content. The imprecise, incomplete and uniform characteristics of the tags have significantly limited tagbased applications, such as social image search and browsing. In this chapter, we discuss the tag processing techniques for improving the quality of the manually input tags for the social images on the Internet, which includes tag filtering, completing and ranking. We will also show various applications benefiting from the processed tags.
Flickr Groups: Multimedia Communities for Multimedia Analysis
Page: 204-222 (19)
Author: Radu-Andrei Negoescu and Daniel Gatica-Perez
DOI: 10.2174/9781608052158113010011
PDF Price: $15
Abstract
We present in this chapter a review of current work that leverages on large online social networks’ meta-information, in particular Flickr Groups. We briefly present this hugely successful feature in Flickr and discuss the various ways in which metadata stemming from users’ interactions with and within groups has been exploited by researchers to improve on state-of-the-art search and browsing algorithms. We then review recent works that have already made use of Flickr Groups, either as a data source, as a way of filtering content, or as a way to reach users for automatic analysis or user studies, and conclude by pointing out to potential directions of future research.
Word2Image: A System for Visual Interpretation of Concepts
Page: 223-239 (17)
Author: Haojie Li, Jinhui Tang, Guangda Li and Tat-Seng Chua
DOI: 10.2174/9781608052158113010012
PDF Price: $15
Abstract
Besides the use of traditional textual semantic description to convey the meanings of a certain word or concept, visual illustration is a complementary and often more intuitive way to interpret the concept. Thus the technique that converts word to image is desirable though it is very difficult. Since a concept usually has different semantic aspects, we need several correct and semantically-rich images to represent the concept under different context. In this chapter, we explore how to leverage the web image collection and existing knowledge resources to fulfill such task and develop a novel multimedia application system named Word2Image. Various techniques, including the correlation analysis, semantic and visual clustering are adapted into our system to produce the sets of high quality, precise, diverse and representative images to visually translate a given concept. The objective and subjective evaluations show the feasibility and effectiveness of our approach.
Flickr Distance for Internet Multimedia Search and Mining
Page: 240-272 (33)
Author: Lei Wu and Xian-Sheng Hua
DOI: 10.2174/9781608052158113010013
PDF Price: $15
Abstract
In this chapter, we will introduce the Flickr distance (FD), which is used to measure the visual correlation between concepts. The relationship between concepts is a reflection of human perception, which is formed mainly based on the human visual information. Thus mining the conceptual correlation from visual information makes sense.
Flickr distance is calculated in two steps, concept modeling and concept distance estimation. In the first step, each concept is assumed to have multiple states, such as front views, side views, multiple semantics, etc., each of which is considered as a latent topic. For each concept, a collection of related images are obtained from the web, and then a latent topic visual language model (LTVLM) is built to capture these states. In the second step, the distance between two concepts is estimated by the Jensen-Shannon (JS) divergence between their LTVLM.
Different from traditional conceptual distance measurements, which are based on Web text documents, FD is based on the visual information. Comparing with the WordNet distance, FD can easily scale up with the increasing size of conceptual corpus. Comparing with the Google distance (NGD) and Tag Concurrence Distance (TCD), FD uses the visual information and can measure more kinds of conceptual relations properly. We apply FD to multimedia related tasks and find FD is more helpful than NGD. Based on FD, we also construct a large scale visual conceptual network (VCNet) to store the knowledge of conceptual relationship. Experiments show that FD is more coherent to human perception and can help boosting the performance of several applications over the existing methods.
Geospatial Web Image Mining
Page: 273-305 (33)
Author: Dirk Ahlers, Susanne Boll and Philipp Sandhaus
DOI: 10.2174/9781608052158113010014
PDF Price: $15
Abstract
One commonly asked question when confronted with a photograph is “Where is this place?” When talking about a place mentioned on the Web, the question arises “What does this place look like?” Today, these questions can not reliably be answered forWeb images as they typically do not reveal their relationship to an actual geographic position. Analysis of the keywords surrounding the images or the content of the images alone has not yet achieved results that would allow deriving a precise location information to select representative images. Photos that are reliably tagged with labels of place names or areas only cover a small fraction of available images and also remain at a keyword level.
Results can be improved for arbitrary Web images by combining features from the Web page as image context and the images themselves as content. We propose a location-based search for Web images that allows finding images that are only implicitly related to a geographic position without having to rely on explicit tagging or metadata. Our spatial Web image search engine first crawls and identifies location-related information on Web pages to determine a geographic relation of the Web page, and then extends this geospatial reference further to assess an image’s location. Combining context and content analysis, we are able to determine if the image actually is a realistic representative photograph of a place and related to a geographic position.
Weighted Data Fusion for CBMIR
Page: 306-352 (47)
Author: Peter Wilkins and Alan F. Smeaton
DOI: 10.2174/9781608052158113010015
PDF Price: $15
Abstract
In this chapter we present an overview of data fusion, and how it can be applied to the task of internet multimedia search, specifically content-based multimedia search. The chapter will primarily be focused on the weighted combination of ranked results from different retrieval experts, to formulate a final ranking for some given content-based information need. The types of data under examination in this chapter are low-level multimedia features, such as colour histograms, edge detection etc. The chapter reports an extensive series of experiments on a sizable collection of visual media and from these experiments a set of interesting and surprising results emerge.
Cross-modality Indexing, Browsing and Search of Distance Learning Media on the Web
Page: 353-366 (14)
Author: Alon Efrat, Arnon Amir, Kobus Barnard and Quanfu Fan
DOI: 10.2174/9781608052158113010016
PDF Price: $15
Abstract
A large segment of people across the world have to search and browse the distance learning media, which calls for better indexing techniques. In this chapter, we introduce such an indexing technique by making full use of the slide channel. The slide channel is created by matching slides to video segments, and the video can be better indexed by mining the information in slide channel. Then we introduce several applications that benefit from the slide channel.
Web Music Indexing and Search
Page: 367-396 (30)
Author: Lie Lu and Zhiwei Li
DOI: 10.2174/9781608052158113010017
PDF Price: $15
Abstract
This chapter presents music indexing schemes by exploiting and utilizing two data sources: surrounding text on web pages and metedata describing music attributes. We first present a framework to index and organize web music by discovering its inherent structure attributes with trustworthy domain knowledge. In this approach, layered LSI spaces are first built to represent the hierarchically structured domain knowledge, music object representation is then constructed through hyperlink analysis, and the structure attributes of a music object is finally discovered by matching against the domain knowledge. This approach also indicates a new way to organize dispersive information on the Surface Web by using trustworthy Deep Web knowledge. We further present our work on enhancing music indexing with automatic music annotation, which attempts to automatically annotate a music object with a set of semantic labels (tags) to create metadata describing its attributes and to facilitate music search, organization, and recommendation. Besides modeling music annotation as a multi-label binary classification task, we also attempt to discover the correlation between semantic labels and present an approach to collective music semantic annotation.
Video Recommendation
Page: 397-414 (18)
Author: Tao Mei and Kiyoharu Aizawa
DOI: 10.2174/9781608052158113010018
PDF Price: $15
Abstract
This chapter is intended to introduce the basic techniques for general recommender systems, as well as specific research for video recommendation, which have become interesting and important topics in video search and mining. We will first define video recommendation and survey three principle approaches to video recommendation, i.e., collaborative filtering, content-based, and hybrid approaches. We will also discuss the connection and difference between video recommendation and search. Then, we will introduce several recent exemplary systems for recommendation, including: (1) “graph-based video recommendation” which builds random work graph based on user-video pairs [5], (2) “contextual video recommendation by multimodal relevance and user feedback” which does not require a large collection of user profiles [30, 43], and (3) “consumer generated video recommendation” which discusses ranking distances for recommendation and proposes an edit-distance for ranking [22]. The first system uses collaborative filtering-based recommendation approach, while the latter two belong to content-based recommendation.
Visual Query Suggestion for Internet Image Search
Page: 415-433 (19)
Author: Zheng-Jun Zha, Linjun Yang and Meng Wang
DOI: 10.2174/9781608052158113010019
PDF Price: $15
Abstract
Query suggestion is an effective approach to bridge the Intention Gap between the users’ search intents and queries. Most existing search engines are able to automatically suggest a list of textual query terms based on users’ current query input, which can be called Textual Query Suggestion. This chapter introduces a new query suggestion scheme named Visual Query Suggestion (VQS) which is dedicated to image search. VQS provides a more effective query interface to help users to precisely express their search intents by joint text and image suggestions. When a user submits a textual query, VQS first provides a list of suggestions, each containing a keyword and a collection of representative images in a dropdown menu. Once the user selects one of the suggestions, the corresponding keyword will be added to complement the initial query as the new textual query, while the image collection will be used as the visual query to further represent the search intent. VQS then performs image search based on the new textual query using text search techniques, as well as content-based visual retrieval to refine the search results by using the corresponding images as query examples.
Image Search by Color Sketch
Page: 434-455 (22)
Author: Jingdong Wang, Yinghai Zhao and Xian-Sheng Hua
DOI: 10.2174/9781608052158113010020
PDF Price: $15
Abstract
Most existing Web image search engines exploit the text information associated with images (e.g., from the surrounding text, or image title) to index images. It is still difficult and challenging to search images into the image content. Tagging is a solution, but is limited in some specific styles or types (e.g., dominant color, clip art, line drawing, or photo). In this chapter, we present a new interactive image search system, which enables users to interactively indicate the search intention, how the colors are spatially distributed in desired images, by scribbling a few color strokes or dragging an image and highlighting a few regions of interest in an intuitive way. Moreover, we propose an effective scheme to mine the latent search intention from the user’s input. Experimental results demonstrate the effectiveness and efficiency of the proposed system.
Index
Page: 456-471 (16)
Author: Xian-Sheng Hua, Marcel Worring and Tat-Seng Chua
DOI: 10.2174/9781608052158113010021
Introduction
With the explosion of video and image data available on the Internet, desktops and mobile devices, multimedia search has gained immense importance. Moreover, mining semantics and other useful information from large-scale multimedia data to facilitate online and local multimedia content analysis, search, and other related applications has also gained an increasing attention from the academia and industry. The rapid increase of multimedia data has brought new challenges to multimedia content analysis and multimedia retrieval, especially in terms of scalability. While on the other hand, large-scale multimedia data has also provided new opportunities to address these challenges and other conventional problems in multimedia analysis. The massive associated metadata, context and social information available on the Internet, desktops and mobile devices, and the large number of grassroots users, are a valuable resource that could be leveraged to solve the these difficulties. This is the first reference book on the subject of internet multimedia search and mining and it will be extremely useful for graduates, researchers and working professionals in the field of information technology and multimedia content analysis.