In this half-day tutorial we focus on the computer vision challenges in internet video search, present methods how to achieve state-of-the-art performance while maintaining efficient execution, and indicate how to obtain spatiotemporal improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition -- the leading competition for video search engines run by NIST -- where we have achieved consistent top-2 performance over the years, including the 2008, 2009, 2010 and 2011 editions. This half-day tutorial is especially meant for researchers and practitioners who are new to the field of video search (introductory), people who have started in this direction (intermediate), or people who are interested in a summary of the state-of-the-art in this exciting area (general interest).
The scientific topic of video search is dominated by five major challenges:
- the sensory gap between an object and its many appearances due to the accidental sensing conditions;
- the semantic gap between a visual concept and its lingual representation;
- the model gap between the amount of notions in the world and the capacity to learn them;
- the query-context gap between the information need and the possible retrieval solutions;
- the interface gap between the tiny window the screen offers to the amount of data;
Comparative evaluation of methods and systems is imperative to appreciate progress. We discuss the data, tasks, and results of TRECVID, the leading benchmark. In addition, we discuss the many derived community initiatives in creating annotations, baselines, and software for repeatable experiments. We conclude the course with our perspective on the many challenges and opportunities ahead for the visual search community.
Lecture TopicsThe technical content of our short course on video search engines is organized as follows:
- Problem statement: scientific, social, and business,
- Course organization: fundamentals, semantics, search, evaluation.
- Invariance: the sensory and semantic gap,
- Local shape: Gaussians, Gabors, and Loweans,
- Texture: natural image statistics, gradients, Weibulls
- Color: light source, reflection, and representation,
- Motion: optic flow, tracking.
- Descriptors: SIFT, SURF, Daisy, HOG3D, STIP, ColorSIFT,
- Words: hard assignment, soft-assignment, difference coding, geometry,
- Similarities: nearest neighbor, histogram intersection, etc
- Classifiers: support vector machines, random forests,
- Localized objects: the visual extent of an object, selective search for where the object might be.
- Concept and event detection: annotation efforts, crowdsourcing, detector performance,
- Translating queries to detectors: textual, visual, semantic, and their combination,
- Interacting with the user through the interface gap: browsing and learning.
- NIST TRECVID Benchmark: data, tasks, and results,
- Benchmark criticism: broad-domain applicability, repeatability, VideOlympics showcase,
- Resources: annotations, baselines, and software,
- Demonstration of the MediaMill Semantic video search engine.
- Concluding remarks: achievements and discussion,
- Future work: challenges and opportunities for the computer vision community.
Several relevant papers are listed on our publication server.
Cees G.M. Snoek received the M.Sc. degree in business information systems (2000) and the Ph.D. degree in computer science (2005), both from the University of Amsterdam, Amsterdam, The Netherlands. He is currently an Assistant Professor in the Intelligent Systems Lab at the University of Amsterdam. He was a visiting scientist at Carnegie Mellon University, Pittsburgh, PA (2003) and at the University of California, Berkeley, CA (2010-2011). His research interest is video and image search. He has published over 100 refereed book chapters, journal and conference papers, and serves on the program committee of the major conferences in multimedia, computer vision, and information retrieval. Dr. Snoek is the lead researcher of the MediaMill Semantic Video Search Engine, which is a consistent top performer in the yearly NIST TRECVID evaluations. He is a co-initiator and co-organizer of the VideOlympics, co-chair of the SPIE Multimedia Content Access conference, and member of the editorial boards for IEEE MultiMedia and IEEE Transactions on Multimedia. He is a lecturer of post-doctoral courses given at international conferences and European summer schools. Cees is recipient of an NWO Veni award (2008), a Fulbright Junior Scholarship (2010), an NWO Vidi award (2012), and the Netherlands Prize for ICT Research (2012). All for research excellence. Several of his Ph.D. students have won best paper awards, including the IEEE Transactions on Multimedia Prize Paper Award.
Arnold W.M. Smeulders graduated from Technical University of Delft in physics in 1977 (M.Sc.) and in 1982 from Leyden University in medicine (Ph.D.) on the topic of visual pattern analysis. In 1994, he became full professor in visual information analysis at the University of Amsterdam. He has an interest in cognitive vision, content-based image retrieval and the picture-language question. He has written over 350 papers in refereed journals and conferences and has been cited 11,000 times. He received a Fulbright grant at Yale University in 1987, and has held a visiting professorship at the City University Hong Kong, Tsukuba Japan, Modena, Italy and Cagliari, Italy. He was elected fellow of International Association of Pattern Recognition. He was associated editor of IEEE Transactions PAMI and IJCV. Currently, he is with the national research institute CWI, scientific director of the large public-private COMMIT research program in the Netherlands, and chair of the policy committee for ICT-research in the Netherlands. He has graduated 40 PhD-students.