Skip to Main content Skip to Navigation

Parallelism and distribution for very large scale content-based image retrieval

Gylfi Thor Gudmunsson 1 
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : The scale of multimedia collections has grown very fast over the last few years. Facebook stores more than 100 billion images, 200 million are added every day. In order to cope with this growth, methods for content-based image retrieval must adapt gracefully. The work presented in this thesis goes in this direction. Two observations drove the design of the high-dimensional indexing technique presented here. Firstly, the collections are so huge, typically several terabytes, that they must be kept on secondary storage. Addressing disk related issues is thus central to our work. Secondly, all CPUs are now multi-core and clusters of machines are a commonplace. Parallelism and distribution are both key for fast indexing and high-throughput batch-oriented searching. We describe in this manuscript a high-dimensional indexing technique called eCP. Its design includes the constraints associated to using disks, parallelism and distribution. At its core is an non-iterative unstructured vectorial quantization scheme. eCP builds on an existing indexing scheme that is main memory oriented. Our first contribution is a set of extensions for processing very large data collections, reducing indexing costs and best using disks. The second contribution proposes multi-threaded algorithms for both building and searching, harnessing the power of multi-core processors. Datasets for evaluation contain about 25 million images or over 8 billion SIFT descriptors. The third contribution addresses distributed computing. We adapt eCP to the MapReduce programming model and use the Hadoop framework and HDFS for our experiments. This time we evaluate eCP's ability to scale-up with a collection of 100 million images, more than 30 billion SIFT descriptors, and its ability to scale-out by running experiments on more than 100 machines.
Document type :
Complete list of metadata

Cited literature [43 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Thursday, January 9, 2014 - 9:46:20 AM
Last modification on : Saturday, June 25, 2022 - 7:45:13 PM
Long-term archiving on: : Thursday, April 10, 2014 - 2:00:11 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00926069, version 1


Gylfi Thor Gudmunsson. Parallelism and distribution for very large scale content-based image retrieval. Other [cs.OH]. Université Rennes 1, 2013. English. ⟨NNT : 2013REN1S082⟩. ⟨tel-00926069⟩



Record views


Files downloads