Parallelism and distribution for very large scale content-based image retrieval

Gylfi Thor Gudmunsson 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : The scale of multimedia collections has grown very fast over the last few years. Facebook stores more than 100 billion images, 200 million are added every day. In order to cope with this growth, methods for content-based image retrieval must adapt gracefully. The work presented in this thesis goes in this direction. Two observations drove the design of the high-dimensional indexing technique presented here. Firstly, the collections are so huge, typically several terabytes, that they must be kept on secondary storage. Addressing disk related issues is thus central to our work. Secondly, all CPUs are now multi-core and clusters of machines are a commonplace. Parallelism and distribution are both key for fast indexing and high-throughput batch-oriented searching. We describe in this manuscript a high-dimensional indexing technique called eCP. Its design includes the constraints associated to using disks, parallelism and distribution. At its core is an non-iterative unstructured vectorial quantization scheme. eCP builds on an existing indexing scheme that is main memory oriented. Our first contribution is a set of extensions for processing very large data collections, reducing indexing costs and best using disks. The second contribution proposes multi-threaded algorithms for both building and searching, harnessing the power of multi-core processors. Datasets for evaluation contain about 25 million images or over 8 billion SIFT descriptors. The third contribution addresses distributed computing. We adapt eCP to the MapReduce programming model and use the Hadoop framework and HDFS for our experiments. This time we evaluate eCP's ability to scale-up with a collection of 100 million images, more than 30 billion SIFT descriptors, and its ability to scale-out by running experiments on more than 100 machines.
Document type :
Theses
Complete list of metadatas

Cited literature [43 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00926069
Contributor : Abes Star <>
Submitted on : Thursday, January 9, 2014 - 9:46:20 AM
Last modification on : Friday, November 16, 2018 - 1:27:52 AM
Long-term archiving on : Thursday, April 10, 2014 - 2:00:11 AM

File

GUDMUNDSSON_Gylfi_Thor.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00926069, version 1

Citation

Gylfi Thor Gudmunsson. Parallelism and distribution for very large scale content-based image retrieval. Other [cs.OH]. Université Rennes 1, 2013. English. ⟨NNT : 2013REN1S082⟩. ⟨tel-00926069⟩

Share

Metrics

Record views

615

Files downloads

1006