Skip to Main content Skip to Navigation

Massive distribution for indexing and mining time series

Djamel-Edine Yagoubi 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Time series arise in many application domains such as finance, agronomy, health, earth monitoring, weather forecasting, to name a few. Because of advances in sensor technology, such applications may produce millions to trillions of time series per day, requiring fast analytical and summarization techniques. The processing of these massive volumes of data has opened up new challenges in time series data mining. In particular, it is to improve indexing techniques that has shown poor performances when processing large databases. In this thesis, we focus on the problem of parallel similarity search in such massive sets of time series. For this, we first need to develop efficient search operators that can query a very large distributed database of time series with low response times. The search operator can be implemented by using an index constructed before executing the queries. The objective of indices is to improve the speed of data retrieval operations. In databases, the index is a data structure, which based on search criteria, efficiently locates data entries satisfying the requirements. Indexes often make the response time of the lookup operation sublinear in the database size. After reviewing the state of the art, we propose three novel approaches for parallel indexing and querying large time series datasets. First, we propose DPiSAX, a novel and efficient parallel solution that includes a parallel index construction algorithm that takes advantage of distributed environments to build iSAX-based indices over vast volumes of time series efficiently. Our solution also involves a parallel query processing algorithm that, given a similarity query, exploits the available processors of the distributed system to efficiently answer the query in parallel by using the constructed parallel index. Second, we propose RadiusSketch a random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers. RadiusSketch includes a parallel index construction algorithm that takes advantage of distributed environments to efficiently build sketch-based indices over very large databases of time series, and then query the databases in parallel. Third, we propose ParCorr, an efficient parallel solution for detecting similar time series across distributed data streams. ParCorr uses the sketch principle for representing the time series. Our solution includes a parallel approach for incremental computation of the sketches in sliding windows and a partitioning approach that projects sketch vectors of time series into subvectors and builds a distributed grid structure. Our solutions have been evaluated using real and synthetics datasets and the results confirm their high efficiency compared to the state of the art.
Complete list of metadatas

Cited literature [80 references]  Display  Hide  Download
Contributor : Reza Akbarinia <>
Submitted on : Wednesday, December 5, 2018 - 11:41:49 AM
Last modification on : Friday, May 17, 2019 - 11:38:56 AM
Long-term archiving on: : Wednesday, March 6, 2019 - 1:50:10 PM


Files produced by the author(s)


  • HAL Id : tel-01945348, version 1



Djamel-Edine Yagoubi. Massive distribution for indexing and mining time series. Numerical Analysis [cs.NA]. Université de Montpellier, 2018. English. ⟨tel-01945348⟩



Record views


Files downloads