Computations on Massive Data Sets : Streaming Algorithms and Two-party Communication

Abstract : In this PhD thesis, we consider two computational models that address problems that arise when processing massive data sets. The first model is the Data Streaming Model. When processing massive data sets, random access to the input data is very costly. Therefore, streaming algorithms only have restricted access to the input data: They sequentially scan the input data once or only a few times. In addition, streaming algorithms use a random access memory of sublinear size in the length of the input. Sequential input access and sublinear memory are drastic limitations when designing algorithms. The major goal of this PhD thesis is to explore the limitations and the strengths of the streaming model. The second model is the Communication Model. When data is processed by multiple computational units at different locations, then the message exchange of the participating parties for synchronizing their calculations is often a bottleneck. The amount of communication should hence be as little as possible. A particular setting is the one-way two-party communication setting. Here, two parties collectively compute a function of the input data that is split among the two parties, and the whole message exchange reduces to a single message from one party to the other one. We study the following four problems in the context of streaming algorithms and one-way two-party communication: (1) Matchings in the Streaming Model. We are given a stream of edges of a graph G=(V,E) with n=|V|, and the goal is to design a streaming algorithm that computes a matching using a random access memory of size O(n polylog n). The Greedy matching algorithm fits into this setting and computes a matching of size at least 1/2 times the size of a maximum matching. A long standing open question is whether the Greedy algorithm is optimal if no assumption about the order of the input stream is made. We show that it is possible to improve on the Greedy algorithm if the input stream is in uniform random order. Furthermore, we show that with two passes an approximation ratio strictly larger than 1/2 can be obtained if no assumption on the order of the input stream is made. (2) Semi-matchings in Streaming and in Two-party Communication. A semi-matching in a bipartite graph G=(A,B,E) is a subset of edges that matches all A vertices exactly once to B vertices, not necessarily in an injective way. The goal is to minimize the maximal number of A vertices that are matched to the same B vertex. We show that for any 0<=ε<=1, there is a one-pass streaming algorithm that computes an O(n^((1-ε)/2))-approximation using Ô(n^(1+ε)) space. Furthermore, we provide upper and lower bounds on the two-party communication complexity of this problem, as well as new results on the structure of semi-matchings. (3) Validity of XML Documents in the Streaming Model. An XML document of length n is a sequence of opening and closing tags. A DTD is a set of local validity constraints of an XML document. We study streaming algorithms for checking whether an XML document fulfills the validity constraints of a given DTD. Our main result is an O(log n)-pass streaming algorithm with 3 auxiliary streams and O(log^2 n) space for this problem. Furthermore, we present one-pass and two-pass sublinear space streaming algorithms for checking validity of XML documents that encode binary trees. (4) Budget-Error-Correcting under Earth-Mover-Distance. We study the following one-way two-party communication problem. Alice and Bob have sets of n points on a d-dimensional grid [Δ]^d for an integer Δ. Alice sends a small sketch of her points to Bob and Bob adjusts his point set towards Alice's point set so that the Earth-Mover-Distance of Bob's points and Alice's points decreases. For any k>0, we show that there is an almost tight randomized protocol with communication cost Ô(kd) such that Bob's adjustments lead to an O(d)-approximation compared to the k best possible adjustments that Bob could make.
Document type :
Theses
Complete list of metadatas

Cited literature [35 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00859643
Contributor : Abes Star <>
Submitted on : Monday, September 9, 2013 - 10:32:16 AM
Last modification on : Friday, January 4, 2019 - 5:32:57 PM
Long-term archiving on : Tuesday, December 10, 2013 - 4:24:46 AM

Identifiers

  • HAL Id : tel-00859643, version 1

Collections

Citation

Christian Konrad. Computations on Massive Data Sets : Streaming Algorithms and Two-party Communication. Other [cs.OH]. Université Paris Sud - Paris XI, 2013. English. ⟨NNT : 2013PA112120⟩. ⟨tel-00859643⟩

Share

Metrics

Record views

746

Files downloads

614