An Information-Theoretic Approach to Distributed Learning. Distributed Source Coding Under Logarithmic Loss

Yigit Ugur

Résumé

One substantial question, that is often argumentative in learning theory, is how to choose a `good' loss function that measures the fidelity of the reconstruction to the original. Logarithmic loss is a natural distortion measure in the settings in which the reconstructions are allowed to be `soft', rather than `hard' or deterministic. In other words, rather than just assigning a deterministic value to each sample of the source, the decoder also gives an assessment of the degree of confidence or reliability on each estimate, in the form of weights or probabilities. This measure has appreciable mathematical properties which establish some important connections with lossy universal compression. Logarithmic loss is widely used as a penalty criterion in various contexts, including clustering and classification, pattern recognition, learning and prediction, and image processing. Considering the high amount of research which is done recently in these fields, the logarithmic loss becomes a very important metric and will be the main focus as a distortion metric in this thesis. In this thesis, we investigate a distributed setup, so-called the Chief Executive Officer (CEO) problem under logarithmic loss distortion measure. Specifically, agents observe independently corrupted noisy versions of a remote source, and communicate independently with a decoder or CEO over rate-constrained noise-free links. The CEO also has its own noisy observation of the source and wants to reconstruct the remote source to within some prescribed distortion level where the incurred distortion is measured under the logarithmic loss penalty criterion. One of the main contributions of the thesis is the explicit characterization of the rate-distortion region of the vector Gaussian CEO problem, in which the source, observations and side information are jointly Gaussian. For the proof of this result, we first extend Courtade-Weissman's result on the rate-distortion region of the discrete memoryless (DM) $K$-encoder CEO problem to the case in which the CEO has access to a correlated side information stream which is such that the agents' observations are independent conditionally given the side information and remote source. Next, we obtain an outer bound on the region of the vector Gaussian CEO problem by evaluating the outer bound of the DM model by means of a technique that relies on the de Bruijn identity and the properties of Fisher information. The approach is similar to Ekrem-Ulukus outer bounding technique for the vector Gaussian CEO problem under quadratic distortion measure, for which it was there found generally non-tight; but it is shown here to yield a complete characterization of the region for the case of logarithmic loss measure. Also, we show that Gaussian test channels with time-sharing exhaust the Berger-Tung inner bound, which is optimal. Furthermore, application of our results allows us to find the complete solutions of three related problems: the quadratic vector Gaussian CEO problem with \textit{determinant} constraint, the vector Gaussian distributed hypothesis testing against conditional independence problem and the vector Gaussian distributed Information Bottleneck problem. With the known relevance of the logarithmic loss fidelity measure in the context of learning and prediction, developing algorithms to compute the regions provided in this thesis may find usefulness in a variety of applications where learning is performed distributively. Motivated from this fact, we develop two type algorithms: i) Blahut-Arimoto (BA) type iterative numerical algorithms for both discrete and Gaussian models in which the joint distribution of the sources are known; and ii) a variational inference type algorithm in which the encoding mappings are parameterized by neural networks and the variational bound approximated by Monte Carlo sampling and optimized with stochastic gradient descent for the case in which there is only a set of training data is available. Finally, as an application, we develop an unsupervised generative clustering framework that uses the variational Information Bottleneck (VIB) method and models the latent space as a mixture of Gaussians. This generalizes the VIB which models the latent space as an isotropic Gaussian which is generally not expressive enough for the purpose of unsupervised clustering. We illustrate the efficiency of our algorithms through some numerical examples.

Une problème de fond, qui suscite un intérêt croissant en théorie de l’apprentissage statistique, est de déterminer des limites d'apprentissage d'une architecture donnée. Aussi, comment choisir la structure la mieux adaptée a un problème donné? Quelle fonction 'risque' fait le plus sens? Ces questions sont d'autant plus importantes que les principales critiques faites aux approches actuelles reposent sur le fait que celles-ci sont très largement expérimentales, et donc sans garanties réelles de reproductibilité ('black-box' approaches). Dans cette thèse, nous adoptons une approche théorie de l'information au problème. En réalité, le problème d'inférence statistique est fondamentalement un problème de codage de source sous mesure de distorsion logarithmique ('logarithmic loss'). Cela peut être démontre a partir des principes de base. La mesure de distorsion logarithmique est largement utilisée comme critère de pénalité dans divers contextes, notamment la prédiction linéaire, la classification, reconnaissance de formes, ainsi que le traitement d'images. Motivés par le problème d'apprentissage a partir de données partielles, nous étudions les limites d'apprentissage du problème de 'Multiview learning'. Nous établissons des limites fondamentales sur la capacité d'apprentissage dans ce contexte en étudiant, et résolvent, le problème de codage de source distribue sous contrainte de distorsion logarithmique. Dans ce problème, dit 'Chief Executive Officer (CEO) problem', des agents observent les versions bruitées d'une source distante. Les agents communiquent de façon indépendante avec un décodeur commun via des liens a capacité finie. Le décodeur commun, dit aussi CEO, a également sa propre observation bruite de la source et souhaite reconstruire la source distante avec un niveau de fidélité donne, ou la fidélité est mesurée selon le critère de pénalité de perte logarithmique. Une des contributions principales de cette thèse est la caractérisation explicite de la région taux de compression / distorsion du problème de codage distribue de sources (CEO problem) sous perte logarithmique dans le cas ou les sources sont vectorielles et Gaussiennes. Pour la preuve de ce résultat, nous commençons par étendre un résultat important de Courtade-Weissman pour le cas spécifique de même modèle ou le CEO ne possède pas sa propre observation dans le cas discret sans mémoire. Ensuite, nous établissons une borne externe sur la région taux de compression / distorsion du problème du problème CEO avec sources vectorielles Gaussiennes en évaluant la limite extérieure du modèle discret sans mémoire au moyen d'une technique qui s'appuie sur l'identité de Bruijn et les propriétés des informations de Fisher. L'application de nos résultats au problème d'apprentissage via la technique Information Bottleneck nous permet d'établir le compromis optimal entre niveau de précision et capacité de généralisation. De plus, nous développons des algorithmes d'apprentissage distribue, qui sont applicables aussi bien au cas de données discrètes (exemple, classification) et que de données continues (exemple régression linéaire). En particulier, un de nos algorithme est de type inférence variationnelle dans lequel les mappings sont paramétrés par des réseaux de neurones et la fonction coût approchée par échantillonnage de Monte Carlo et optimisé avec descente de gradient stochastique. Nous montrons que nos algorithmes convergent; et nous proposons des structures d'apprentissage basées sur les réseaux de neurones qui permettent d'implémenter efficacement ces algorithmes. Les algorithmes et architectures développés dans cette these sont valides par des résultats expérimentales. Enfin, en tant qu'application, nous développons un cadre de clustering génératif non supervisé qui utilise le modèle VIB et modélise l'espace latent comme un mélange de gaussiennes.

An Information-Theoretic Approach to Distributed Learning. Distributed Source Coding Under Logarithmic Loss

Approche théorie de l’information à l’apprentissage statistique. Codage distribué de sources sous mesure de fidélité logarithmique

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Lien texte intégral

Citer

Exporter

Collections

Partager