Restricted Boltzmann machines : from compositional representations to protein sequence analysis

Abstract : Restricted Boltzmann machines (RBM) are graphical models that learn jointly a probability distribution and a representation of data. Despite their simple architecture, they can learn very well complex data distributions such the handwritten digits data base MNIST. Moreover, they are empirically known to learn compositional representations of data, i.e. representations that effectively decompose configurations into their constitutive parts. However, not all variants of RBM perform equally well, and little theoretical arguments exist for these empirical observations. In the first part of this thesis, we ask how come such a simple model can learn such complex probability distributions and representations. By analyzing an ensemble of RBM with random weights using the replica method, we have characterised a compositional regime for RBM, and shown under which conditions (statistics of weights, choice of transfer function) it can and cannot arise. Both qualitative and quantitative predictions obtained with our theoretical analysis are in agreement with observations from RBM trained on real data. In a second part, we present an application of RBM to protein sequence analysis and design. Owe to their large size, it is very difficult to run physical simulations of proteins, and to predict their structure and function. It is however possible to infer information about a protein structure from the way its sequence varies across organisms. For instance, Boltzmann Machines can leverage correlations of mutations to predict spatial proximity of the sequence amino-acids. Here, we have shown on several synthetic and real protein families that provided a compositional regime is enforced, RBM can go beyond structure and extract extended motifs of coevolving amino-acids that reflect phylogenic, structural and functional constraints within proteins. Moreover, RBM can be used to design new protein sequences with putative functional properties by recombining these motifs at will. Lastly, we have designed new training algorithms and model parametrizations that significantly improve RBM generative performance, to the point where it can compete with state-of-the-art generative models such as Generative Adversarial Networks or Variational Autoencoders on medium-scale data.
Complete list of metadatas

Cited literature [252 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02183417
Contributor : Abes Star <>
Submitted on : Monday, July 15, 2019 - 12:05:13 PM
Last modification on : Tuesday, July 16, 2019 - 1:29:39 AM

File

Tubiana-2018-These.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02183417, version 1

Citation

Jérôme Tubiana. Restricted Boltzmann machines : from compositional representations to protein sequence analysis. Physics [physics]. PSL Research University, 2018. English. ⟨NNT : 2018PSLEE039⟩. ⟨tel-02183417⟩

Share

Metrics

Record views

92

Files downloads

48