Typologies textuelles et partitions musicales : dissimilarités, classification et autocorrélation.

Abstract : Focused on formalism and methods in its first part, this thesis is constructed from three basic formalised concepts, namely: a contingency table, an Euclidean dissimilarity matrix and an exchange matrix. Those concepts permit the expression and development of several Data Analysis or Machine Learning methods: Correspondence Analysis (CA), interpreted as a particular case of Multidimensional Scaling; classification and clustering, combined with Schoenberg transformations; and the autocorrelation and cross-autocorrelation indices, adapted to multivariate analysis and allowing the consideration of various neighbourhood families. In the second part of the thesis, these methods lead to an Exploratory Data Analysis of textual and musical data of various types. For textual data, we are interested in clustering clauses into discourse types, based upon the distribution of part-of-speech (POS) tags in the clauses. Although the statistical link between POS tags and discourse types is significant, the results obtained with the K-means algorithm or a fuzzy variant of it, possibly combined with a Schoenberg transformation, remain difficult to interpret. We also deal with multi-label classification into dialog acts of turns, again based on the POS tags they contain, but also on lemmas and on the meaning of verbs. Results obtained by means of discriminant analysis combined with a Schoenberg transformation are promising. Finally, we examine the textual autocorrelation, in terms of similarities between various positions in a text, thought as a sequence of localized units. In particular, the phenomenon of word length alternation in a text is studied for a family of neighbourhoods of variable span. We also consider presence-absence similarities, according to the apparition of specific POS, as well as the semantic similarities between textual positions. Regarding musical data, we propose to represent a musical score as a contingency table. We begin by using CA and the autocorrelation index to discover underlying structures within each score. Then, we apply the same approach on the different voices in a musical score, with a procedure alike to a fuzzy variant of multiple correspondence analysis and making use of the cross-autocorrelation index. Whether in the whole musical scores or in different voices they contain, repeated structures are actually detected, provided they are not transposed. Finally, we propose to cluster twenty musical scores by four different composers, each represented by a contingency table, by introducing a similarity index between the pairs of configurations. A majority of scores turn out to be thus successfully regrouped according to their composer.
Complete list of metadatas

Cited literature [125 references]  Display  Hide  Download

Contributor : Christelle Cocco <>
Submitted on : Thursday, October 16, 2014 - 3:04:26 PM
Last modification on : Tuesday, January 29, 2019 - 8:06:23 AM
Long-term archiving on : Saturday, January 17, 2015 - 10:15:35 AM


  • HAL Id : tel-01074904, version 1



Christelle Cocco. Typologies textuelles et partitions musicales : dissimilarités, classification et autocorrélation.. Méthodes et statistiques. Université de Lausanne, 2014. Français. ⟨tel-01074904⟩



Record views


Files downloads