Skip to Main content Skip to Navigation

Concepts et algorithmes pour la découverte des structures formelles des langues

Hervé Déjean 1
1 Equipe Hultech - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : This presentation describes a method which allows the uncovering of syntactic structures from untagged corpora (no lexicon, just raw text). It can be considered as a continuation of Zellig Harris distributional work developed in the 50'. Following the distributional hypothesis, only formal criteria are used (no resort to semantics).

The method is based on a simple idea of the language: it is a linear object in which the boundaries (beginning and ending) of the different structures are marked by characteristic elements. The structures so delimited are the simple phrase (non recursive) and the clause, which are both multilingually and formally defined. The phrase Boundaries Indicator (BI) corresponds to morphemes (linked or free), and the clause BI to morphemes and phrases.

From this theoretical structure, we extract the list of all the categories an element can belong to (beginning and ending BI of phrases and clauses). Once structures and categories are identified, we build specified contexts for each category in order to classify all the words of the texts. These contexts are built thanks to prototypical elements which are easily identified from formal criteria (their identification relies on their behaviour related to punctuation marks). We can thus classify a word into several categories. The categorization first deals with clause elements (such as conjunctions, verbal phrases), and then with nominal phrases.

This method allows word categorization and segmentation of the corpus into phrases. These concepts and algorithms were partially tested on several natural languages such as French, German, Turkish, Vietnamese, Swahili.
Complete list of metadata
Contributor : Hervé Déjean Connect in order to contact the contributor
Submitted on : Tuesday, September 4, 2007 - 1:45:17 PM
Last modification on : Tuesday, October 19, 2021 - 11:34:55 PM
Long-term archiving on: : Friday, April 9, 2010 - 1:32:55 AM


  • HAL Id : tel-00169572, version 1


Hervé Déjean. Concepts et algorithmes pour la découverte des structures formelles des langues. Théorie et langage formel [cs.FL]. Université de Caen, 1998. Français. ⟨tel-00169572⟩



Record views


Files downloads