Skip to Main content Skip to Navigation

Guided data selection for predictive models

Marie Le Guilly 1, 2 
2 BD - Base de Données
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Databases and machine learning (ML) have historically evolved as two separate domains: while databases are used to store and query the data, ML is devoted to predictive models inference, clustering, etc. Despite its apparent simplicity, the “data preparation” step of ML applications turns out to be the most time-consuming step in practice. Interestingly this step encompasses the bridge between databases and ML. In this setting, we raise and address three main problems related to data selection for building predictive models. First, the database usually contains more than the data of interest: how to separate the data that the analyst wants from the one she does not want? We propose to see this problem as imbalanced classification between the tuples of interest and the rest of the database. We develop an undersampling method based on the functional dependencies of the database. Second, we discuss the writing of the query returning the tuples of interest. We propose a SQL query completion solution based on data semantics, that starts from a very general query, and helps an analyst to refine it until she selects her data of interest. This process aims at helping the analyst to design the query that will eventually select the data she requires. Third, assuming the data has successfully been extracted from the database, the next natural question follows: is the selected data suited to answer the considered ML problem? Since getting a predictive model from the features to the class to predict amounts to providing a function, we point out that it makes sense to first assess the existence of that function in the data. This existence can be studied through the prism of functional dependencies, and we show how they can be used to understand a model’s limitation, and to refine the initial data selection if necessary.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Monday, February 1, 2021 - 2:42:21 PM
Last modification on : Tuesday, June 1, 2021 - 2:08:08 PM
Long-term archiving on: : Sunday, May 2, 2021 - 7:29:51 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03127360, version 1


Marie Le Guilly. Guided data selection for predictive models. Databases [cs.DB]. Université de Lyon, 2020. English. ⟨NNT : 2020LYSEI072⟩. ⟨tel-03127360⟩



Record views


Files downloads