Mining Software Engineering Data for Useful Knowledge

Boris Baldassari 1, 2
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : Repositories used in the development of software projects contain a great amount of interesting information on tools, methods and development practices. This information is not put to use these days because it is difficult to retrieve (e.g. for distributed teams, using many different tools), its nature implies specific characteristics (semi-structured, incomplete or inconsistent data with skewed distributions), and because its interpretation and analysis is highly subject to the knowledge and constraints of the domain. The research work achieved in Maisqual addresses these concerns by investigating methods borrowed from the data mining field to help understand, assess and improve the quality of software projects. To that end we identified some common issues encountered in the development and assessment of software projects, and established a list of data mining techniques that could bring useful answers. Several open source software projects were analysed, using metrics extracted from both the code itself and the development process (mainly configuration management and mailing lists). Because of the intrinsic nature of this data and its semantic context, it appears that some methods are better suited than others: e.g. unsupervised clustering, outliers detection and regression analysis. The output of this work is threefold: firstly, some of the technics developed were introduced in SQuORE, a professional-grade software quality assessment tool edited by SQuORING Technologies. The work accomplished on the analysis process and quality models also brought useful insights for the company's team and product. Another contribution to the domain is the methodological and tooling framework setup for our analysis, which relies on safe and semantically consistent measurement techniques. Last but not least, we published under a free licence several software-related data sets for the academic and industry communities to foster research in this area of knowledge.
Document type :
Complete list of metadatas
Contributor : Preux Philippe <>
Submitted on : Monday, April 4, 2016 - 11:51:34 AM
Last modification on : Thursday, February 21, 2019 - 10:52:49 AM
Long-term archiving on : Monday, November 14, 2016 - 2:51:26 PM


  • HAL Id : tel-01297400, version 1


Boris Baldassari. Mining Software Engineering Data for Useful Knowledge. Machine Learning [stat.ML]. Université de Lille, 2014. English. ⟨tel-01297400⟩



Record views


Files downloads