Skip to Main content Skip to Navigation

Machine learning for performance modelling on colossal software configuration spaces

Hugo Martin 1, 2 
Abstract : Almost all of today's software systems are configurable. With the help of options, it is possible to modify the behavior of these systems, to add or remove certain capabilities to improve their performance or to adapt them to different situations. Each of these options is linked to certain parts of the code, and ensuring that these parts work well together, or that they cannot be used together, is one of the challenges during the development and the usage of these software products, known as Software Product Lines (SPL). While this may seem relatively simple with a few options, some software assembles thousands of options spread over millions of lines of code, making the task much more complex. Over the past decade, researchers have begun to use machine learning techniques to address the problems of Software Product Lines. One of the key problems is the prediction of different properties of software, such as the speed of execution of a task, which can vary greatly depending on the configuration of the software used. Measuring the properties for each configuration can be costly and complex, or even impossible in the most extreme cases. The creation of a model allowing to predict the properties of the system, with the help of measurements on only a small part of the possible configurations, is a task in which machine learning excels. Different solutions have been developed, but they have only been validated in cases where the number of options is quite small. However, a large part of SPL have hundreds or even thousands of options. Without testing machine learning solutions on systems with so many options, it is impossible to know if these solutions are suitable for such cases. The first contribution of this thesis is the application of machine learning algorithms on a Software Product Line at a scale never before achieved. Using Linux and its 15,000 options, it was possible to determine that linear algorithms, but also those specialized for SPL, are not able to work properly at this scale. Only tree-based algorithms, as well as neural networks, were able to provide a fairly accurate model with reasonable resources in terms of time and memory. The second contribution is the Feature Ranking List, a list of options ranked by importance towards their impact on a target software property, generated by an improved feature selection based on decision trees. We evaluated its effects on Linux kernel binary size prediction models under the same conditions as the first contribution. The desired and best known effect of feature selection in general is a major speed-up in learning time as well as a significant improvement in accuracy for most of the previously considered algorithms. The third contribution is the improvement of automated performance specialization and its evaluation on different SPL including Linux. Performance specialization is a process that consists in adding constraints on an SPL in order to meet a certain performance threshold defined by the user, to help them when configuring the software. The results show that it is possible to obtain a sufficiently accurate set of rules, even on Linux.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Saturday, June 18, 2022 - 6:48:43 AM
Last modification on : Friday, August 5, 2022 - 2:54:52 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03698474, version 1


Hugo Martin. Machine learning for performance modelling on colossal software configuration spaces. Artificial Intelligence [cs.AI]. Université Rennes 1, 2021. English. ⟨NNT : 2021REN1S117⟩. ⟨tel-03698474⟩



Record views


Files downloads