IMPROVEMENTS FOR SYMBOLIC REGRESSION ALGORITHMS: OPTIMIZATION, SELECTION, MODEL SIMPLIFICATION, AND SELF-ADAPTATION
Symbolic Regression (SR) is a regression method that searches for mathematical expressions to describe the relationship between variables, balancing prediction accuracy and interpretability. The unrestricted search space of solutions is infinitely large, and current approaches face several challenges: i) optimal expression parameters are hard to calculate, ii) candidate selection during the search can affect the outcome, iii) expressions can grow without justified improvement in prediction error, and iv) the search often follows fixed hyper-parameters, missing the current context of solutions. This thesis compiles the articles developed throughout the doctorate, aiming to address each of these problems in such a way as to improve the search space exploration of the genetic programming algorithm to obtain models with high accuracy and low complexity. Linear and non-linear optimization methods were combined, and the results showed that separately optimizing non-linear and linear coefficients provides better performance but longer execution time and larger expression size, while non-linear optimization alone has the worst performance. The ϵ-lexicase selection was enhanced, proving superior to several SR algorithms in a benchmark with real-world and synthetic data. A new simplification method using memoization and localitysensitive hashing was proposed, resulting in improved prediction error and reduced occurrence of non-linear functions in the final expressions. A framework with a self-adaptive strategy for search space exploration was proposed, achieving Pareto-optimal performance in terms of accuracy and simplicity of solutions in a benchmark of real-world and synthetic problems, dominating previously proposed contemporary SR approaches. The contributions improved different current algorithms, and each contribution can be seen as an alternative to exploring the search space for solutions.