CURRENT CHALLENGES OF SYMBOLIC REGRESSION: OPTIMIZATION, SELECTION, MODEL SIMPLIFICATION, AND BENCHMARKING
Symbolic Regression (SR) is a regression method that aims to discover mathematical expressions
that describe the relationship between variables, and it is often implemented through Genetic
Programming, a metaphor for the process of biological evolution. Its appeal lies in combining
predictive accuracy with interpretable models, but its promise is limited by several long-standing
challenges: parameters are difficult to optimize, the selection of solutions can affect the search,
and odels often grow unnecessarily complex. In addition, current methods must be constantly
re-evaluated to understand the SR landscape. This thesis addresses these challenges through a
sequence of studies conducted throughout the doctorate, each focusing on an important aspect of
the SR search process. First, I investigate parameter optimization, obtaining insights into its role in
improving predictive accuracy, albeit with trade-offs in runtime and expression size. Next, I study
parent selection, exploring ϵ-lexicase to select parents more likely to generate good performing
offspring. The focus then turns to simplification, where I introduce a novel method based on
memoization and locality-sensitive hashing that reduces redundancy and yields simpler, more
accurate models. All of these contributions are implemented into a multi-objective evolutionary
SR library, which achieves Pareto-optimal performance in terms of accuracy and simplicity
on benchmarks of real-world and synthetic problems, outperforming several contemporary
SR approaches. The thesis concludes reimaginating a famous large-scale symbolic regression
benchmark suite, to assess the symbolic regression landscape, demonstrating that our method
achieves Pareto-optimal performance.