A better symbolic regression method, by explicitly considering units

Author:Murphy | View: 23634 | Time: 2025-03-23 19:19:42

Figure created by the author by combining different Dall-E-2 generations.

Symbolic regression is a technique that helps us understand how different pieces of data are related to each other, by finding mathematical equations that describe the relationships in place. I am very expectant of symbolic regression methods and advocate for them, because by providing explicit equations they are in principle highly interpretable, in straightforward manners -in contrast to most modern AI models which behave like black boxes that we can't understand, making it difficult to know how and why they work.

A new work, by Tenachi et al and available as a preprint in arXiv, presents a new approach that uses deep reinforcement learning to find equations for the variables of a dataset while taking into account the units associated with the data. This approach helps eliminate physically impossible solutions and improves performance by restricting the freedom of the equation generator.

Index

**- Introduction

About symbolic regression, and introducing the new methodology
How the new symbolic regression method works
Concluding notes, preprint, and code**

Introduction

A fundamental problem in the natural sciences and in engineering is that of discovering quantitative relationships between the independent and dependent variables measured for a given system. While one can perfectly model such relationships through black box-like models like those produced by most modern conventional AI methods, it is far more desirable to quantify the relationships in terms of symbolic equations, like say a=F/m or I=I0exp(-kt).

Why? Well, there are a number of reasons:

Equations are interpretable and thus more intelligible by the human mind, possibly even connecting the variables through relationships that are backed up by the science or engineering concepts already known. For example, in a=F/m we can directly understand that a stronger force applied to an object causes a stronger acceleration, in a linear fashion; and in I=Ioexp(-kt) we know the decay of the dependent variable is exponential over time, from which we can play with maths to obtain equations for half-life times, linearized forms (through logarithms), and more.
If interpretable in simple terms, equations connecting variables have a good chance of relating directly to the underlying concepts, ideas and axioms of the science or engineering problem being modeled. Think for example of the I=I0exp(-kt) applied to radioactive decay, from whose derivative (dI/dt = -kt) we can understand that the rate at which the isotope decays is proportional to the total number of radioactive nuclei left in the sample at the time, hence this is a first-order process.
Propagating independent variables (inputs) through an equation to model the resulting outputs (dependent variables) is an extremely fast calculation, virtually immediate, compared to propagating the inputs through all the units of a neural network. This might not have an impact for individual predictions, but may be relevant when large numbers of predictions need to be done. Besides, equations that relate variables symbolically can be integrated into other programs seamlessly by just plugging in the fitted equations. See for example a great use case of this here.
Importantly, equations relating variables analytically might be more safely extrapolable outside of the domains for which input data was sampled.

None of these points holds, at least not in such a straightforward manner, for regular Machine Learning models as they work by merely convoluting and mixing signals through lots of nested functions and combinations until they can model the dependent variables properly, in ways that might work perfectly but are hardly interpretable -if at all.

All that explanation above is why when a relationship can be modeled through equations, you better go for that. But what happens when you cannot find a mathematical expression to relate the variables? Then you either try regular neural networks, or you can try symbolic regression.

About symbolic regression, and introducing the new methodology

Symbolic regression aims to find free-form symbolic analytical function/s that fit the variables of a given dataset, being more general than methods which simply fit coefficients inside linear or non-linear functions. To get it straight, symbolic regression is not simply fitting equations but actually finding their expressions (and then yes, fitting the coefficients they contain). Summarizing from the introduction but using different terms, the advantages of symbolic regression include compactness, generalization (meaning that if correct, an analytical expression can be much better at extrapolating outside of the training range), and intelligibility and interpretability.

To make these advantages clear and to understand the role that symbolic regression can play in modeling datasets, look at the recent advances and examples I have covered from recent literature about symbolic regression used for scientific applications:

Real-world applications of symbolic regression

Google proposes new method to derive analytical expressions for terms in quantum mechanics…

Unfortunately, implementing symbolic regression for discovering new physical laws is extremely challenging. More so is achieving compact, simple equations that can truly be interpretable in terms of the underlying science or engineering concepts. I stress this because many models for symbolic regression come up with such complicated equations, that they end up being very hard to interpret and add little or nothing in comparison to standard machine learning models. And in the most successful cases, finding simple enough equations usually requires long execution times as the program needs to explore the different branches of the huge tree of possible mathematical operations that must be combined and tested while building up possible equations.

This last two points, i.e. the need for simplified equations and possibly fast convergence into them, were the main motivation behind the new work by Tenachi et al. And here's their main contribution: The authors realized that the units of the variables to be connected by the symbolic regression procedure impose a strong constraint on the shape of the equation. They then explored how this fact could be exploited to optimize equation search, and thus came up with a specific framework to incorporate information about physical units into the symbolic regression procedure.

The authors realized that the units of the variables to be connected by the symbolic regression procedure impose a strong constraint on the shape of the equation they are looking for. They exploited this to optimize equation search, and came up with a framework to incorporate information about physical units into symbolic regression procedure.

By including units as constraints during the equation search process, the new framework effectively addresses the combinatorial challenge of the huge search space of trial expressions. The search space shrinks dramatically, resulting in much faster search for equations. Moreover, the authors found that the procedure results in simpler expressions that are hence more interpretable and accurate than those obtained by other symbolic regression methods.

How the new symbolic regression method works

At its core, the framework includes a novel symbolic embedding, here tailored for physics but in principle scalable more broadly, that allows to control the units of each symbol generated in a partially composed mathematical expression. This results in the procedure automatically guiding the search space exclusively through paths where units remain consistent. As the procedure runs, it uses a recurrent neural network to generate analytical expressions and cycles them through steps of reinforcement learning under the constraints that impose the units, thus resulting in physically meaningful combinations of the input variables.

In more detail, the procedure begins by generating symbolic expressions using binary trees where each node represents a symbol of the expression in a library of available symbols. The expressions are treated as sequences of categorical vectors, and token sequences are generated using the recurrent neural network. (By the way, these categorical vectors can be artificially tuned to incorporate prior knowledge; and some priors are adopted at this stage, for example restraining the maximum possible size of the analytical expressions, allowing no more than two levels of nested trigonometric operations and no self-nesting of exponent and log operators which are unusual in sciences, etc.)

The generated expressions are subjected to physical unit constraints, which are computed in situ using a procedure that computes the required units whenever possible and leaves them as free otherwise. Then, in the reinforcement learning part, a set of trial symbolic functions is generated as indicated and a reward is computed for each function by confronting it to the data. The network is then required to generate a new batch of trial functions, encouraging it to produce better results by reinforcing behavior associated with high reward values. The approach reinforces candidates which are sampled based on not only the output of the reinforcement network but also on the local units constraints derived from the prior distribution, which ensures the physical correctness of token choices.

Notably, the methodology allows the candidate functions to contain constants with fixed units, but with free numerical values, which allows for modeling situations where the problem has some unknown physical scales. Finally, the optimal value of the constants is found via gradient descent using a standard optimization routine.

Concluding notes, preprint, and code

After demonstrating the effectiveness of the new approach on a range of examples from astrophysics, the authors of this new work aim to build a powerful general-purpose symbolic regression algorithm for other physical sciences.

You can read the full preprint here at arXiv:

Deep symbolic regression for physics guided by units constraints: toward the automated discovery of…

And you can try the program here:

GitHub – WassimTenachi/PhySO: Physical Symbolic Optimization

You may also find useful the presentation by the preprint's lead author, and in the Twitter thread he rolled out:

As Tenachi himself concludes in his Twitter thread,

While neural networks are excellent tools for modeling physical systems, they lack interpretability and generalization capabilities. [The new method] offers a chance to open up these black boxes and recover underlying equations, which, as physicists, we all know and love.

www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, science, technology, Programming, etc. Become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and subscribe to get my new stories by email. To consult about small jobs, check my services page here. You can contact me here.

Tags: Data Science Machine Learning Mathematics Programming Statistics