Quick and Dirty Way to Fit Regression Models Using (Only) SQL.

Author:Murphy  |  View: 22470  |  Time: 2025-03-23 18:56:27
Photo by Michael Dziedzic on Unsplash

SQL programmers hardly fit any ML models.

Someone else will do it unless they have either Python or R knowledge. While Python and scikit-learn are often my go-to tools for Machine Learning, it is worth noting that SQL can also do some quick and dirty model fitting.

Regression models are a common one almost everyone needs. I remember using it in high school physics for the first time.

In such situations, if you have your data in a Postgres table, you don't have to leave your Sql environment to fit such trivial models.

3 SQL Optimization Techniques That Can Instantly Boost Query Speed

Here's how we do this.

Regression modeling in SQL

Postgres has built-in utilities to work with regression models. You don't have to install or activate any special modules.

We can easily fit linear regression models quickly and make predictions using them.

A linear regression model is about finding the equation of a line that generalizes the dataset. Thus, we only need to find the line's intercept and slope.

The regr_slope and regr_intercept functions help us with this task.

Let's suppose we have a table with the rainfall and temperature columns. And we need to predict the missing values in the temperature column using the rainfall information.

Here's a SELECT statement that retrieves the rainfall and temperature values from the weather table, where missing temperature values are filled with predictions using the regr_slope and regr_intercept functions:

SELECT 
  rainfall, 
  CASE 
    WHEN temperature IS NULL 
    THEN regr_slope(temperature, rainfall) * rainfall 
         + regr_intercept(temperature, rainfall) 
    ELSE temperature 
  END AS temperature
FROM 
  weather;

We use a CASE statement to check if the temperature value is missing (i.e., NULL). If it is missing, we use the regr_slope and regr_intercept Functions to predict the temperature value based on the corresponding rainfall value. Otherwise, we use the original temperature value.

These 5 SQL Techniques Cover ~80% of Real-Life Projects

Persisting the predictions

If I want to fill in the missing values in the table permanently, I can use a slightly modified version of the above code. You can create a materialized view for the predictions or insert the forecasts into a different table.

-- Populate the table with predicted temperature values
INSERT INTO predicted_temperature (rainfall, temperature)
SELECT 
  t1.rainfall, 
  CASE 
    WHEN t1.temperature IS NULL 
    THEN regr_slope(t2.temperature, t2.rainfall) * t1.rainfall 
         + regr_intercept(t2.temperature, t2.rainfall) 
    ELSE t1.temperature 
  END AS temperature
FROM 
  weather t1
  LEFT JOIN weather t2 ON t1.rainfall = t2.rainfall
WHERE 
  t1.temperature IS NULL;

And here's the same but creating a materialized view. Materialized views are simply queries saved in the database along with their results from the latest run.

CREATE MATERIALIZED VIEW predicted_temperature_mv AS
SELECT 
  rainfall, 
  CASE 
    WHEN temperature IS NULL 
    THEN regr_slope(temperature, rainfall) * rainfall 
         + regr_intercept(temperature, rainfall) 
    ELSE temperature 
  END AS temperature
FROM 
  weather;

Python To SQL – I Can Now Load Data 20X Faster

Limitations of fitting regression models in SQL

Anyone who worked on regression models before could attest that the above examples still need to be completed. Fitting a correct model is more complex than what we've discussed here.

Especially in real-life situations, we should consider more than one independent variable to predict the dependent one. Sometimes we have to use polynomial order in fitting the regression models rather than the linear one. We may also need to use them both.

Postgres's regression utilities aren't capable of handling such complex modeling. We can only build a linea regressor with one independent variable.

SQL on Pandas – My New Favorite for 10X Speed.

Conclusion

Linear regression models are probably the most used ones for predicting continuous data. Data scientists often use it as a starting point for more complex ML modeling.

Although we need the support of programming languages such as Python for more sophisticated machine-learning tasks, simple tasks like linear regressions can be done inside SQL itself.

I hope this little technique discussed in this post will help you in your daily work.


Thanks for reading, friend! If you enjoyed my article, let's keep in touch on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because, at no extra cost for you, I earn a small commission for referring you.

Tags: Artificial Intelligence Data Science Machine Learning Software Engineering Sql

Comment