Regression in Python using R-style formula – it’s easy!
I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.
Before running this you will need to install the pandas
, statsmodels
and patsy
packages. If you’re using conda you should be able to do this by running the following from the terminal:
conda install statsmodels patsy
(and then say yes when it asks you to confirm it)
import pandas as pd
from statsmodels.formula.api import ols
Before we can do any regression, we need some data – so lets read some data on cars:
df = pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")
You may have noticed from the code above that you can just give a URL to the read_csv
function and it will download it and open it – handy!
Anyway, here is the data:
df.head()
Model | MPG | Cylinders | Engine Disp | Horsepower | Weight | Accelerate | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | amc ambassador dpl | 15.0 | 8 | 390.0 | 190 | 3850 | 8.5 | 70 | American |
1 | amc gremlin | 21.0 | 6 | 199.0 | 90 | 2648 | 15.0 | 70 | American |
2 | amc hornet | 18.0 | 6 | 199.0 | 97 | 2774 | 15.5 | 70 | American |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | American |
4 | buick estate wagon (sw) | 14.0 | 8 | 455.0 | 225 | 3086 | 10.0 | 70 | American |
Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr()
method:
df.corr()
MPG | Cylinders | Engine Disp | Horsepower | Weight | Accelerate | Year | |
---|---|---|---|---|---|---|---|
MPG | 1.000000 | -0.777618 | -0.805127 | -0.778427 | -0.832244 | 0.423329 | 0.580541 |
Cylinders | -0.777618 | 1.000000 | 0.950823 | 0.842983 | 0.897527 | -0.504683 | -0.345647 |
Engine Disp | -0.805127 | 0.950823 | 1.000000 | 0.897257 | 0.932994 | -0.543800 | -0.369855 |
Horsepower | -0.778427 | 0.842983 | 0.897257 | 1.000000 | 0.864538 | -0.689196 | -0.416361 |
Weight | -0.832244 | 0.897527 | 0.932994 | 0.864538 | 1.000000 | -0.416839 | -0.309120 |
Accelerate | 0.423329 | -0.504683 | -0.543800 | -0.689196 | -0.416839 | 1.000000 | 0.290316 |
Year | 0.580541 | -0.345647 | -0.369855 | -0.416361 | -0.309120 | 0.290316 | 1.000000 |
Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:
model = ols("MPG ~ Year", data=df)
results = model.fit()
The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols
method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df
). We then call fit()
to actually do the regression.
We can easily get a summary of the results here – including all sorts of crazy statistical measures!
results.summary()
Dep. Variable: | MPG | R-squared: | 0.337 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.335 |
Method: | Least Squares | F-statistic: | 198.3 |
Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 1.08e-36 |
Time: | 10:42:17 | Log-Likelihood: | -1280.6 |
No. Observations: | 392 | AIC: | 2565. |
Df Residuals: | 390 | BIC: | 2573. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -70.0117 | 6.645 | -10.536 | 0.000 | -83.076 -56.947 |
Year | 1.2300 | 0.087 | 14.080 | 0.000 | 1.058 1.402 |
Omnibus: | 21.407 | Durbin-Watson: | 1.121 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 15.843 |
Skew: | 0.387 | Prob(JB): | 0.000363 |
Kurtosis: | 2.391 | Cond. No. | 1.57e+03 |
We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:
df.columns
Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight', 'Accelerate', 'Year', 'Origin'], dtype='object')
We can now add in more variables – doing multiple regression:
model = ols("MPG ~ Year + Weight + Horsepower", data=df)
results = model.fit()
results.summary()
Dep. Variable: | MPG | R-squared: | 0.808 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.807 |
Method: | Least Squares | F-statistic: | 545.4 |
Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 9.37e-139 |
Time: | 10:42:17 | Log-Likelihood: | -1037.4 |
No. Observations: | 392 | AIC: | 2083. |
Df Residuals: | 388 | BIC: | 2099. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -13.7194 | 4.182 | -3.281 | 0.001 | -21.941 -5.498 |
Year | 0.7487 | 0.052 | 14.365 | 0.000 | 0.646 0.851 |
Weight | -0.0064 | 0.000 | -15.768 | 0.000 | -0.007 -0.006 |
Horsepower | -0.0050 | 0.009 | -0.530 | 0.597 | -0.024 0.014 |
Omnibus: | 41.952 | Durbin-Watson: | 1.423 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 69.490 |
Skew: | 0.671 | Prob(JB): | 8.14e-16 |
Kurtosis: | 4.566 | Cond. No. | 7.48e+04 |
We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the Horsepower
is very high. If we remove Horsepower
from the regression then it barely changes the results:
model = ols("MPG ~ Year + Weight", data=df)
results = model.fit()
results.summary()
Dep. Variable: | MPG | R-squared: | 0.808 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.807 |
Method: | Least Squares | F-statistic: | 819.5 |
Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 3.33e-140 |
Time: | 10:42:17 | Log-Likelihood: | -1037.6 |
No. Observations: | 392 | AIC: | 2081. |
Df Residuals: | 389 | BIC: | 2093. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -14.3473 | 4.007 | -3.581 | 0.000 | -22.224 -6.470 |
Year | 0.7573 | 0.049 | 15.308 | 0.000 | 0.660 0.855 |
Weight | -0.0066 | 0.000 | -30.911 | 0.000 | -0.007 -0.006 |
Omnibus: | 42.504 | Durbin-Watson: | 1.425 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 71.997 |
Skew: | 0.670 | Prob(JB): | 2.32e-16 |
Kurtosis: | 4.616 | Cond. No. | 7.17e+04 |
We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin
. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in C()
to force it to be a categorical variable.
model = ols("MPG ~ Year + Origin", data=df)
results = model.fit()
results.summary()
Dep. Variable: | MPG | R-squared: | 0.579 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.576 |
Method: | Least Squares | F-statistic: | 178.0 |
Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 1.42e-72 |
Time: | 10:42:17 | Log-Likelihood: | -1191.5 |
No. Observations: | 392 | AIC: | 2391. |
Df Residuals: | 388 | BIC: | 2407. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -61.2643 | 5.393 | -11.360 | 0.000 | -71.868 -50.661 |
Origin[T.European] | 7.4784 | 0.697 | 10.734 | 0.000 | 6.109 8.848 |
Origin[T.Japanese] | 8.4262 | 0.671 | 12.564 | 0.000 | 7.108 9.745 |
Year | 1.0755 | 0.071 | 15.102 | 0.000 | 0.935 1.216 |
Omnibus: | 10.231 | Durbin-Watson: | 1.656 |
---|---|---|---|
Prob(Omnibus): | 0.006 | Jarque-Bera (JB): | 10.589 |
Skew: | 0.402 | Prob(JB): | 0.00502 |
Kurtosis: | 2.980 | Cond. No. | 1.60e+03 |
You can see here that Patsy has automatically created extra variables for Origin
: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.
Just for reference, you can easily get any of the statistical outputs as attributes on the results
object:
results.rsquared
0.57919459237581172
results.params
Intercept -61.264305 Origin[T.European] 7.478449 Origin[T.Japanese] 8.426227 Year 1.075484 dtype: float64
You can also really easily use the model to predict based on values you’ve got:
results.predict({'Year':90, 'Origin':'European'})
array([ 43.00766095])
If you found this post useful, please consider buying me a coffee.
This post originally appeared on Robin's Blog.
Categorised as: Programming, Python
Just a small typo – I think the prediction should have a year of 90 rather than 1990 (there were two figure dates in the original df). Otherwise we get a very impressive fuel economy 1900 years on 🙂
Good catch – I’ve fixed it now.
Thanks!
However, if speed is a concern, one is better off using scikit learn.
Interesting – I hadn’t considered speed in this, but it sounds like an interesting topic for a future blog post, thanks!
[…] Regression in Python using R-style formula – it’s easy! […]
Thank you so much!!! I am exploring Python for different functions (I use R) and this tutorial was IMMENSELY helpful.
Hi Robin,
This is really interesting, and I am wondering if you could advise on one final thing.
I’m able to get predictions based on passing in the following for one result:
results.predict({‘Property_Type’:’T’,’Lease_Type’:’L’,’City’:’LONDON’})
However, if I want to pass in the other part of my data as a data frame to test it, could you advise the syntax for that as it doesn’t work when I pass in the data frame.
Could even speak offline if you’ve 5 minutes spare and I could share with you the colab book I’m working on.
Trying to predict house price based on a number of categories!
Many thanks
Jon
Thanks a lot, you just save my life.