## Regression in Python using R-style formula – it’s easy!

I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the `pandas`

, `statsmodels`

and `patsy`

packages. If you’re using conda you should be able to do this by running the following from the terminal:

`conda install statsmodels patsy`

(and then say yes when it asks you to confirm it)

```
import pandas as pd
from statsmodels.formula.api import ols
```

Before we can do any regression, we need some data – so lets read some data on cars:

```
df = pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")
```

You may have noticed from the code above that you can just give a URL to the `read_csv`

function and it will download it and open it – handy!

Anyway, here is the data:

```
df.head()
```

Model | MPG | Cylinders | Engine Disp | Horsepower | Weight | Accelerate | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|

0 | amc ambassador dpl | 15.0 | 8 | 390.0 | 190 | 3850 | 8.5 | 70 | American |

1 | amc gremlin | 21.0 | 6 | 199.0 | 90 | 2648 | 15.0 | 70 | American |

2 | amc hornet | 18.0 | 6 | 199.0 | 97 | 2774 | 15.5 | 70 | American |

3 | amc rebel sst | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | American |

4 | buick estate wagon (sw) | 14.0 | 8 | 455.0 | 225 | 3086 | 10.0 | 70 | American |

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the `corr()`

method:

```
df.corr()
```

MPG | Cylinders | Engine Disp | Horsepower | Weight | Accelerate | Year | |
---|---|---|---|---|---|---|---|

MPG | 1.000000 | -0.777618 | -0.805127 | -0.778427 | -0.832244 | 0.423329 | 0.580541 |

Cylinders | -0.777618 | 1.000000 | 0.950823 | 0.842983 | 0.897527 | -0.504683 | -0.345647 |

Engine Disp | -0.805127 | 0.950823 | 1.000000 | 0.897257 | 0.932994 | -0.543800 | -0.369855 |

Horsepower | -0.778427 | 0.842983 | 0.897257 | 1.000000 | 0.864538 | -0.689196 | -0.416361 |

Weight | -0.832244 | 0.897527 | 0.932994 | 0.864538 | 1.000000 | -0.416839 | -0.309120 |

Accelerate | 0.423329 | -0.504683 | -0.543800 | -0.689196 | -0.416839 | 1.000000 | 0.290316 |

Year | 0.580541 | -0.345647 | -0.369855 | -0.416361 | -0.309120 | 0.290316 | 1.000000 |

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

```
model = ols("MPG ~ Year", data=df)
results = model.fit()
```

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The `ols`

method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called `df`

). We then call `fit()`

to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!

```
results.summary()
```

Dep. Variable: | MPG | R-squared: | 0.337 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.335 |

Method: | Least Squares | F-statistic: | 198.3 |

Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 1.08e-36 |

Time: | 10:42:17 | Log-Likelihood: | -1280.6 |

No. Observations: | 392 | AIC: | 2565. |

Df Residuals: | 390 | BIC: | 2573. |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Intercept | -70.0117 | 6.645 | -10.536 | 0.000 | -83.076 -56.947 |

Year | 1.2300 | 0.087 | 14.080 | 0.000 | 1.058 1.402 |

Omnibus: | 21.407 | Durbin-Watson: | 1.121 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 15.843 |

Skew: | 0.387 | Prob(JB): | 0.000363 |

Kurtosis: | 2.391 | Cond. No. | 1.57e+03 |

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

```
df.columns
```

Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight', 'Accelerate', 'Year', 'Origin'], dtype='object')

We can now add in more variables – doing multiple regression:

```
model = ols("MPG ~ Year + Weight + Horsepower", data=df)
results = model.fit()
results.summary()
```

Dep. Variable: | MPG | R-squared: | 0.808 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.807 |

Method: | Least Squares | F-statistic: | 545.4 |

Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 9.37e-139 |

Time: | 10:42:17 | Log-Likelihood: | -1037.4 |

No. Observations: | 392 | AIC: | 2083. |

Df Residuals: | 388 | BIC: | 2099. |

Df Model: | 3 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Intercept | -13.7194 | 4.182 | -3.281 | 0.001 | -21.941 -5.498 |

Year | 0.7487 | 0.052 | 14.365 | 0.000 | 0.646 0.851 |

Weight | -0.0064 | 0.000 | -15.768 | 0.000 | -0.007 -0.006 |

Horsepower | -0.0050 | 0.009 | -0.530 | 0.597 | -0.024 0.014 |

Omnibus: | 41.952 | Durbin-Watson: | 1.423 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 69.490 |

Skew: | 0.671 | Prob(JB): | 8.14e-16 |

Kurtosis: | 4.566 | Cond. No. | 7.48e+04 |

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the `Horsepower`

is very high. If we remove `Horsepower`

from the regression then it barely changes the results:

```
model = ols("MPG ~ Year + Weight", data=df)
results = model.fit()
results.summary()
```

Dep. Variable: | MPG | R-squared: | 0.808 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.807 |

Method: | Least Squares | F-statistic: | 819.5 |

Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 3.33e-140 |

Time: | 10:42:17 | Log-Likelihood: | -1037.6 |

No. Observations: | 392 | AIC: | 2081. |

Df Residuals: | 389 | BIC: | 2093. |

Df Model: | 2 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Intercept | -14.3473 | 4.007 | -3.581 | 0.000 | -22.224 -6.470 |

Year | 0.7573 | 0.049 | 15.308 | 0.000 | 0.660 0.855 |

Weight | -0.0066 | 0.000 | -30.911 | 0.000 | -0.007 -0.006 |

Omnibus: | 42.504 | Durbin-Watson: | 1.425 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 71.997 |

Skew: | 0.670 | Prob(JB): | 2.32e-16 |

Kurtosis: | 4.616 | Cond. No. | 7.17e+04 |

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called `Origin`

. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in `C()`

to force it to be a categorical variable.

```
model = ols("MPG ~ Year + Origin", data=df)
results = model.fit()
results.summary()
```

Dep. Variable: | MPG | R-squared: | 0.579 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.576 |

Method: | Least Squares | F-statistic: | 178.0 |

Date: | Sat, 20 Aug 2016 | Prob (F-statistic): | 1.42e-72 |

Time: | 10:42:17 | Log-Likelihood: | -1191.5 |

No. Observations: | 392 | AIC: | 2391. |

Df Residuals: | 388 | BIC: | 2407. |

Df Model: | 3 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Intercept | -61.2643 | 5.393 | -11.360 | 0.000 | -71.868 -50.661 |

Origin[T.European] | 7.4784 | 0.697 | 10.734 | 0.000 | 6.109 8.848 |

Origin[T.Japanese] | 8.4262 | 0.671 | 12.564 | 0.000 | 7.108 9.745 |

Year | 1.0755 | 0.071 | 15.102 | 0.000 | 0.935 1.216 |

Omnibus: | 10.231 | Durbin-Watson: | 1.656 |
---|---|---|---|

Prob(Omnibus): | 0.006 | Jarque-Bera (JB): | 10.589 |

Skew: | 0.402 | Prob(JB): | 0.00502 |

Kurtosis: | 2.980 | Cond. No. | 1.60e+03 |

You can see here that Patsy has automatically created extra variables for `Origin`

: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.

Just for reference, you can easily get any of the statistical outputs as attributes on the `results`

object:

```
results.rsquared
```

0.57919459237581172

```
results.params
```

Intercept -61.264305 Origin[T.European] 7.478449 Origin[T.Japanese] 8.426227 Year 1.075484 dtype: float64

You can also really easily use the model to predict based on values you’ve got:

```
results.predict({'Year':90, 'Origin':'European'})
```

array([ 43.00766095])

**Categorised as:** Programming, Python

Just a small typo – I think the prediction should have a year of 90 rather than 1990 (there were two figure dates in the original df). Otherwise we get a very impressive fuel economy 1900 years on ðŸ™‚

Good catch – I’ve fixed it now.

Thanks!

However, if speed is a concern, one is better off using scikit learn.

Interesting – I hadn’t considered speed in this, but it sounds like an interesting topic for a future blog post, thanks!

[…] Regression in Python using R-style formula Ã¢â‚¬â€œ itÃ¢â‚¬â„¢s easy! […]

Thank you so much!!! I am exploring Python for different functions (I use R) and this tutorial was IMMENSELY helpful.

Hi Robin,

This is really interesting, and I am wondering if you could advise on one final thing.

I’m able to get predictions based on passing in the following for one result:

results.predict({‘Property_Type’:’T’,’Lease_Type’:’L’,’City’:’LONDON’})

However, if I want to pass in the other part of my data as a data frame to test it, could you advise the syntax for that as it doesn’t work when I pass in the data frame.

Could even speak offline if you’ve 5 minutes spare and I could share with you the colab book I’m working on.

Trying to predict house price based on a number of categories!

Many thanks

Jon

Thanks a lot, you just save my life.