# Robin's Blog

## Regression in Python using R-style formula – it’s easy!

I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the `pandas`, `statsmodels` and `patsy` packages. If you’re using conda you should be able to do this by running the following from the terminal:

`conda install statsmodels patsy`

(and then say yes when it asks you to confirm it)

```import pandas as pd
from statsmodels.formula.api import ols
```

Before we can do any regression, we need some data – so lets read some data on cars:

```df = pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")
```

You may have noticed from the code above that you can just give a URL to the `read_csv` function and it will download it and open it – handy!

Anyway, here is the data:

```df.head()
```
Model MPG Cylinders Engine Disp Horsepower Weight Accelerate Year Origin
0 amc ambassador dpl 15.0 8 390.0 190 3850 8.5 70 American
1 amc gremlin 21.0 6 199.0 90 2648 15.0 70 American
2 amc hornet 18.0 6 199.0 97 2774 15.5 70 American
3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 American
4 buick estate wagon (sw) 14.0 8 455.0 225 3086 10.0 70 American

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the `corr()` method:

```df.corr()
```
MPG Cylinders Engine Disp Horsepower Weight Accelerate Year
MPG 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541
Cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647
Engine Disp -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855
Horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361
Weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120
Accelerate 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316
Year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

```model = ols("MPG ~ Year", data=df)
results = model.fit()
```

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The `ols` method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called `df`). We then call `fit()` to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!

```results.summary()
```
Dep. Variable: R-squared: MPG 0.337 OLS 0.335 Least Squares 198.3 Sat, 20 Aug 2016 1.08e-36 10:42:17 -1280.6 392 2565. 390 2573. 1 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -70.0117 6.645 -10.536 0.000 -83.076 -56.947 1.2300 0.087 14.080 0.000 1.058 1.402
 Omnibus: Durbin-Watson: 21.407 1.121 0 15.843 0.387 0.000363 2.391 1570

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

```df.columns
```
```Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight',
'Accelerate', 'Year', 'Origin'],
dtype='object')```

We can now add in more variables – doing multiple regression:

```model = ols("MPG ~ Year + Weight + Horsepower", data=df)
results = model.fit()
results.summary()
```
Dep. Variable: R-squared: MPG 0.808 OLS 0.807 Least Squares 545.4 Sat, 20 Aug 2016 9.37e-139 10:42:17 -1037.4 392 2083. 388 2099. 3 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -13.7194 4.182 -3.281 0.001 -21.941 -5.498 0.7487 0.052 14.365 0.000 0.646 0.851 -0.0064 0.000 -15.768 0.000 -0.007 -0.006 -0.0050 0.009 -0.530 0.597 -0.024 0.014
 Omnibus: Durbin-Watson: 41.952 1.423 0 69.49 0.671 8.14e-16 4.566 74800

We can see that bringing in some extra variables has increased the \$R^2\$ value from ~0.3 to ~0.8 – although we can see that the P value for the `Horsepower` is very high. If we remove `Horsepower` from the regression then it barely changes the results:

```model = ols("MPG ~ Year + Weight", data=df)
results = model.fit()
results.summary()
```
Dep. Variable: R-squared: MPG 0.808 OLS 0.807 Least Squares 819.5 Sat, 20 Aug 2016 3.33e-140 10:42:17 -1037.6 392 2081. 389 2093. 2 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -14.3473 4.007 -3.581 0.000 -22.224 -6.470 0.7573 0.049 15.308 0.000 0.660 0.855 -0.0066 0.000 -30.911 0.000 -0.007 -0.006
 Omnibus: Durbin-Watson: 42.504 1.425 0 71.997 0.67 2.32e-16 4.616 71700

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called `Origin`. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in `C()` to force it to be a categorical variable.

```model = ols("MPG ~ Year + Origin", data=df)
results = model.fit()
results.summary()
```
Dep. Variable: R-squared: MPG 0.579 OLS 0.576 Least Squares 178.0 Sat, 20 Aug 2016 1.42e-72 10:42:17 -1191.5 392 2391. 388 2407. 3 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -61.2643 5.393 -11.360 0.000 -71.868 -50.661 7.4784 0.697 10.734 0.000 6.109 8.848 8.4262 0.671 12.564 0.000 7.108 9.745 1.0755 0.071 15.102 0.000 0.935 1.216
 Omnibus: Durbin-Watson: 10.231 1.656 0.006 10.589 0.402 0.00502 2.98 1600

You can see here that Patsy has automatically created extra variables for `Origin`: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.

Just for reference, you can easily get any of the statistical outputs as attributes on the `results` object:

```results.rsquared
```
`0.57919459237581172`
```results.params
```
```Intercept            -61.264305
Origin[T.European]     7.478449
Origin[T.Japanese]     8.426227
Year                   1.075484
dtype: float64```

You can also really easily use the model to predict based on values you’ve got:

```results.predict({'Year':90, 'Origin':'European'})
```
`array([ 43.00766095])`

Categorised as: Programming, Python

1. Ron Hartley-Davies says:

Just a small typo – I think the prediction should have a year of 90 rather than 1990 (there were two figure dates in the original df). Otherwise we get a very impressive fuel economy 1900 years on π

2. Robin Wilson says:

Good catch – I’ve fixed it now.

Thanks!

3. Tanmay says:

However, if speed is a concern, one is better off using scikit learn.

4. Robin Wilson says:

Interesting – I hadn’t considered speed in this, but it sounds like an interesting topic for a future blog post, thanks!

5. […] Regression in Python using R-style formula β itβs easy! […]

6. Arpan says:

Thank you so much!!! I am exploring Python for different functions (I use R) and this tutorial was IMMENSELY helpful.