Robin's Blog

Orthogonal Distance Regression in Python

Linear regression is often used to estimate the relationship between two variables – basically by drawing the ‘line of best fit’ on a graph. The mathematical method that is used for this is known as Least Squares, and aims to minimise the sum of the squared error for each point. The key question here is how do you calculate the error (also known as the residual) for each point?

In standard linear regression, the aim is to predict the the Y value from the X value – so the sensible thing to do is to calculate the error in the Y values (shown as the gray lines in the image below). However, sometimes it’s more sensible to take into account the error in both X and Y (as shown by the dotted red lines in the image below) – for example, when you know that your measurements of X are uncertain, or when you don’t want to focus on the errors of one variable over another.

Normal_vs_ODR

 Orthogonal Distance Regression (ODR) is a method that can do this (orthogonal in this context means perpendicular – so it calculates errors perpendicular to the line, rather than just ‘vertically’). Unfortunately, it’s a lot more complicated to implement than standard linear regression, but fortunately there is some lovely Fortran code called ODRPACK that does it for us. Even more fortunately, the lovely scipy people have wrapped this Fortran code in the scipy.odr Python module.

However, because of the complexity of the underlying method, using the scipy.odr module is a lot harder than the simple scipy.stats.linregress function – so I’ve written some code to make it easier. The code below provides a function, orthoregress, which implements ODR and behaves in exactly the same way as the linregress function. For more information, see the ODRPACK User Guide, and the scipy.odr documentation.


Categorised as: Academic, Programming, Python


8 Comments

  1. Fredrik says:

    Thank you for a very interesting post on regression using Python. I had never heard of Orthogonal Distance Regression so I learned both about that technique AND how to perform it. Very pedagogical post. Thank you, again.

  2. zhpmatrix says:

    Hi,it’s really wonderful to read your blog!

    Can you tell me more about the “it calculates errors perpendicular to the line, rather than just ‘vertically” ?

    For me ,it is just the same, am I right ?

    Thank you very much !

  3. Robin Wilson says:

    Hi,

    The error calculation is actually different: the vertical errors are shown in gray in the image, and the perpendicular errors are shown in dotted red lines. The vertical errors are just the errors in the ‘y’ direction, whereas the perpendicular errors take into account errors in both ‘x’ and ‘y’.

    Hope that helps!

    Robin

  4. Daniel says:

    If you have data with errors in y and x, is it ok to just replace “dat = Data(x, y)” with “dat = RealData(xdata, y=ydata, sx=error_x, sy=error_y)”? It produces estimators, that are really close to the one I got using a ROOT fit with TGraphErrors at least…

  5. Vincent says:

    Just want to point out, for the case of the “line of best fit” you are presenting, there exists a solution which can be calculated analytically (and of course much faster than using scipy.odr). Actually, two solutions exist and one is optimal, minimizing the sum of perpendicular distances.

    See: http://mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.html

    Great post though… got me started.

  6. fmath says:

    I would like to point out the Answer of Vincent saying that he is right about an easier solution using the analytical form but it exists only if the curve is a straight line y=a+bx. For other non-linear formulas there may or may not exist an analytical solution and for cases when y=f(a,b,x) being f() some non separable function, there is no closed form solution and the numerical method using scipy or simialr is needed.

Leave a Reply

Your email address will not be published. Required fields are marked *