Robin's Blog

Pandas-FSDR: a simple function for finding significant differences in pandas DataFrames

In the spirit of my Previously Unpublicised Code series, today I’m going to share Pandas-FSDR. This is a simple library with one function which finds significant differences between two columns in a pandas DataFrame.

For example, imagine you had the following data frame:

Subject UK World
Biology 50 40
Geography 75 80
Computing 100 50
Maths 1500 1600

You may be interested in the differences between the values for the UK and the World (these could be test scores or something similar). Pandas-FSDR will tell you – by running one function you can get output like this:

  • Maths is significantly smaller for UK (1500 for UK compared to 1600 for World)
  • Computing is significantly larger for UK (100 for UK compared to 50 for World)

Differences are calculated in absolute and relative terms, and all thresholds can be altered by changing parameters to the function. The function will even output pre-formatted Markdown text for display in an IPython notebook, inclusion in a dashboard or similar. The output above was created by running this code:

result = FSDR(df, 'UK', 'World', rel_thresh=30, abs_thresh=75)

This is a pretty simple function, but I thought it might be worth sharing. I originally wrote it for some contract data science work I did years ago, where I was sharing the output of Jupyter Notebooks with clients directly, and wanted something that would ‘write the text’ of the comparisons for me, so it could be automatically updated when I had new data. If you don’t want it to write anything then it’ll just output a list of row indices which have significant differences.

Anyway, it’s nothing special but someone may find it useful.

The code and full documentation are available in the Pandas-FSDR Github repository

Categorised as: Previously Unpublicised Code, Programming, Python

Leave a Reply

Your email address will not be published. Required fields are marked *