Robin's Blog

Introducing recipy: effortless provenance tracking with Python

By time this blog post is published, I will have finished my presentation about recipy at EuroSciPy (see the abstract for my talk), and so I thought it would be a good time to introduce recipy to the wider world. I’ve been looking for something like recipy for ages – and I suggested the idea at the Collaborations Workshop 2015 Hack Day. I got together in a team with Raquel Alegre and Janneke van der Zwaan, and our implementation of recipy won the Hack Day prize! I’m very excited about where it could go next, but first I ought to explain what it is:

So, have you ever run a Python script to produce some outputs and then forgotten exactly how you created them? For example, you created plot.png a few weeks ago and now you want to use it in a publication, but you can’t remember how you created it. By adding a single line of code to your script, recipy will log your inputs, outputs and code each time you run the script, and you can then query the resulting database to find out how exactly plot.png was created.

Does this sound good to you? If so, read on to find out how to use it.

Installation is stupidly simple: pip install recipy

Using it is also very simple – just take a Python script like this:

import pandas as pd
from matplotlib.pyplot import *

data = pd.read_csv('data.csv')

data.plot(x='year', y='temperature')
savefig('graph.png')

data.temperature = data.temperature - 273
data.to_csv('output_kelvin.csv')

and add a single extra line of code to the top:

import recipy
import pandas as pd
from matplotlib.pyplot import *
...(code continues as above)...

Now you can just run the script as usual, and you’ll see a little bit of extra output on stdout:

recipy run inserted, with ID 1b40ce05-c587-4f5d-bfae-498e64d71a6c

This just shows that recipy has recorded this particular run of your code.

Once you’ve done this you can query your recipy database using the recipy command-line tool. For example, you can run:

$ recipy search graph.png

Run ID: 1b40ce05-c587-4f5d-bfae-498e64d71a6c
Created by robin on 2015-08-27T20:50:23
Ran /Users/robin/code/euroscipy/recipy/example_script.py using /Users/robin/.virtualenvs/recipypres/bin/python
Git: commit 4efa33fc6e0a81e9c16c522377f07f9bf66384e2, in repo /Users/robin/code/euroscipy, with origin None
Environment: Darwin-14.3.0-x86_64-i386-64bit, python 2.7.9 (default, Feb 10 2015, 03:28:08)
Inputs:
  /Users/robin/code/euroscipy/recipy/data.csv

Outputs:
  /Users/robin/code/euroscipy/recipy/graph.png
  /Users/robin/code/euroscipy/recipy/output_kelvin.csv

** Previous runs creating this output have been found. Run with --all to show. **

You can also view these runs in a GUI by running recipy gui, which will give you a web interface like:

RecipyGUI

 

There are more ways to search and find more details about particular runs: see recipy --help for more details. Full documentation is available at Github – which includes information about how this all works under the hood (it’s all to do with the crazy magic of sys.meta_path).

So – please install recipy (pip install recipy), let me know what you think of it (feel free to comment here, or email me at robin AT rtwilson.com), and please submit issues on Github for any bugs you run into (pull requests would be even nicer!).


Categorised as: Programming, Python


2 Comments

  1. Marcel Stimberg says:

    Looks very interesting! Are you aware of Sumatra (https://pythonhosted.org/Sumatra/)? It’s not quite the same approach (you have to be explicitly state which input files you use) but it tries to solve the same problem.

Leave a Reply

Your email address will not be published.