Category Archives: Computing
This is a first of a number of posts based upon discussions I had while at the Collaborations Workshop 2013 (#CollabW13 on twitter) in Oxford, UK. During one of the sessions I described a simple technique that I try and use to increase the sustainability, reproducibility and releasability of code that I write, data I collect and the results of my work – and people thought this idea was great, and that I should blog about it…
So, what is this wonderful technique:
On Friday afternoon (when you’re bored and can’t be bothered to do any hard work…) spend an hour or two cleaning up and documenting your work from that week
It’s a very simple idea, but it really does work – and it also gives you something to do during the last afternoon of the week when you’re feeling tired and can’t really be bothered. If you can’t commit to doing it every week, you can try every fortnight or month – and I even go as far as adding it as an event to my calendar, to try and stop people arranging meetings with me then!
So, what sort of things can you do during this time?
- Document your code: Depending on the project and the intended use of the documentation, this can be anything from adding some better comments to your code, to documenting individual functions/methods (for example, using docstrings in Python) or writing high-level documentation of the whole system.
- Refactor your code: Refactoring is a “disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behaviour” – that is, basically tidying up, cleaning up, and possibly redesigning you’re code. If you’re anything like me, the code you write when actually doing science isn’t very neat, nice or well-designed – because you’re focused on the science at that point. These few hours on a Friday are your time to focus on the code for a bit…
- Generalise your code to create a library: There are probably a number of things in your code that could be used in many other programs – things like reading certain file formats, performing statistical operations or applying algorithms to array data. This is a perfect time to take these algorithms and ‘decouple’ them from their immediate application in this code so that they can be easily used in other programs you may write in the future. Ideally, this generalised functionality can be packaged into a library and distributed for others to use. My Py6S library was created in this way: I took code that I was using to run the 6S model, generalised it into a library, documented it well, released it – and it has taken on a life of its own!
- Add README files to your folders: If you’re anything like me, you’ll have loads of folders containing data and code – some of which isn’t particularly well named and may not have metadata. One of the easiest (and most effective) ways to deal with this is to create simple README files in each folder explaining what is in the folder, where it came from, what format it’s in – basically anything that you think you’ll want to know about it in a year’s time if you come back to it. I can say from experience just how useful having these README files is!
The key benefit of all of these is that it makes it so much easier to come back to your research later on, and it also makes it so much easier for you to share your research, make it reproducible and allow others to build upon it – and the great thing is that it doesn’t even take that much work. Thirty seconds writing a few notes in a README file could easily save you a week of work in a year’s time, and extending your code into a library would allow you to re-use it in other projects without much extra work.
Another similar idea that was mentioned by someone else at the Collaborations Workshop was for Research Councils to force people to add extra time to the end of their grants to do this sort of thing – although personally, I think it is a far better idea to do this as you go along. Trying to add documentation to some code that you wrote two years ago is often quite challenging…
So, there it is – a simple way to use up some time at the end of the week (when you can’t really be bothered to do anything ‘new’) which will significantly improve the sustainability, reproducibility and releasability of your code and data. Try it out, and let us know how you do in the comments below!
Recently I was shocked to find that there didn’t seem to be a simple tool which would convert BibTeX files to COINS metadata span tags – so I wrote one!
That sentence probably made no sense to you – so lets go through it in a bit more depth. I use LaTeX to write all of my reports for my PhD, and therefore I keep all of my references in BibTeX format. I also use BibTeX format to keep a list of all of my publications, which I then use to automatically produce a nice listing of my publications on my website. I’ve recently become a big fan of Zotero, which will import references from webpages with a single click. This works for many sites like Google Scholar, Web of Knowledge, Science Direct etc – and I wanted to get the same for my publications page.
Examining the information given on the Zotero Exposing Metadata page suggests that one of the ways to expose this metadata in a HTML page is to use COINS (ContextObjects in Spans). This involves putting a number of strange-looking <SPAN> elements into your HTML page, which Zotero (and various other tools like Mendeley) will then use to automatically add the bibliographic data to their database.
So, how should I create the COINS metadata? Well, you can generate one item at a time using the online generator, or you can export items from Zotero as COINS, but neither of these methods can be automated. I’d really like to have a simple command-line tool that would take a BibTeX file and produce COINS metadata for all of the entries in the file…
So that’s what I created! It’s called bib2coins and it is available on the Python Package Index, to install simply run pip install bib2coins, and it will automatically be placed on your path. You can then just run it as bib2coins bibtexfile.bib and it will print out a load of COINS spans to standard output – just ready for you to pipe into the middle of a HTML file!
The code is fairly simple, and uses a BibTeX parser written by Vassilios Karakoidas combined with my own code to create the COINS spans themselves. It is not finished yet, and currently works well for journals and ‘inproceedings’ items but hasn’t been tested on much else (I haven’t written any books, so I’m not so concerned about creating COINS metadata for them!). However, I will be updating this tool to support more bibliographic item types in the near future.
Sphinx is a great tool for documenting Python programs (and lots of other things – I recently saw a lecturer who had done all of his lecture notes using Sphinx!) and I’ve used it for my new project (which will be announced on this blog in the next few days). Now that the project is near release, I wanted to get the documentation onto ReadTheDocs to make it nice and easily accessible (and so that it can be easily built every time I commit to GitHub).
The theory is that you just point ReadTheDocs to your GitHub repository, it finds the Sphinx conf.py file, and all is fine. However, if you use any module outside of the standard library, and you’re using the Sphinx autodoc module, then it will fail to compile the documentation. This is because the Python code that you are documenting needs to be able to be imported for autodoc to work, and if you are trying to import a module that doesn’t exist by default on a Python install then an error will be produced.
The ReadTheDocs FAQ says that you can setup a pip_requirements file to install any modules that are needed for your code, but this won’t work for any modules that include C code. This is understandable – as ReadTheDocs don’t want any random C code executing on their server – but it means that trying to build the docs for any code that uses numpy, scipy or matplotlib (or many other modules) will fail.
The FAQ suggests how to solve this – using a so-called ‘mock’. This is an object that pretends to be one of these modules, so that it can be imported, but doesn’t actually do anything. This doesn’t matter as it is not normally necessary to actually run the code to produce the docs, just to be able to import it. However, the code that is provided by ReadTheDocs doesn’t work for any modules that you import using the * operator – for example, from matplotlib import *. After asking a StackOverflow question, I found how to fix this for the code that ReadTheDocs provide, but a comment suggested a far easier way to do this, simply add code like the following to the top of your conf.py file:
import mock MOCK_MODULES = ['numpy', 'scipy', 'matplotlib', 'matplotlib.pyplot', 'scipy.interpolate'] for mod_name in MOCK_MODULES: sys.modules[mod_name] = mock.Mock()
In the MOCK_MODULES list put the names of all of the modules that you import. It is important to list submodules (such as matplotlib.pyplot) as well as the main modules. After committing your changes and pushing to GitHub, you should find that your docs compile properly.
In a project recently I was struggling to find a way to parse strings that contain a date range, for example:
- 27th-29th June 2010
- Tuesday 29 May -> Sat 2 June 2012
- From 27th to 29th March 1999
None of the Python modules I investigated (including parsedatetime) seemed to be able to cope with the range of strings that I had to deal with. I investigated patching parsedatetime to allow it to do what I wanted, but I found it very hard to get into the code. So, I thought, why not write my own…
So I did, and I’ve released it under the LGPL and you can install it right now by running:
pip install daterangeparser
The current version will parse a wide range of formats (see the examples in the documentation) and will deal with individual dates as well as date ranges. The API is very simple – just import the parse method and run it, giving the date range string as an argument. For example:
from daterangeparser import parse print parse("14th-19th Feb 2010")
This will produce an output tuple with two datetime objects in it: the start and end date of the range you gave.
The parser is built using PyParsing – a great Python parsing framework that I have found very easy to get to grips with. It is incredibly powerful, very easy to use, and really shows how limited regular expressions can be! Now that I’ve done this I have an urge to use PyParsing to write parsers for all of the horrible scientific data formats that I have to deal with in my PhD….watch this space!
This problem is known by various names such as:
- Ctrl-Space doesn’t do anything in Eclipse!
- Why can’t I get auto-complete to work properly in Eclipse?
- I’ve just set up a new University computer and things don’t work like they do on my laptop (maybe that one’s just me…)
It’s actually very simple to solve, but the problem is actually nothing to do with Eclipse. First of all, let’s see what the problem is:
You’ve just installed Eclipse, are starting to do some programming in it, and want to use the very handy auto-complete feature. So, you type part of a function name and press Ctrl-Space, waiting for the magic to work and the rest of the name to be automatically typed….but it doesn’t happen!
In the image above (which unfortunately doesn’t include the cursor) I had typed ST, and pressed Ctrl-Space to autocomplete it but nothing happened.
When trying to fix this myself, I went in to the Eclipse options (Windows->Preferences, then General->Keys) and tried to find the command for auto-complete. Helpfully, it’s not called autocomplete or anything like that – it’s called Content Assist. This showed me that, as I expected, Content Assist was assigned to Ctrl-Space:
So why wasn’t Eclipse picking this up? I tried setting the key for Content Assist manually, but when I deleted the text in the key binding box and pressed Ctrl-Space, it showed that only Ctrl registered – somehow the spacebar press was being ‘eaten up’ by something else. What could it be?
The simple answer is: the Windows language services utility – obvious really! This seems to be set by default (at least some of the time) to switch between languages by using Ctrl-Space. On my personal computer I only have one language set up (English (UK)), but on the university computers there are loads – French, German, Italian, Chinese (simplified) etc. You can find out what languages you have set up by going to Control Panel -> Region and Language -> Keyboards and Languages (tab) and then Change Keyboards (again, how obvious…). You’ll see a list of languages installed – remove any that you don’t want (click the language and then click the Remove button) until you only have the ones you want left. That fixed it for me, but you can also check the Advanced Key Settings tab to make sure that none of the keyboard shortcuts that are set include Ctrl-Space.
Once you’ve done that, Ctrl-Space should work nicely
I’ve just discovered something that I feel I must share here – partly to make more people aware of it, and partly so I don’t forget it. In the IDL programming language you will sometimes find your program interrupted by a line saying something like:
% Program caused arithmetic error: Floating divide by 0
Sometimes it will be obvious where the error is – but often you can spend ages looking for it (just like with segfaults in C…). What I only just found out is that if you run the command
At the IDL prompt before running your program you will get a far more informative error message like
% Program caused arithmetic error: Floating divide by 0
% Detected at JUNK 3 junk.pro
The key thing is that this tells you what line the error occurred on (line 3 of junk.pro in the above example) – which helps you to narrow down the problem far more quickly.
More details on the values that !EXCEPT can take is available here – basically the options are no messages, unhelpful messages and helpful messages.
This is very useful – but just beware that running with !EXCEPT=2 all the time will slow down your code, so only do it if you need to for debugging purposes.
As part of my research I do a fair amount of data collection in the field. Some of the instruments I use are very modern and connect to a computer via USB, interacting with custom-written client software which allows such luxuries as timed logging, triggered logging and local calibration. However, a number of the instruments are older and don’t have computer-based logging capability, requiring you to log data to their internal memory and then download it later.
This is often perfectly satisfactory, but timing can be an issue. For example, when taking measurements using a number of instruments it is often important to make sure that measurements are taken at the same time. For example, if spectral measurements are being taken and other instruments (for example sunshine sensors, like the sensor shown below) are being used to gather data which can then be used to atmospherically correct the spectra, then it is very important to ensure that measurements are taken at the same time. This is particularly a problem in areas of fast changing weather like the UK, where sky conditions can change very quickly.
A tool called SJinn allows you to send simple strings over a RS-232 (standard serial port) connection and then obtain data sent back by the instruments. One of the examples given by SJinn is the following:
rs232 -b600 -p7n2 -s"\n" -r16
This sends a newline character over the serial port (at 600 baud with 7 data bits and 2 stop bits) and then returns the next 16 characters send on the line. In this case, it would provide the voltage measured by a digital voltmeter. As this is simply a command-line tool, it is very easy to combine into scripts, and thus use to collect timed measurements (eg. via the use of the cron daemon). I have used similar techniques to obtain measurements from the sunshine sensor shown above – a script for which will be available on my website soon.
You may find, as I have done recently, that a network printer installed on a Windows Vista starts suddenly showing as Offline even when other machines on the network can access it fine. I originally thought it would be an IP address issue, but it turned out not to be anything to do with that. In fact, the solution was far simpler – but also slightly strange…
It turns out that Windows Vista automatically enables SNMP support for networked printers, and if it can’t get a response to a SNMP message then it assumes the printer is offline. SNMP stands for Simple Network Management Protocol and is a way of getting information from network devices (such as routers, servers and printers), mainly for the purposes of finding out if there are any problems with the devices. A number of networked printers implement SNMP, and will respond to SNMP queries with information, but some don’t. My printer (a fairly old Lexmark T640) is one of the ones that doesn’t implement it – so of course Vista will never get a response to a SNMP message. The result of which is that the printer will start showing as offline at a seemingly random time because Vista has just sent a SNMP message to it, and it hasn’t responded.
Thankfully there is a simple way to fix this – and it just involves telling Vista not to try and communicate with the printer via SNMP. Simply right-click on the printer in the Printers window, choose the Ports tab, and select Configure Port. At the bottom you will see a checkbox saying something like SNMP Status Enable. Untick that, and the printer should start showing as online again.
(Update: If this doesn’t work, then try the method described in Coxy’s comment, below)
The first piece of software in my series of essential OS X software is a very handy tool which reminds you when you haven’t attached a file in an email when you intended to. How does it do this? Well, it searches for key words in the email and reminds you if, for example, you use the word attached without attaching a file.
This sort of functionality is already present in a number of other email apps such as GMail and Thunderbird, but isn’t present by default in OS X’s mail application. However, this free tool will add it. Simply download it from http://eaganj.free.fr/code/mail-plugin/ and follow the instructions (just make sure you download the beta version if you’ve got Snow Leopard, or it won’t work!)
I have recently discovered PyDev – a Python IDE which runs within Eclipse. Although I’d given up on big all-singing, all-dancing IDEs a few years ago I’m really liking it. The Ctrl-Space completion is very handy, as are the number of refactorings that are available from the menus.
Anyway, I use the Enthought Python Distribution (EPD) on my Mac, as it provides Python with a number of important scientific libraries (NumPy, SciPy, Matplotlib etc) in an easy-to-install package for OS X. It’s really handy – and is free for academic use. The only problem with using EPD is that applications can sometimes get confused between EPD and the Apple-provided version of Python.
It turns out that PyDev is one of those applications. If you follow the PyDev installation instructions, it suggests you click the Auto Config button to configure your Python interpreter. This will not work for EPD! Instead, (after deleting the interpreter you have configured already, if you’ve already configured one), click the New button and then fill in the fields as below:
Interpreter Name: This is just a name to refer to the interpreter by – it can be anything you like. I tend to use EPDPython.
Interpreter Path: You’ll need to find the python executable provided by EPD. This is normally located somewhere like:
The best way to find it is to navigate from /Library down the path, choosing the most sensible folder at each stage. When you get to the Versions folder, make sure you choose the latest version (highest number) folder, and then choose the bin directory and then the python executable. Once this is done, PyDev will automatically find the relevant folders to add to your PYTHONPATH, and everything will be working.