Slightly boringly, this very similar to my last post – but it’s also something useful that you may want to know, and that I’ll probably forget if I don’t write it down somewhere.
I run convolutions a lot on satellite images, and Landsat images are around 8000 x 8000 pixels. Using a random 8000 x 8000 pixel image, with a 3 x 3 kernel (a size I often use), I find that convolve2d takes 2.9 seconds, whereas convolve takes 1.1 seconds – a useful speedup, particularly if you’re running a loop of various convolutions. The same sort of speed-up can be seen with larger kernel sizes, for example, a 9 x 9 kernel takes 14.7 seconds and 6.8 seconds – an even more useful speed-up.
When I switched my code from convolve2d to convolve, I had to do a bit of playing around to get the various options set to ensure that I got the same results. For reference, the following two lines of code are equivalent:
Interestingly these functions are both within scipy, so there must be a reason for them both to exist. Can anyone enlighten me? My naive view is that they could be merged – or if not, then a note could be added to the documentation saying that convolve is far faster!
This entry in my series covers manifestoclouds: my code for producing word clouds from political party manifestos.
This is very simple, generic code that just ties together a few libraries – and is by no means restricted to just political party manifestos – but I keep it around because I find it useful occasionally. I use the Python WordCloud library and matplotlib to produce the word clouds – after manually extracting the text from the PDFs using one of a myriad of tools depending on your platform.
The results look quite nice – and the results for the political party manifestos are quite politically interesting:
Everyone is very focused on people and what the party will do, but there are some interesting differences. The Conservatives have the joint focus on continuing things they have been doing during their years in coalition, but also doing various new things now that they’re no longer ‘held back’ by the Lib Dems. The Green party focus a lot on local things, and have a focus on what the public (both in terms of people, and in terms of the public sector) can do. The Labour party are talking a lot about needs, but also talk about work more than the Tories (surprisingly!).
Anyway, ignoring the politics, the code is available on Github as usual.
Recently I’ve been investigating a key dataset in my research, and really seeking to understand what is causing the patterns that I see. I realised that it would be really useful if I could plot an interactive scatter plot in Python, and then hover over points to find out further information in them.
Putting this into more technical terms: I wanted a scatter plot with tooltips (or ‘hover boxes’, or whatever else you want to call them) that showed information from the other columns in the pandas DataFrame that held the original data.
I’m fairly experienced with creating graphs using matplotlib, but I have very little experience with other systems (although I had played a bit with bokeh and mpld3). So, I emailed my friend and colleague Max Albert, asking if he knew of a nice simple function that looked a bit like this:
# df is a pandas DataFrame with columns A, B, C and D
#
# scatter plot of A vs B, with a hover tool giving the
# values of A, B, C and D
plot(A, B, 'x')
# same, but only showing C (to deal with DataFrames with
# loads of columns)
plot(A, B, 'x', cols=['C'])
He got back to me saying that he didn’t know of a function that worked quite like that, but that it was relatively simple to do in bokeh – and sent some example code. I’ve now tidied that code up and created a function that is very similar to the example above.
My function is called scatter_with_hover, and using it, the two examples above would be written as:
You can pass all sorts of other parameters to the function too – such as marker shapes to use (circles, squares etc), a figure to plot on to, and any other parameters that the bokeh scatter function accepts (such as color, size etc).
Anyway, the code is available at this gist – feel free to use it in any way you want, and I hope it’s useful.
As a fellow of the Software Sustainability Institute, I have talked a lot about how important software is in science, and how we need to make sure we recognise its contribution. Writing good scientific software takes a lot of work, and can have a lot of impact on the community, but it tends not to be recognised anywhere near as much as writing papers. This needs to change!
Depsy, a new project run by ImpactStory, have developed an awesome new tool which allows you to investigate impact metrics for scientific software which has been published to the Python and R package indexes (PyPI and CRAN respectively). For example, if you go to the page for Py6S, you see something like this (click to enlarge):
This gives various statistics about my Py6S library, such as number of downloads, number of citations, and the PageRank value in the dependency network. However, far more importantly it gives the percentiles for each of these too. This is particularly useful for non-computing readers who may have absolutely no idea how many downloads to expect for a scientific Python library (what counts as "good"? 20 downloads? 20,000 downloads?). Here, they can see that Py6S is at the 89th percentile for downloads, 77th for citations and 84th for dependencies, and has an overall ‘impact’ percentile of 95%.
As well as making me feel nice and warm inside, this gives me some actual numbers to show that the time I spent on Py6S hasn’t been wasted, and it has actually had an impact. You can take this further and click on my name on the bottom left, and get to my metrics page, which shows that I am at the 91st percentile for impact…again something that makes me feel good, but can also be used in job applications, etc.
I won’t tell you how long I spent browsing around various projects and people that I know, but it was far too long, when I really should have been doing something else (fixing bugs in an algorithm, I think) instead.
Anyway, I thought I’d finish this post by summarising a few of the brilliant things about Depsy, and a few of the areas that could be improved. I’ve discussed all of the latter issues with the authors, and they are very keen to continue working on Depsy and improve it – and are currently actively seeking funding to do this. So, if you happen to have lots of money, please throw it their way!
So, great things:
The percentiles. I’ve talked about this above, but they’re really great. (I think percentiles should be used a lot more in life generally, but that’s another blog post…)
Their incredibly detailed about page, which links to huge amounts of information on how every statistic is calculated
The open-source code that powers it, which allows you to check their algorithms and hack around with it (yes, I did spend a little while checking to make sure that they weren’t vulnerable to Little Bobby Tables attacks)
The responsiveness of the development team: Jason and Heather have responded to my emails quickly, and taken on board lots of my comments already.
A few things that will be improved in the future, and a few more general concerns:
Currently they try to guess whether a library is research software or not, but it seems to get a fair few things wrong. I believe they’re actively working on this.
Citation searching is a bit dodgy at the moment. For example, it shows one citation for Py6S, but actually there is a paper on Py6S which has seven citations on Google Scholar. My idea for CITATION files could help them to link up packages and papers – so that’s another reason to add a CITATION file to your project!
Not all scientific software is in Python and R, or hosted on Github, so they are hoping to expand their sources in the future.
More generally, the issue with all metrics is that they never quite manage to capture what you’re interested in. Unfortunately, metrics are a fact of life in academia – but we need to be careful that we don’t start blindly applying ‘software impact metrics’ in unsuitable situations. Still, even if these metrics do start being used stupidly, they’re definitely better than the Impact Factor (after all, we know how they’re calculated for a start!)
So, enough waffling from me: go and check out Depsy, if you find any problems then post an issue on Github – and tell your friends!
This is just a brief public service announcement reporting something that I’ve just found: np.percentile is a lot faster than scipy.stats.scoreatpercentile – almost an order of magnitude faster in some cases.
Someone recently asked me why on earth I was using scoreatpercentile anyway – and it turns out that np.percentile was only added in numpy 1.7, which was released part-way through my PhD in Feb 2013, hence why the scipy function is used in some of my code.
In my code I frequently calculate percentiles from satellite images represented as large 2D numpy arrays – and the speed differences can be quite astounding:
Image size
scoreatpercentile
percentile
speedup
100
595us
169us
3.5x
1000
84ms
13ms
6.5x
3000
927ms
104ms
9x
8000
8s
1s
8x
As you can see, we get 3-4 times speedup for even small arrays (100 x 100, so 10,000 elements), and up to 8-9 times speedup for large arrays (tens of millions of elements).
Anyway, the two functions have very similar signatures and options – the only thing missing from np.percentile is the ability to set hard upper or lower limits – so it should be fairly easy to switch over, and it’s worth it for the speed boost!
I was going to post this as one of my ‘previously unpublicised code’ posts, but that would be stretching the title a bit, as I have previously blogged about my implementation of the van Heuklon (1979) ozone model.
This is just a brief update (in the spirit of my ‘previously unpublicised code’ posts) to say that I have now packaged up my implementation as a proper python package, added a bit more documentation, and a license and all of the other things that a proper library should have. I can’t say it’s the most exciting library ever – it has one function in it – but it could be useful for someone.
So, next time you need to calculate ozone concentrations (and, be honest, you know the time will come when you need to do that), just run:
and then you can simply run something like this to get all of the ozone concentration estimates you could ever need:
from vanHOzone import get_ozone_conc
get_ozone_conc(50.3, -1.34, '2015-03-14 13:00')
The full documentation is in the docstring:
get_ozone_conc(lat, lon, timestamp)
Returns the ozone contents in matm-cm for the given latitude/longitude and timestamp (provided as either a datetime object or a string in any sensible format - I strongly recommend using an ISO 8601 format of yyyy-mm-dd) according to van Heuklon's Ozone model.
The model is described in Van Heuklon, T. K. (1979). Estimating atmospheric ozone for solar radiation models. Solar Energy, 22(1), 63-68.
This function uses numpy functions, so you can pass arrays and it will return an array of results. The timestamp argument can be either an array/list or a single value. If it is a single value then this will be used for all lat/lon values given.
(as an aside, this is my first time of releasing a conda package, and I think it worked ok!)
No, I’m not seven years old (even though I may act like it sometimes)…but this blog is! My first blog post was on the 11th November 2008 – a rather philosophical post entitled Wisdom and Knowledge – and so on this anniversary I thought I’d look at my blog over the years, the number of visits I’ve got, my most popular posts, and so on.
The earliest snapshot of my website available at www.archive.org is shown below, with my second ever post on the home page (an article that was later republished in the Headington School magazine).
Traffic
Unfortunately it’s a little difficult to look at statistics across the whole of this blog’s life – as I used to use Piwik for website analytics, but switched to Google Analytics part-way through 2014 (when Piwik just got too slow to use effectively). Cleverly, I didn’t export my Piwik data first – so I can’t look at any statistics prior to 2014.
However, I can say that since August 2014, around 240,000 people have viewed my blog, with around 300,000 individual page views – a fairly impressive total, averaging at around 15,000-20,000 visitors per month, but increasing in recent months to around 25,000 per month.
There’s a very noticeable pattern in page views, with significantly fewer views at weekends – probably reflecting the work-like nature of my blog.
The vast majority of these views come from so-called ‘organic searches’ – basically just people searching online for something. The referrals I get tend to be either from social networks (mainly Twitter, but sometimes Facebook too), or from ‘blog aggregators’ that some of my posts are contributed to (my R posts can be found on R Bloggers and my Python posts on Planet Python).
Most popular posts
So, what are these people coming to read? Well, a lot of them are coming to find out how to fix a network printer suddenly showing as offline in Windows – in fact, 51% of all of my page views have been for that post! It’s actually become quite popular – being linked from Microsoft support pages, Microsoft TechNet forums, SpiceWorks forums and more.
My second most popular post is also a how to post – in this case, how to solve Ctrl-Space autocomplete not working in Eclipse (10% of page views), and my third most popular ‘post’ is actually the second page of comments for the network printer post…obviously people are very keen to fix their network printers!
These posts make sense as the most frequently viewed posts: they are either posts that help solve very frustrating problems, posts which have been linked from high-traffic websites (like the Guardian), posts which show you how to do something useful, or posts that have received a lot of attention on Twitter.
My favourite posts
It’s hard to choose some favourites from the 130-odd (some veryodd!) posts that I’ve published – but I’m going to try and choose some from various categories:
Want to write some code? Get away from your computer – I reference this post a lot when teaching programming, as it is far too tempting (particularly for newbies) to start fiddling around with the code, rather than actually thinking!
(These posts had good ideas in them, but I never really followed them up…maybe in the future)
Rules of thumb in remote sensing – a list of common ‘rules of thumb’ in my field and roughly where they come from. I think this sort of ‘common sense’ should be documented more in most fields.
Well, partly because it’s fun, partly because it’s useful and partly because I feel I ought to. Let’s take those separately:
It’s fun to write interesting articles about things. I enjoy sharing my knowledge about topics that I know a lot about, and I enjoy writing the morephilosophicalarticles.
It’s useful because it records information that I tend to forget (I can never remember why I get this error message without looking it up – and I’ve found my own blog from Google searches many times!), it gets me known in the community, it publicises the thingsthatIdo, and it shares useful information with other people.
I feel that I have an obligation to share things, from two perspectives: 1) As a scientist, who is funded mostly by taxpayers, I feel I need to share my researchwork with the public and 2) As a user of lots of open-source software, I feel a need to share someofmycode, and share my knowledgeaboutprogramming.
(For another useful guide on reasons to blog as an academic, see Matt Might’s article – in fact, read his whole blog – it’s great!)
Conclusion
I started this blog as an experiment: I had no idea what I would write, and no idea whether anyone would read it. I posted very little in the first few years, but have been posting semi-regularly since 2011, and people actually seem to be reading it. Hopefully, I will continue for another seven years…
Linear regression is often used to estimate the relationship between two variables – basically by drawing the ‘line of best fit’ on a graph. The mathematical method that is used for this is known as Least Squares, and aims to minimise the sum of the squared error for eachpoint. The key question here is how do you calculate the error (also known as the residual) for each point?
In standard linear regression, the aim is to predict the the Y value from the X value – so the sensible thing to do is to calculate the error in the Y values (shown as the gray lines in the image below). However, sometimes it’s more sensible to take into account the error in both X and Y (as shown by the dotted red lines in the image below) – for example, when you know that your measurements of X are uncertain, or when you don’t want to focus on the errors of one variable over another.
Orthogonal Distance Regression (ODR) is a method that can do this (orthogonal in this context means perpendicular – so it calculates errors perpendicular to the line, rather than just ‘vertically’). Unfortunately, it’s a lot more complicated to implement than standard linear regression, but fortunately there is some lovely Fortran code called ODRPACK that does it for us. Even more fortunately, the lovely scipy people have wrapped this Fortran code in the scipy.odr Python module.
However, because of the complexity of the underlying method, using the scipy.odr module is a lot harder than the simple scipy.stats.linregress function – so I’ve written some code to make it easier. The code below provides a function, orthoregress, which implements ODR and behaves in exactly the same way as the linregress function. For more information, see the ODRPACK User Guide, and the scipy.odr documentation.
When looking through my profile on Github recently, I realised that I had over fifty repositories – and a number of these weren’t really used much by me anymore, but probably contained useful code that no-one really knows about! So, I’m going to write a series of posts giving brief descriptions of the code and what it does, and point people to the Github repository and any documentation (if available). I’m also going make sure that I take this opportunity to ensure that every repository that I publicise has a README file and a LICENSE file.
So, let’s get going with the first repository, which is RTWIDL: a set of useful functions for the IDL programming language. It’s slightly incorrect to call this "unpublicised" code as there has been a page on my website for a while, but it isn’t described in much detail there.
These functions were written during my Summer Bursary at the University of Southampton, and are mainly focused around loading data from file formats that aren’t supported natively by IDL. Specifically, I have written functions to load data from the OceanOptics SpectraSuite software (used to record spectra from instruments like the USB2000), Delta-T logger output files, and NEODC Ames format files. This latter format is interesting – it’s a modification of the NASA Ames file format, so that it can store datetime information as the independent variable. Unfortunately, due to this change none of the ‘standard’ functions for reading NASA Ames format data in IDL will work with this data. Quite a lot of data is available in this format, as for a number of years it was the format of choice of the National Earth Observation Data Centre in the UK (see their documentation on the format). Each of these three functions has detailed documentation, in PDF format, available here.
As well as these functions, there are also a few utility functions for checking whether ENVI is running, loading files into ENVI without ENVI ‘taking over’ the IDL variable, and displaying images with default min-max scaling. These aren’t documented so well, but should be fairly self-explanatory.
NASA image ID AS17-148-22727 is famous. Although you may not recognise the number, you will almost certainly recognise the image:
This was taken by NASA Apollo astronauts on the 7th December 1972, while the Apollo 17 mission was on its way to the moon. It has become one of the most famous photographs ever taken, and has been widely credited as providing an important contribution to the ‘green revolution’, which was rapidly growing in the early 1970s.
It wasn’t, in fact, the first image of the whole Earth to be taken from space – the first images were taken by the ATS-III satellite in 1967, but limitations in satellite imaging at the time meant that their quality was significantly lower:
The Apollo astronauts had the benefit of a high-quality Hasselblad film camera to take their photographs – hence the significantly higher quality.
Part of the reason that this image has become so famous, is because there haven’t been any more like it – Apollo 17 was the last manned lunar mission, and we haven’t sent humans that far away from the Earth since. NASA has released a number of mosaics of satellite images from satellites such as MODIS and Landsat and called these ‘Blue Marble’ images – but they’re not quite the same thing!
Anyway, fast-forwarding to 2015…and NASA launched DSCOVR, the Deep Space Climate Observatory satellite (after a long political battle with climate-change-denying politicians in the United States). Rather than orbiting relatively close to the Earth (around 700km for polar orbiting satellites, and 35,000km for geostationary satellites), DSCOVR sits at the ‘Earth-Sun L1 Lagrangian point’, around 1.5 million km from Earth!
At this distance, it can do a lot of exciting things – such as monitor the solar wind and Coronal Mass Ejections – as it has a continuous view of both the sun and Earth. From our point of view here, the Earth part of this is the most interesting – as it will constantly view the sunny side of the Earth, allowing it to take ‘Blue Marble’ images multiple times a day.
The instrument that acquires these images is called the Earth Polychromatic Imaging Camera (EPIC, hence my little pun in the title), and it takes many sequential Blue Marble images every day. At least a dozen of these images each day are made available on the DSCOVR:EPIC website – within 36 hours of being taken. As well as providing beautiful images (with multiple images per day meaning that it’s almost certain that one of the images will cover the part of the Earth where you live!), EPIC can also be used for proper remote-sensing science – allowing scientists to monitor vegetation health, aerosol effects and more at a very broad spatial scale (but with a very high frequently).
So, in forty years we have moved from a single photograph, taken by a human risking his life around 45,000km from Earth, to multiple daily photographs taken by a satellite orbiting at 1.5 million km from Earth – and providing useful scientific data at the same time (other instruments on DSCOVR will provide warnings of ‘solar storms’ to allow systems on Earth to be protected before the ‘storm’ arrives – but that’s not my area of expertise).
So, click this link and go and look at what the Earth looked like yesterday, from 1.5 million km away, and marvel at the beautiful planet we live on.