I've been doing a bit of freelancing 'on the side' for a while - but now I've made it official: I am available for freelance work. Please look at my new website or contact me if you're interested in what I can do for you, or carry on reading for more details.
Since I stopped working as an academic, and took time out to focus on my work and look after my new baby, I've been trying to find something which allows me to fit my work nicely around the rest of my life. I've done bits of short part-time work contracts, and various bits of freelance work - and I've now decided that freelancing is the way forward.
I've created a new freelance website which explains what I do and the experience I have - but to summarise here, my areas of focus are:
Remote Sensing - I am an expert at processing satellite and aerial imagery, and have processed time-series of thousands of images for a range of clients. I can help you produce useful information from raw satellite data, and am particularly experienced at atmospheric remote sensing and atmospheric correction.
GIS - I can process geographic data from a huge range of sources into a coherent data library, perform analyses and produce outputs in the form of static maps, webmaps and reports.
Data science - I have experience processing terabytes of data to produce insights which were used directly by the United Nations, and I can apply the same skills to processing your data: whether it is a single questionnaire or a huge automatically-generated dataset. I am particularly experienced at making research reproducible and self-documenting.
Python - I am an experienced Python programmer, and maintain a number of open-source modules (such as Py6S). I produce well-written, Pythonic code with high-quality tests and documentation.
The testimonials on my website show how much previous clients have valued the work I've done for them.
I've heard from a various people that they were rather put off by the nature of the auction that I ran for a day's work from me - so if you were interested in working with me but wanted a standard sort of contract, and more than a day's work, then please get in touch and we can discuss how we could work together.
(I'm aware that the last few posts on the blog have been focused on the auction for work, and this announcement of freelance work. Don't worry - I've got some more posts lined up which are more along my usual lines. Stay tuned for posts on Leaflet webmaps and machine learning of large raster stacks)
Just a quick reminder that you've only got until next Tuesday to bid for a day's work from me - so get bidding here.
The full details and rules are available in my previous post, but basically I'll do a day's work for the highest bidder in this auction - working on coding, data science, GIS/remote sensing, teaching...pretty much anything in my areas of expertise. This could be a great way to get some work from me for a very reasonable price - so please have a look, and share with anyone else who you think might be interested.
Summary: I will do a day's work for the highest bidder in this auction. This could mean you get a day's work from me very cheaply. Please read all of this post carefully, and then submit your bid here before 5th Feb.
This experiment is based very heavily on David MacIver's experiment in auctioning off a day's work (see his blog posts introducing it, and summarising the results). It seemed to work fairly well for him, and I am interested to see how it will work for me.
So, if you win this auction, I will do one day (8 hours) of work for you, on a project of your choosing. If you've been following this blog then you'll have a reasonable idea of what sort of things I can do - but to jog your memory, here are some ideas:
Working on an open-source project: I could work to add features to, fix bugs in, or document, an open-source project of mine - probably either Py6S or recipy.
Pair programming: I could work with you to write some code - it could be code to do pretty-much anything, but I'm most experienced in data science, geographical data processing, computer vision, remote sensing/GIS and similar areas.
Programming: I could write some code by myself, to do a reasonably-simple task of your choosing, providing the well-documented code for you to use or develop further. As above, this could be anything, but would work best if it were in my areas of expertise.
Data science: I could do some analysis of a reasonably simple dataset for you, providing well-documented code to allow you to extend the analysis.
GIS/Remote Sensing: I could perform some remote sensing/GIS analysis on a dataset, potentially producing well-designed maps as outputs.
Teaching: I could work with you, online or in person, to help you understand a topic with which I am familiar - for example, Python programming, data science, computer science, remote sensing, GIS and so on.
Review & comments: I could review and give comments on documents in my areas of expertise, for example, a draft paper, chapter of a thesis, or similar.
These are just a few ideas of things I could do - I am happy to do most things, although I will let you know if I think that I do not have the required expertise to do what you are requesting.
The bid is only for me to work for 8 hours, so I strongly suggest either a short self-contained project, or something that can be stopped at any point and still be useful. If you want me to continue working past 8 hours then I would be happy to negotiate some further work - but this would be entirely outside of the bidding process.
The 8 hours work will likely be split over multiple days: due to my health I find working for 8 hours straight to be very difficult, so I will probably do the work in two or three chunks. I am happy to do the work entirely independently, or to work in close collaboration with you.
If I produce something tangible as part of this work (eg. some code, some documentation) then I will give you the rights to do whatever you wish with these (the only exception being work on my open-source projects, for which I will require you to agree to release the work under the same open-source license as the rest of the project).
Following David's lead, the auction will be a Vickrey Auction, where all bids are secret, and the highest bidder wins but pays the second highest bidder's bid. This means that the mathematically best amount to bid is exactly the amount you are willing to pay for my time.
If there is only one bidder, then you will get a day of my work and pay nothing for it.
If there is a tie for top place then I will pick the work I most want to do, and charge the highest bid.
The auction closes at 23:59 UTC on the 5th February 2019. Bids submitted after that time will be invalid.
The day of work must be claimed by the end of March 2019. I will contact the winner to arrange dates and times. I will send an invoice after the work is completed, and this must be paid within 30 days.
If your company wants to bid then I am happy to invoice them after the work is complete and, within reason, jump through the necessary hoops to get the invoice paid.
If you wish me to work in-person then I will invoice you for travel costs on top of the bid payment. Work can only be carried out in a wheelchair accessible building, and in general I would prefer remote work.
If you ask me to do something illegal, unethical, or just something that I firmly do not want to do, then I will delete your bid. If you would have been one of the top bidders then I will inform you of this.
After the auction is over, and the work has been completed, I will post on this blog a summary of the bids received, the winning bid and so on.
To go ahead and submit your bid, please fill in the form here.
The quick summary of this post is: I give talks. You might like them. Here are some details of talks I've done. Feel free to invite me to speak to your group - contact me at [email protected]. Read on for more details.
I enjoy giving talks on a variety of subjects to a range of groups. I've mentioned some of my programming talks on my blog before, but I haven't mentioned anything about my other talks so far. I've spoken at amateur science groups (Cafe Scientifique or U3A science groups and similar), programming conferences (EuroSciPy, PyCon UK etc), schools (mostly to sixth form students), unconferences (including short talks made up on the day) and at academic conferences.
Feedback from audiences has been very good. I've won the 'best talk' prize at a number of events including the Computational Modelling Group at the University of Southampton, the Student Conference on Complexity Science, and EuroSciPy. A local science group recently wrote:
"The presentation that Dr Robin Wilson gave on Complex systems in the world around us to our Science group was excellent. The clever animated video clips, accompanied by a clear vocal description gave an easily understood picture of the underlining principles involved. The wide range of topics taken from situations familiar to everyone made the examples pertinent to all present and maintained their interest throughout. A thoroughly enjoyable and thought provoking talk."
A list of talks I've done, with a brief summary for each talk, is at the end of this post. I would be happy to present any of these talks at your event - whether that is a science group, a school Geography class, a programming meet-up or something else appropriate. Just get in touch on [email protected].
All of these are illustrated with lots of images and videos - and one even has live demonstrations of complex system models. They're designed for people with an interest in science, but they don't assume any specific knowledge - everything you need is covered from the ground up.
Monitoring the environment from space
Hundreds of satellites orbit the Earth every day, collecting data that is used for monitoring almost all aspects of the environment. This talk will introduce to you the world of satellite imaging, take you beyond the ‘pretty pictures’ to the scientific data behind them, and show you how the data can be applied to monitor plant growth, air pollution and more.
From segregation to sand dunes: complex systems in the world around us
‘Complex’ systems are all around us, and are often difficult to understand and control. In this talk you will be introduced to a range of complex systems including segregation in cities, sand dune development, traffic jams, weather forecasting, the cold war and more – and will show how looking at these systems in a decentralised way can be useful in understanding and controlling them. I'm also working on a talk for a local science and technology group on railway signalling, which should be fascinating. I'm happy to come up with new talks in areas that I know a lot about - just ask.
These are illustrated with code examples, and can be made suitable for a range of events including local programming meet-ups, conferences, keynotes, schools and more.
Writing Python to process millions of row of mobile data - in a weekend
In April 2105 there was a devastating earthquake in Nepal, killing thousands and displacing hundreds of thousands more. Robin Wilson was working for the Flowminder Foundation at the time, and was given the task of processing millions of rows of mobile phone call records to try and extract useful information on population displacement due to the disaster. The aid agencies wanted this information as quickly as possible – so he was given the unenviable task of trying to produce preliminary outputs in one bank-holiday weekend… This talk is the story of how he wrote code in Python to do this, and what can be learnt from his experience. Along the way he’ll show how Python enables rapid development, introduce some lesser-used built-in data structures, explain how strings and dictionaries work, and show a slightly different approach to data processing.
xarray: the power of pandas for multidimensional arrays
"I wish there was a way to easily manipulate this huge multi-dimensional array in Python...", I thought, as I stared at a huge chunk of satellite data on my laptop. The data was from a satellite measuring air quality - and I wanted to slice and dice the data in some supposedly simple ways. Using pure numpy was just such a pain. What I wished for was something like pandas - with datetime indexes, fancy ways of selecting subsets, group-by operations and so on - but something that would work with my huge multi-dimensional array.
The solution: xarray - a wonderful library which provides the power of pandas for multi-dimensional data. In this talk I will introduce the xarray library by showing how just a few lines of code can answer questions about my data that would take a lot of complex code to answer with pure numpy - questions like 'What is the average air quality in March?', 'What is the time series of air quality in Southampton?' and 'What is the seasonal average air quality for each census output area?'.
After demonstrating how these questions can be answered easily with xarray, I will introduce the fundamental xarray data types, and show how indexes can be added to raw arrays to fully utilise the power of xarray. I will discuss how to get data in and out of xarray, and how xarray can use dask for high-performance data processing on multiple cores, or distributed across multiple machines. Finally I will leave you with a taster of some of the advanced features of xarray - including seamless access to data via the internet using OpenDAP, complex apply functions, and xarray extension libraries.
recipy: effortless provenance in Python
Imagine the situation: You’ve written some wonderful Python code which produces a beautiful output: a graph, some wonderful data, a lovely musical composition, or whatever. You save that output, naturally enough, as awesome_output.png. You run the code a couple of times, each time making minor modifications. You come back to it the next week/month/year. Do you know how you created that output? What input data? What version of your code? If you’re anything like me then the answer will often, frustratingly, be “no”.
This talk will introduce recipy, a Python module that will save you from this situation! With the addition of a single line of code to the top of your Python files, recipy will log each run of your code to a database, keeping track of all of your input files, output files and the code that was used - as well as a lot of other useful information. You can then query this easily and find out exactly how that output was created.
In this talk you will hear how to install and use recipy and how it will help you, how it hooks into Python and how you can help with further development.
Decentralised systems, complexity theory, self-organisation and more
This talk/lesson is very similar to my complex systems talk described above, but is altered to make it more suitable for use in schools. So far I have run this as a lesson in the International Baccalaureate Theory of Knowledge (TOK) course, but it would also be suitable for A-Level students studying a wide range of subjects.
GIS/Remote sensing for geographers
I've run a number of lessons for sixth form geographers introducing them to the basics of GIS and remote sensing. These topics are often included in the curriculum for A-Level or equivalent qualifications, but it's often difficult to teach them without help from outside experts. In this lesson I provide an easily-understood introduction to GIS and remote sensing, taking the students from no knowledge at all to a basic understanding of the methods involved, and then run a discussion session looking at potential uses of GIS/RS in topics they have recently covered. This discussion session really helps the content stick in their minds and relates it to the rest of their course.
As an experienced programmer, and someone with formal computer science education, I have provided input to a range of computing lessons at sixth-form level. This has included short talks and part-lessons covering various programming topics, including examples of 'programming in the real world' and discussions on structuring code for larger projects. Recently I have provided one-on-one support to A-Level students on their coursework projects, including guidance on code structure, object-oriented design, documentation and GUI/backend interfaces.
A while back a friend on Twitter pointed me towards a question on the GIS StackExchange site about the 6S model, asking if "that was the thing you wrote". I didn't write the 6S model (Eric Vermote and colleagues did that), but I did write a fairly well-used Python interface to the 6S model, so I know a fair amount about it.
The question was about atmospherically correcting radiance values using 6S. When you configure the atmospheric correction mode in 6S you give it a radiance value measured at the sensor and it outputs an atmospherically-corrected radiance value. Simple. However, it also outputs three coefficients: xa, xb and xc which can be used to atmospherically-correct other at-sensor radiance values. These coefficients are used in the following formulae, given in the 6S output:
where acr is the atmospherically-corrected radiance.
The person asking the question had found that when he used the formula to correct the same radiance that he had corrected using 6S itself, he got a different answer. In his case, the result from 6S itself was 0.02862, but when he ran his at-sensor radiance through the formula he got a different answer: 0.02879, a difference of 0.6%.
I was intrigued by this question, as I've used 6S for a long time and never noticed this before...strangely, I'd never thought to check! The rest of this post is basically a copy of my answer on the StackExchange site, but with a few bits of extra explanation.
I thought originally that it might be an issue with the parameterisation of 6S - but I tried a few different parameterisations myself and came up with the same issue - I was getting a slightly different atmospherically-corrected reflectance when putting the coefficients through the formula, compared to the reflectance that was output by the 6S model directly.
The 6S manual is very detailed, but somehow never seems to answer the questions that I have - for example, it doesn't explain anywhere how the three coefficients are calculated. It does, however, have an example output file which includes the atmospheric correction results (see the final page of Part 1 of the manual). This includes the following outputs:
******************************************************************************** atmospheric correction result **-----------------------------** input apparent reflectance :0.100** measured radiance [w/m2/sr/mic]:38.529** atmospherically corrected reflectance **Lambertian case :0.22180** BRDF case :0.22180** coefficients xa xb xc :0.006850.038850.06835** y=xa*(measured radiance)-xb; acr=y/(1.+xc*y)********************************************************************************
If you work through the calculation using the formula given you find that the result of the calculation doesn't match the 6S output. Let me say that again: in the example provided by the 6S authors, the model output and formula don't match! I couldn't quite believe this...
So, I wondered if the formula was some sort of simple curve fitting to a few outputs from 6S, and would therefore be expected to have a small error compared to the actual model outputs. As mentioned earlier, the manual explains a lot of things in a huge amount of detail, but is completely silent on the calculation of these coefficients. Luckily the 6S source code is available to download. Less conveniently, the source code is in written in Fortran 77!
I am by no means an expert in Fortran 77 (in fact, I've never written any Fortran code in real-life), but I've had a dig in to the code to try and find out how the coefficients are calculated.
If you want to follow along, the code to calculate the coefficients starts at line 3382 of main.f. The actual coefficients are set in lines 3393-3397:
(strangely xb is set twice, to the same value, and another coefficient xap is set, which never seems to be used - I have no idea why!).
It's fairly obvious from this code that there is no complicated curve fitting algorithm used - the coefficients are simply algebraic manipulations of other variables used in the model. For example, xc is set to the value of the variable sast, which, through a bit of detective work, turns out to be the total spherical albedo (see line 3354). You can check this in the 6S output: the value of xc is always the same as the total spherical albedo which is shown a few lines further up in the output file. Similarly xb is calculated based on various variables including tgasm, which is the total global gas transmittance and sdtott, which is the total downward scattering transmittance, and so on. (These variables are difficult to decode, because Fortran 77 has a limit of six characters for variable names, so they aren't very descriptive!).
I was stumped at this point, until I thought about numerical precision. I realised that the xacoefficient has a number of zeros after the decimal point, and wondered if there might not be enough significant figures to produce an accurate output when using the formula. It turned out this was the case, but I'll go through how I altered the 6S code to test this.
Line 3439 of main.f is responsible for writing the coefficients to the file. It consists of:
This tells Fortran to write the output to the file/output stream iwr using the format code specified at line 944, and write the three variables xa, xb and xc. Looking at line 944 (that is, the line given a Fortran line number of 944, which is actually line 3772 in the file...just to keep you on your toes!) we see:
944 format(1h*,6x,40h coefficients xa xb xc :,
s ' y=xa*(measured radiance)-xb; acr=y/(1.+xc*y)',
This rather complicated line explains how to format the output. The key bit is 3(f8.5,1x) which tells Fortran to write a floating point number (f) with a maximum width of 8 characters, and 5 decimal places (8.5) followed by a space (1x), and to repeat that three times (the 3(...)). We can alter this to print out more decimal places - for example, I changed it to 3(f10.8,1x), which gives us 8 decimal places. If we do this, then we find that the output runs into the *'s that are at the end of each line, so we need to alter a bit of the rest of the line to reduce the number of spaces after the text coefficients xa xb xc. The final, working line looks like this:
944 format(1h*,6x,35h coefficients xa xb xc :,
s ' y=xa*(measured radiance)-xb; acr=y/(1.+xc*y)',
If you alter this line in main.f and recompile 6S, you will see that your output looks like this:
******************************************************************************** atmospheric correction result **-----------------------------** input apparent reflectance :0.485** measured radiance [w/m2/sr/mic]:240.000** atmospherically corrected reflectance **Lambertian case :0.45439** BRDF case :0.45439** coefficients xa xb xc :0.002973620.202919300.24282509** y=xa*(measured radiance)-xb; acr=y/(1.+xc*y)********************************************************************************
If you then apply the formula you will find that the output of the formula, and the output of the model match - at least, to the number of decimal places of the model output.
In my tests of this, I got the following for the original 6S code:
Perc Diff: 0.1507718536%
(the percentage difference I was getting was smaller than the questioner found - but that will just depend on the parameterisation used)
and this for my altered 6S code:
Perc Diff: -0.0009364659%
A lot better!
For reference, to investigate this I used Py6S, the Python interface to the 6S model that I wrote. I used the following functions to automatically calculate the results using the formula from a Py6S SixS object, and to calculate the percentage difference automatically:
def calc_acr(radiance, xa, xb, xc):
y = xa * radiance - xb
acr = y/(1.0+ xc * y)return acr
def calc_acr_from_obj(radiance, s):return calc_acr(radiance, s.outputs.coef_xa, s.outputs.coef_xb, s.outputs.coef_xc)def difference_between_formula_and_model(s):
formula = calc_acr_from_obj(s.outputs.measured_radiance, s)
model = s.outputs.atmos_corrected_reflectance_lambertian
diff = model - formula
perc_diff =(diff / model)*100print("Model: %.10f"% model)print("Formula: %.10f"% formula)print("Perc Diff: %.10f%%"% perc_diff)
and my example errors above came from running Py6S using the following parameterisation:
Just as a slight addendum, if you're atmospherically-correcting Sentinel-2 data with 6S then you might want to consider using ARCSI - an atmospheric correction tool that uses Py6S internally, but does a lot of the hard work for you. The best way to learn ARCSI is with their tutorial document.
As I mentioned in the previous post, I attended - and spoke at - PyCon UK 2018 in Cardiff. Last time I provided a link to my talk on xarray - this time I want to provide some general thoughts on the conference, some suggested talks to watch, and a particular comment on the creche/childcare that was available.
In summary: I really enjoyed my time at PyCon UK and I would strongly suggest you attend. Interestingly for the first time I think I got more out of some of the informal activities than some of the talks - people always say that the 'hallway track' is one of the best bits of the conference, but I'd never really found this before.
So, what bits did I particularly enjoy?
Of the many talks that I attended, I'd particularly recommend watching the videos of:
There were two other things that went on that were very interesting. One was a 'bot competition' run by Peter Ingelsby, where you had to write Python bots to play Connect 4 against each other. I didn't have the time (or energy!) to write a bot, but I enjoyed looking at the code of the various bots that won at the end - some very clever techniques in there! Some of the details of the bots are described in this presentation at the end of the conference.
On the final day of the conference, people traditionally take part in 'sprints' - working on a whole range of Python projects. However, this year there was another activity taking place during the sprints day: a set of 'Lean Coffee' discussions run by David MacIver. I won't go into the way this worked in detail, as David has written a post all about it, but I found it a very satisfying way to finish the conference. We had discussions about a whole range of issues - including the best talks at the conference, how to encourage new speakers, testing methods for Python code, other good conferences, how to get the most out of the 'hallway track' and lots more. Because of the way the 'Lean Coffee' works, each discussion is time-bound, and only occurs if the majority of the people around the table are interested in it - so it felt far more efficient than most group discussions I've been in. I left wanting to run some Lean Coffee sessions myself sometime (and, while writing this, am kicking myself for not suggesting it at a local unconference I went to last week!). I may also have volunteered myself to run some more sessions like it during the main conference next year - wait to hear more on that front.
My wife and I wouldn't have been able to attend PyCon UK without their childcare offer. The childcare is described on the conference website, but there isn't a huge amount of detail. My aim in this section is to provide a bit more real-world information on how it actually worked and what it was like - along with some cute photos.
So, having said we wanted to use the creche when we booked our tickets, we got an email a few days before the conference asking us to provide our child's name, age and any special requirements. We turned up on the first day at about 8:45 (the first session started at 9:00), not really sure what to expect, and found a room for the creche just outside of the main hall (the Assembly Room). It was a fairly small room, but that didn't matter as there weren't that many children.
Inside there were two nursery staff, from Brecon Mobile Childcare. They specialise in doing childcare at conferences, parties, weddings and so on - so they were used to looking after children that they didn't know very well. They introduced themselves to us, and to our son, and got us to fill in a form with our details and his details, including emergency contact details for us. We talked a little about his routine and when he tends to nap, snack and so on, and then we kissed him goodbye and left. They assured us that if he got really upset and they couldn't settle him (because they didn't know him very well) then they'd call our mobiles and we could come and calm him down. We could then go off and enjoy the conference - and, in fact, the staff suggested that we shouldn't come visiting during the breaks as that was likely to just upset him as he'd have to say goodbye to Mummy and Daddy multiple times.
I think there were something like 5 children there on the first day, ranging in age from about six months to ten years. The room had a variety of toys in it suitable for various different ages (including colouring and board games for the older ones, and soft toys and play mats for the younger ones), plus a small TV showing some children's TV programmes (Teletubbies was on when we came in).
We came back at lunchtime and found that he'd had a good time. He cried a little when we left, but stopped in about a minute, and the staff engaged him with some of the toys. He'd had a short nap in his pram (we left that with them in the room) and had a few of his snacks. We collected him for lunch and took him down to the main lunch hall to get some food.
PyCon UK make it very clear that children are welcomed in all parts of the conference venue, and no-one looked at us strangely for having a child with us at lunchtime. Various other attendees engaged with our son nicely, and we soon had him sitting on a seat and eating some of the food provided. Those with younger children should note that there wasn't any special food provided for children: our son was nearly 18 months old, so he could just eat the same as us, but younger children may need food bringing specially for them. There also weren't any high chairs around, which could have been useful - but our son managed fairly well sitting on a chair and then on the floor, and didn't make too much mess.
After eating lunch we took him for a walk in his pram around the park outside the venue, with the aim of getting him to sleep. We didn't manage to get him to sleep, but he did get some fresh air. We then took him up to the creche room again and said goodbye, and left him to have fun playing with the staff for the afternoon.
We were keen to go to the lightning talks that afternoon, so went to the main hall at 5:30pm in time for them. Part-way through the talks, when popping to the toilet, we found one of the creche staff outside the main hall with our son. It turned out that the creche only continued until 5:30, not until 6:30 when the conference actually finished. We were a little surprised by this (and gave feedback to the organisers saying that the creche should finish when the main conference finishes), but it didn't actually cause us much problem. We'd been told that children are welcome in any of the talks - and the lightning talks are more informal than most of the talks - so we brought him into the main hall and played with him at the back.
He enjoyed wandering around with his Mummy's conference badge around his neck, and kept walking up and down the aisle smiling at people. Occasionally he got a bit too near the front, and we were asked very nicely by one of the organisers the next day to try and keep him out of the main eye-line of the speakers as it can be a bit distracting for them, but we were assured that they were more than happy to have him in the room. He even did some of his climbing over Mummy games at the back, and then breastfed for a bit, and no-one minded at all.
The rest of the days were just like the first, except that there were less children in the creche, and therefore only one member of staff. For most of the days there were just two children: our son, and a ten year old girl. On the last day (the sprints day) there was just Julian. During some of these days the staff member was able to take Julian out for a walk in his pram, which was nice, and got him a bit of fresh air.
So, that's pretty-much all there is to say about the creche. It worked very well, and it allowed both my wife and me to attend - something which isn't possible with most conferences. We were happy to leave our son with the staff, and he seemed to have a nice time. We'll definitely use the creche again!
Last week I attended PyCon UK 2018 in Cardiff, and had a great time. I'm going to write a few posts about this conference - and this first one is focused on my talk.
I spoke in the 'PyData' track, with a talk entitled XArray: the power of pandas for multidimensional arrays. PyCon UK always do a great job of getting the videos up online very quickly, so you can watch the video of my talk below:
The slides for my talk are available here and a Github repository with the notebook which was used to create the slides here.
I think the talk went fairly well, although I found my positioning a bit awkward as I was trying to keep out of the way of the projector, while also being in range of the microphone, and trying to use my pointer to point out specific parts of the screen.
Feedback was generally good, with some useful questions afterwards, and a number of positive comments from people throughout the rest of the conference. One person emailed me to say that my talk was "the highlight of the conference" for him - which was very pleasing. My tweet with a link to the video of my talk also got a number of retweets, including from the PyData and NumFocus accounts, which got it quite a few views
In the interests of full transparency, I have posted online the full talk proposal that I submitted, as this may be helpful to others trying to come up with PyCon talk proposals.
Next up in my PyCon UK series of posts: a general review of the conference.
During the Nepal earthquake response project I worked on, we were gradually getting access to historical mobile phone data for use in our analyses. I wanted to keep track of which days of data we had got access to, and which ones we were still waiting for.
I wrote a simple script to print out a list of days that we had data for - but that isn't very easy to interpret. Far easier would be a calendar with days highlighted. I thought this would be very difficult to generate - but then I found the pcal utility, which makes it easy to produce something like this:I'm not going to go into huge detail here, as the pcal man page is very comprehensive - and pcal can do far more than I show here. However, to create an output like the one shown above you'll need to put together a list of dates in a text file. Here's what my dates.txt file looks like:
Back in 2012, I wrote the following editorial for SENSED, the magazine of the Remote Sensing and Photogrammetry Society. I found it recently while looking through back issues, and thought it deserved a wider audience, as it is still very relevant. I've made a few updates to the text, but it is mostly as published.
In this editorial, I'd like to delve a bit deeper into our subject, and talk about the assumptions that we all make when doing our work.
In a paper written almost twenty years ago, Duggin and Robinove produced a list of assumptions which they thought were implicit in most remote sensing analyses. These were:
There is a very high degree of correlation between the surface attributes of interest, the optical properties of the surface, and the data in the image.
The radiometric calibration of the sensor is known for each pixel.
The atmosphere does not affect the correlation (see 1 above), or the atmospheric correction perfectly corrects for this.
The sensor spatial response characteristics are accurately known at the time of image acquisition.
The sensor spectral response and calibration characteristics are accurately known at the time of image acquisition.
Image acquisition conditions were adequate to provide good radiometric contrast between the features of interest and the background.
The scale of the image is appropriate to detect and quantify the features of interest.
The correlation (see 1 above) is invariant across the image.
The analytical methods used are appropriate and adequate to the task.
The imagery is analysed at the appropriate scale
There is a method of verifying the accuracy with which ground attributes have been determined, and this method is uniformly sensitive across the image.
I firmly believe that now is a very important time to start examining this list more closely. We are in an era when products are being produced routinely from satellites: end-user products such as land-cover maps, but also products designed to be used by the remote sensing community, such as atmospherically-corrected surface reflectance products. Similarly, GUI-based 'one-click' software is being produced which purports to perform very complicated processing, such as atmospheric correction or vegetation canopy modelling, very easily.
My question to you, as scientists and practitioners in the field is: Have you stopped to examine the assumptions underlying the products you use?, and even if you're not using products such as those above, have you looked at your analysis to see whether it really stands up to a scrutiny of its assumptions?
I suspect the answer is no - it certainly was for me until recently. There is a great temptation to use satellite-derived products without really looking into how they are produced and the assumptions that may have been made in their production process (seriously, read the Algorithm Theoretical Basis Document!). Ask yourself, are those assumptions valid for your particular use of the data?
Looking at the list of assumptions above, I can see a number which are very problematic. Number 8 is one that I have struggled with myself - how do I know whether the correlation between the ground data of interest and the image data is uniform across the image. I suspect it isn't - but I'd need a lot of ground data to test it, and even then, what could I do about it? Of course, number 11 causes lots of problems for validation studies too. Number 4 and 5 are primarily related to the calibration of the sensors, which is normally managed by the operators themselves. We might not be able to do anything about it - but have we considered it, particularly when using older and therefore less well-calibrated data?
As a relatively young member of the field, it may seem like I'm 'teaching my grandparents to suck eggs', and I'm sure this is familiar to many of you. Those of you who have been in the field a while have probably read the paper - more recent entrants may not have done so. Regardless of experience, I think we could all do with thinking these through a bit more. So on go, have a read of the list above, maybe read the paper, and have a think about your last project: were your assumptions valid?
I'm interested in doing some more detailed work on the Duggin and Robinove paper, possibly leading to a new paper revisiting their assumptions in the modern era of remote sensing. If you're interested in collaborating with me on this then please get in touch via [email protected]
This is another entry in my 'Previously Unpublicised Code' series - explanations of code that has been sitting on my Github profile for ages, but has never been discussed publicly before. This time, I'm going to talk about BankClassify a tool for classifying transactions on bank statements into categories like Supermarket, Eating Out and Mortgage automatically. It is an interactive command-line application that looks like this:
For each entry in your bank statement, it will guess a category, and let you correct it if necessary - learning from your corrections.
I've been using this tool for a number of years now, as I never managed to find another tool that did quite what I wanted. I wanted to have an interactive classification process where the computer guessed a category for each transaction but you could correct it if it got it wrong. I also didn't want to be restricted in what I could do with the data once I'd categorised it - I wanted a simple CSV output, so I could just analyse it using pandas. BankClassify meets all my needs.
If you want to use BankClassify as it is written at the moment then you'll need to be banking with Santander - as it can only important text-format data files downloaded from Santander Online Banking at the moment. However, if you've got a bit of Python programming ability (quite likely if you're reading this blog) then you can write another file import function, and use the rest of the module as-is. To get going, just look at the README in the repository.
So, how does this work? Well it uses a Naive Bayesian classifier - a very simple machine learning tool that is often used for spam filtering (see this excellent article by Paul Graham introducing its use for spam filtering). It simply splits text into tokens (more on this later) and uses training data to calculate probabilities that text containing each specific token belongs in each category. The term 'naive' is used because of various naive, and probably incorrect, assumptions which are made about independence between features, using a uniform prior distribution and so on.
Creating a Naive Bayesian classifier in Python is very easy, using the textblob package. There is a great tutorial on building a classifier using textblob here, but I'll run quickly through my code anyway:
First we load all the previous data from the aptly-named AllData.csv file, and pass it to the _get_training function to get the training data from this file in a format acceptable to textblob. This is basically a list of tuples, each of which contains (text, classification). In our case, the text is the description of the transaction from the bank statement, and the classification is the category that we want to assign it to. For example ("CARD PAYMENT TO SHELL TOTHILL,2.04 GBP, RATE 1.00/GBP ON 29-08-2013", "Petrol"). We use the _extractor function to split the text into tokens and generate 'features' from these tokens. In our case this is simply a function that splits the text by either spaces or the '/' symbol, and creates a boolean feature with the value True for each token it sees.
Now we've got the classifier, we read in the new data (_read_santander_file) and the list of categories (_read_categories) and then get down to the classification (_ask_with_guess). The classification just calls the classifier.classify method, giving it the text to classify. We then do a bit of work to nicely display the list of categories (I use colorama to do nice fonts and colours in the terminal) and ask the user whether the guess is correct. If it is, then we just save the category to the output file - but if it isn't we call the classifier.update function with the correct tuple of (text, classification), which will update the probabilities used within the classifier to take account of this new information.
That's pretty-much it - all of the rest of the code is just plumbing that joins all of this together. This just shows how easy it is to produce a useful tool, using a simple machine learning technique.
Just as a brief aside, you can do interesting things with the classifier object, like ask it to tell you what the most informative features are:
Most Informative Features
IN = True Cheque : nan = 6.5 : 1.0
UNIVERSITY = True Cheque : nan = 6.5 : 1.0
PAYMENT = None Cheque : nan = 6.5 : 1.0
COWHERDS = True Eating : nan = 6.5 : 1.0
CARD = None Cheque : nan = 6.5 : 1.0
CHEQUE = True Cheque : nan = 6.5 : 1.0
TICKETOFFICESALE = True Travel : nan = 6.5 : 1.0
SOUTHAMPTON = True Cheque : nan = 6.5 : 1.0
CRAFT = True Craft : nan = 4.3 : 1.0
LTD = True Craft : nan = 4.3 : 1.0
HOBBY = True Craft : nan = 4.3 : 1.0
RATE = None Cheque : nan = 2.8 : 1.0
GBP = None Cheque : nan = 2.8 : 1.0
SAINSBURYS = True Superm : nan = 2.6 : 1.0
WAITROSE = True Superm : nan = 2.6 : 1.0
Here we can see that tokens like IN, UNIVERSITY, PAYMENT and SOUTHAMPTON are highly predictive of the category Cheque (as most of my cheque pay-ins are shown in my statement as PAID IN AT SOUTHAMPTON UNIVERSITY), and that CARD not existing as a feature is also highly predictive of the category being cheque (fairly obviously). Names of supermarkets also appear there as highly predictive for the Supermarket class and TICKETOFFICESALE for Travel (as that is what is displayed on my statement for a ticket purchase at my local railway station). You can even see some of my food preferences in there, with COWHERDS being highly predictive of the Eating Out category.
So, have a look at the code on Github, and have a play with it - let me know if you do anything cool.