Web scraping for nurses, using Python

The internet is full of structured data. Learn about how to harvest and recombine it, all with the power of Python!

As I count down the days to returning to study in the new year, I'm immensely pleased that I've finally achieved my main motivation for learning to code! Given how important RSS is to gathering news and commentary articles easily, I've been frustrated by the recent trend for sites to not provide a feed of new articles. Even NICE discontinued their feeds, meaning if you want to be notified of updates you have to subscribe to a newsletter like it's still the late 90s!

I'd previously got around this using something called Yahoo Pipes, a visionary node-graph tool to process and recombine information from around the web. It was way ahead of its time, so naturally Yahoo killed it when their finances started looking rocky.

Remember when Yahoo were this ambitious? RIP.

So I and many others started looking around for alternatives, the standout being a cloud based service from Kimono Labs. They made a Chrome extension that turned any website into a data API with a few clicks, including an RSS feed. You need to see it in action to understand just how cool that is.

Of course Kimono Labs got bought out and the whole thing was shut down. Hnnngh...

Twice burned and missing essential sources for my newsreader, I decided enough was enough. I would learn how to do web scraping myself, because how hard could it be? Turns out not hard at all, thanks to the power of the Python coding language and the spectacular community who freely share their hard work. This leads us to Scrapy, the leading web scraping framework for Python. It's powerful and flexible and you should care about it as a nurse, because with a little ramp up time it will give you a world of data to play with and learn from.

Let's start with a simple example, which is my use case - updating a feed of new articles from any website. Once you've installed Scrapy and its dependencies it's configured something like this:-

import scrapy


class BthNhsSpider(scrapy.Spider):
    name = "bthnhs"
    start_urls = [
        'http://www.nhs.uk/news/Pages/NewsIndex.aspx',
    ]

    def parse(self, response):
        for article in response.css('div.pad'):
            yield {
                'title': article.xpath('h3/a/span/text()').extract_first(),
                'link': article.xpath('h3').css('a::attr(href)').extract_first(),
                'date': article.xpath('p[@class="date"]/text()').extract_first(),
                'description': article.xpath('p[@class="copy"]/text()').extract_first(),
            }

This is a "spider" in Scrapy parlance. This one requests the first page of articles of the fab Behind The Headlines blog on NHS Choices (they fact-check press coverage of health science). Once it's downloaded the page you can extract any information that can be recognised by a CSS selector or XPath query. If that sounds scary, it really isn't. I was familiar with CSS from basic web design, but XPath gave me a bit of a learning curve; thankfully there are some excellent guides and tutorials, along with an interactive shell to help play with your data. Where the code says "yield" I'm selecting the information I need to construct an RSS feed. Elsewhere in the Scrapy configuration I've defined a "pipeline" that can process the data before saving it out to a file or database, which in my case just saves it to an XML file my Django install is pointed to. Then all you need is Django's built-in feed generator to turn the data into an RSS compliant format to be served through your website.

So I'm happy! I'm now capable of solving my own problem and no one can take it away from me again. But what might be next? Web scraping is step one in all sorts of interesting areas so I thought I'd leave you with a few examples to whet your appetite.

You're a health union comms officer and you want build an interactive web app to show how English nurse wages are constantly falling behind other territories. No problem! NHS Digital publish regular wage data for the NHS, in the US you can get similar data from the Bureau of Labor Statistics and so on for other countries. And you'll want a measure of purchasing power parity to account for the cost-of-living differences, perhaps some tax data too? You can scrape all that information from various internet sources, on an ongoing basis. Finesse it, throw it in a database and present it however you want. Hooray for web scraping! (Incidentally, with 2015 data, that gap for average nurse wages between England and the US is ~$20k on my napkin math...).

MonkeyLearn, a machine learning analytics service, have a great blog about using Scrapy to train a sentiment analysis module, to categorise user reviews. How about applying that to various sources of NHS feedback, such as reviews on NHS Choices or PatientOpinion? Or go further and integrate comments from social media channels like Twitter and Facebook too! What hospital wouldn't appreciate an automated indicator of sentiment across every public source? Web scraping found some really negative feedback? Have it automatically sent to the PALS team inbox for review! (But please read the Data Protection Act before actually doing that...).

Okay, one more. Take the followers of a popular Twitter nursing account, @WeNurses would be a good hub. Scrape or use the Twitter API to iterate through their followers and note any with profiles with the word "student" in it, that also have a link to a website. Take those links and crawl them with Scrapy, trying the usual places for blog RSS feeds, eg WordPress normally has one at mysite.com/feed. Now you have the world's most extensive list of student nurse blogs. You could just subscribe to the feeds and enjoy reading, but how about throwing it back at MonkeyLearn or Python's nltk library for keyword extraction? Suddenly that's a uniquely nuanced view of what students tend to blog about, ranked by keyword frequency.

I'll keep saying it, this is not hard to implement when Python is so approachable and leverages libraries this powerful. The amount of effort I've dedicated to it is just a few weeks of spare time. You're simply writing enough script to glue the bits together and give it a purpose and I can't think of a profession better placed than nurses to put this to use to benefit service users. I hope I've sparked your imagination and you'll think about giving it a try for yourself!

It always helps to have a good book to kick off with, so here's a fun guide to get you started: Web Scraping with Python: Collecting Data from the Modern Web.