Scrapers
From Plings Info
Contents |
Python Scraping
These are a bunch of website/spreadsheet scraping scripts written in python, that output plings xml files. These can be found in a git repository, the files can be browsed on the web here, or downloaded using git:
git clone git://gitorious.org/plings-scrapers/plings-scrapers.git plings-scrapers
These have been released under the MIT license, to permit them to be reused as widely as possible.
Website Scrapers
Some of these scrapers work differently. Older ones take no arguments, and scrape the site directly into an XML file. Newer scrapers take an argument of either get or scrape. get pulls the webpages into a cache directory (which you may need to create). scrape does the actual conversion of the html files in this cache directory to XML.
| Name | Files | Url | Status | Scraped Data | Structure | API Method | Recurred? | Cron? | Full Address? | Notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| B-Active | bactive.py | [1] | Live | [2] | New | XML POST | No | Yes | Yes? | Poor address data | |
| BBC Blast | bbcblast.py
bbclast_in.py | [3] | Inactive | [4] | Website->CSV->API | SOAP | No | No | |||
| Bristol | bristol.py | [5] | Live | [6] | Website->Cache->API | XML POST | No | No | |||
| Bolton | bolton.py | [7] | Live | [8] | SOAP->API | XML POST | No | Yes | Yes | ||
| Cheshire | cheshire.py | [9] | Live | [10] | Website->Cache->API | XML POST | No | No | |||
| Cambridgeshire | cambridgeshire.py | [11] | Pending | New | XML POST | Yes | Yes | ||||
| Coventry Active | coventry.py | [12] | Live | [13] | Website->Cache->API | XML POST | Yes | No | |||
| East Riding | east_riding.py
windmill/east_riding.py | [14] | Inactive | [15] | Website->Pagelist->Cache->API | XML POST | No | No | Creates the pagelist using the windmill framework | ||
| Hackney | hackney.py | [16] | Live | [17] | Website->Cache->API | XML POST | No | No | Contains interesting start/end time regexes | ||
| Hertfordshire | hertfordshire.py | [18] | Live | [19] | New | XML POST | Yes | Yes | |||
| Lancashire see Scrapers_Lancashire | lancashire.py
lancashire_defs.py | [20] | Live | [21] | Website->Cache->API | XML POST | No | No | |||
| Luton Museums | luton_museums.py | [22] | Inactive | Website->API | XML POST | No | No | ||||
| Luton Youth | luton_youth.py | [23] | Inactive | Website->API | XML POST | Yes | No | ||||
| Manchester | manchester.py | [24] | Live | [25] | Website->Cache->API | XML POST | Yes | No | Contains interesting latlng2postcode function | ||
| National Fishing | national_fishing.py | [26] | Seasonal | [27] | Website->Cache->API | XML POST | No | No | |||
| North Yorkshire (Gimi) | north_yorkshire.py
north_yorkshire.sh | [28] | Live (Little data) | [29] | Website->Cache->CSV->API | XML POST | No | Yes | No | Only about a week's data availible | |
| SkyRide | skyride.py | [30] | Seasonal | [31] | Website->Cache->API | XML POST | No | ||||
| Warrington | warrington.py | [32] | Live (Little Data) | [33] | Website->Cache->API | XML POST | No | Yes | No | Contains mapmagic function (scrape from google map) | |
| Wirral | wirral.py | [34] | Live | [35] | Website->Cache->API | XML POST | No | Yes | No | ||
| Worcestershire | worcestershire.py | [36] | Live | [37] | Website->Cache->API | XML POST | No | Yes | No |
Spreadsheet Parsers
These only make sense if you have the spreadsheets from the Local Authorities in question.
- islington.py
- islington_halfterm.py
- leicester.py
- lewisham.py
- norfolk.py
- staffordshire.py
- stockport.py
Other
NAFIS
- nafis.py
Not actually a plings scraper, but used instead to take information from [38] and place it into a spreadsheet. There is also a scraperwiki version.
Scrapers in Progress
Scrapers that have not yet put data into the live database:
- tameside.py (unfinished, poorly structured data)
- tameside-whatson.py (poor data content)
- luton.py (unfinished, poorly structured data)
- walkit.py (unfinished, poorly structured data)
- youngscotwow.py (poor time and postcode information)
External API scrapers
- culture24.py (unfinished, requires an API key to get the data we need)
Plings Input Libraries
- plings.py, plings_base.py, plings_sax.py and plings_soap.py
Classes and methods for more easily inputting data to plings using python. These classes provide an Activities class which can be used to create/send plings api data.
- plings.py is the oldest, and creates an XML file using DOM methods. (This XML file can then be sent to the server using HTTP POST, for example using xml_post.py)
- plings_sax.py is similar but uses SAX methods, so is faster and therefore generally the preferred option
- plings_soap.py uses the SOAP methods
Example of use:
from plings_sax import Activities activities = Activities() activities.outname = "somefile.xml" activity = activities.create_activity() activity.add_field("ActivitySourceID", id) #etc activity.venue.add_field("Name", name) #etc # Starts and Ends may be added like the above, or activity.setup_recurring may be used # Organisation is optional activity.add_org() activity.org.add_field("Name", "Blah") #etc # .... # .... activity.finish() # For SOAP actually creates activity # .... # .... activities.finish()
All three of these methods inherit from the base class, which also provides some helper functions
- activity.hash(venue_name) - Strips whitespace then hashes -> useful for creating ids from venue names
Feel free to use these yourself, but bear in mind they are incomplete and may change dramatically.
Tools
- xml_post.py
Because plings.py simply creates an xml file, xml_post.py is a small script used for actually posting the data to plings. Expects a config.py file with a dict which maps la names to API Keys.
- split.py
Used for splitting a large plings XML file, into smaller files with a specified number of activities each. This is necessary so that the server doesn't run out of memory during import.
- picklecat.py
Display the data in a pickle file.
- api_response_csv.py
Generate a csv file of errors from the response given by the XML Post API.
See Also
- Spreadsheet conversion - to convert spreadsheets to the standard plings spreadsheet format.

