Scrapers

From Plings Info

Jump to: navigation, search

Contents

Python Scraping

These are a bunch of website/spreadsheet scraping scripts written in python, that output plings xml files. These can be found in a git repository, the files can be browsed on the web here, or downloaded using git:

git clone git://gitorious.org/plings-scrapers/plings-scrapers.git plings-scrapers

These have been released under the MIT license, to permit them to be reused as widely as possible.

Website Scrapers

Some of these scrapers work differently. Older ones take no arguments, and scrape the site directly into an XML file. Newer scrapers take an argument of either get or scrape. get pulls the webpages into a cache directory (which you may need to create). scrape does the actual conversion of the html files in this cache directory to XML.

Name Files Url Status Scraped Data Structure API Method Recurred? Cron? Full Address? Notes
B-Active bactive.py [1] Live [2] New XML POST No Yes Yes? Poor address data
BBC Blast bbcblast.py

bbclast_in.py

[3] Inactive [4] Website->CSV->API SOAP No No
Bristol bristol.py [5] Live [6] Website->Cache->API XML POST No No
Bolton bolton.py [7] Live [8] SOAP->API XML POST No Yes Yes
Cheshire cheshire.py [9] Live [10] Website->Cache->API XML POST No No
Cambridgeshire cambridgeshire.py [11] Pending New XML POST Yes Yes
Coventry Active coventry.py [12] Live [13] Website->Cache->API XML POST Yes No
East Riding east_riding.py

windmill/east_riding.py

[14] Inactive [15] Website->Pagelist->Cache->API XML POST No No Creates the pagelist using the windmill framework
Hackney hackney.py [16] Live [17] Website->Cache->API XML POST No No Contains interesting start/end time regexes
Hertfordshire hertfordshire.py [18] Live [19] New XML POST Yes Yes
Lancashire see Scrapers_Lancashire lancashire.py

lancashire_defs.py

[20] Live [21] Website->Cache->API XML POST No No
Luton Museums luton_museums.py [22] Inactive Website->API XML POST No No
Luton Youth luton_youth.py [23] Inactive Website->API XML POST Yes No
Manchester manchester.py [24] Live [25] Website->Cache->API XML POST Yes No Contains interesting latlng2postcode function
National Fishing national_fishing.py [26] Seasonal [27] Website->Cache->API XML POST No No
North Yorkshire (Gimi) north_yorkshire.py

north_yorkshire.sh

[28] Live (Little data) [29] Website->Cache->CSV->API XML POST No Yes No Only about a week's data availible
SkyRide skyride.py [30] Seasonal [31] Website->Cache->API XML POST No
Warrington warrington.py [32] Live (Little Data) [33] Website->Cache->API XML POST No Yes No Contains mapmagic function (scrape from google map)
Wirral wirral.py [34] Live [35] Website->Cache->API XML POST No Yes No
Worcestershire worcestershire.py [36] Live [37] Website->Cache->API XML POST No Yes No

Spreadsheet Parsers

These only make sense if you have the spreadsheets from the Local Authorities in question.

  • islington.py
  • islington_halfterm.py
  • leicester.py
  • lewisham.py
  • norfolk.py
  • staffordshire.py
  • stockport.py

Other

NAFIS

nafis.py

Not actually a plings scraper, but used instead to take information from [38] and place it into a spreadsheet. There is also a scraperwiki version.

Scrapers in Progress

Scrapers that have not yet put data into the live database:

  • tameside.py (unfinished, poorly structured data)
  • tameside-whatson.py (poor data content)
  • luton.py (unfinished, poorly structured data)
  • walkit.py (unfinished, poorly structured data)
  • youngscotwow.py (poor time and postcode information)

External API scrapers

  • culture24.py (unfinished, requires an API key to get the data we need)

Plings Input Libraries

plings.py, plings_base.py, plings_sax.py and plings_soap.py

Classes and methods for more easily inputting data to plings using python. These classes provide an Activities class which can be used to create/send plings api data.

  • plings.py is the oldest, and creates an XML file using DOM methods. (This XML file can then be sent to the server using HTTP POST, for example using xml_post.py)
  • plings_sax.py is similar but uses SAX methods, so is faster and therefore generally the preferred option
  • plings_soap.py uses the SOAP methods

Example of use:

from plings_sax import Activities
activities = Activities()
activities.outname = "somefile.xml"
activity = activities.create_activity()
activity.add_field("ActivitySourceID", id) #etc
activity.venue.add_field("Name", name) #etc
# Starts and Ends may be added like the above, or activity.setup_recurring may be used
# Organisation is optional
activity.add_org()
activity.org.add_field("Name", "Blah") #etc
# ....
# ....
activity.finish() # For SOAP actually creates activity
# ....
# ....
activities.finish()

All three of these methods inherit from the base class, which also provides some helper functions

  • activity.hash(venue_name) - Strips whitespace then hashes -> useful for creating ids from venue names

Feel free to use these yourself, but bear in mind they are incomplete and may change dramatically.

Tools

xml_post.py

Because plings.py simply creates an xml file, xml_post.py is a small script used for actually posting the data to plings. Expects a config.py file with a dict which maps la names to API Keys.

split.py

Used for splitting a large plings XML file, into smaller files with a specified number of activities each. This is necessary so that the server doesn't run out of memory during import.

picklecat.py

Display the data in a pickle file.

api_response_csv.py

Generate a csv file of errors from the response given by the XML Post API.

See Also

Personal tools