Scrapers Lancashire
From Plings Info
- These bugs should now be fixed. Please use the code from the git repository as it is more up to date.
The Lancashire scraper seemed to need some work doing.
Below is code that uses the code from Scrapers. BUT it needs fixing!
Contents |
Instructions
Download or checkout a version of the code from Scrapers
You will need to create a cache directory (if it does not already exist). Within that you will need a 'lancashire' directory and within that a 'venue' and a 'repeats' directory
cache
- lancashire
-venues
-repeats
If you do not create these, the script will tell you!
Backup the lancashire.py file in the checked out code, then replace lancashire.py with the code below. You will also need to make sure the lancashire_defs.py file is in the same directory.
Running the script
The parser runs in 2 stages.
First, to get the data from the website, it downloads html pages into the cache/lancashire directory. To do this, run the lancashire.py file with the 'get' argument
python lancashire.py get
Once you have some data you need to parse it, and turn it into plings API ready XML To do this run To do this, run the lancashire.py file with the 'scrape' argument
python lancashire.py scrape
A file, lancashire.xml will be made in the same directory as the script.
About the Lancashire Data
In order to get the data (I think) the script does the following.
1. Pulls a list of activities from search results page, and stores the html in a temporary directory e.g. http://yps.lancashire.gov.uk/activity-search?ft=%25&page=0 line 27 sets the number of pages to work through. This is manual setting
2. Goes through the stored html and visits each activities page and stores the html of those pages.
Once you have some stored data, you can run lancashire.py with the 'scrape' argument. Some of the information is pulled directly from the activity page, but the rest is found by going to venue link from that page, and to the 'repeats' link.
As data is found it's added to an XML builder using activity.add_field part of the plings_sax module
To add all instances of the activity, we make a copy of the activity, append the date to the id, and store the XML from there.
To Do
It looks as though we can parse all the data we need out of the html files. I'm stuck at the part that turns it into plings API ready XML.
I think the only stumbling block is the 'Description' data.
To recreate: Run
python lancashire.py scrape
and look at the resulting lancashire.xml file. There will be part formed xml, broken around the <Description> tag
To make it work: Uncomment line 89
#desc["description"] = "hello world"
to overwrite the parsed version of the description. Run the script again and you should see well formed xml in lancashire.xml
The problem may well lie in lancashire_defs.py where the description function sits and pulls the data from the html page
Code
lancashire.py
#!/usr/bin/env python # Copyright (c) 2010 Ben Webb <bjwebb67@googlemail.com> # Released as free software under the MIT license, # see the LICENSE file for details. import BeautifulSoup import urllib2 from plings_sax import Activities import datetime import dateutil.parser import re import sys import os import lancashire_defs ##Regular expressions for use later to match parsed URLs venre = re.compile("/content/(.*)") repre = re.compile("/node/([0-9]+)/repeats") cachedir = "cache/lancashire/" baseurl = "http://yps.lancashire.gov.uk/" if len(sys.argv) < 2: print "You must supply an action: get or scrape" #get pulls data from the website, scrape parses it (and pulls additional info) elif sys.argv[1] == "get": for i in range(0,1): # 0 is the first page, 1 is the last page. Use this for testing then increase the page numbers when working if str(i)+".html" in os.listdir(os.path.join(cachedir, "tmp")): html = open(os.path.join(cachedir, "tmp", str(i)+".html")).read() else: url = baseurl + "activity-search?ft=%25&page=" + str(i) html = urllib2.urlopen(url).read() f = open(os.path.join(cachedir, "tmp", str(i)+".html"), "w") f.write(html) f.close() page1 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) for act in page1.findAll(attrs={"class": "search-activity"}): a = act.find("a") url2 = baseurl + a["href"] print url2 id = a["href"] if not (id+".html").encode("utf8") in os.listdir(cachedir): html = urllib2.urlopen(url2.encode("utf8")).read() print id f = open(os.path.join(cachedir, id.encode("utf8")+".html"), "w") f.write(html) f.close() elif sys.argv[1] == "scrape": activities = Activities() activities.outname = "lancashire.xml" for file in os.listdir(cachedir): try: file = file.decode("utf8") if file.endswith(".html"): id = file[:-len(".html")] print id ##Create our activity activity = activities.create_activity() ##Write the ActivitySourceID activity.add_field("ActivitySourceID", id) ##Open the saved HTML for parsing html = open(os.path.join(cachedir, file)).read() page = BeautifulSoup.BeautifulSoup(html, fromEncoding="utf-8", convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) ##Find the activity name activity.add_field("Name", page.find("h2", "with-tabs").contents[0]) print page.find("h2", "with-tabs").contents[0] ##Activity contact details data = lancashire_defs.contact_details(page) #print data.keys() print data["Contact name:"] print data["Contact phone:"] print data["Contact e-mail:"] ##Activity Age ranges age_range = lancashire_defs.ages(page) print age_range["From age:"] print age_range["To age:"] ##Activity Description desc = lancashire_defs.description(page) #desc["description"] = "hello world" #print desc["description"] activity.add_field("Description", desc["description"]) activity.add_field("ContactName", data["Contact name:"]) activity.add_field("ContactNumber", data["Contact phone:"]) try: activity.add_field("MinAge", age_range["From age:"]) activity.add_field("MaxAge", age_range["To age:"]) except KeyError: pass ##Get the venue details ##First each activity page has a URL to the venue - use this url as a venue id ##aaahhh - could have used venre.match m = re.sub("\/content\/","",page.find("fieldset", "fieldgroup group-act-venue").find("a")["href"]) venid = m #print m #break ##Now see if we have stored the venue page in the cache. If not grab it from the web if (venid+".html").encode("utf8") in os.listdir(os.path.join(cachedir, "venue")): html = open(os.path.join(cachedir, "venue", venid+".html")).read() else: venurl = baseurl + "content/" + venid print venurl html = urllib2.urlopen(venurl).read() f = open(os.path.join(cachedir, "venue", venid.encode("utf8")+".html"), "w") f.write(html) f.close() ##Open the venue page for parsing page2 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) ##VenueID we know, postcode and Name are easily found and added to the activity XML in the venue braces activity.venue.add_field("ProviderVenueID", venid) activity.venue.add_field("Name", page2.find("div", "block-top-right-orange").find("div", "titles").h2.contents[0]) activity.venue.add_field("Postcode", page2.find("span", "postal-code").contents[0]) print page2.find("div", "block-top-right-orange").find("div", "titles").h2.contents[0] print page2.find("span", "postal-code").contents[0] #break ##Find the Phone Number and Main Contact data data2 = lancashire_defs.venue_details(page2) print data2["Phone Number:"] print data2["Main Contact:"] #break ##Add phone, contact names and website detiails the activity XML activity.venue.add_field("ContactPhone", data2["Phone Number:"]) try: names = data2["Main Contact:"].split(" ") activity.venue.add_field("ContactForename", names[0]) activity.venue.add_field("ContactSurname", names[1]) activity.venue.add_field("Website", baseurl+"content/"+venid) except IndexError: pass ##Now we look for the link to 'Repeats' this tells us all the dates of this activity ##This grabs the node id of the link. It is of the form /node/42838/repeats m = repre.match(page.find("a", text="Repeats").parent["href"]) realid = m.group(1) ##Check to see if we have a file with this id in the cache - if not fetch it if realid+".html" in os.listdir(os.path.join(cachedir, "repeats")): html = open(os.path.join(cachedir, "repeats", realid+".html")).read() else: venurl = baseurl + "node/" + realid + "/repeats" print venurl html = urllib2.urlopen(venurl).read() f = open(os.path.join(cachedir, "repeats", realid+".html"), "w") f.write(html) f.close() ##Now we open and parse the 'repeats' html to pull out start and end dates. page3 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) ##datefield is a div of many odd/even divs containing data and time spans datefield = page3.find("div", "field field-type-datetime field-field-act-date") #print datefield #for d in datefield.findAll("span", "date-display-single"): for d in datefield.findAll("div","field-item odd"): ##We need to get the date, start and end fields from the HTML and convert to an iso timestamp ##date is in 'Mon 14/8/10 - ' format ##times are in simple 24hr clock format activity_date = d.find("span", "date-display-single").contents[0].strip(" - ") activity_date = dateutil.parser.parse(activity_date).date() #print activity_date start_time = d.find("span","date-display-start").contents[0].strip() hours_to_time = dateutil.parser.parse(start_time).time() start_iso = datetime.datetime.combine(activity_date, hours_to_time).isoformat() print start_iso end_time = d.find("span","date-display-end").contents[0].strip() hours_to_time = dateutil.parser.parse(end_time).time() end_iso = datetime.datetime.combine(activity_date, hours_to_time).isoformat() print end_iso #activity2 is a clone of the data gathered so far with the activity id getting an iso datetime added to it activity2 = activities.cloneactivity(activity, activity.id+"-"+activity_date.isoformat()) activity2.add_field("Starts", start_iso) activity2.add_field("Ends", end_iso) activity2.finish() #activity.add_field("Starts", start_iso) #activity.add_field("Ends", end_iso) #activity.finish() except KeyError: pass except AttributeError: pass except ValueError: pass except urllib2.HTTPError: pass except urllib2.URLError: from time import sleep sleep(10) except IOError: print "IOError" activities.finish() #print activities.doc.toprettyxml(indent=" ")
lancashire_defs.py
#!/usr/bin/env python # Copyright (c) 2010 David Carpenter <caprenter@gmail.com> # Released as free software under the MIT license, # see the LICENSE file for details. def contact_details(page): data = {} for div in page.findAll("fieldset", { "class" : "fieldgroup group-act-contact" }): name = div.find("div", "field field-type-text field-field-act-contact-name") phone = div.find("div", "field field-type-text field-field-act-contact-phone") email = div.find("div", "field field-type-email field-field-act-email") #print email if name: l = name.find("div", "field-label") label = l.contents[0].strip() #print label #for div2 in div.find("div", {"class" : "field-items"}): l2 = name.find("div", "field-item odd") contact_name = l2.contents[0].strip() #print contact_name data [label] = contact_name if email: l = email.find("div", "field-label") label = l.contents[0].strip() #print label #for div2 in div.find("div", {"class" : "field-items"}): l2 = email.find("div", "field-item odd") #print l2 email_address = l2.find("a").contents[0].strip() #print email_address data [label] = email_address if phone: #print phone l = phone.find("div", "field-label") #print l label = l.contents[0].strip() #print label #for div2 in div.find("div", {"class" : "field-items"}): l2 = phone.find("div", "field-item odd") phone_no1 = l2.contents[0].strip() l3 = phone.find("div", "field-item even") #print l3 if l3: phone_no2 = l3.contents[0].strip() phone_no1 = phone_no1 + " / " + phone_no2 data [label] = phone_no1 return data def ages(page): age_range = {} minage = page.find("div", "field field-type-number-integer field-field-act-from-age") #print minage minage_label = minage.find("div","field-label-inline-first").contents[0].strip() minage_value = minage.find("div","field-item odd") minage_value = minage_value.div.nextSibling.extract().strip() age_range [minage_label] = minage_value maxage = page.find("div", "field field-type-number-integer field-field-act-to-age") #print maxage maxage_label = maxage.find("div","field-label-inline-first").contents[0].strip() maxage_value = maxage.find("div","field-item odd") maxage_value = maxage_value.div.nextSibling.extract().strip() age_range [maxage_label] = maxage_value return age_range def description(page): desc = {} description_html = page.find("div", "field field-type-text field-field-act-desc") html = description_html.find("div","field-item odd").findAll(text=True) #findAll(text=True) removes html tags from the string desc["description"] = html print html return desc def venue_details(page): data = {} phone = page.find("div","field field-type-text field-field-telephone") main_contact = page.find("div","field field-type-text field-field-main-contact") if main_contact: contact_name = main_contact.find("div","field-item odd").contents[0].strip() l = main_contact.find("div", "field-label") label = l.contents[0].strip() data [label] = contact_name if phone: #print phone l = phone.find("div", "field-label") #print l label = l.contents[0].strip() #print label #for div2 in div.find("div", {"class" : "field-items"}): l2 = phone.find("div", "field-item odd") phone_no1 = l2.contents[0].strip() l3 = phone.find("div", "field-item even") #print l3 if l3: phone_no2 = l3.contents[0].strip() phone_no1 = phone_no1 + " / " + phone_no2 data [label] = phone_no1 return data

