Scrapers Lancashire

From Plings Info

Jump to: navigation, search
These bugs should now be fixed. Please use the code from the git repository as it is more up to date.

The Lancashire scraper seemed to need some work doing.

Below is code that uses the code from Scrapers. BUT it needs fixing!

Contents

Instructions

Download or checkout a version of the code from Scrapers

You will need to create a cache directory (if it does not already exist). Within that you will need a 'lancashire' directory and within that a 'venue' and a 'repeats' directory

cache
  - lancashire
      -venues
      -repeats

If you do not create these, the script will tell you!

Backup the lancashire.py file in the checked out code, then replace lancashire.py with the code below. You will also need to make sure the lancashire_defs.py file is in the same directory.

Running the script

The parser runs in 2 stages.

First, to get the data from the website, it downloads html pages into the cache/lancashire directory. To do this, run the lancashire.py file with the 'get' argument

python lancashire.py get

Once you have some data you need to parse it, and turn it into plings API ready XML To do this run To do this, run the lancashire.py file with the 'scrape' argument

python lancashire.py scrape

A file, lancashire.xml will be made in the same directory as the script.

About the Lancashire Data

In order to get the data (I think) the script does the following.

1. Pulls a list of activities from search results page, and stores the html in a temporary directory e.g. http://yps.lancashire.gov.uk/activity-search?ft=%25&page=0 line 27 sets the number of pages to work through. This is manual setting

2. Goes through the stored html and visits each activities page and stores the html of those pages.

Once you have some stored data, you can run lancashire.py with the 'scrape' argument. Some of the information is pulled directly from the activity page, but the rest is found by going to venue link from that page, and to the 'repeats' link.

As data is found it's added to an XML builder using activity.add_field part of the plings_sax module

To add all instances of the activity, we make a copy of the activity, append the date to the id, and store the XML from there.

To Do

It looks as though we can parse all the data we need out of the html files. I'm stuck at the part that turns it into plings API ready XML.

I think the only stumbling block is the 'Description' data.

To recreate: Run

python lancashire.py scrape

and look at the resulting lancashire.xml file. There will be part formed xml, broken around the <Description> tag

To make it work: Uncomment line 89

#desc["description"] = "hello world"

to overwrite the parsed version of the description. Run the script again and you should see well formed xml in lancashire.xml

The problem may well lie in lancashire_defs.py where the description function sits and pulls the data from the html page

Code

lancashire.py

#!/usr/bin/env python
# Copyright (c) 2010 Ben Webb <bjwebb67@googlemail.com>
# Released as free software under the MIT license,
# see the LICENSE file for details.
import BeautifulSoup
import urllib2
from plings_sax import Activities
import datetime
import dateutil.parser
import re
import sys
import os
import lancashire_defs
 
##Regular expressions for use later to match parsed URLs
venre = re.compile("/content/(.*)")
repre = re.compile("/node/([0-9]+)/repeats")
 
cachedir = "cache/lancashire/"
baseurl = "http://yps.lancashire.gov.uk/"
 
 
if len(sys.argv) < 2:
    print "You must supply an action: get or scrape" #get pulls data from the website, scrape parses it (and pulls additional info)
 
elif sys.argv[1] == "get":
    for i in range(0,1): # 0 is the first page, 1 is the last page. Use this for testing then increase the page numbers when working
        if str(i)+".html" in os.listdir(os.path.join(cachedir, "tmp")):
            html = open(os.path.join(cachedir, "tmp", str(i)+".html")).read()
        else:
            url = baseurl + "activity-search?ft=%25&page=" + str(i)
            html = urllib2.urlopen(url).read()
            f = open(os.path.join(cachedir, "tmp", str(i)+".html"), "w")
            f.write(html)
            f.close()
 
        page1 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
 
        for act in page1.findAll(attrs={"class": "search-activity"}):
            a = act.find("a")
            url2 = baseurl + a["href"]
            print url2
            id = a["href"]
            if not (id+".html").encode("utf8") in os.listdir(cachedir):
                html = urllib2.urlopen(url2.encode("utf8")).read()
                print id
                f = open(os.path.join(cachedir, id.encode("utf8")+".html"), "w")
                f.write(html)
                f.close()
 
elif sys.argv[1] == "scrape":
    activities = Activities()
    activities.outname = "lancashire.xml"
 
    for file in os.listdir(cachedir):
        try:
            file = file.decode("utf8")
            if file.endswith(".html"):
                id = file[:-len(".html")]
            print id
 
            ##Create our activity
            activity = activities.create_activity()
            ##Write the ActivitySourceID
            activity.add_field("ActivitySourceID", id)
 
            ##Open the saved HTML for parsing
            html = open(os.path.join(cachedir, file)).read()
            page = BeautifulSoup.BeautifulSoup(html, fromEncoding="utf-8", convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
 
            ##Find the activity name
            activity.add_field("Name", page.find("h2", "with-tabs").contents[0])
            print page.find("h2", "with-tabs").contents[0]
 
            ##Activity contact details
            data = lancashire_defs.contact_details(page)
            #print data.keys()
            print data["Contact name:"]
            print data["Contact phone:"]
            print data["Contact e-mail:"]
 
            ##Activity Age ranges
            age_range = lancashire_defs.ages(page)
            print age_range["From age:"]
            print age_range["To age:"]
 
            ##Activity Description
            desc = lancashire_defs.description(page)
            #desc["description"] = "hello world"
            #print desc["description"]
 
 
            activity.add_field("Description", desc["description"])
            activity.add_field("ContactName", data["Contact name:"])
            activity.add_field("ContactNumber", data["Contact phone:"])
            try:
                activity.add_field("MinAge", age_range["From age:"])
                activity.add_field("MaxAge", age_range["To age:"])
            except KeyError: pass
 
            ##Get the venue details
            ##First each activity page has a URL to the venue - use this url as a venue id
            ##aaahhh - could have used venre.match
            m = re.sub("\/content\/","",page.find("fieldset", "fieldgroup group-act-venue").find("a")["href"])
            venid = m
            #print m
            #break
 
            ##Now see if we have stored the venue page in the cache. If not grab it from the web
            if (venid+".html").encode("utf8") in os.listdir(os.path.join(cachedir, "venue")):
                html = open(os.path.join(cachedir, "venue", venid+".html")).read()
            else:
                venurl = baseurl + "content/" + venid
                print venurl
                html = urllib2.urlopen(venurl).read()
                f = open(os.path.join(cachedir, "venue", venid.encode("utf8")+".html"), "w")
                f.write(html)
                f.close()
 
            ##Open the venue page for parsing
            page2 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
 
            ##VenueID we know, postcode and Name are easily found and added to the activity XML in the venue braces
            activity.venue.add_field("ProviderVenueID", venid)
            activity.venue.add_field("Name", page2.find("div", "block-top-right-orange").find("div", "titles").h2.contents[0])
            activity.venue.add_field("Postcode", page2.find("span", "postal-code").contents[0])
            print page2.find("div", "block-top-right-orange").find("div", "titles").h2.contents[0]
            print page2.find("span", "postal-code").contents[0]
            #break
 
            ##Find the Phone Number and Main Contact data
            data2 = lancashire_defs.venue_details(page2)
            print data2["Phone Number:"]
            print data2["Main Contact:"]
            #break
 
            ##Add phone, contact names and website detiails the activity XML
            activity.venue.add_field("ContactPhone", data2["Phone Number:"])
            try:
                names = data2["Main Contact:"].split(" ")
                activity.venue.add_field("ContactForename", names[0])
                activity.venue.add_field("ContactSurname", names[1])
                activity.venue.add_field("Website", baseurl+"content/"+venid)
            except IndexError: pass
 
            ##Now we look for the link to 'Repeats' this tells us all the dates of this activity
            ##This grabs the node id of the link. It is of the form /node/42838/repeats
            m = repre.match(page.find("a", text="Repeats").parent["href"])
            realid = m.group(1)
 
            ##Check to see if we have a file with this id in the cache - if not fetch it
            if realid+".html" in os.listdir(os.path.join(cachedir, "repeats")):
                html = open(os.path.join(cachedir, "repeats", realid+".html")).read()
            else:
                venurl = baseurl + "node/" + realid + "/repeats"
                print venurl
                html = urllib2.urlopen(venurl).read()
                f = open(os.path.join(cachedir, "repeats", realid+".html"), "w")
                f.write(html)
                f.close()
 
            ##Now we open and parse the 'repeats' html to pull out start and end dates.
            page3 = BeautifulSoup.BeautifulSoup(html, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
 
            ##datefield is a div of many odd/even divs containing data and time spans
            datefield = page3.find("div", "field field-type-datetime field-field-act-date")
            #print datefield
 
            #for d in datefield.findAll("span", "date-display-single"):
            for d in datefield.findAll("div","field-item odd"):
                ##We need to get the date, start and end fields from the HTML and convert to an iso timestamp
                ##date is in 'Mon 14/8/10 - ' format
                ##times are in simple 24hr clock format
                activity_date = d.find("span", "date-display-single").contents[0].strip(" - ")
                activity_date = dateutil.parser.parse(activity_date).date()
                #print activity_date
 
                start_time = d.find("span","date-display-start").contents[0].strip()
                hours_to_time = dateutil.parser.parse(start_time).time()
                start_iso = datetime.datetime.combine(activity_date, hours_to_time).isoformat()
                print start_iso
 
                end_time = d.find("span","date-display-end").contents[0].strip()
                hours_to_time = dateutil.parser.parse(end_time).time()
                end_iso = datetime.datetime.combine(activity_date, hours_to_time).isoformat()
                print end_iso
 
 
                #activity2 is a clone of the data gathered so far with the activity id getting an iso datetime added to it
                activity2 = activities.cloneactivity(activity, activity.id+"-"+activity_date.isoformat())
                activity2.add_field("Starts", start_iso)
                activity2.add_field("Ends", end_iso)                
                activity2.finish()
            #activity.add_field("Starts", start_iso)
            #activity.add_field("Ends", end_iso)
            #activity.finish()
        except KeyError: pass
        except AttributeError: pass
        except ValueError: pass
        except urllib2.HTTPError: pass
        except urllib2.URLError:
            from time import sleep
            sleep(10)
        except IOError:
            print "IOError"
 
    activities.finish()
    #print activities.doc.toprettyxml(indent="    ")

lancashire_defs.py

#!/usr/bin/env python
# Copyright (c) 2010 David Carpenter <caprenter@gmail.com>
# Released as free software under the MIT license,
# see the LICENSE file for details.
 
def contact_details(page):
  data = {}
  for div in page.findAll("fieldset", { "class" : "fieldgroup group-act-contact" }):
     name = div.find("div", "field field-type-text field-field-act-contact-name")
     phone = div.find("div", "field field-type-text field-field-act-contact-phone")
     email = div.find("div", "field field-type-email field-field-act-email")
     #print email
     if name:
        l = name.find("div", "field-label")
        label = l.contents[0].strip()
        #print label
        #for div2 in div.find("div", {"class" : "field-items"}):
        l2 = name.find("div", "field-item odd")
        contact_name = l2.contents[0].strip()
        #print contact_name
        data [label] = contact_name
     if email:
        l = email.find("div", "field-label")
        label = l.contents[0].strip()
        #print label
        #for div2 in div.find("div", {"class" : "field-items"}):
        l2 = email.find("div", "field-item odd")
        #print l2
        email_address = l2.find("a").contents[0].strip()
        #print email_address
        data [label] = email_address
     if phone:
        #print phone
        l = phone.find("div", "field-label")
        #print l
        label = l.contents[0].strip()
        #print label
        #for div2 in div.find("div", {"class" : "field-items"}):
        l2 = phone.find("div", "field-item odd")
        phone_no1 = l2.contents[0].strip()
        l3 = phone.find("div", "field-item even")
        #print l3
        if l3:
          phone_no2 = l3.contents[0].strip()
          phone_no1 = phone_no1 + " / " + phone_no2
        data [label] = phone_no1
  return data
 
 
 
def ages(page):
  age_range = {}
  minage = page.find("div", "field field-type-number-integer field-field-act-from-age")
  #print minage
  minage_label = minage.find("div","field-label-inline-first").contents[0].strip()
  minage_value = minage.find("div","field-item odd")
  minage_value = minage_value.div.nextSibling.extract().strip() 
  age_range [minage_label] = minage_value
 
  maxage = page.find("div", "field field-type-number-integer field-field-act-to-age")
  #print maxage
  maxage_label = maxage.find("div","field-label-inline-first").contents[0].strip()
  maxage_value = maxage.find("div","field-item odd")
  maxage_value = maxage_value.div.nextSibling.extract().strip() 
  age_range [maxage_label] = maxage_value 
  return age_range
 
def description(page):
  desc = {}
  description_html = page.find("div", "field field-type-text field-field-act-desc")
  html = description_html.find("div","field-item odd").findAll(text=True) #findAll(text=True) removes html tags from the string
  desc["description"] = html
  print html
  return desc
 
 
 
def venue_details(page):
  data = {}
  phone =  page.find("div","field field-type-text field-field-telephone")
  main_contact =  page.find("div","field field-type-text field-field-main-contact")
  if main_contact:
        contact_name = main_contact.find("div","field-item odd").contents[0].strip()
        l = main_contact.find("div", "field-label")
        label = l.contents[0].strip()
        data [label] = contact_name
  if phone:
        #print phone
        l = phone.find("div", "field-label")
        #print l
        label = l.contents[0].strip()
        #print label
        #for div2 in div.find("div", {"class" : "field-items"}):
        l2 = phone.find("div", "field-item odd")
        phone_no1 = l2.contents[0].strip()
        l3 = phone.find("div", "field-item even")
        #print l3
        if l3:
          phone_no2 = l3.contents[0].strip()
          phone_no1 = phone_no1 + " / " + phone_no2
        data [label] = phone_no1
  return data
Personal tools