Tuesday, March 09, 2010
Screen scraping joy
Been doing a touch of screen scraping, scripting with Ruby, against a target that was ‘unwilling’. A few observations:
- Using Mechanize (available in various forms for Perl, Python and Ruby [homepage for the latter]) is a must. I started with the Ruby HTTP library, then went to Curb (Ruby’s implementation of Curl), but having the pages you retrieve abstracted into an object that you can manipulate in familiar terms (like, say page.forms_with :name => "choose_colour") helps you concentrate on the peculiarities of your task
- Replicating the path of a real user is important. There could be session variables at the server end that mean jumping about between items that you cannot navigate between as a regular user will generate error pages, but see below
- Don’t count on friendly HTTP errors from the server, as it might not know it has done anything wrong
- If the page output looks OK but you cannot parse it, check for funny characters hidden in the HTML. I found ASCII nulls dotted about; these are initially hard to spot for somewhat obvious reasons. Browsers can deal with this kind of dodginess but XML parsers, as @fidothe reminds me, must ignore the elements in which such characters occur. I was able to do this to get around the problem:
@agent = Mechanize.new
class << @agent
alias :orig_get :get
alias :orig_fetch_page :fetch_page
# remove the chaff characters
def get(options, parameters = [], referer = nil)
page = orig_get(options, parameters, referer)
page.body = page.body.gsub(/"[0x00]"/,"")
page
end
def fetch_page(params)
page = orig_fetch_page(params)
page.body = page.body.gsub(/"[0x00]"/, "")
page
end
end
[0x00] represents ascii null in the sample code; I was able to select and paste the character from an HTML dump with both vim and a GUI text editor but it tends to be less than visible in the wild and YMMV.
Labels: ruby, screen.scraper, scripting
