Tuesday, March 09, 2010

Screen scraping joy

Been doing a touch of screen scraping, scripting with Ruby, against a target that was ‘unwilling’. A few observations:

  • Using Mechanize (available in various forms for Perl, Python and Ruby [homepage for the latter]) is a must. I started with the Ruby HTTP library, then went to Curb (Ruby’s implementation of Curl), but having the pages you retrieve abstracted into an object that you can manipulate in familiar terms (like, say page.forms_with :name => "choose_colour") helps you concentrate on the peculiarities of your task
  • Replicating the path of a real user is important. There could be session variables at the server end that mean jumping about between items that you cannot navigate between as a regular user will generate error pages, but see below
  • Don’t count on friendly HTTP errors from the server, as it might not know it has done anything wrong
  • If the page output looks OK but you cannot parse it, check for funny characters hidden in the HTML. I found ASCII nulls dotted about; these are initially hard to spot for somewhat obvious reasons. Browsers can deal with this kind of dodginess but XML parsers, as @fidothe reminds me, must ignore the elements in which such characters occur. I was able to do this to get around the problem:

  • @agent = Mechanize.new
    class << @agent
    alias :orig_get :get
    alias :orig_fetch_page :fetch_page
    # remove the chaff characters
    def get(options, parameters = [], referer = nil)
    page = orig_get(options, parameters, referer)
    page.body = page.body.gsub(/"[0x00]"/,"")
    page
    end
    def fetch_page(params)
    page = orig_fetch_page(params)
    page.body = page.body.gsub(/"[0x00]"/, "")
    page
    end
    end

    [0x00] represents ascii null in the sample code; I was able to select and paste the character from an HTML dump with both vim and a GUI text editor but it tends to be less than visible in the wild and YMMV.

  • Assume that what you’re doing is an unwelcome task. If the points above don’t give you that impression, other curiosities probably will.

Labels: , ,

TCO