vendredi 31 juillet 2015

Scrapy rules and regular expression

I'm trying to use Scrapy to scrap information from geonames.org. More specifically, I want to retrieve the 10 largest cities for each country. My starting URL is http://ift.tt/1llrHys. On this page, I want to follow each URL that meets the regex:

/countries/\w{2}/..html

Then on followed pages (that is the country pages), I want to follow the URL with the following structure http://ift.tt/pxPylTXX/largest-cities-in-YYYY.html where XX is the two letter country code and YYYY is the actual name of the country which obviously can be of variable length. The code below doesn't work. I suspect it's due to a problem with the regex of the second rule. But maybe not!

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
import re
import os

class MySpider(CrawlSpider):
    name = 'geocodeSpider'
    allowed_domains = ['www.geonames.org']
    start_urls = ['http://ift.tt/1llrHys']

    fileName="largest_cities.txt"    
    try:
        os.remove(os.path.join('geocode/output',fileName))
    except OSError:
        pass
    rules = (
        Rule(LinkExtractor(allow=(r'/countries/\w{2}/.\.html', )),),
        Rule(LinkExtractor(allow=(r'/\w{2}/largest-cities-in-.\.html', )), callback='parse_item'),
  )


    def parse_item(self, response):
...

Thanks a lot for your help!!

TG

Aucun commentaire:

Enregistrer un commentaire