I'm trying to use Scrapy to scrap information from geonames.org. More specifically, I want to retrieve the 10 largest cities for each country. My starting URL is http://ift.tt/1llrHys. On this page, I want to follow each URL that meets the regex:
/countries/\w{2}/..html
Then on followed pages (that is the country pages), I want to follow the URL with the following structure http://ift.tt/pxPylTXX/largest-cities-in-YYYY.html where XX is the two letter country code and YYYY is the actual name of the country which obviously can be of variable length. The code below doesn't work. I suspect it's due to a problem with the regex of the second rule. But maybe not!
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
import re
import os
class MySpider(CrawlSpider):
name = 'geocodeSpider'
allowed_domains = ['www.geonames.org']
start_urls = ['http://ift.tt/1llrHys']
fileName="largest_cities.txt"
try:
os.remove(os.path.join('geocode/output',fileName))
except OSError:
pass
rules = (
Rule(LinkExtractor(allow=(r'/countries/\w{2}/.\.html', )),),
Rule(LinkExtractor(allow=(r'/\w{2}/largest-cities-in-.\.html', )), callback='parse_item'),
)
def parse_item(self, response):
...
Thanks a lot for your help!!
TG
Aucun commentaire:
Enregistrer un commentaire