python - Failed to crawl element of specific website with scrapy spider -
i want website addresses of jobs, write scrapy spider, want of value xpath://article/dl/dd/h2/a[@class="job-title"]/@href,
when execute spider command :
scrapy spider auseek -a addsthreshold=3
the variable "urls" used preserve values empty, can me figure it,
here code:
from scrapy.contrib.spiders import crawlspider,rule scrapy.selector import selector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.conf import settings scrapy.mail import mailsender scrapy.xlib.pydispatch import dispatcher scrapy.exceptions import closespider scrapy import log scrapy import signals myproj.items import aditem import time class auseekspider(crawlspider): name = "auseek" result_address = [] addresscount = int(0) addressthresh = int(0) allowed_domains = ["seek.com.au"] start_urls = [ "http://www.seek.com.au/jobs/in-australia/" ] def __init__(self,**kwargs): super(auseekspider, self).__init__() self.addressthresh = int(kwargs.get('addsthreshold')) print 'init finished...' def parse_start_url(self,response): print 'this start url function' log.msg("pipeline.spider_opened called", level=log.info) hxs = selector(response) urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract() print 'urls is:',urls print 'test element:',urls[0].encode("ascii") url in urls: postfix = url.getattribute('href') print 'postfix:',postfix url = urlparse.urljoin(response.url,postfix) yield request(url, callback = self.parse_ad) return def parse_ad(self, response): print 'this parse_ad function' hxs = selector(response) item = aditem() log.msg("pipeline.parse_ad called", level=log.info) item['name'] = str(self.name) item['picnum'] = str(6) item['link'] = response.url item['date'] = time.strftime('%y%m%d',time.localtime(time.time())) self.addresscount = self.addresscount + 1 if self.addresscount > self.addressthresh: raise closespider('get enough website address') return item
the problems is:
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
urls empty when tried print out, cant figure out why doesn't work , how can correct it, help.
scrapy not evaluate javascript. if run following command, see raw html not contain anchors looking for.
curl http://www.seek.com.au/jobs/in-australia/ | grep job-title
you should try phantomjs or selenium instead.
after examining network requests in chrome, job listing appear have originated this jsonp request. should easy retrieve whatever need it.
Comments
Post a Comment