python - Failed to crawl element of specific website with scrapy spider -


i want website addresses of jobs, write scrapy spider, want of value xpath://article/dl/dd/h2/a[@class="job-title"]/@href, when execute spider command :

scrapy spider auseek -a addsthreshold=3 

the variable "urls" used preserve values empty, can me figure it,

here code:

from scrapy.contrib.spiders import crawlspider,rule scrapy.selector import selector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.conf import settings scrapy.mail import mailsender scrapy.xlib.pydispatch import dispatcher scrapy.exceptions import closespider scrapy import log scrapy import signals  myproj.items import aditem import time  class auseekspider(crawlspider):     name = "auseek"     result_address = []     addresscount = int(0)     addressthresh = int(0)     allowed_domains = ["seek.com.au"]     start_urls = [         "http://www.seek.com.au/jobs/in-australia/"     ]      def __init__(self,**kwargs):         super(auseekspider, self).__init__()         self.addressthresh = int(kwargs.get('addsthreshold'))         print 'init finished...'      def parse_start_url(self,response):         print 'this start url function'         log.msg("pipeline.spider_opened called", level=log.info)         hxs = selector(response)         urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()         print 'urls is:',urls         print 'test element:',urls[0].encode("ascii")         url in urls:             postfix = url.getattribute('href')             print 'postfix:',postfix             url = urlparse.urljoin(response.url,postfix)             yield request(url, callback = self.parse_ad)          return        def parse_ad(self, response):         print 'this parse_ad function'         hxs = selector(response)           item = aditem()         log.msg("pipeline.parse_ad called", level=log.info)         item['name'] = str(self.name)         item['picnum'] = str(6)         item['link'] = response.url         item['date'] = time.strftime('%y%m%d',time.localtime(time.time()))          self.addresscount = self.addresscount + 1         if self.addresscount > self.addressthresh:             raise closespider('get enough website address')         return item 

the problems is:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract() 

urls empty when tried print out, cant figure out why doesn't work , how can correct it, help.

scrapy not evaluate javascript. if run following command, see raw html not contain anchors looking for.

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title 

you should try phantomjs or selenium instead.

after examining network requests in chrome, job listing appear have originated this jsonp request. should easy retrieve whatever need it.


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -