python - Failed to crawl element of specific website with scrapy spider -

- January 15, 2012

i want website addresses of jobs, write scrapy spider, want of value xpath://article/dl/dd/h2/a[@class="job-title"]/@href, when execute spider command :

scrapy spider auseek -a addsthreshold=3

the variable "urls" used preserve values empty, can me figure it,

here code:

from scrapy.contrib.spiders import crawlspider,rule scrapy.selector import selector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.conf import settings scrapy.mail import mailsender scrapy.xlib.pydispatch import dispatcher scrapy.exceptions import closespider scrapy import log scrapy import signals  myproj.items import aditem import time  class auseekspider(crawlspider):     name = "auseek"     result_address = []     addresscount = int(0)     addressthresh = int(0)     allowed_domains = ["seek.com.au"]     start_urls = [         "http://www.seek.com.au/jobs/in-australia/"     ]      def __init__(self,**kwargs):         super(auseekspider, self).__init__()         self.addressthresh = int(kwargs.get('addsthreshold'))         print 'init finished...'      def parse_start_url(self,response):         print 'this start url function'         log.msg("pipeline.spider_opened called", level=log.info)         hxs = selector(response)         urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()         print 'urls is:',urls         print 'test element:',urls[0].encode("ascii")         url in urls:             postfix = url.getattribute('href')             print 'postfix:',postfix             url = urlparse.urljoin(response.url,postfix)             yield request(url, callback = self.parse_ad)          return        def parse_ad(self, response):         print 'this parse_ad function'         hxs = selector(response)           item = aditem()         log.msg("pipeline.parse_ad called", level=log.info)         item['name'] = str(self.name)         item['picnum'] = str(6)         item['link'] = response.url         item['date'] = time.strftime('%y%m%d',time.localtime(time.time()))          self.addresscount = self.addresscount + 1         if self.addresscount > self.addressthresh:             raise closespider('get enough website address')         return item

the problems is:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

urls empty when tried print out, cant figure out why doesn't work , how can correct it, help.

scrapy not evaluate javascript. if run following command, see raw html not contain anchors looking for.

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title

you should try phantomjs or selenium instead.

after examining network requests in chrome, job listing appear have originated this jsonp request. should easy retrieve whatever need it.

Search This Blog

CSS

python - Failed to crawl element of specific website with scrapy spider -

Comments

Post a Comment

Popular posts from this blog

sql server - MSSQL Text and Varchar(MAX) fields shown (MEMO) in DBGrid -

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

mysql - Flyway migration, Unable to obtain Jdbc connection from DataSource -