django - scrappy status page for failed spiders -

- February 15, 2015

i have made spider crawl news , here code

class abcspider(xmlfeedspider): handle_httpstatus_list = [404, 500] name = 'abctv' allowed_domains = ['abctvnepal.com.np'] start_urls = [     'http://www.abctvnepal.com.np', ]  def parse(self, response):      if response.status in self.handle_httpstatus_list:         return request(url="http://google.com", callback=self.after_404)      hxs = htmlxpathselector(response) # xpath selector     sites = hxs.select('//div[@class="marlr respo-left"]/div/div/h3')     items = []     site in sites:         item = newsitem()         item['title'] = escape(''.join(site.select('a/text()').extract())).strip()         item['link'] = escape(''.join(site.select('a/@href').extract())).strip()         item['description'] = escape(''.join(site.select('p/text()').extract()))         item = request(item['link'],meta={'item': item},callback=self.parse_detail)         items.append(item)     return items  def parse_detail(self, response):     item = response.meta['item']     sel = htmlxpathselector(response)     details = sel.select('//div[@class="entry"]/p/text()').extract()     detail = ''     piece in details:         detail = detail + piece     item['details'] = detail     item['location'] = detail.split(",",1)[0]     item['published_date'] = (detail.split(" ",1)[1]).split(" ",1)[0]+' '+((detail.split(" ",1)[1]).split(" ",1)[1]).split(" ",1)[0]          return item  def after_404(self, response):     print response.url

what want if spider dont work or dont crawl want show status page saying spider not working. how can that?? how can make status page ?? ??

i have integrated django. can make url in django status display. if yes how

i can steps take without providing clear examples (better thank links anyway)

create django project
create single view in project
this single view has able somehow connect webcrawler :p. there several ways of doing it:
write status updates database (you can include django project python path , gain access django orm in crawler). you'll have create models hold data, not hard.
you can use kind of message queue (might want check out http://www.celeryproject.org/). might complicated option requires setting , configuring different software.
or check if process running executing shell command in view , confirming if process of correct pid exists or not.
return data based on approach 4. 5. or 6 view.

Search This Blog

CSS

django - scrappy status page for failed spiders -

Comments

Post a Comment

Popular posts from this blog

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

double exclamation marks in haskell -

javascript - How to get D3 Tree link text to transition smoothly? -