django - scrappy status page for failed spiders -
i have made spider crawl news , here code
class abcspider(xmlfeedspider): handle_httpstatus_list = [404, 500] name = 'abctv' allowed_domains = ['abctvnepal.com.np'] start_urls = [ 'http://www.abctvnepal.com.np', ] def parse(self, response): if response.status in self.handle_httpstatus_list: return request(url="http://google.com", callback=self.after_404) hxs = htmlxpathselector(response) # xpath selector sites = hxs.select('//div[@class="marlr respo-left"]/div/div/h3') items = [] site in sites: item = newsitem() item['title'] = escape(''.join(site.select('a/text()').extract())).strip() item['link'] = escape(''.join(site.select('a/@href').extract())).strip() item['description'] = escape(''.join(site.select('p/text()').extract())) item = request(item['link'],meta={'item': item},callback=self.parse_detail) items.append(item) return items def parse_detail(self, response): item = response.meta['item'] sel = htmlxpathselector(response) details = sel.select('//div[@class="entry"]/p/text()').extract() detail = '' piece in details: detail = detail + piece item['details'] = detail item['location'] = detail.split(",",1)[0] item['published_date'] = (detail.split(" ",1)[1]).split(" ",1)[0]+' '+((detail.split(" ",1)[1]).split(" ",1)[1]).split(" ",1)[0] return item def after_404(self, response): print response.url
what want if spider dont work or dont crawl want show status page saying spider not working. how can that?? how can make status page ?? ??
i have integrated django. can make url in django status display. if yes how
i can steps take without providing clear examples (better thank links anyway)
- create django project
- create single view in project
- this single view has able somehow connect webcrawler :p. there several ways of doing it:
- write status updates database (you can include django project python path , gain access django orm in crawler). you'll have create models hold data, not hard.
- you can use kind of message queue (might want check out http://www.celeryproject.org/). might complicated option requires setting , configuring different software.
- or check if process running executing shell command in view , confirming if process of correct pid exists or not.
- return data based on approach 4. 5. or 6 view.
Comments
Post a Comment