web scraping - Issues with my Crawl Spider and Pagination. Only Value from the First Page being Extracted -
my task take stock updates 1 of suppliers : www.sportsshoes.com .
the issue facing despite crawl spider visiting each page of category page gives returns data first page. case if try , scrape each page independently i.e if assign scrape third page of category returns results first page.
my code:
from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector scrapy.spider import basespider sportshoes.items import sportshoesitem import urlparse scrapy.http.request import request class myspider(crawlspider): name = "tennis" allowed_domains = ["sportsshoes.com"] start_urls = ["http://www.sportsshoes.com/products/shoe/tennis/", "http://www.sportsshoes.com/products/shoe/tennis#page=2", "http://www.sportsshoes.com/products/shoe/tennis#page=3"] rules = (rule (sgmllinkextractor(allow=(),restrict_xpaths=('//div[@class="product-detail"]',)) , callback="parse_items", follow= true,),) def parse_items(self, response): hxs = htmlxpathselector(response) titles = hxs.select("//html") items = [] titles in titles: item = sportshoesitem() item ["productname"] = titles.select("//h1[@id='product_title']/span/text()").extract() item ["size"] = titles.select('//option[@class="sizeoption"]/text()').extract() item ["sku"] = titles.select("//div[@id='product_ref']/strong/text()").extract() items.append(item) return(items)
ps : had used method :
rules = (rule (sgmllinkextractor(allow=(),restrict_xpaths=('//div[@class="paginator"]',)), follow= true), rule (sgmllinkextractor(restrict_xpaths=('//div[@class="hproduct product"]',)) , callback="parse_items", follow= true),)
those #page=2
, #page=3
links considered same page scrapy. scrapy, they're interpreted in-page named anchor references. they're not downloaded twice.
they mean in browser though, of javascript.
when inspect happens in browser's inspect/developer tool when click on "next" pages links, you'll notice ajax calls http://www.sportsshoes.com/ajax/products/search.php
http post requests, , parameters similar following:
page:3 search-option[show]:20 search-option[sort]:relevency q: na:ytowont9 sa:ytoyontpoja7ytoyontzojm6imtlesi7czoxmzoichjvzhvjdf9jbgfzcyi7czo2oij2ywx1zteio3m6ndoic2hvzsi7fwk6mtthoji6e3m6mzoia2v5ijtzoju6innwb3j0ijtzojy6inzhbhvlmsi7czo2oij0zw5uaxmio319 aav:ytowont9 layout:undefined
the responses these ajax calls xml documents embedding html containing pages next pages, end replacing first page products.
<?xml version="1.0" encoding="utf-8" ?> <response success="true"> <message>success</message> <value key="html"><![cdata[<div id="extra-data" data-extra-na="ytowont9"...
you have emulate these ajax call data pages. note these post requests contain special header: x-requested-with: xmlhttprequest
to tell scrapy send post requests, can use "method" parameter when creating requests
objects.
Comments
Post a Comment