web scraping - Issues with my Crawl Spider and Pagination. Only Value from the First Page being Extracted -

- April 15, 2012

my task take stock updates 1 of suppliers : www.sportsshoes.com .

the issue facing despite crawl spider visiting each page of category page gives returns data first page. case if try , scrape each page independently i.e if assign scrape third page of category returns results first page.

my code:

from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector scrapy.spider import basespider sportshoes.items import sportshoesitem import urlparse  scrapy.http.request import request   class myspider(crawlspider):   name = "tennis"   allowed_domains = ["sportsshoes.com"]   start_urls = ["http://www.sportsshoes.com/products/shoe/tennis/",                 "http://www.sportsshoes.com/products/shoe/tennis#page=2",                 "http://www.sportsshoes.com/products/shoe/tennis#page=3"]    rules = (rule (sgmllinkextractor(allow=(),restrict_xpaths=('//div[@class="product-detail"]',))     , callback="parse_items", follow= true,),)    def parse_items(self, response):     hxs = htmlxpathselector(response)     titles = hxs.select("//html")     items = []     titles in titles:       item = sportshoesitem()       item ["productname"] = titles.select("//h1[@id='product_title']/span/text()").extract()       item ["size"] = titles.select('//option[@class="sizeoption"]/text()').extract()       item ["sku"] = titles.select("//div[@id='product_ref']/strong/text()").extract()        items.append(item)       return(items)

ps : had used method :

  rules = (rule (sgmllinkextractor(allow=(),restrict_xpaths=('//div[@class="paginator"]',)), follow= true),     rule (sgmllinkextractor(restrict_xpaths=('//div[@class="hproduct product"]',))     , callback="parse_items", follow= true),)

those #page=2 , #page=3 links considered same page scrapy. scrapy, they're interpreted in-page named anchor references. they're not downloaded twice.

they mean in browser though, of javascript.

when inspect happens in browser's inspect/developer tool when click on "next" pages links, you'll notice ajax calls http://www.sportsshoes.com/ajax/products/search.php http post requests, , parameters similar following:

page:3 search-option[show]:20 search-option[sort]:relevency q: na:ytowont9 sa:ytoyontpoja7ytoyontzojm6imtlesi7czoxmzoichjvzhvjdf9jbgfzcyi7czo2oij2ywx1zteio3m6ndoic2hvzsi7fwk6mtthoji6e3m6mzoia2v5ijtzoju6innwb3j0ijtzojy6inzhbhvlmsi7czo2oij0zw5uaxmio319 aav:ytowont9 layout:undefined

the responses these ajax calls xml documents embedding html containing pages next pages, end replacing first page products.

<?xml version="1.0" encoding="utf-8" ?> <response success="true">     <message>success</message>     <value key="html"><![cdata[<div id="extra-data" data-extra-na="ytowont9"...

you have emulate these ajax call data pages. note these post requests contain special header: x-requested-with: xmlhttprequest

to tell scrapy send post requests, can use "method" parameter when creating requests objects.

Search This Blog

CSS

web scraping - Issues with my Crawl Spider and Pagination. Only Value from the First Page being Extracted -

Comments

Post a Comment

Popular posts from this blog

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

double exclamation marks in haskell -

sql server - MSSQL Text and Varchar(MAX) fields shown (MEMO) in DBGrid -