Skip to content Skip to sidebar Skip to footer

How To Use Scrapy For Amazon.com Links After "next" Button?

I am relatively new to Python and Scrapy. I'm trying to scrap the links in 'Customers who bought this item also bought'. For example: http://www.amazon.com/Confessions-Economic-Hi

Solution 1:

So I understand you were able to scrape these "Customers Who Bought This Item Also Bought" product details. As you probably saw, these are within a ul in a div with class "shoveler-content":

<divid="purchaseButtonWrapper"class="shoveler-button-wrapper"><aclass="back-button"onclick="return false;"style=""href="#Back"><divclass="shoveler-content"><ultabindex="-1"><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;"><divid="purchase_B003LSTK8G"class="new-faceout p13nimp"data-ref="pd_sim_kstore_1"data-asin="B003LSTK8G">
                ...
                </div></li><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;">...</li><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;">...</li><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;">...</li><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;">...</li><liclass="shoveler-cell"style="margin-left: 16px; margin-right: 16px;">...</li></ul></div><aclass="next-button"onclick="return false;"style=""href="#Next"><spanclass="auiTestSprite s_shvlNext">...</span></a></div></div>

When you inspect your browser of choice's network activity (via Firebug or Chrome Inspect tool), when you click on the "next" button for next suggested products, you'll see an AJAX query to this sort of URL:

http://www.amazon.com
    /gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
    id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
    &pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
    &shovelerName=purchase

(I'm using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)

What's in the id query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next "next" click a user will probably make.

What do you get back from this AJAX query? Still within your browser's inspect tool, you'll see the response is of type application/json, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:

<divclass="new-faceout p13nimp"id="purchase_B00261OOWQ"data-asin="B00261OOWQ"data-ref="pd_sim_kstore_7"><ahref="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7"class="sim-img-title" ><divclass="product-image"><imgsrc="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg"width="100"alt=""height="100"border="0" /></div> Home Game: An Accidental Guide to Fatherhood
    </a><divclass="byline"><spanclass="carat">&#8250</span><ahref="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a></div><divclass="rating-price"><spanclass="rating-stars"><spanclass="crAvgStars"style="white-space:no-wrap;"><spanclass="asinReviewsSummary"name="B00261OOWQ"><ahref="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7"><spanclass="auiTestSprite s_star_4_0 "title="4.1 out of 5 stars" ><span>4.1 out of 5 stars</span></span></a>&nbsp;</span>
                (<ahref="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
            </span></span></div><divclass="binding-platform"> Kindle Edition </div><divclass="pricetext"><spanclass="price"style="margin-right:5px">$11.36</span></div></div>

So you basically get what was in the original page section for suggested products earlier, in each <li> from <div class="shoveler-content"><ul>

But how do you get those ASINs codes to append to the AJAX query's id parameter?

Well, in the product page, you'll notice this section

<div id="purchaseSimsData"class="sims-data" style="display:none" data-baseAsin="B005CRQ2OE"data-featureId="pd_sim"data-pageId="B005CRQ2OEr_sim_2"data-reftag="pd_sim_kstore"data-wdg="ebooks_display_on_website"data-widgetName="purchase">
    B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
    B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
    B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
    B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
    B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
    B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
    B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
    B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
    B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
    B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
    B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
    B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
    B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
    B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>

which looks like all the suggested products ASINs.

Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json package, and then parse each HTML snippet to extract product info you want.

Solution 2:

I would recommend you to avoid scrapy especially since you're a beginner. Use awesome Requests module for downloading pages https://github.com/kennethreitz/requests

and BeautifulSoup for parsing webpages. http://www.crummy.com/software/BeautifulSoup/.

Post a Comment for "How To Use Scrapy For Amazon.com Links After "next" Button?"