Amazon-Crawling

Task

Implementing Web Scrapping using Scrapy

Synopsis

Objective: Succeed in scraping Amazon Baby Clothing Sets products data and store it employing MongoDB Compass

Approach:

Taken advantage of Scrapy, Pywin32, Scrapy-User-Agents and PyMongo to create a spider for scrapping
Applied HTML concepts and Selector Gadget Chrome Extension to select similarly placed data collectively
Bypassed restrictions via agents and proxies, defined pipeline and encoded extraction for multiple pages

Result: Collected scraped data of 334 documents inside the NoSQL database supported by MongoDB

Requirements

Pycharm
Python Version - 3.10
MongoDB
Git and GitHub
Chrome Browser

Note

The venv folder automatically gets created in AmazonCrawling when venv is selected on creating the project and the .idea folder just stores project-related settings

Prerequisites

HTML and Scraping Basics

Steps for Scrapping

Create new project in Pycharm(VENV)
Install packages (Only once, do upgrade) - Scrapy, pywin32 and scrapy-user-agents

scrapy to do scraping
pywin32 integrate pycharm to windows
scrapy-user-agents to create fake users for scraping every time (No need when scrapping a site for the first time)

Go to terminal -> down arrow on right -> command prompt -> should have venv
Execute below commands

scrapy startproject amazoncrawl (Only for the first time because already started when reworking on the project)
cd amazoncrawl
scrapy genspider amazon_spiders amazon.com (Only for the first time to create spider for crawling) (Can have any name and website)

Go to items.py and give variable # comment name for all the variables
Get the selector gadget extension on Chrome and enable it
Go in amazon_spiders.py -> In class after name, write start_urls = ['url'] -> remove allowed domains -> xpath copy and paste -> give items['varname']= varname, yield items and pass
Execute the below commands in the terminal

cls (clears screen)
scrapy crawl amazon_spiders (Always needed to see scraped data) (Do give cd amazoncrawl when reworking on the project before this)

..items have .. representing that we're going one step/directory back

https://www.google.com/robots.txt to check allowed extraction from a website

Bypass restrictions

3 Ways to Bypass Restrictions

Using Google bot user agent (Google Bot):

Search it and open the second link with whatismybrowser.com. (Take any one) (Can extract any number of pages)
Add this in settings.py under USER_AGENT:

USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

Using user agents (Different users):

Install scrapy-user-agents.
With this, you do not get disallowed to access links after scrolling once through rotating 2200 agents. (Will extract 25-35 pages max)
Add in settings.py under middlewares:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Use proxies (Different IP addresses):

Similar to 2nd.
Install scrapy-proxy-pool.
Add in settings.py, PROXY_POOL_ENABLED = True after ROBOTSTXT_OBEY = True and under middlewares:

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

run scrapy crawl amazon_spiders again

Connect with MongoDB Compass

Create data -> db folder in C drive
Open mongod from program files and then close it after some time
Check that some files are there in the db folder
Open MongoDB Compass and click connect

Store Data in MongoDB

Install pymongo package
pymongo connects Python to Mongodb
Comment out ITEM_PIPELINES in settings.py
In pipelines.py,

Inside init in class, add connection (using username & port number or connection_string), database and collection
Add documents in the collection as a dictionary in the process_item function of class
Run scrapy crawl amazon_spiders again
Open MongoDB Compass and refresh
The database will be there

Scrape Multiple Pages

For scraping multiple pages, in spider file, add page_number = 2 as the second line and

next_page = 'https://www.amazon.in/s?k=clothing&i=apparel&rh=n%3A1571271031%2Cn%3A1953173031&dc&page=' + str(AmazonSpidersSpider.page_number) + '&qid=1648063354&rnid=3576079031&ref=sr_pg_' + str(AmazonSpidersSpider.page_number)
if AmazonSpidersSpider.page_number <= 334:
    yield response.follow(next_page, callback=self.parse)
    AmazonSpidersSpider.page_number += 1

at the end of the parse() function (after yield items and before pass)

Push Code to GitHub

In the command prompt terminal, write,

git init
git add . && git commit -m "initial commit" (May ask for email and username if committing for the first time from your computer) (Will see n files changed)
git branch -M main
git remote add origin https://github.com/rashika-dabas/Amazon-Crawling.git
git push -u origin main (May ask for your GitHub password)

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
amazoncrawl		amazoncrawl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon-Crawling

Task

Synopsis

Objective: Succeed in scraping Amazon Baby Clothing Sets products data and store it employing MongoDB Compass

Approach:

Result: Collected scraped data of 334 documents inside the NoSQL database supported by MongoDB

Requirements

Note

Prerequisites

Steps for Scrapping

..items have .. representing that we're going one step/directory back

https://www.google.com/robots.txt to check allowed extraction from a website

3 Ways to Bypass Restrictions

Connect with MongoDB Compass

Store Data in MongoDB

Scrape Multiple Pages

Push Code to GitHub

About

Releases

Packages

Languages

rashika-dabas/Amazon-Crawling

Folders and files

Latest commit

History

Repository files navigation

Amazon-Crawling

Task

Synopsis

Objective: Succeed in scraping Amazon Baby Clothing Sets products data and store it employing MongoDB Compass

Approach:

Result: Collected scraped data of 334 documents inside the NoSQL database supported by MongoDB

Requirements

Note

Prerequisites

Steps for Scrapping

..items have .. representing that we're going one step/directory back

https://www.google.com/robots.txt to check allowed extraction from a website

3 Ways to Bypass Restrictions

Connect with MongoDB Compass

Store Data in MongoDB

Scrape Multiple Pages

Push Code to GitHub

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages