How to design a Web Crawler

Gainlo link : http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/

Features:

Crawl a given URL Pool. (how to identify the initial Pool of URLs?)
Parse the HTML page result from a URL.
Add new URLs to the URL Pool.
Index the HTML data to enable easy retrieval. (Can happen in offline pipeline)
Throttle the frequency of visiting URLs.
Avoid unintentionally visiting the same URL repeatedly.
Avoid Infinite Loop as URLs often contain links to each other.
Criteria for Crawling a URL?

Estimation: Just plugging in some numbers here to get a feel for how much storage is needed and time available to crawl a URL. If there are unrealistic, we can adjust them to get a more realistic estimate.

Storage: Number of new URLs crawled per day = 10 Million URLs Number of new URLS crawled per year = 365 Million URLs Number of new URLS crawled in 10 years = 3.65 Billion URLs Indexing and storing data per URL - 100 MB

Storage needed for indexing/storing 3.65 Billion URLS in 10 years = 100 * 10^6 * 3.65 * 10^9 = 365 Peta Bytes ( 10^15)

Time taken to Crawl a URL: Time to crawl one URL in seconds = (246060/(10 * 10 ^6)) = 86400 * 10^-7 = 8640 * 10^-6 = 8640 micro seconds.

Design Goals:

Latency - Lower the better as we want any changes to URL or addition of new URLS to be crawled/indexed ASAP.
Consistency - If a server has fresher results of a URL compared to another one, that should be ok. As long as eventual consistency is guaranteed.
Availability - Highly available system increases the speed at which we crawl websites.

High Level Design: APIs:

URLResult getUnscannedURLsFromPool();
CrawlResults CrawlAURl(url_id);
AddUrlToPool(url_id);
bool IsUrlAlreadyScanned(url_id);
void ParseURLdata(UrlData);
bool ShouldUrlBeCrawled(url_id);

Message Flow: On a single machine here is how the flow would like,

Get a URL to scan.
HTTP GET request to get the content of website/URL.
Parse the contents of the results. a) Index the data b) Add the links to other URLs to the URL pool.
Continue Crawling. ie go to step 1.

Deep Dive: Application Server: To achieve Design goals(latency/availability), we can have multiple application servers to help us crawl the URLs. A load balancer can be used to distribute the load across these servers.

DataBase/Storage:

We need to keep track of the URL pool
Store the indexed data from these websites. Simple Example for indexing: (Hello was seen in URL 1 ( 5 times), 2 ( 2 times), 3 (1 time)) (World was seen in URL 3 (1 time), 4(10 times), 5(20 times) ) Now when someone searches for Hello World, we will return URL 3.
DataBase Schema. KeyWord -> (URL ID, Frequency)
NoSQL vs SQL. Given the scale and limited relationships we can go with NoSQL DB for this.

Frequency of Crawling: Frequency of Crawling a certain website could be based on several factors,

Based on robot.txt file in a given URL crawling frequency can be decdided
If a website has fresh content it can be crawled again. How can we find if a website has fresh content? Probably use RSS feeds from websites to detect fresh content?
If website is popular this could also factor into frequent crawling of a URL.

Data Deduplication ( DeDup) ( Bloom Filter approach) Since there are several different servers crawling different websites, they may detect the same URLs multiple times and try to add them to the URL pool. How to ensure the URL pool doesn't contain the same URL multiple times. Basically what we need is a HashSet that can keep unique list of URLs. But given the scale of URLs we are dealing with this might result in needing a lot of memory to keep a HashSet of this size. Bloom Filter can be used for this purpose. Bloom filter can tell us if a URL is definitely not in the pool or if it is probably in the pool. The probability of incorrectly thinking a URL has been already visited(false positive) is very low.

Bloom Filter : https://www.youtube.com/watch?v=bgzUdBVr5tE Bloom Filter Wiki Link : https://en.wikipedia.org/wiki/Bloom_filter

Parsing HTML data: Parsing and indexing of HTML data can happen in a offline pipeline. This should not stop us from Crawling websites.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to design a Web Crawler

Clone this wiki locally