Home

Jump to bottom Edit New page

v-pravin edited this page Jan 4, 2013 · 7 revisions

Links to other Wiki pages: FAQ

Work Documentation

Day 1 : 14. 11. 2012

The first day has been all about going for a right tool for the job and the below two became good choices to go for,
1. Phantom JS
2. CasperJS
The above experimental tools have the following pros to it,
1. They can handle session-based nature of the given Link
2. The ability to reuse the same code for multiple URLs with least modifications.
Previously tried my luck with python based scrapers(we used this before in most of our work) but they did not handle the given URLs well and so went for a tool that had good automation features and wide forum support.

Day 2 : 15. 11. 2012

Today, i am playing around the tools looking for a possible code model with the web pages.
I have some doubts with the way in which the crawlers and scrappers should work and some other bottlenecks that came by,
1. The tag structure of web pages could change anytime - This could ask for possible change in code after some time or a more robust code.
2. Copy pasting - This is manual.
3. Finalizing on the structure of the flat file that is scrapped from the pages.
After a brief conversation with Anand i get to two probable ways by which the data can be got,
1. through code (or)
2. through human labor (copy-pasting)
To clarify this further Anand has arranged for a conference with Amrtha tomorrow.

Day 3 : 16. 11. 2012

Sat down for a call with Amrtha and Anand through Skype today morning.
The notes taken during the call by Anand could be found at FAQ page of this repo.
I could develop a basic working crawler that drills down through links and takes snap shots of the pages.
Much of the problem that i had was with learning casperjs and phantomjs framework.
Casperjs is much less cruder than phantomjs as far as i can see now.
Having problems with committing the code so i might take time to troubleshoot this.

Day 4 : 18. 11. 2012

Finished with few starting scripts for the webpages.
Same scrappers cannot be used with all the web pages as they have varying tag structures.
Scripts are not stable so had to rely more on casperjs side then on jquery side.

Day 5 : 19. 11. 2012

Targeting scrapers for the first three links today.
I might skip the third link for they have weird responses sometimes.
Still facing some troubles with push requests to repos.
Finished with the two scrapers but yet to push them.

Day 6 : 20. 11. 2012

Targeting scrappers for next three links.
Some of these have different kind of querying mechanism and so had to adapt features from casperjs.
Looks like we might need config files to give the signatures.
Anand suggested me with doing this later.
I might push this till tomorrow afternoon.

Day 7 : 21. 11. 2012

Finished with rest of the scrapers that till eight.
Time to stabilize them.
With this part i might have a talk with Anand and come back.

Day 8 : 22. 11. 2012

Meeting with Nisha and Amritha at 10am.

Day 9 : 23. 11. 2012

Looking into possibilities of having a config.json file to input the link specific grab reference.
Anand has approved for this idea but for now i have hard coded everything.

Day 10 : 26. 11. 2012

Code development for the day.

Day 11 : 27. 11. 2012

Code development for the day.

Day 12 : 28. 11. 2012

Facing a common issue among all the code files.
Issue is on framework side.

Day 13 : 29. 11. 2012

Searching for a way to keep source files in modular way.
Anand asked me go through CommonJS (new to me)

Day 14 : 30. 11. 2012

Made a commit which had all the core functionality working.
Sad thing is i did not make them as modular so the code length is insanely large.

Day 15 : 03. 12. 2012

Sick today.

Day 16 : 04. 12. 2012

Anand did not have a commit access to the Arghyam repo till evening.
A commit was made by him which reduced the code to reasonable size.

Day 17 : 05. 12. 2012

Code Development.

Day 18 : 06. 12. 2012

Code Development.

Day 19 : 07. 12. 2012

Code Development.

Day 20 : 10. 12. 2012

Day 21 : 11. 12. 2012

Sick today.

Day 22 : 12. 12. 2012

Installed CasperJS RC5 + PhantomJS 1.7.0
Found issues with CasperJS RC5 + PhantomJS 1.7
Filed the same at https://github.com/n1k0/casperjs/issues/317

Day 23 : 13. 12. 2012

The issue with CasperJS was fixed immediately.
Optimized : 1, 2, 4 & 8 ; Added : 10 & 12

Day 24 : 14. 12. 2012

Added : 9, 11 & 13

Day 25 : 17. 12. 2012

Meeting with Nisha & Amrtha
More time need to be budgeted for stabilizing scrapers.
Asked for more links that need to be scraped for the project.

Day 26 : 18. 12. 2012

Nisha has sent me more links to scrape.
Refactored file names.

Day 27 : 19. 12. 2012

Scraper for the 13th link need to be run in EC2 for they require good bandwidth.

Day 28 : 20. 12. 2012

Scraper 17 ,18 and 29 done.

Day 29 : 21. 12. 2012

Refactoring code to use more of CSS than to do it using JQuery.

Day 30 : 24. 12. 2012

Lot of remote commits to include optimizations.

Day 31 : 26. 12. 2012

Local commits for optimization changes.

Day 32 : 27. 12. 2012

Local commits for optimization changes.

Day 33 : 28. 12. 2012

Local commits for optimization changes.

Day 34 : 31. 12. 2012

Local commits for optimization changes.

Add a custom sidebar

Clone this wiki locally