Skip to content
v-pravin edited this page Jan 4, 2013 · 7 revisions

Links to other Wiki pages: FAQ

Work Documentation

Day 1 : 14. 11. 2012

  1. The first day has been all about going for a right tool for the job and the below two became good choices to go for,
    1. Phantom JS
    2. CasperJS
  2. The above experimental tools have the following pros to it,
    1. They can handle session-based nature of the given Link
    2. The ability to reuse the same code for multiple URLs with least modifications.
  3. Previously tried my luck with python based scrapers(we used this before in most of our work) but they did not handle the given URLs well and so went for a tool that had good automation features and wide forum support.

Day 2 : 15. 11. 2012

  1. Today, i am playing around the tools looking for a possible code model with the web pages.
  2. I have some doubts with the way in which the crawlers and scrappers should work and some other bottlenecks that came by,
    1. The tag structure of web pages could change anytime - This could ask for possible change in code after some time or a more robust code.
    2. Copy pasting - This is manual.
    3. Finalizing on the structure of the flat file that is scrapped from the pages.
  3. After a brief conversation with Anand i get to two probable ways by which the data can be got,
    1. through code (or)
    2. through human labor (copy-pasting)
  4. To clarify this further Anand has arranged for a conference with Amrtha tomorrow.

Day 3 : 16. 11. 2012

  1. Sat down for a call with Amrtha and Anand through Skype today morning.
  2. The notes taken during the call by Anand could be found at FAQ page of this repo.
  3. I could develop a basic working crawler that drills down through links and takes snap shots of the pages.
  4. Much of the problem that i had was with learning casperjs and phantomjs framework.
  5. Casperjs is much less cruder than phantomjs as far as i can see now.
  6. Having problems with committing the code so i might take time to troubleshoot this.

Day 4 : 18. 11. 2012

  1. Finished with few starting scripts for the webpages.
  2. Same scrappers cannot be used with all the web pages as they have varying tag structures.
  3. Scripts are not stable so had to rely more on casperjs side then on jquery side.

Day 5 : 19. 11. 2012

  1. Targeting scrapers for the first three links today.
  2. I might skip the third link for they have weird responses sometimes.
  3. Still facing some troubles with push requests to repos.
  4. Finished with the two scrapers but yet to push them.

Day 6 : 20. 11. 2012

  1. Targeting scrappers for next three links.
  2. Some of these have different kind of querying mechanism and so had to adapt features from casperjs.
  3. Looks like we might need config files to give the signatures.
  4. Anand suggested me with doing this later.
  5. I might push this till tomorrow afternoon.

Day 7 : 21. 11. 2012

  1. Finished with rest of the scrapers that till eight.
  2. Time to stabilize them.
  3. With this part i might have a talk with Anand and come back.

Day 8 : 22. 11. 2012

  1. Meeting with Nisha and Amritha at 10am.

Day 9 : 23. 11. 2012

  1. Looking into possibilities of having a config.json file to input the link specific grab reference.
  2. Anand has approved for this idea but for now i have hard coded everything.

Day 10 : 26. 11. 2012

  1. Code development for the day.

Day 11 : 27. 11. 2012

  1. Code development for the day.

Day 12 : 28. 11. 2012

  1. Facing a common issue among all the code files.
  2. Issue is on framework side.

Day 13 : 29. 11. 2012

  1. Searching for a way to keep source files in modular way.
  2. Anand asked me go through CommonJS (new to me)

Day 14 : 30. 11. 2012

  1. Made a commit which had all the core functionality working.
  2. Sad thing is i did not make them as modular so the code length is insanely large.

Day 15 : 03. 12. 2012

  1. Sick today.

Day 16 : 04. 12. 2012

  1. Anand did not have a commit access to the Arghyam repo till evening.
  2. A commit was made by him which reduced the code to reasonable size.

Day 17 : 05. 12. 2012

  1. Code Development.

Day 18 : 06. 12. 2012

  1. Code Development.

Day 19 : 07. 12. 2012

  1. Code Development.

Day 20 : 10. 12. 2012

  1. Had a short meeting with Nisha.
  2. Clarified my doubts on the next batch of links

Day 21 : 11. 12. 2012

  1. Sick today.

Day 22 : 12. 12. 2012

  1. Installed CasperJS RC5 + PhantomJS 1.7.0
  2. Found issues with CasperJS RC5 + PhantomJS 1.7
  3. Filed the same at https://github.com/n1k0/casperjs/issues/317

Day 23 : 13. 12. 2012

  1. The issue with CasperJS was fixed immediately.
  2. Optimized : 1, 2, 4 & 8 ; Added : 10 & 12

Day 24 : 14. 12. 2012

  1. Added : 9, 11 & 13

Day 25 : 17. 12. 2012

  1. Meeting with Nisha & Amrtha
  2. More time need to be budgeted for stabilizing scrapers.
  3. Asked for more links that need to be scraped for the project.

Day 26 : 18. 12. 2012

  1. Nisha has sent me more links to scrape.
  2. Refactored file names.

Day 27 : 19. 12. 2012

  1. Scraper for the 13th link need to be run in EC2 for they require good bandwidth.

Day 28 : 20. 12. 2012

  1. Scraper 17 ,18 and 29 done.

Day 29 : 21. 12. 2012

  1. Refactoring code to use more of CSS than to do it using JQuery.

Day 30 : 24. 12. 2012

  1. Lot of remote commits to include optimizations.

Day 31 : 26. 12. 2012

  1. Local commits for optimization changes.

Day 32 : 27. 12. 2012

  1. Local commits for optimization changes.

Day 33 : 28. 12. 2012

  1. Local commits for optimization changes.

Day 34 : 31. 12. 2012

  1. Local commits for optimization changes.
Clone this wiki locally