-
Notifications
You must be signed in to change notification settings - Fork 4
Home
v-pravin edited this page Jan 4, 2013
·
7 revisions
Links to other Wiki pages: FAQ
- The first day has been all about going for a right tool for the job and the below two became good choices to go for,
- The above experimental tools have the following pros to it,
- They can handle session-based nature of the given Link
- The ability to reuse the same code for multiple URLs with least modifications.
- Previously tried my luck with python based scrapers(we used this before in most of our work) but they did not handle the given URLs well and so went for a tool that had good automation features and wide forum support.
- Today, i am playing around the tools looking for a possible code model with the web pages.
- I have some doubts with the way in which the crawlers and scrappers should work and some other bottlenecks that came by,
- The tag structure of web pages could change anytime - This could ask for possible change in code after some time or a more robust code.
- Copy pasting - This is manual.
- Finalizing on the structure of the flat file that is scrapped from the pages.
- After a brief conversation with Anand i get to two probable ways by which the data can be got,
- through code (or)
- through human labor (copy-pasting)
- To clarify this further Anand has arranged for a conference with Amrtha tomorrow.
- Sat down for a call with Amrtha and Anand through Skype today morning.
- The notes taken during the call by Anand could be found at FAQ page of this repo.
- I could develop a basic working crawler that drills down through links and takes snap shots of the pages.
- Much of the problem that i had was with learning casperjs and phantomjs framework.
- Casperjs is much less cruder than phantomjs as far as i can see now.
- Having problems with committing the code so i might take time to troubleshoot this.
- Finished with few starting scripts for the webpages.
- Same scrappers cannot be used with all the web pages as they have varying tag structures.
- Scripts are not stable so had to rely more on casperjs side then on jquery side.
- Targeting scrapers for the first three links today.
- I might skip the third link for they have weird responses sometimes.
- Still facing some troubles with push requests to repos.
- Finished with the two scrapers but yet to push them.
- Targeting scrappers for next three links.
- Some of these have different kind of querying mechanism and so had to adapt features from casperjs.
- Looks like we might need config files to give the signatures.
- Anand suggested me with doing this later.
- I might push this till tomorrow afternoon.
- Finished with rest of the scrapers that till eight.
- Time to stabilize them.
- With this part i might have a talk with Anand and come back.
- Meeting with Nisha and Amritha at 10am.
- Looking into possibilities of having a config.json file to input the link specific grab reference.
- Anand has approved for this idea but for now i have hard coded everything.
- Code development for the day.
- Code development for the day.
- Facing a common issue among all the code files.
- Issue is on framework side.
- Searching for a way to keep source files in modular way.
- Anand asked me go through CommonJS (new to me)
- Made a commit which had all the core functionality working.
- Sad thing is i did not make them as modular so the code length is insanely large.
- Sick today.
- Anand did not have a commit access to the Arghyam repo till evening.
- A commit was made by him which reduced the code to reasonable size.
- Code Development.
- Code Development.
- Code Development.
- Had a short meeting with Nisha.
- Clarified my doubts on the next batch of links
- http://tsc.gov.in/Report/Financial/RptStateLevelFinyrwise_net.aspx?id=FIN
- http://tsc.gov.in/Report/Financial/RptPercentageFinComponentStatewiseDistrictwise_net.aspx?id=FIN
- http://tsc.gov.in/Report/Physical/RptStateWiseSelective_net.aspx?id=PHY
- http://tsc.gov.in/Report/MonitoringStatusReport/RptAchofIEC_HRDStatewiseDistrictwiseDetails.aspx?fin=2012-2013&id=AIP
- http://tsc.gov.in/Report/MonitoringStatusReport/RptAIPQuarterPlanVsAchievementStatewise_net.aspx?fin=2012-2013&id=AIP
- http://tsc.gov.in/NBA/NBAHome.aspx
- http://indiawater.gov.in/IMISReports/NRDWPPanchayatProfile.aspx?IPanchayat=0000035761&PanchName=ADHELAI
- http://indiawater.gov.in/IMISReports/NRDWPMain.aspx
- Sick today.
- Installed CasperJS RC5 + PhantomJS 1.7.0
- Found issues with CasperJS RC5 + PhantomJS 1.7
- Filed the same at https://github.com/n1k0/casperjs/issues/317
- The issue with CasperJS was fixed immediately.
- Optimized : 1, 2, 4 & 8 ; Added : 10 & 12
- Added : 9, 11 & 13
- Meeting with Nisha & Amrtha
- More time need to be budgeted for stabilizing scrapers.
- Asked for more links that need to be scraped for the project.
- Nisha has sent me more links to scrape.
- Refactored file names.
- Scraper for the 13th link need to be run in EC2 for they require good bandwidth.
- Scraper 17 ,18 and 29 done.
- Refactoring code to use more of CSS than to do it using JQuery.
- Lot of remote commits to include optimizations.
- Local commits for optimization changes.
- Local commits for optimization changes.
- Local commits for optimization changes.
- Local commits for optimization changes.