-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
71 lines (58 loc) · 3.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
<html>
<head>
<meta charset="UTF-8">
<title>
Scrapelet
</title>
</head>
<body>
<h1>Scrapelet</h1>
This is a Javascript <a id="marklet" href="#">scraper</a> bookmarklet. You can use it to scrape lists of urls. The assumption is that all the urls present items rendered using the same template. Thus, you show the scraper one example of the item you want to scrape, and it seeks out other similar instances on the pages and scrapes them.
<h2>Pros and Cons</h2>
This is a pure Javascript in-browser scraper. It uses your own browser to do the scraping. The advantage of this approach is that the scraper has all the access capabilities of your browser: it will automatically use your login credentials to access restricted content. Also, if the page uses Javascript to render its content, the scraper will be able to execute that Javascript and extract that content after it has been rendered.
<p>
On the downside, as a pure-Javascript scraper this scraper is subject to Javascript restrictions. Thus, it cannot scrape from multiple web sites, because it is only allowed to access content from the site holding the page where you started the scraping process.
<p>
An even bigger downside is that this is a fragile, highly experimental tool. It will not do a good job handling variability in the content.
<h2>Installation</h2>
There's nothing to install. All you need to do is drag
this <a id="marklet1" href="#">scraper</a>
bookmarklet link into your bookmarks collection.
<h2>Usage</h2>
To use this scraper, visit the web site that you want
to scrape (you can only scrape from a single web site). Then, click the bookmarklet. That will open a dialog for selecting an example of the item you want to scrape, using the mouse and arrow keys. You'll then be able to enter a list of urls you'd like to scrape. All the urls need to be on the same site as the example page you used to teach the scraper. After scraping completes, a page will open containing a table of results.
<p>
<b>Allow popups.</b> The scraper works by opening new windows. If your browser complains about popups, make sure to permit them.
<p>
<b>Be patient.</b> The scraper spends a few seconds waiting for each page to "settle down" (in case the page uses some Javascript rendering) before it jumps in and extracts the content.
<h2>Results</h2>
Some scrapers ask you to select the particular thing you'd like to extract. This scraper just extracts <b>everything</b>. It gives you a table of all possible pieces of the items it found. The top line counts the number of items with a value in each column. After you copy and paste the table into a spreadsheet, you can delete all the columns you don't need.
<h2>Try it</h2> If you want to try this scraper, just click one of the
links above to launch it. Then, select a cell of the table below as
the scraping example. Or, you can select one of the headers in order
to scrape the headers out of this page. Finally, enter the url of
this page when asked what you want to scrape. It should extract the
cells from the table below.
<table style="border: 3px solid black">
<tr>
<td>first cell<td>second cell
</tr>
<tr>
<td>third cell<td>fourth cell
</tr>
</table>
<script>
(function() {
var markletStart =
"javascript:(function(){var%20jsCode=document.createElement('script');jsCode.setAttribute('src','";
var markletEnd = "');document.body.appendChild(jsCode);}());";
var baseEnd = window.location.href.lastIndexOf('/');
var base = window.location.href.substring(0,baseEnd+1);
var marklet = markletStart + base + 'scrapelet.js' + markletEnd;
var markletLink = document.getElementById("marklet");
var markletLink1 = document.getElementById("marklet1");
markletLink.href=marklet;
markletLink1.href=marklet;
})();
</script>
</body>