Dropbox: Stops after one scroll #6

theoryshaw · 2021-01-11T16:42:56Z

Using https://webscraper.io/, the scape stops after one scroll. Anyone have any clues?

{ "_id": "dropbox2", "startUrl": ["https://www.dropbox.com/events?date=22-8-2020"], "selectors": [{ "id": "tr1", "type": "SelectorElementScroll", "parentSelectors": ["_root"], "selector": "tr.mc-media-row", "multiple": true, "delay": 2000 }, { "id": "date", "type": "SelectorText", "parentSelectors": ["tr1"], "selector": "div.mc-media-cell-text-detail", "multiple": false, "regex": "", "delay": 0 }, { "id": "span", "type": "SelectorHTML", "parentSelectors": ["tr1"], "selector": "span", "multiple": false, "regex": "", "delay": 0 }, { "id": "link", "type": "SelectorLink", "parentSelectors": ["tr1"], "selector": "a:first", "multiple": false, "delay": 0 }] }

The text was updated successfully, but these errors were encountered:

gitcoinbot · 2021-01-11T16:47:34Z

Issue Status: 1. Open 2. Started 3. Submitted 4. Done

This issue now has a funding of 0.0932 ETH (100.0 USD @ $1072.99/ETH) attached to it.

If you would like to work on this issue you can 'start work' on the Gitcoin Issue Details page.
Want to chip in? Add your own contribution here.
Questions? Checkout Gitcoin Help or the Gitcoin's Discord
$964,487.73 more funded OSS Work available on the Gitcoin Issue Explorer

theoryshaw · 2021-01-11T16:58:41Z

Forum link: https://forum.webscraper.io/t/dropbox-stops-after-one-scroll/6864

felixniemeyer · 2021-01-11T18:36:34Z

Hi theoryshaw,

I tried to reproduce your issue and in firefox, the scrolling worked. Had you been trying on chrome?

Despite the scrolling working fine, the results were wrong and missing a lot of entries (maybe you also talked about this observation when saying the scrolling would not be working).

I found a possible reason for that:
Whatever UI framework Dropbox is using, it reuses the table-row elements and only replaces their contents. After scrolling the page, webscraper may not reconsider elements that were already existing (or overrides it with the most recent values?).

If you're simply want to download the data once, here is an alternative method that might work for you... (click this paragraph)

Re-request the events data from Dropbox' server and adjust the page_size:

open the dropbox/events page
open your browser's developer tools (F12)
click on the Network tab
reload your page
filter the network results for "/events/ajax"
right click the first entry
click on "Edit and Resend"
in the request body replace "page_size=25" with "page_size=250"
(the request body looks something like this: is_xhr=true&t=...&page_size=25&ns_ids=2199782192%2C2410245600%2C94858538&timestamp=1610389011&include_avatars=true)
wait for the request to finish
right click on the request in the list of request
choose "copy response"

there you have it, as JSON. You'd still need to distill you're desired data fields from it (e.g. use regex to get the links out of the html string)

here is how one entry from this request looks like:

{
	"0": {
		"ago": "vor 54 Min.",
		"pkey": "836799753 94858538",
		"is_dup": false,
		"name": "asdasd",
		"timestamp": 1610386540,
		"event_blurb": "Sie haben die Datei <a target='_blank' href='/event_details/58540542/94858538/804719346/0'>test (17) (1) (1).web</a> hinzugefügt",
		"context_blurb": null,
		"avatar_url": null,
		"blurb": "Sie haben die Datei <a target='_blank' href='/event_details/58540542/94858538/804719346/0'>test (17) (1) (1).web</a> hinzugefügt.",
		"id": 836799753,
		"ns_id": 94858538
	}
}

there may be an upper limit for page_size, that the api allows. Maybe check the last entry from your response agains the last entry you can see on the page to make sure you've got everything.

Hope this helps

Cheers

Felix

theoryshaw · 2021-01-11T20:18:20Z

Hey Felix,
Yes it was Chrome.
I just tried Firefox too, and it does the same thing, that is, it only scrolls once and then quits, but it does produce the data.
I'm looking for a solution where it continues to scroll, and continues to collect data until I quit the routine. So, yes, looking for an automated scenario.
I've had https://webscraper.io/ work in this capacity on other sites.

After scrolling the page, webscraper may not reconsider elements that were already existing (or overrides it with the most recent values?).

Maybe that's why it's not working and cannot be realized using Webscaper.

felixniemeyer · 2021-01-11T21:32:58Z

Ok, just to make sure it's not about the specific date in the startUrl:
please can you try whether scrolling works when you use this:

{ "_id": "dropbox2", "startUrl": ["https://www.dropbox.com/events"], "selectors": [{ "id": "tr1", "type": "SelectorElementScroll", "parentSelectors": ["_root"], "selector": "tr.mc-media-row", "multiple": true, "delay": 2000 }, { "id": "date", "type": "SelectorText", "parentSelectors": ["tr1"], "selector": "div.mc-media-cell-text-detail", "multiple": false, "regex": "", "delay": 0 }, { "id": "span", "type": "SelectorHTML", "parentSelectors": ["tr1"], "selector": "span", "multiple": false, "regex": "", "delay": 0 }, { "id": "link", "type": "SelectorLink", "parentSelectors": ["tr1"], "selector": "a:first", "multiple": false, "delay": 0 }] }

Btw. the alternative approach I've described in the earlier post lets you download hundreds of events with a few clicks... check it out if you haven't already.

theoryshaw · 2021-01-12T00:43:13Z

I tried that too, but unfortunately it didn't work either. :\

gitcoinbot · 2021-01-13T17:41:31Z

Issue Status: 1. Open 2. Started 3. Submitted 4. Done

Work for 0.0932 ETH (94.04 USD @ $1065.8/ETH) has been submitted by:

Learn more on the Gitcoin Issue Details page
Want to chip in? Add your own contribution here.
Questions? Checkout Gitcoin Help or the Gitcoin's Discord
$968,146.24 more funded OSS Work available on the Gitcoin Issue Explorer

agbilotia1998 · 2021-01-13T17:48:25Z

Hi @theoryshaw can you please check this repo https://github.com/agbilotia1998/dropbox-event-scraper

For getting the data, open networks tab in chrome, filter by /events/ajax, right click on first result and click on copy-> copy as node.js fetch.

The node.js fetch should look something like this:

fetch("https://www.dropbox.com/events/ajax", {
  "headers": {
    "accept": "text/plain, */*; q=0.01",
    "accept-language": "en-US,en;q=0.9,hi;q=0.8",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "x-requested-with": "XMLHttpRequest",
    "cookie": "locale=en; gvc=MTUyOTQ5Njc3NzI0MzM2MjI3NjI3OTEwNDI3MzE5ODkwMDUzOTk5; _ga=GA1.2.1247140008.1610457563; last_active_role=personal; _gid=GA1.2.158964997.1610556117; lid=AADNocw-0mTg7-gag6H_g5o7PCFcRnkAkoqm5xhmb57mfg; blid=AABr-ubOkuR8Ln_zR7AM_k_G0k2ZWMswwU9A9RuR7JclHQ; __Host-ss=8vnB8wnhTY; jar=W3sibnMiOiA4OTY1NzU5NzYwLCAicmVtZW1iZXIiOiB0cnVlLCAidWlkIjogMzgyNjkyOTA0MCwgImgiOiAiIiwgImV4cGlyZXMiOiAxNzA1MTY0MTIwfV0%3D; t=mXT3hPS0uP5Jqyp1uAgR5IGs; preauth=; __Host-js_csrf=mXT3hPS0uP5Jqyp1uAgR5IGs; bjar=W3sidGVhbV9pZCI6ICIiLCAicm9sZSI6ICJwZXJzb25hbCIsICJ1aWQiOiAzODI2OTI5MDQwLCAic2Vzc19pZCI6IDI3MTE1NDM3MTYzODcyODI4MTU0MTk2OTAzNzA2NzYyMjQyMTc0NSwgImV4cGlyZXMiOiAxNzA1MTY0MTIwLCAidXNlcl9naWQiOiAiQUFxYW4zZU5ORzVqSndqX0FqNlZ6aDR6In1d; db-help-center-uid=ZXlKMllXeDFaU0k2SUhzaWRXbGtJam9nTXpneU5qa3lPVEEwTUgwc0lDSnphV2R1WVhSMWNtVWlPaUFpUVVGQlJrcFpaa2xZVFZoRU0xQldiREJNVUdGQldXTmplVkJxZDNsNFpWSm5UbDlZY21GamJGaGFMVlJLZHlKOQ%3D%3D; utag_main=v_id:0176f6c03fae00c3bce964dc6e5803079002d07100b7e$_sn:2$_se:3$_ss:0$_st:1610557921231$ses_id:1610556113157%3Bexp-session$_pn:1%3Bexp-session"
  },
  "referrer": "https://www.dropbox.com/events",
  "referrerPolicy": "origin-when-cross-origin",
  "body": "is_xhr=true&t=mXT3hPS0uP5Jqyp1uAgR5IGs&page_size=25&ns_ids=8965546032%2C8965759760&timestamp=1610559894&include_avatars=true",
  "method": "POST",
  "mode": "cors"
});

copy the object from above like:

{
  "headers": {
    "accept": "text/plain, */*; q=0.01",
    "accept-language": "en-US,en;q=0.9,hi;q=0.8",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "x-requested-with": "XMLHttpRequest",
    "cookie": "locale=en; gvc=MTUyOTQ5Njc3NzI0MzM2MjI3NjI3OTEwNDI3MzE5ODkwMDUzOTk5; _ga=GA1.2.1247140008.1610457563; last_active_role=personal; _gid=GA1.2.158964997.1610556117; lid=AADNocw-0mTg7-gag6H_g5o7PCFcRnkAkoqm5xhmb57mfg; blid=AABr-ubOkuR8Ln_zR7AM_k_G0k2ZWMswwU9A9RuR7JclHQ; __Host-ss=8vnB8wnhTY; jar=W3sibnMiOiA4OTY1NzU5NzYwLCAicmVtZW1iZXIiOiB0cnVlLCAidWlkIjogMzgyNjkyOTA0MCwgImgiOiAiIiwgImV4cGlyZXMiOiAxNzA1MTY0MTIwfV0%3D; t=mXT3hPS0uP5Jqyp1uAgR5IGs; preauth=; __Host-js_csrf=mXT3hPS0uP5Jqyp1uAgR5IGs; bjar=W3sidGVhbV9pZCI6ICIiLCAicm9sZSI6ICJwZXJzb25hbCIsICJ1aWQiOiAzODI2OTI5MDQwLCAic2Vzc19pZCI6IDI3MTE1NDM3MTYzODcyODI4MTU0MTk2OTAzNzA2NzYyMjQyMTc0NSwgImV4cGlyZXMiOiAxNzA1MTY0MTIwLCAidXNlcl9naWQiOiAiQUFxYW4zZU5ORzVqSndqX0FqNlZ6aDR6In1d; db-help-center-uid=ZXlKMllXeDFaU0k2SUhzaWRXbGtJam9nTXpneU5qa3lPVEEwTUgwc0lDSnphV2R1WVhSMWNtVWlPaUFpUVVGQlJrcFpaa2xZVFZoRU0xQldiREJNVUdGQldXTmplVkJxZDNsNFpWSm5UbDlZY21GamJGaGFMVlJLZHlKOQ%3D%3D; utag_main=v_id:0176f6c03fae00c3bce964dc6e5803079002d07100b7e$_sn:2$_se:3$_ss:0$_st:1610557921231$ses_id:1610556113157%3Bexp-session$_pn:1%3Bexp-session"
  },
  "referrer": "https://www.dropbox.com/events",
  "referrerPolicy": "origin-when-cross-origin",
  "body": "is_xhr=true&t=mXT3hPS0uP5Jqyp1uAgR5IGs&page_size=25&ns_ids=8965546032%2C8965759760&timestamp=1610559894&include_avatars=true",
  "method": "POST",
  "mode": "cors"
}

and paste it in options.json in the above repo.

Run the script by executing docker build -t scraper . followed by docker run scraper you will get a list of events.

theoryshaw · 2021-01-14T04:05:50Z

Thanks @agbilotia1998. I was looking for a solution using webscraper.io, however, this might work, as well.

I'm somewhat tech savvy, but can you do a video screen capture of the steps you are taking to run this script?

agbilotia1998 · 2021-01-14T04:17:33Z

Sure, will do a screen capture.

theoryshaw · 2021-01-14T04:45:30Z

I think i got it. But it appears to only scrapes the first page of events. Is it possible for it to keep scrolling, and continue to capture data? That's what webscraper.io does.
Also, possible to export this out to a CSV too?

agbilotia1998 · 2021-01-14T06:22:05Z

The current page size is 250, when I tested with 35 events all of them appeared. Can we come over a call and see the issue. Also its possible to export results as CSV, do let me know what particular fields you require in CSV?

agbilotia1998 · 2021-01-14T13:14:42Z

I've added the code to write extracted data to a csv file, do let me know if you require any changes? Execute docker build -t scraper . then docker run --name scraper_container scraper and then docker cp scraper_container:/scraper/output.csv ./output.csv

theoryshaw · 2021-01-14T14:40:14Z

I will test the CSV.
I need a solution that exports out months and months of events. So it would seems that the solution needs to automatically scroll down on the events page, in order to bring up more events to export.

agbilotia1998 · 2021-01-14T14:43:24Z

Right now the page size is 250, you can increase the page size to get more data.

theoryshaw · 2021-01-14T14:43:35Z

You can export these fields.

agbilotia1998 · 2021-01-14T14:44:55Z

As of now there is name, timestamp, ago and blurb fields, do you need all of them?

theoryshaw · 2021-01-14T14:48:33Z

just add 'event blurb' and that should suffice.

theoryshaw · 2021-01-14T14:49:41Z

would increasing the page size be able to get months and months of data?

agbilotia1998 · 2021-01-14T14:51:32Z

I've added event_blurb as well. Yes, increasing the page size should be able to get months of data.

theoryshaw · 2021-01-14T15:01:58Z

Seems the
field is missing from the csv.

agbilotia1998 · 2021-01-14T15:03:25Z

The last field in csv is blurb

theoryshaw · 2021-01-14T15:17:28Z

'event blurb' is included, but 'blurb' is not.

theoryshaw · 2021-01-14T15:19:04Z

sorry, you are correct, i see 'blurb' now.

agbilotia1998 · 2021-01-14T15:22:01Z

Okay, great!!

theoryshaw · 2021-01-14T15:25:45Z

awesome solution..thx.
let me see if I can get months of data, if so, i'll shoot you payment
Thanks!

theoryshaw · 2021-01-14T15:58:21Z

I increased the page_size to 2500, but got the following error...

agbilotia1998 · 2021-01-14T16:09:49Z

Can we come over a call?

theoryshaw · 2021-01-14T16:16:32Z

sure, just emailed you a jitsi link.

agbilotia1998 · 2021-01-14T17:54:07Z

I've updated the script, please check!

theoryshaw · 2021-01-14T18:25:33Z

Nicely done Ayush! thanks for the help. Will shoot you the funds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropbox: Stops after one scroll #6

Dropbox: Stops after one scroll #6

theoryshaw commented Jan 11, 2021

gitcoinbot commented Jan 11, 2021

theoryshaw commented Jan 11, 2021

felixniemeyer commented Jan 11, 2021 •

edited

Loading

theoryshaw commented Jan 11, 2021

felixniemeyer commented Jan 11, 2021 •

edited

Loading

theoryshaw commented Jan 12, 2021

gitcoinbot commented Jan 13, 2021

agbilotia1998 commented Jan 13, 2021 •

edited

Loading

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021 •

edited

Loading

agbilotia1998 commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021 •

edited

Loading

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

Dropbox: Stops after one scroll #6

Dropbox: Stops after one scroll #6

Comments

theoryshaw commented Jan 11, 2021

gitcoinbot commented Jan 11, 2021

theoryshaw commented Jan 11, 2021

felixniemeyer commented Jan 11, 2021 • edited Loading

theoryshaw commented Jan 11, 2021

felixniemeyer commented Jan 11, 2021 • edited Loading

theoryshaw commented Jan 12, 2021

gitcoinbot commented Jan 13, 2021

agbilotia1998 commented Jan 13, 2021 • edited Loading

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021 • edited Loading

agbilotia1998 commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021 • edited Loading

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

agbilotia1998 commented Jan 14, 2021

theoryshaw commented Jan 14, 2021

felixniemeyer commented Jan 11, 2021 •

edited

Loading

felixniemeyer commented Jan 11, 2021 •

edited

Loading

agbilotia1998 commented Jan 13, 2021 •

edited

Loading

theoryshaw commented Jan 14, 2021 •

edited

Loading

agbilotia1998 commented Jan 14, 2021 •

edited

Loading