-
Notifications
You must be signed in to change notification settings - Fork 7
Dropbox: Stops after one scroll #6
Comments
Issue Status: 1. Open 2. Started 3. Submitted 4. Done This issue now has a funding of 0.0932 ETH (100.0 USD @ $1072.99/ETH) attached to it.
|
Hi theoryshaw, I tried to reproduce your issue and in firefox, the scrolling worked. Had you been trying on chrome? Despite the scrolling working fine, the results were wrong and missing a lot of entries (maybe you also talked about this observation when saying the scrolling would not be working). I found a possible reason for that: If you're simply want to download the data once, here is an alternative method that might work for you... (click this paragraph)Re-request the events data from Dropbox' server and adjust the page_size:
there you have it, as JSON. You'd still need to distill you're desired data fields from it (e.g. use regex to get the links out of the html string) here is how one entry from this request looks like: {
"0": {
"ago": "vor 54 Min.",
"pkey": "836799753 94858538",
"is_dup": false,
"name": "asdasd",
"timestamp": 1610386540,
"event_blurb": "Sie haben die Datei <a target='_blank' href='/event_details/58540542/94858538/804719346/0'>test (17) (1) (1).web</a> hinzugefügt",
"context_blurb": null,
"avatar_url": null,
"blurb": "Sie haben die Datei <a target='_blank' href='/event_details/58540542/94858538/804719346/0'>test (17) (1) (1).web</a> hinzugefügt.",
"id": 836799753,
"ns_id": 94858538
}
} there may be an upper limit for page_size, that the api allows. Maybe check the last entry from your response agains the last entry you can see on the page to make sure you've got everything. Cheers Felix |
Hey Felix,
Maybe that's why it's not working and cannot be realized using Webscaper. |
Ok, just to make sure it's not about the specific date in the startUrl:
Btw. the alternative approach I've described in the earlier post lets you download hundreds of events with a few clicks... check it out if you haven't already. |
I tried that too, but unfortunately it didn't work either. :\ |
Issue Status: 1. Open 2. Started 3. Submitted 4. Done Work for 0.0932 ETH (94.04 USD @ $1065.8/ETH) has been submitted by:
|
Hi @theoryshaw can you please check this repo https://github.com/agbilotia1998/dropbox-event-scraper For getting the data, open networks tab in chrome, filter by The node.js fetch should look something like this:
copy the object from above like:
and paste it in options.json in the above repo. Run the script by executing |
Thanks @agbilotia1998. I was looking for a solution using webscraper.io, however, this might work, as well. I'm somewhat tech savvy, but can you do a video screen capture of the steps you are taking to run this script? |
Sure, will do a screen capture. |
The current page size is 250, when I tested with 35 events all of them appeared. Can we come over a call and see the issue. Also its possible to export results as CSV, do let me know what particular fields you require in CSV? |
I've added the code to write extracted data to a csv file, do let me know if you require any changes? Execute |
I will test the CSV. |
Right now the page size is 250, you can increase the page size to get more data. |
As of now there is name, timestamp, ago and blurb fields, do you need all of them? |
just add 'event blurb' and that should suffice. |
would increasing the page size be able to get months and months of data? |
I've added |
The last field in csv is |
sorry, you are correct, i see 'blurb' now. |
Okay, great!! |
awesome solution..thx. |
Can we come over a call? |
sure, just emailed you a jitsi link. |
I've updated the script, please check! |
Nicely done Ayush! thanks for the help. Will shoot you the funds. |
Using https://webscraper.io/, the scape stops after one scroll. Anyone have any clues?
{ "_id": "dropbox2", "startUrl": ["https://www.dropbox.com/events?date=22-8-2020"], "selectors": [{ "id": "tr1", "type": "SelectorElementScroll", "parentSelectors": ["_root"], "selector": "tr.mc-media-row", "multiple": true, "delay": 2000 }, { "id": "date", "type": "SelectorText", "parentSelectors": ["tr1"], "selector": "div.mc-media-cell-text-detail", "multiple": false, "regex": "", "delay": 0 }, { "id": "span", "type": "SelectorHTML", "parentSelectors": ["tr1"], "selector": "span", "multiple": false, "regex": "", "delay": 0 }, { "id": "link", "type": "SelectorLink", "parentSelectors": ["tr1"], "selector": "a:first", "multiple": false, "delay": 0 }] }
The text was updated successfully, but these errors were encountered: