[7] Adding s3 getters to the website scraping code #10

beingkk · 2024-10-30T12:11:12Z

Closes #7

Hi @helenCNode, you can merge this code and then download the scraped data by running these commands from the terminal

python dsp_nesta_brain/getters/nesta.py --download
python dsp_nesta_brain/getters/nesta.py --unzip

You can then use the following function to load the metadata table in Python

from dsp_nesta_brain.getters.nesta import load_metadata
metadata_df = load_metadata()

At the moment each website file is saved as a txt file. For further processing, you'd need to ingest it and use BeautifulSoup.
For example, you'll see that I've added a _scrape() function to your scrape module which is just the BeautifulSoup component of your scrape() function - and that could be used to go through all the txt files.

Hope this helps!

beingkk · 2024-10-30T12:11:54Z

Oh, and no need to review the s3 utils - they are copied from another project and have already been reviewed.

beingkk added 3 commits October 30, 2024 12:02

adding s3 getters

80c4127

add metadata getter

fc34620

add metadata getter

d14f937

beingkk requested a review from helenCNode October 30, 2024 12:11

beingkk linked an issue Oct 30, 2024 that may be closed by this pull request

Scraping the Nesta website #7

Open

helenCNode approved these changes Oct 30, 2024

View reviewed changes

helenCNode merged commit 4b0df46 into ccid_demo Oct 30, 2024

beingkk deleted the 7-scraping-nesta-website branch November 11, 2024 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7] Adding s3 getters to the website scraping code #10

[7] Adding s3 getters to the website scraping code #10

beingkk commented Oct 30, 2024 •

edited

Loading

beingkk commented Oct 30, 2024

[7] Adding s3 getters to the website scraping code #10

[7] Adding s3 getters to the website scraping code #10

Conversation

beingkk commented Oct 30, 2024 • edited Loading

beingkk commented Oct 30, 2024

beingkk commented Oct 30, 2024 •

edited

Loading