Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7] Adding s3 getters to the website scraping code #10

Merged
merged 3 commits into from
Oct 30, 2024

Conversation

beingkk
Copy link
Contributor

@beingkk beingkk commented Oct 30, 2024

Closes #7

Hi @helenCNode, you can merge this code and then download the scraped data by running these commands from the terminal

python dsp_nesta_brain/getters/nesta.py --download
python dsp_nesta_brain/getters/nesta.py --unzip

You can then use the following function to load the metadata table in Python

from dsp_nesta_brain.getters.nesta import load_metadata
metadata_df = load_metadata()

At the moment each website file is saved as a txt file. For further processing, you'd need to ingest it and use BeautifulSoup.
For example, you'll see that I've added a _scrape() function to your scrape module which is just the BeautifulSoup component of your scrape() function - and that could be used to go through all the txt files.

Hope this helps!

@beingkk beingkk requested a review from helenCNode October 30, 2024 12:11
@beingkk
Copy link
Contributor Author

beingkk commented Oct 30, 2024

Oh, and no need to review the s3 utils - they are copied from another project and have already been reviewed.

@beingkk beingkk linked an issue Oct 30, 2024 that may be closed by this pull request
@helenCNode helenCNode merged commit 4b0df46 into ccid_demo Oct 30, 2024
@beingkk beingkk deleted the 7-scraping-nesta-website branch November 11, 2024 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraping the Nesta website
2 participants