feat: lesson about using a framework #1303

honzajavorek · 2024-11-25T16:00:23Z

This PR introduces a new lesson to the Python course for beginners in scraping. The lesson is about working with a framework. Decisions I made:

I opted not to use type hints to make the examples less cluttered and to avoid the need to explain type hints to people who didn't ever use them
The logging section serves two purposes - first, it adds logging :) and second, it conveniently provides code for the whole program at the end of the lesson
I had a hard time to come up with exercises, because most of the simple stuff I came up with was too simple and would result in shorter and simpler code without the framework 😅
- I decided to have one classic scenario (listing & detail) just to let the student write their first Crawlee program. It's a bit challenging regarding traversal over the HTML to get the data, but it shouldn't be challenging regarding Crawlee.
- I introduced one scenario where the scraper needs to jump through several pages (even domains) to get the result. Such program would be hard or at least very annoying to write without framework.
As always, I focused on having the examples based on real world sites which are somewhat known and popular globally, but also don't feature extensive anti-scraping protections.

Crawlee feedback

Regarding Crawlee, I didn't have much trouble to write this lesson, apart from the part where I wanted to provide hints on how to do this:

requests = []
for ... in context.soup.select(...):
    ...
    requests.append(Request.from_url(imdb_search_url, label="..."))
await context.add_requests(requests)

I couldn't find good example in the docs, and I was afraid that even if I provided pointers to all the individual pieces, the student wouldn't be able to figure it out.

Also, I wanted to link to docs when pointing out the fact that enqueue_links() has a limit argument, but I couldn't find enqueue_links() in the docs. I found this which is weird. It's not clear what object is documented, or what it is, feels like some internals, not as regular docs of a method. I probably know how come it's this way, but I don't think it's useful this way and I decided I don't want to send people from the course to that page.

One more thing: I do think that Crawlee should log some "progress" information about requests made or - especially - items scraped. It's so weird to run the program and then just look at the program as if it hanged, waiting if something happens or not. E.g. Scrapy logs how many items per minute I scraped, which I personally find super useful.

vdusek

I got your point for avoiding type hints. However, in the case of the handler:

    @crawler.router.default_handler
    async def handle_listing(context):
        ...

It leaves the reader without any possibility for code completions or static analysis when working with the context object.

In my opinion, type hints should be included here. We have been using them across all docs & examples.

Just a suggestion for you to reconsider, not a request.

Other than that, good job 🙂, and the code seems to be working.

sources/academy/webscraping/scraping_basics_python/12_framework.md

vdusek · 2024-12-02T18:33:41Z

sources/academy/webscraping/scraping_basics_python/12_framework.md

+
+From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
+
+We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.


I think this is fine, but if you would want more reasons you can check out this PR.

honzajavorek · 2024-12-03T09:12:33Z

Thanks for the review! I see your point and I will indeed reconsider adding the type hint, at least for the context. It would be easier decision if the type name wasn't 28 characters long, but you're right about the benefits for people with editors like VS Code, where we could assume some level of automatic code completions.

mnmkng · 2024-12-30T16:11:04Z

sources/academy/webscraping/scraping_basics_python/12_framework.md

+1. We perform imports and specify an asynchronous `main()` function.
+1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
+1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
+1. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.
+1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.


Suggested change

1. We perform imports and specify an asynchronous `main()` function.

1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.

1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.

1. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.

1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.

1. We perform imports and specify an asynchronous `main()` function.

2. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.

3. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.

4. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.

5. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.

honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Nov 25, 2024

honzajavorek added 10 commits November 27, 2024 17:18

feat: draft an intro to the framework lesson

6f02bfe

feat: add first crawlee example

d6bfaad

style: improve English

a52c5b4

style: few improvements to the text

8404d19

feat: crawl product detail pages

8cc8c21

style: improve English and add one paragraph

9338b54

feat: section about extracting data

4e3bf4a

style: improve English, add paragraph

1b319b5

feat: continue with the framework lesson

6af60e1

style: various wording fixes and additions

e18ea31

honzajavorek force-pushed the honzajavorek/py-framework branch from cb0f718 to e18ea31 Compare November 27, 2024 16:19

honzajavorek added 4 commits November 27, 2024 17:24

style: make linter happy

bb5b4bd

feat: add exercise

75d559e

feat: add exercise

f947c03

style: English

ecc715b

honzajavorek marked this pull request as ready for review November 28, 2024 16:45

honzajavorek requested review from mnmkng, vdusek, metalwarrior665 and TC-MO November 28, 2024 16:45

vdusek approved these changes Dec 2, 2024

View reviewed changes

honzajavorek mentioned this pull request Dec 3, 2024

docs: Improve the Features section in README apify/crawlee-python#772

Merged

honzajavorek mentioned this pull request Dec 5, 2024

Finalize the Web scraping basics for Python devs course #1319

Open

18 tasks

mnmkng approved these changes Dec 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lesson about using a framework #1303

feat: lesson about using a framework #1303

honzajavorek commented Nov 25, 2024 •

edited

Loading

vdusek left a comment

vdusek Dec 2, 2024

honzajavorek commented Dec 3, 2024

mnmkng Dec 30, 2024


		From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.

		We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.

feat: lesson about using a framework #1303

Are you sure you want to change the base?

feat: lesson about using a framework #1303

Conversation

honzajavorek commented Nov 25, 2024 • edited Loading

Crawlee feedback

vdusek left a comment

Choose a reason for hiding this comment

vdusek Dec 2, 2024

Choose a reason for hiding this comment

honzajavorek commented Dec 3, 2024

mnmkng Dec 30, 2024

Choose a reason for hiding this comment

honzajavorek commented Nov 25, 2024 •

edited

Loading