Skip to content

Commit

Permalink
feat: add exercise
Browse files Browse the repository at this point in the history
  • Loading branch information
honzajavorek committed Nov 28, 2024
1 parent 75d559e commit f947c03
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprin

Your output should look something like this:

```text
```py
{'title': 'Senior Full Stack Developer',
'company': 'Baserow',
'url': 'https://www.python.org/jobs/7705/',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,8 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade

If you export the dataset as a JSON, you should see something like this:

```text
<!-- eslint-skip -->
```json
[
{
"url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus",
Expand Down Expand Up @@ -493,3 +494,99 @@ Hints:
```

</details>

### Use Crawlee to find rating of the most popular Netflix films

The [Global Top 10](https://www.netflix.com/tudum/top10) page contains a table of the most currently popular Netflix films worldwide. Scrape the movie names, then search for each movie at the [IMDb](https://www.imdb.com/). Assume the first search result is correct and find out what's the film's rating. Each item you push to the Crawlee's default dataset should contain the following data:

- URL of the film's imdb.com page
- Title
- Rating

If you export the dataset as a JSON, you should see something like this:

<!-- eslint-skip -->
```json
[
{
"url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1",
"title": "The Merry Gentlemen",
"rating": "5.0/10"
},
{
"url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1",
"title": "Hot Frosty",
"rating": "5.4/10"
},
...
]
```

For each name from the Global Top 10, you'll need to construct a `Request` object with IMDb search URL. Take the following code snippet as a hint on how to do it:

```py
...
from urllib.parse import quote_plus

async def main():
...

@crawler.router.default_handler
async def handle_netflix_table(context):
requests = []
for name_cell in context.soup.select(...):
name = name_cell.text.strip()
imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
requests.append(Request.from_url(imdb_search_url, label="..."))
await context.add_requests(requests)

...
...
```

When following the first search result, you may find handy to know that `context.enqueue_links()` takes a `limit` keyword argument, where you can specify the max number of HTTP requests to enqueue.

<details>
<summary>Solution</summary>

```py
import asyncio
from urllib.parse import quote_plus

from crawlee import Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def handle_netflix_table(context):
requests = []
for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
name = name_cell.text.strip()
imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
await context.add_requests(requests)

@crawler.router.handler("IMDB_SEARCH")
async def handle_imdb_search(context):
await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)

@crawler.router.handler("IMDB")
async def handle_imdb(context):
rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
rating_text = context.soup.select_one(rating_selector).text.strip()
await context.push_data({
"url": context.request.url,
"title": context.soup.select_one("h1").text.strip(),
"rating": rating_text,
})

await crawler.run(["https://www.netflix.com/tudum/top10"])
await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)

if __name__ == '__main__':
asyncio.run(main())
```

</details>

0 comments on commit f947c03

Please sign in to comment.