Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

tw4l · 2024-12-03T21:31:05Z

Fixes #2182

This rather large PR adds the rest of what should be needed for public collections work in the frontend.

New API endpoints

/public/orgs/{org_slug}/collections/{coll_id}: Public GET endpoint for public collection (returns 404 collection_not_found if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public)
- /public/orgs/{org_slug}/collections/{coll_id}/download: Streaming download endpoint for public collection (returns 404 collection_not_found if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public; 403 if Collection.allowPublicDownload is false)
/orgs/{oid}/collections/{coll_id}/urls: Paginated GET list of urls in collection sorted descending by snapshot count, with information about each individual snapshot (including page_id, needed for below), for use in selecting a home url/snapshot for collections in the frontend
/orgs/{oid}/collections/{coll_id}/home-url: POST endpoint to set collection home url/start snapshot by page_id
/orgs/{oid}/collections/{coll_id}/thumbnail: PUT endpoint to upload collection thumbnail as a stream. Will replace existing thumbnail if called on a collection that already has one.
/orgs/{oid}/collections/{coll_id}/thumbnail: DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints

/public-collections/{org_slug}: Public collection list now supports pagination, only returns public fields for collections, and has had the path modified to /public/orgs/{org_slug}/collections following offline feedback
Several pages endpoints that previously only supported /crawls/ in their path, e.g. /orgs/{oid}/crawls/all/pages/reAdd, now support /uploads/ and /all-crawls/ namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For /orgs/{oid}/namespace/all/pages/reAdd, crawls or uploads will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent, but I didn't want to expand the scope of this PR even larger than it already has been).
/orgs/{oid}/namespace/all/pages/reAdd now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added.
The collection PATCH endpoint now allows the user to set a default thumbnail that is served by the frontend via the defaultThumbnailName field, as well as the new allowPublicDownload field to determine whether the collection is downloadable when/if it is made public.

Other big changes

New uploads will now have their pages read into the database! Collection page counts now also include uploads!
A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly!
Adds a new ImageFile subclass of BaseFile for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints

Other less big changes

Collections now have a dateEarliest and dateLatest that we can generate a range from, based on the earliest and latest snapshot in the collection
Public collections now have their own model that restricts the fields we share to unauthenticated users
Collections now have a caption field that can be used to set a short caption to use for public collection views (presumably with more visibility than full description)
The process of adding pages to the database from WACZ files is more flexible to account for older WACZ files whose page lists may deviate from the current norm (e.g. non-UUID page ids)

Testing

Backend tests have been added or modified for all API endpoint changes.

To test the migration:

Spin up main locally
Upload some WACZ files in one or more orgs and add them to some collections
Re-deploy from this branch
Verify that migration appeared to work successfully in backend logs
Verify that background job container (in default namespace) logs look good
Verify that pages have been added for uploads and that collections have been updated accordingly (look at data in Network tab for collection replay.json endpoint on the collection page)

backend/btrixcloud/models.py

tw4l · 2024-12-04T16:42:33Z

@SuaYoo I added a max size validation for thumbnails of 2 MB (following Youtube's example). It'll be easy to raise it or set different limits for other types of user-uploaded image files in the future. I'm wondering if we also want to restrict file types? I'm calculating a mime type for the image files based on the filename they're uploaded with, we could use that or try to detect via magic numbers

SuaYoo

Based on the availability of API endpoints, this looks like it covers everything we need on the backend so far! I'm also able to get it running locally.

Response is sorted desc by page count match and includes an array containing page_id, ts, and status for each snapshot with that URL.

SuaYoo · 2024-12-17T22:52:05Z

Documenting some feedback from Discord:

Ran into an issue when using the /urls endpoint with prefix search. urlPrefix doesn't seem to accept values after ?, for example [frontend/src/features/collections/select-collection-start-page.ts](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page). Passing in an encoded value to urlPrefix doesn't return any results.
Unlisted collections should still return in the public GET endpoint, regardless of whether the org itself is public.

tw4l · 2024-12-18T05:54:33Z

Documenting some feedback from Discord:

* Ran into an issue when using the `/urls` endpoint with prefix search. `urlPrefix` doesn't seem to accept values after `?`, for example `[frontend/src/features/collections/select-collection-start-page.ts](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page)`. Passing in an encoded value to `urlPrefix` doesn't return any results.

* Unlisted collections should still return in the public GET endpoint, regardless of whether the org itself is public.

@SuaYoo these are both now fixed, pushed to dev, and tested

SuaYoo

Tested newest changes locally, working well!

SuaYoo · 2024-12-23T19:02:10Z

@tw4l mind if I merge this into the feature branch?

tw4l · 2024-12-23T19:14:19Z

@tw4l mind if I merge this into the feature branch?

done! :)

Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints

tw4l requested review from ikreymer and SuaYoo December 3, 2024 21:32

SuaYoo reviewed Dec 3, 2024

View reviewed changes

backend/btrixcloud/models.py Show resolved Hide resolved

SuaYoo self-requested a review December 4, 2024 02:54

SuaYoo force-pushed the public-collections-feature branch from 25ad642 to 26740b6 Compare December 4, 2024 02:55

SuaYoo reviewed Dec 4, 2024

View reviewed changes

backend/btrixcloud/models.py Show resolved Hide resolved

tw4l force-pushed the issue-2182-thumbnail-backend branch from 883a4db to d4d1fa6 Compare December 4, 2024 16:43

SuaYoo force-pushed the public-collections-feature branch from d6f4549 to b7f8ac1 Compare December 9, 2024 16:58

tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from a759584 to ff67749 Compare December 9, 2024 22:07

SuaYoo force-pushed the public-collections-feature branch from b7f8ac1 to c7f827c Compare December 10, 2024 01:46

SuaYoo force-pushed the issue-2182-thumbnail-backend branch from fe2ea9c to 5504722 Compare December 10, 2024 02:00

tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from 6f90a14 to 970e25f Compare December 10, 2024 21:36

SuaYoo force-pushed the public-collections-feature branch from c7f827c to 323070b Compare December 11, 2024 00:32

SuaYoo force-pushed the issue-2182-thumbnail-backend branch from 970e25f to f249608 Compare December 11, 2024 00:32

tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from fa6493c to d148f46 Compare December 11, 2024 22:16

SuaYoo self-requested a review December 12, 2024 15:39

SuaYoo approved these changes Dec 12, 2024

View reviewed changes

SuaYoo force-pushed the public-collections-feature branch from 323070b to efc3e1d Compare December 16, 2024 17:33

SuaYoo force-pushed the issue-2182-thumbnail-backend branch 3 times, most recently from 145f64e to 76134c6 Compare December 16, 2024 18:00

SuaYoo force-pushed the public-collections-feature branch from efc3e1d to 76daafb Compare December 17, 2024 17:31

tw4l added 4 commits December 17, 2024 09:33

Add list endpoint to get sorted list of URLs in collection

9c7c920

Response is sorted desc by page count match and includes an array containing page_id, ts, and status for each snapshot with that URL.

Add endpoint to set or update collection home url

2041be5

Fixups

02a7d1e

Use updated response for /home-urls endpoint

d75c4af

tw4l added 4 commits December 17, 2024 09:33

Add public collection download endpoint

4b9a60d

Implement and enforce Collection.allowPublicDownload

7cf0b62

Fix linting

7ac420d

Make sure coll out models return allowPublicDownload as bool

d00cc39

SuaYoo force-pushed the issue-2182-thumbnail-backend branch from 76134c6 to d00cc39 Compare December 17, 2024 17:33

SuaYoo mentioned this pull request Dec 17, 2024

feat: Collection thumbnails, start page, and public view updates #2209

Merged

tw4l added 3 commits December 17, 2024 15:43

Make migration idempotent - don't readd existing upload pages

4482b38

Reformat migration

73813dd

Fix typing

2dce950

SuaYoo self-requested a review December 17, 2024 22:50

tw4l force-pushed the issue-2182-thumbnail-backend branch from f5ee561 to 2835e71 Compare December 18, 2024 04:23

Allow getting and downloading public collections if org profile disabled

979884e

tw4l force-pushed the issue-2182-thumbnail-backend branch from 2835e71 to 979884e Compare December 18, 2024 04:45

URL decode urlPrefix in /urls endpoint

320b218

tw4l force-pushed the issue-2182-thumbnail-backend branch from ba7e296 to 320b218 Compare December 18, 2024 04:57

tw4l added 2 commits December 18, 2024 00:45

Do URL decoding inside list urls method

fb0ad7b

Escape special characters in url_prefix regex

f882409

tw4l mentioned this pull request Dec 18, 2024

Add page count to crawl model #2257

Open

SuaYoo approved these changes Dec 23, 2024

View reviewed changes

tw4l merged commit 1a15ea4 into public-collections-feature Dec 23, 2024
5 checks passed

tw4l deleted the issue-2182-thumbnail-backend branch December 23, 2024 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

tw4l commented Dec 3, 2024 •

edited

Loading

tw4l commented Dec 4, 2024

SuaYoo left a comment

SuaYoo commented Dec 17, 2024 •

edited

Loading

tw4l commented Dec 18, 2024

SuaYoo left a comment

SuaYoo commented Dec 23, 2024

tw4l commented Dec 23, 2024

Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

Conversation

tw4l commented Dec 3, 2024 • edited Loading

New API endpoints

Changes to existing API endpoints

Other big changes

Other less big changes

Testing

tw4l commented Dec 4, 2024

SuaYoo left a comment

Choose a reason for hiding this comment

SuaYoo commented Dec 17, 2024 • edited Loading

tw4l commented Dec 18, 2024

SuaYoo left a comment

Choose a reason for hiding this comment

SuaYoo commented Dec 23, 2024

tw4l commented Dec 23, 2024

tw4l commented Dec 3, 2024 •

edited

Loading

SuaYoo commented Dec 17, 2024 •

edited

Loading