Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend work for public collections: thumbnails, url list, upload pages, and so on #2198

Merged
merged 57 commits into from
Dec 23, 2024

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Dec 3, 2024

Fixes #2182

This rather large PR adds the rest of what should be needed for public collections work in the frontend.

New API endpoints

  • /public/orgs/{org_slug}/collections/{coll_id}: Public GET endpoint for public collection (returns 404 collection_not_found if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public)
    • /public/orgs/{org_slug}/collections/{coll_id}/download: Streaming download endpoint for public collection (returns 404 collection_not_found if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public; 403 if Collection.allowPublicDownload is false)
  • /orgs/{oid}/collections/{coll_id}/urls: Paginated GET list of urls in collection sorted descending by snapshot count, with information about each individual snapshot (including page_id, needed for below), for use in selecting a home url/snapshot for collections in the frontend
  • /orgs/{oid}/collections/{coll_id}/home-url: POST endpoint to set collection home url/start snapshot by page_id
  • /orgs/{oid}/collections/{coll_id}/thumbnail: PUT endpoint to upload collection thumbnail as a stream. Will replace existing thumbnail if called on a collection that already has one.
  • /orgs/{oid}/collections/{coll_id}/thumbnail: DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints

  • /public-collections/{org_slug}: Public collection list now supports pagination, only returns public fields for collections, and has had the path modified to /public/orgs/{org_slug}/collections following offline feedback
  • Several pages endpoints that previously only supported /crawls/ in their path, e.g. /orgs/{oid}/crawls/all/pages/reAdd, now support /uploads/ and /all-crawls/ namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For /orgs/{oid}/namespace/all/pages/reAdd, crawls or uploads will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent, but I didn't want to expand the scope of this PR even larger than it already has been).
  • /orgs/{oid}/namespace/all/pages/reAdd now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added.
  • The collection PATCH endpoint now allows the user to set a default thumbnail that is served by the frontend via the defaultThumbnailName field, as well as the new allowPublicDownload field to determine whether the collection is downloadable when/if it is made public.

Other big changes

  • New uploads will now have their pages read into the database! Collection page counts now also include uploads!
  • A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly!
  • Adds a new ImageFile subclass of BaseFile for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints

Other less big changes

  • Collections now have a dateEarliest and dateLatest that we can generate a range from, based on the earliest and latest snapshot in the collection
  • Public collections now have their own model that restricts the fields we share to unauthenticated users
  • Collections now have a caption field that can be used to set a short caption to use for public collection views (presumably with more visibility than full description)
  • The process of adding pages to the database from WACZ files is more flexible to account for older WACZ files whose page lists may deviate from the current norm (e.g. non-UUID page ids)

Testing

Backend tests have been added or modified for all API endpoint changes.

To test the migration:

  • Spin up main locally
  • Upload some WACZ files in one or more orgs and add them to some collections
  • Re-deploy from this branch
  • Verify that migration appeared to work successfully in backend logs
  • Verify that background job container (in default namespace) logs look good
  • Verify that pages have been added for uploads and that collections have been updated accordingly (look at data in Network tab for collection replay.json endpoint on the collection page)

@tw4l tw4l requested review from ikreymer and SuaYoo December 3, 2024 21:32
@SuaYoo SuaYoo self-requested a review December 4, 2024 02:54
@SuaYoo SuaYoo force-pushed the public-collections-feature branch from 25ad642 to 26740b6 Compare December 4, 2024 02:55
@tw4l
Copy link
Member Author

tw4l commented Dec 4, 2024

@SuaYoo I added a max size validation for thumbnails of 2 MB (following Youtube's example). It'll be easy to raise it or set different limits for other types of user-uploaded image files in the future. I'm wondering if we also want to restrict file types? I'm calculating a mime type for the image files based on the filename they're uploaded with, we could use that or try to detect via magic numbers

@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch from 883a4db to d4d1fa6 Compare December 4, 2024 16:43
@SuaYoo SuaYoo force-pushed the public-collections-feature branch from d6f4549 to b7f8ac1 Compare December 9, 2024 16:58
@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from a759584 to ff67749 Compare December 9, 2024 22:07
@SuaYoo SuaYoo force-pushed the public-collections-feature branch from b7f8ac1 to c7f827c Compare December 10, 2024 01:46
@SuaYoo SuaYoo force-pushed the issue-2182-thumbnail-backend branch from fe2ea9c to 5504722 Compare December 10, 2024 02:00
@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from 6f90a14 to 970e25f Compare December 10, 2024 21:36
@SuaYoo SuaYoo force-pushed the public-collections-feature branch from c7f827c to 323070b Compare December 11, 2024 00:32
@SuaYoo SuaYoo force-pushed the issue-2182-thumbnail-backend branch from 970e25f to f249608 Compare December 11, 2024 00:32
@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch 2 times, most recently from fa6493c to d148f46 Compare December 11, 2024 22:16
@SuaYoo SuaYoo self-requested a review December 12, 2024 15:39
Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the availability of API endpoints, this looks like it covers everything we need on the backend so far! I'm also able to get it running locally.

@SuaYoo SuaYoo force-pushed the public-collections-feature branch from 323070b to efc3e1d Compare December 16, 2024 17:33
@SuaYoo SuaYoo force-pushed the issue-2182-thumbnail-backend branch 3 times, most recently from 145f64e to 76134c6 Compare December 16, 2024 18:00
@SuaYoo SuaYoo force-pushed the public-collections-feature branch from efc3e1d to 76daafb Compare December 17, 2024 17:31
tw4l added 4 commits December 17, 2024 09:33
Response is sorted desc by page count match and includes an array
containing page_id, ts, and status for each snapshot with that URL.
@SuaYoo SuaYoo self-requested a review December 17, 2024 22:50
@SuaYoo
Copy link
Member

SuaYoo commented Dec 17, 2024

Documenting some feedback from Discord:

  • Ran into an issue when using the /urls endpoint with prefix search. urlPrefix doesn't seem to accept values after ?, for example [frontend/src/features/collections/select-collection-start-page.ts](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page). Passing in an encoded value to urlPrefix doesn't return any results.
  • Unlisted collections should still return in the public GET endpoint, regardless of whether the org itself is public.

@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch from f5ee561 to 2835e71 Compare December 18, 2024 04:23
@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch from 2835e71 to 979884e Compare December 18, 2024 04:45
@tw4l tw4l force-pushed the issue-2182-thumbnail-backend branch from ba7e296 to 320b218 Compare December 18, 2024 04:57
@tw4l
Copy link
Member Author

tw4l commented Dec 18, 2024

Documenting some feedback from Discord:

* Ran into an issue when using the `/urls` endpoint with prefix search. `urlPrefix` doesn't seem to accept values after `?`, for example `[frontend/src/features/collections/select-collection-start-page.ts](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page)`. Passing in an encoded value to `urlPrefix` doesn't return any results.

* Unlisted collections should still return in the public GET endpoint, regardless of whether the org itself is public.

@SuaYoo these are both now fixed, pushed to dev, and tested

Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested newest changes locally, working well!

@SuaYoo
Copy link
Member

SuaYoo commented Dec 23, 2024

@tw4l mind if I merge this into the feature branch?

@tw4l tw4l merged commit 1a15ea4 into public-collections-feature Dec 23, 2024
5 checks passed
@tw4l tw4l deleted the issue-2182-thumbnail-backend branch December 23, 2024 19:14
@tw4l
Copy link
Member Author

tw4l commented Dec 23, 2024

@tw4l mind if I merge this into the feature branch?

done! :)

SuaYoo pushed a commit that referenced this pull request Dec 23, 2024
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
SuaYoo pushed a commit that referenced this pull request Jan 6, 2025
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
SuaYoo pushed a commit that referenced this pull request Jan 7, 2025
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
SuaYoo pushed a commit that referenced this pull request Jan 7, 2025
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
SuaYoo pushed a commit that referenced this pull request Jan 9, 2025
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants