-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend work for public collections: thumbnails, url list, upload pages, and so on #2198
Backend work for public collections: thumbnails, url list, upload pages, and so on #2198
Conversation
25ad642
to
26740b6
Compare
@SuaYoo I added a max size validation for thumbnails of 2 MB (following Youtube's example). It'll be easy to raise it or set different limits for other types of user-uploaded image files in the future. I'm wondering if we also want to restrict file types? I'm calculating a mime type for the image files based on the filename they're uploaded with, we could use that or try to detect via magic numbers |
883a4db
to
d4d1fa6
Compare
d6f4549
to
b7f8ac1
Compare
a759584
to
ff67749
Compare
b7f8ac1
to
c7f827c
Compare
fe2ea9c
to
5504722
Compare
6f90a14
to
970e25f
Compare
c7f827c
to
323070b
Compare
970e25f
to
f249608
Compare
fa6493c
to
d148f46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the availability of API endpoints, this looks like it covers everything we need on the backend so far! I'm also able to get it running locally.
323070b
to
efc3e1d
Compare
145f64e
to
76134c6
Compare
efc3e1d
to
76daafb
Compare
Response is sorted desc by page count match and includes an array containing page_id, ts, and status for each snapshot with that URL.
76134c6
to
d00cc39
Compare
Documenting some feedback from Discord:
|
f5ee561
to
2835e71
Compare
2835e71
to
979884e
Compare
ba7e296
to
320b218
Compare
@SuaYoo these are both now fixed, pushed to dev, and tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested newest changes locally, working well!
@tw4l mind if I merge this into the feature branch? |
done! :) |
Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints
Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints
Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints
Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints
Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints
Fixes #2182
This rather large PR adds the rest of what should be needed for public collections work in the frontend.
New API endpoints
/public/orgs/{org_slug}/collections/{coll_id}
: Public GET endpoint for public collection (returns 404collection_not_found
if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public)/public/orgs/{org_slug}/collections/{coll_id}/download
: Streaming download endpoint for public collection (returns 404collection_not_found
if org with slug doesn't exist or isn't public, collection id doesn't exist or isn't public; 403 ifCollection.allowPublicDownload
is false)/orgs/{oid}/collections/{coll_id}/urls
: Paginated GET list of urls in collection sorted descending by snapshot count, with information about each individual snapshot (including page_id, needed for below), for use in selecting a home url/snapshot for collections in the frontend/orgs/{oid}/collections/{coll_id}/home-url
: POST endpoint to set collection home url/start snapshot by page_id/orgs/{oid}/collections/{coll_id}/thumbnail
: PUT endpoint to upload collection thumbnail as a stream. Will replace existing thumbnail if called on a collection that already has one./orgs/{oid}/collections/{coll_id}/thumbnail
: DELETE endpoint to remove collection thumbnailChanges to existing API endpoints
/public-collections/{org_slug}
: Public collection list now supports pagination, only returns public fields for collections, and has had the path modified to/public/orgs/{org_slug}/collections
following offline feedbackpages
endpoints that previously only supported/crawls/
in their path, e.g./orgs/{oid}/crawls/all/pages/reAdd
, now support/uploads/
and/all-crawls/
namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For/orgs/{oid}/namespace/all/pages/reAdd
,crawls
oruploads
will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent, but I didn't want to expand the scope of this PR even larger than it already has been)./orgs/{oid}/namespace/all/pages/reAdd
now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added.defaultThumbnailName
field, as well as the newallowPublicDownload
field to determine whether the collection is downloadable when/if it is made public.Other big changes
ImageFile
subclass ofBaseFile
for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpointsOther less big changes
dateEarliest
anddateLatest
that we can generate a range from, based on the earliest and latest snapshot in the collectioncaption
field that can be used to set a short caption to use for public collection views (presumably with more visibility than full description)Testing
Backend tests have been added or modified for all API endpoint changes.
To test the migration:
main
locally