feat(recap): Enable appellate PDF purchases #4948

ERosendo · 2025-01-20T22:53:13Z

This PR addresses issue #4861

Key Changes:

Enhanced download_pacer_pdf_by_rd method: This method now incorporates a check for the court_id of the recap document. Based on the court_id, the method adjusts its behavior to ensure proper PDF downloads.
Adds a custom error message when trying to purchase ACMS PDFs.

I created a separate issue to review and explore potential methods for purchasing PDFs from ACMS courts.

mlissner

A couple little things from me, but I leave it to @albertisfu to do the full review. Thank you both!

cl/corpus_importer/tasks.py

cl/recap/tasks.py

This commit replaces direct court_id retrieval from the db in the appellate court check within the if statement with the existing pacer_court_id variable.

albertisfu

This looks good and works properly for purchasing appellate documents without attachments.

But I tested purchasing a main appellate document with attachments and the purchase failed. However, the issue seems to lie in Juriscraper, where the download_pdf method in AppellateDocketReport does not use the make_doc1_url method to build the document URL.
As a result, it does not change the four-digit from 0 to 1 and instead of retrieving the PDF, it retrieves the attachment page, causing the process to fail.

ERosendo · 2025-01-23T06:05:33Z

@albertisfu Thanks for the review!

But I tested purchasing a main appellate document with attachments and the purchase failed.

I believe this behavior is correct. Appellate attachments don't have a main document. Therefore, if a user tries to use the /recap-fetch endpoint to purchase a document, and it's actually an attachment page, the request should fail. I don't think changing the fourth digit will work the way we want, I'm sure it will result in the purchase of one of the documents from the attachment page, which will then be added as the main document.

Here's an example (I uninstalled the extension to prevent the entry from being updated):

The first entry from case 25-1055 has attachments, but it's currently listed as a regular entry. This is because we haven't yet processed an upload that would update the entry with the proper attachment metadata

If a user requests to purchase the document for this entry, CL will use 003014889026 as the pacer_doc_id to retrieve it. However, this will only lead to the attachment page, which contains links to the attached documents.

Furthermore, when examining the attachment page, you'll find that CL would have bought Attachment 3 if we changed the fourth digit.

On the other hand, since you brought this up, I tried buying attachments from an entry we already show as having attachments, and it didn't work. So, I think we can fix this by issue doing the following:

If someone tries to buy a recap document, but we don't know if it's really an attachment page, we send the pacer_doc_id just like it is in our database.
If we already know they're trying to buy an attachment, we change the fourth digit of the pacer_doc_id(It's okay to do this now that we know it's an attachment) before we try to download the PDF using the download_pacer_pdf_by_rd function

def download_pacer_pdf_by_rd(
    rd_pk: int,
    pacer_case_id: str,
    pacer_doc_id: int,
    session_data: SessionData,
    magic_number: str | None = None,
    de_seq_num: str | None = None,
) -> tuple[Response | None, str]:
    ...
    rd = RECAPDocument.objects.get(pk=rd_pk)
    pacer_court_id = map_cl_to_pacer_id(rd.docket_entry.docket.court_id)
    s = ProxyPacerSession(
        cookies=session_data.cookies, proxy=session_data.proxy_address
    )
    if is_appellate_court(pacer_court_id):
        report = AppellateDocketReport(pacer_court_id, s)
+       pacer_doc_id = (
+           pacer_doc_id
+           if not rd.attachment_number
+           else f"{pacer_doc_id[:3]}1{pacer_doc_id[4:]}"
+       )
        r, r_msg = report.download_pdf(
            pacer_doc_id=pacer_doc_id, pacer_case_id=pacer_case_id
        )
    else:
        report = FreeOpinionReport(pacer_court_id, s)
        r, r_msg = report.download_pdf(
            pacer_case_id, pacer_doc_id, magic_number, de_seq_num=de_seq_num
        )
    return r, r_msg

let me know what you think.

albertisfu · 2025-01-23T17:41:12Z

Thanks @ERosendo for the details.

I recreated the process using the example you provided, and I believe this is not an issue:

Furthermore, when examining the attachment page, you'll find that CL would have bought Attachment 3 if we changed the fourth digit.

This is because it’s possible for the pacer_doc_id shown in the docket entry on the docket report to differ from the attachment 1. When changing the fourth digit from 0 to 1 to purchase it, it will still point to the same PDF, which in this case corresponds to Attachment 3.

So, if a user buys the "main document" (which is actually Attachment 3) when the attachment data is not available, the PDF for Attachment 3 will still be correctly retrieved when changing the fourth digit from 0 to 1. However, it will initially be displayed as the "Main document" until the attachment metadata is retrieved:

Once the attachment metadata is received, that "Main document" will be converted to Attachment 3, which corresponds to the PDF previously purchased:

Do you think this can still be a problem?

I agree with your logic for fixing the pacer_doc_id so that appellate attachments can also be purchased. However, if you also agree that the behavior described above is not an issue, then perhaps changing the fourth digit from 0 to 1 could be applied in all the cases. If so, this logic might be better implemented in Juriscraper?

ERosendo · 2025-01-23T18:06:46Z

@albertisfu Thanks for looking into this. However, I'm still hesitant about purchasing and using this document as the main document for the following reasons:

I'm concerned that users might be confused by the document header, which says "Document: 1-3," when they access the main document. I suspect this could lead to reports of data inconsistencies 😄
Purchasing the document and updating the docket entry would remove the "buy on pacer" button from the docket report. This could discourage CL users from uploading the attachment page, as they might assume the document is already available. Additionally, the document would appear available through the API, making it difficult for users of the /recap-fetch endpoint to identify the entry as a potential attachment page.
From an extension perspective, adding the PDF as the main document would likely display the recap icon next to the entry in the dockets report, which could be misleading for extension users and might prevent them from accessing the attachment page.

I think we should try adding an error message in Juriscraper that informs users that the document might be an attachment and that they should retry the fetch request. This would help maintain data accuracy.

What do you think?

albertisfu · 2025-01-23T18:21:50Z

Yeah, those are valid concerns! I think adding an error message in Juriscraper would be helpful.
If the data retrieved is not a PDF, can we easily identify whether it’s actually an attachment page? If so, I think we can add an error message in CL in the Fetch queue like this:

This PACER document is an attachment, and we don't have the attachment metadata yet. Try purchasing the attachment page first and try again.

ERosendo · 2025-01-24T21:22:38Z

@albertisfu Here's the juriscraper PR:

freelawproject/juriscraper#1309

In commit 0f6ad2e, I added a test to ensure we handle failed purchases correctly by adding the error message to the processing queue. The Juriscraper PR is just adding another error message for a different scenario.

I believe it's safe to merge this PR independently of the juriscraper PR.

albertisfu · 2025-01-24T22:24:15Z

In commit 0f6ad2e, I added a test to ensure we handle failed purchases correctly by adding the error message to the processing queue. The Juriscraper PR is just adding another error message for a different scenario.

Thanks. I took a look to your Juriscraper PR https://github.com/freelawproject/juriscraper/pull/1309/files which modifies the error message for failed purchases that returned an attachment page instead.

I’m not entirely sure about the error message you’re using since it mentions "our system," which refers to CourtListener. Considering that Juriscraper is widely used outside of CL, I was thinking of a more generic message like:

Unable to download PDF. An attachment page was returned instead.

This message could then be used in CL to assign the FQ message to a more detailed one for CL API users, such the one you have in Juriscraper:

"This PACER document is part of an attachment page. "
 "Our system currently lacks the metadata for this attachment. "
"Please purchase the attachment page and try again."

What do you think? If you agree, should the logic to enrich the FQ message be added in this PR?

Additionally, will the logic you suggested in #4948 (comment) to handle appellate attachment PDF purchases be included in a separate PR?

ERosendo · 2025-01-24T23:29:33Z

should the logic to enrich the FQ message be added in this PR?

@albertisfu I like your suggestion. Let's add that logic in this PR.

will the logic you suggested in #4948 (comment) to handle appellate attachment PDF purchases be included in a separate PR?

I apologize! I was under the impression that I had pushed those changes along with the new test

This commit adds logic to adjust the pacer_doc_id used for purchasing PDF attachments

ERosendo · 2025-01-25T02:38:09Z

@albertisfu I've implemented your suggestions, including the error message and purchase attachment tweak. Ready for another review!

mlissner · 2025-01-26T06:58:32Z

I'm ducking out of this one and leaving it to you guys. You're doing great! If you are both happy, let's do it.

albertisfu

Thanks, @ERosendo! This looks pretty close just a small comment, and I think we're ready to go.

Also, I've merged the Juriscraper PR. We should just remember to do a release before appellate attachment purchases are enabled.

cl/corpus_importer/tasks.py

- Refactored download_pacer_pdf_by_rd to improve type hints for pacer_doc_id. - Fixed the logic for handling attachment document purchases by simplifying the process of updating the fourth digit of the pacer_doc_id.

albertisfu · 2025-01-28T16:30:44Z

Thanks @ERosendo this looks great now! Set to auto-merge.

ERosendo added 2 commits January 20, 2025 18:48

feat(recap): Enable appellate PDF purchases

8f756b8

feat(recap): Avoid trying to purchase ACMS PDFs

992f856

ERosendo marked this pull request as ready for review January 20, 2025 23:17

ERosendo requested a review from mlissner January 20, 2025 23:17

ERosendo assigned mlissner Jan 20, 2025

mlissner approved these changes Jan 22, 2025

View reviewed changes

cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved

cl/recap/tasks.py Outdated Show resolved Hide resolved

mlissner requested a review from albertisfu January 22, 2025 00:44

mlissner assigned albertisfu and unassigned mlissner Jan 22, 2025

ERosendo added 2 commits January 22, 2025 11:57

feat(corpus_importer): Refines download_pacer_pdf_by_rd helper

7e7684b

This commit replaces direct court_id retrieval from the db in the appellate court check within the if statement with the existing pacer_court_id variable.

feat(recap): Add helper function to check if document is from ACMS

812ff64

albertisfu reviewed Jan 22, 2025

View reviewed changes

albertisfu assigned ERosendo and unassigned albertisfu Jan 22, 2025

tests(recap): Add test for handling failed purchase logic

0f6ad2e

feat(corpus_importer): Improve PDF purchase logic

1b184ae

This commit adds logic to adjust the pacer_doc_id used for purchasing PDF attachments

ERosendo force-pushed the 4861-feat-enable-appellate-pdf-purchases branch from 63c9b50 to 1b184ae Compare January 25, 2025 01:26

feat(corpus_importer): Improve error message for appellate purchases

1218b96

ERosendo requested a review from mlissner January 25, 2025 02:38

ERosendo assigned albertisfu and unassigned ERosendo Jan 25, 2025

mlissner removed their request for review January 26, 2025 06:58

mlissner requested a review from albertisfu January 26, 2025 06:58

albertisfu requested changes Jan 28, 2025

View reviewed changes

cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved

ERosendo added 2 commits January 28, 2025 09:00

Merge branch 'main' into 4861-feat-enable-appellate-pdf-purchases

f572976

fix(corpus_importer): Updates download_pacer_pdf_by_rd signature

f820dc5

- Refactored download_pacer_pdf_by_rd to improve type hints for pacer_doc_id. - Fixed the logic for handling attachment document purchases by simplifying the process of updating the fourth digit of the pacer_doc_id.

ERosendo requested a review from albertisfu January 28, 2025 14:25

Merge branch 'main' into 4861-feat-enable-appellate-pdf-purchases

b3953ac

albertisfu enabled auto-merge January 28, 2025 16:30

albertisfu approved these changes Jan 28, 2025

View reviewed changes

albertisfu merged commit 7e3a45a into main Jan 28, 2025
15 checks passed

albertisfu deleted the 4861-feat-enable-appellate-pdf-purchases branch January 28, 2025 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recap): Enable appellate PDF purchases #4948

feat(recap): Enable appellate PDF purchases #4948

ERosendo commented Jan 20, 2025 •

edited

Loading

mlissner left a comment

albertisfu left a comment

ERosendo commented Jan 23, 2025

albertisfu commented Jan 23, 2025

ERosendo commented Jan 23, 2025

albertisfu commented Jan 23, 2025

ERosendo commented Jan 24, 2025

albertisfu commented Jan 24, 2025

ERosendo commented Jan 24, 2025

ERosendo commented Jan 25, 2025

mlissner commented Jan 26, 2025

albertisfu left a comment

albertisfu commented Jan 28, 2025

feat(recap): Enable appellate PDF purchases #4948

feat(recap): Enable appellate PDF purchases #4948

Conversation

ERosendo commented Jan 20, 2025 • edited Loading

mlissner left a comment

Choose a reason for hiding this comment

albertisfu left a comment

Choose a reason for hiding this comment

ERosendo commented Jan 23, 2025

albertisfu commented Jan 23, 2025

ERosendo commented Jan 23, 2025

albertisfu commented Jan 23, 2025

ERosendo commented Jan 24, 2025

albertisfu commented Jan 24, 2025

ERosendo commented Jan 24, 2025

ERosendo commented Jan 25, 2025

mlissner commented Jan 26, 2025

albertisfu left a comment

Choose a reason for hiding this comment

albertisfu commented Jan 28, 2025

ERosendo commented Jan 20, 2025 •

edited

Loading