Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(recap): Enable appellate PDF purchases #4948

Merged
merged 10 commits into from
Jan 28, 2025

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Jan 20, 2025

This PR addresses issue #4861

Key Changes:

  • Enhanced download_pacer_pdf_by_rd method: This method now incorporates a check for the court_id of the recap document. Based on the court_id, the method adjusts its behavior to ensure proper PDF downloads.

  • Adds a custom error message when trying to purchase ACMS PDFs.

I created a separate issue to review and explore potential methods for purchasing PDFs from ACMS courts.

@ERosendo ERosendo marked this pull request as ready for review January 20, 2025 23:17
@ERosendo ERosendo requested a review from mlissner January 20, 2025 23:17
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple little things from me, but I leave it to @albertisfu to do the full review. Thank you both!

cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved
cl/recap/tasks.py Outdated Show resolved Hide resolved
@mlissner mlissner requested a review from albertisfu January 22, 2025 00:44
@mlissner mlissner assigned albertisfu and unassigned mlissner Jan 22, 2025
This commit replaces direct court_id retrieval from the db in the appellate court check within the if statement with the existing pacer_court_id variable.
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and works properly for purchasing appellate documents without attachments.

But I tested purchasing a main appellate document with attachments and the purchase failed. However, the issue seems to lie in Juriscraper, where the download_pdf method in AppellateDocketReport does not use the make_doc1_url method to build the document URL.
As a result, it does not change the four-digit from 0 to 1 and instead of retrieving the PDF, it retrieves the attachment page, causing the process to fail.

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Jan 22, 2025
@ERosendo
Copy link
Contributor Author

@albertisfu Thanks for the review!

But I tested purchasing a main appellate document with attachments and the purchase failed.

I believe this behavior is correct. Appellate attachments don't have a main document. Therefore, if a user tries to use the /recap-fetch endpoint to purchase a document, and it's actually an attachment page, the request should fail. I don't think changing the fourth digit will work the way we want, I'm sure it will result in the purchase of one of the documents from the attachment page, which will then be added as the main document.

Here's an example (I uninstalled the extension to prevent the entry from being updated):

The first entry from case 25-1055 has attachments, but it's currently listed as a regular entry. This is because we haven't yet processed an upload that would update the entry with the proper attachment metadata

image

If a user requests to purchase the document for this entry, CL will use 003014889026 as the pacer_doc_id to retrieve it. However, this will only lead to the attachment page, which contains links to the attached documents.

image

Furthermore, when examining the attachment page, you'll find that CL would have bought Attachment 3 if we changed the fourth digit.


On the other hand, since you brought this up, I tried buying attachments from an entry we already show as having attachments, and it didn't work. So, I think we can fix this by issue doing the following:

  1. If someone tries to buy a recap document, but we don't know if it's really an attachment page, we send the pacer_doc_id just like it is in our database.

  2. If we already know they're trying to buy an attachment, we change the fourth digit of the pacer_doc_id(It's okay to do this now that we know it's an attachment) before we try to download the PDF using the download_pacer_pdf_by_rd function

def download_pacer_pdf_by_rd(
    rd_pk: int,
    pacer_case_id: str,
    pacer_doc_id: int,
    session_data: SessionData,
    magic_number: str | None = None,
    de_seq_num: str | None = None,
) -> tuple[Response | None, str]:
    ...
    rd = RECAPDocument.objects.get(pk=rd_pk)
    pacer_court_id = map_cl_to_pacer_id(rd.docket_entry.docket.court_id)
    s = ProxyPacerSession(
        cookies=session_data.cookies, proxy=session_data.proxy_address
    )
    if is_appellate_court(pacer_court_id):
        report = AppellateDocketReport(pacer_court_id, s)
+       pacer_doc_id = (
+           pacer_doc_id
+           if not rd.attachment_number
+           else f"{pacer_doc_id[:3]}1{pacer_doc_id[4:]}"
+       )
        r, r_msg = report.download_pdf(
            pacer_doc_id=pacer_doc_id, pacer_case_id=pacer_case_id
        )
    else:
        report = FreeOpinionReport(pacer_court_id, s)
        r, r_msg = report.download_pdf(
            pacer_case_id, pacer_doc_id, magic_number, de_seq_num=de_seq_num
        )
    return r, r_msg

let me know what you think.

@albertisfu
Copy link
Contributor

Thanks @ERosendo for the details.

I recreated the process using the example you provided, and I believe this is not an issue:

Furthermore, when examining the attachment page, you'll find that CL would have bought Attachment 3 if we changed the fourth digit.

This is because it’s possible for the pacer_doc_id shown in the docket entry on the docket report to differ from the attachment 1. When changing the fourth digit from 0 to 1 to purchase it, it will still point to the same PDF, which in this case corresponds to Attachment 3.

So, if a user buys the "main document" (which is actually Attachment 3) when the attachment data is not available, the PDF for Attachment 3 will still be correctly retrieved when changing the fourth digit from 0 to 1. However, it will initially be displayed as the "Main document" until the attachment metadata is retrieved:

Screenshot 2025-01-23 at 11 39 43 a m

Screenshot 2025-01-23 at 11 30 08 a m

Once the attachment metadata is received, that "Main document" will be converted to Attachment 3, which corresponds to the PDF previously purchased:

Screenshot 2025-01-23 at 11 32 44 a m

Do you think this can still be a problem?

I agree with your logic for fixing the pacer_doc_id so that appellate attachments can also be purchased. However, if you also agree that the behavior described above is not an issue, then perhaps changing the fourth digit from 0 to 1 could be applied in all the cases. If so, this logic might be better implemented in Juriscraper?

@ERosendo
Copy link
Contributor Author

@albertisfu Thanks for looking into this. However, I'm still hesitant about purchasing and using this document as the main document for the following reasons:

  1. I'm concerned that users might be confused by the document header, which says "Document: 1-3," when they access the main document. I suspect this could lead to reports of data inconsistencies 😄

  2. Purchasing the document and updating the docket entry would remove the "buy on pacer" button from the docket report. This could discourage CL users from uploading the attachment page, as they might assume the document is already available. Additionally, the document would appear available through the API, making it difficult for users of the /recap-fetch endpoint to identify the entry as a potential attachment page.

  3. From an extension perspective, adding the PDF as the main document would likely display the recap icon next to the entry in the dockets report, which could be misleading for extension users and might prevent them from accessing the attachment page.

I think we should try adding an error message in Juriscraper that informs users that the document might be an attachment and that they should retry the fetch request. This would help maintain data accuracy.

What do you think?

@albertisfu
Copy link
Contributor

Yeah, those are valid concerns! I think adding an error message in Juriscraper would be helpful.
If the data retrieved is not a PDF, can we easily identify whether it’s actually an attachment page? If so, I think we can add an error message in CL in the Fetch queue like this:

This PACER document is an attachment, and we don't have the attachment metadata yet. Try purchasing the attachment page first and try again.

@ERosendo
Copy link
Contributor Author

@albertisfu Here's the juriscraper PR:

freelawproject/juriscraper#1309

In commit 0f6ad2e, I added a test to ensure we handle failed purchases correctly by adding the error message to the processing queue. The Juriscraper PR is just adding another error message for a different scenario.

I believe it's safe to merge this PR independently of the juriscraper PR.

@albertisfu
Copy link
Contributor

In commit 0f6ad2e, I added a test to ensure we handle failed purchases correctly by adding the error message to the processing queue. The Juriscraper PR is just adding another error message for a different scenario.

Thanks. I took a look to your Juriscraper PR https://github.com/freelawproject/juriscraper/pull/1309/files which modifies the error message for failed purchases that returned an attachment page instead.

I’m not entirely sure about the error message you’re using since it mentions "our system," which refers to CourtListener. Considering that Juriscraper is widely used outside of CL, I was thinking of a more generic message like:

Unable to download PDF. An attachment page was returned instead.

This message could then be used in CL to assign the FQ message to a more detailed one for CL API users, such the one you have in Juriscraper:

"This PACER document is part of an attachment page. "
 "Our system currently lacks the metadata for this attachment. "
"Please purchase the attachment page and try again."

What do you think? If you agree, should the logic to enrich the FQ message be added in this PR?

Additionally, will the logic you suggested in #4948 (comment) to handle appellate attachment PDF purchases be included in a separate PR?

@ERosendo
Copy link
Contributor Author

should the logic to enrich the FQ message be added in this PR?

@albertisfu I like your suggestion. Let's add that logic in this PR.

will the logic you suggested in #4948 (comment) to handle appellate attachment PDF purchases be included in a separate PR?

I apologize! I was under the impression that I had pushed those changes along with the new test

This commit adds logic to adjust the pacer_doc_id used for purchasing PDF attachments
@ERosendo ERosendo force-pushed the 4861-feat-enable-appellate-pdf-purchases branch from 63c9b50 to 1b184ae Compare January 25, 2025 01:26
@ERosendo
Copy link
Contributor Author

@albertisfu I've implemented your suggestions, including the error message and purchase attachment tweak. Ready for another review!

@ERosendo ERosendo requested a review from mlissner January 25, 2025 02:38
@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Jan 25, 2025
@mlissner mlissner removed their request for review January 26, 2025 06:58
@mlissner mlissner requested a review from albertisfu January 26, 2025 06:58
@mlissner
Copy link
Member

I'm ducking out of this one and leaving it to you guys. You're doing great! If you are both happy, let's do it.

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ERosendo! This looks pretty close just a small comment, and I think we're ready to go.

Also, I've merged the Juriscraper PR. We should just remember to do a release before appellate attachment purchases are enabled.

cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved
- Refactored download_pacer_pdf_by_rd to improve type hints for pacer_doc_id.

- Fixed the logic for handling attachment document purchases by simplifying the process of updating the fourth digit of the pacer_doc_id.
@ERosendo ERosendo requested a review from albertisfu January 28, 2025 14:25
@albertisfu
Copy link
Contributor

Thanks @ERosendo this looks great now! Set to auto-merge.

@albertisfu albertisfu enabled auto-merge January 28, 2025 16:30
@albertisfu albertisfu merged commit 7e3a45a into main Jan 28, 2025
15 checks passed
@albertisfu albertisfu deleted the 4861-feat-enable-appellate-pdf-purchases branch January 28, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants