Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge "sibling" RSS item that belong to a multi-entry docket item #217

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 39 additions & 5 deletions juriscraper/pacer/rss_feeds.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from ..lib.html_utils import html_unescape
from ..lib.log_tools import make_default_logger
from ..lib.string_utils import harmonize, clean_string
from ..lib.utils import previous_and_next

logger = make_default_logger()

Expand Down Expand Up @@ -98,19 +99,52 @@ def _parse_text(self, text):

@property
def data(self):
"""Override this to create a list of docket-like objects instead of the
usual dict that is usually provided by the docket report.
"""Return a list of docket-like objects instead of the usual dict that
is usually provided by the BaseDocketReport superclass.

When CMECF generates the RSS feed, it breaks up items with
multiple consecutive entries into multiple RSS items with
identical timestamp/id/title. We reverse that and recombine
those items.
"""
if self._data is not None:
return self._data

data_list = []
for entry in self.feed.entries:
for previous_entry, entry, next_entry in previous_and_next(
self.feed.entries):
data = self.metadata(entry)

# We are guaranteed to only have a single docket entry for each
# RSS item, and thus we use data['docket_entries'][0] below.
# Coming up with an alternative data representation here and
# then transforming it into what CL expects after we're done
# iterating over the list is just not worth the bother.
data[u'docket_entries'] = self.docket_entries(entry)
# BUT: Guarantee this condition persists into the future:
assert len(data[u'docket_entries']) <= 1

# If this entry and the immediately prior entry match
# in metadata, then add the current description to
# the previous entry's and continue the loop.
if (
data_list and data_list[-1][u'docket_entries']
and data[u'docket_entries']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these two lines are of a different nature than the rest of the equality lines. Is it worth pulling them out of the if statement here? For example, maintain the logic, but do these types of tests in an earlier if statement? Something like:

if data_list and data_list[-1][u'docket_entries'] and data[u'docket_entries']:
    if all([entry.title == previous_entry.title, ...]):

It might allow you to put the comments about equality closer to where the equality is tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err...I think it is advantageous to not materially increase the nesting level of logic with gratuitous nested ifs.
The comments didn't seem to me to be far removed from equality — to me they made sense before the word if.

Perhaps your reaction is to the fact that the comments don't discuss the first 2 lines of the if expressions…that's because I was focusing on what was maybe a little specialized. But certainly the comment could be broadened to discus the first 2 lines, i.e. "If there are docket entries associated with the previous entry and the current entry, and if those two entries match in metdata…"

But also, there's nothing stopping us from putting comments in the midst of the if expressions, so if there's really a desire for "Instructions at the point of need" then that can just happen, there's no need to change the structure of the if statement.

So if you want to change the structure of the if, probably a better reason is required. I actually did try the all() style and concluded erroneously that it didn't work here (I think I omitted the []), but that said it's not terribly familiar to me so I wouldn't vote for it though I don't oppose it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the extraneous nesting bugged me too. If you pull all this logic into a function, you could reverse the logic here and do:

if not all([this, that, the other thing]):
    return  # Or what have you

That'd avoid the deep logic and keep the two things apart?

and entry.title == previous_entry.title
and entry.link == previous_entry.link
and entry.id == previous_entry.id
and entry.published == previous_entry.published
):
data_list[-1][u'docket_entries'][0][u'short_description'] += (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a line like this deserves a comment, if even a short one. Or maybe break the bits into variables for readability?

previous_entry_desc = data_list[-1][u'docket_entries'][0][u'short_description']
current_entry_desc = data[u'docket_entries'][0][u'short_description']
combined_desc = ' AND '.join([previous_entry_desc, current_entry_desc])
data_list[-1][u'docket_entries'][0][u'short_description'] = combined_desc

That's a bit wordy, but clearer. This multiline thing is pretty bad.

Copy link
Contributor Author

@johnhawkinson johnhawkinson May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think that is clearer.
Forcing the repetition of data_list[-1][u'docket_entries'][0][u'short_description'] is something I would like to avoid, because it requires the reader to parse the 5 levels of hierarchy in their brain twice, and it's not obvious until inspection that they are the same. Edit: esp. when they're not consecutive.

Of course, the flipside is that I suspect you have an unfamiliarity with the += operator and do not love it the way you should, and no amount of cajoling will instill that love in you until you have worked with it for a long time. To my eye the avoidance of complex repetition is the strongest case for the so-called "addition assignment" operator. (The real problem is you can't do many of the usual fixes because the complexity is in an lvalue not just an rvalue.

And I definitely don't think that ' AND '.join([a, b])
is somehow clearer than a+' AND '+b.

I'll think about how to write this better.

At one point I had prevdata (or previous_data) instead of data_list[-1] but I axed it because I felt it wasn't helping. (d59e007)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm familiar with the +=. It's more about how you have three lines of code to do one small thing. That's a smell, but I'll await further approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me:

I suspect A and B

Mike:

Not A.

So B then. As I feared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I suspect you won't go for this one, but it kind of jumped out at me so I thought I'd offer it here:

    def entry(d):
        return d[u'docket_entries'][0]

    def get_append_shorttext(de, text=''):
        de[u'short_description'] += text
        return de[u'short_description']

    previous_entry = entry(data_list[-1])
    current_text = get_append_shorttext(entry(data))
    get_append_shorttext(previous_entry, ' AND ' + current_text)

Think of it as a conversation-starter :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a lot of indirection in there, and get_append_shorttext is either getting or setting depending on the text argument....not great.

Hm. What about something like:

previous_entry = data_list[-1][u'docket_entries'][0]
current_entry = data[u'docket_entries'][0]
previous_entry[u'short_description'] += ' AND ' + current_entry[u'short_description']

That gets us back down to three lines, avoids repetition of the 5 levels of hierarchy, and uses += instead of a join? No additional helper functions needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, while I'm functionally happen with that, it fails PEP-8. And it's actually kind of tricky to fix, and it's why I introduced the helper function, which I agree is totally ridiculous and I'm not sure I could dignify by calling it a strawman:

#        1         2         3         4         5         6         7         8
#2345678901234567890123456789012345678901234567890123456789012345678901234567890
                previous_entry[u'short_description'] += ' AND ' +current_entry[u'short_description']  #1
                previous_entry[u'short_description'] += (  #2
                    ' AND ' + current_entry[u'short_description'])
                previous_entry[u'short_description'] += (  #3
                    ' AND ' +
                    current_entry[u'short_description'])
                previous_entry[u'short_description'] += (  #4
                    ' AND ' +
                    current_entry[u'short_description']
                )
                previous_entry[u'short_description'] += \ 
                    ' AND ' + current_entry[u'short_description']  #5

Although I find 5 tempting, I think that backslash terminates are disfavored (although not against PEP8?).
And of 2,3,4 I dunno. I had the sense you wanted to avoid multiple lines for this expression, and yet there's a certain additional clarity. What's your preference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5 has my vote followed closely by 1 (a minor PEP8 violation isn't the end of the world in my book). 2, 3, and 4 just don't seem worth it. I leave the rest of this to you. Let's land this.

' AND ' +
data[u'docket_entries'][0][u'short_description'])
continue

data[u'parties'] = None
data[u'docket_entries'] = self.docket_entries(entry)
if data[u'docket_entries'] and data['docket_number']:
if data[u'docket_entries'] and data[u'docket_number']:
data_list.append(data)

self._data = data_list
return data_list

Expand Down Expand Up @@ -146,7 +180,7 @@ def docket_entries(self, entry):
u'date_filed': date(*entry.published_parsed[:3]),
u'document_number': self._get_value(self.document_number_regex,
entry.summary),
u'description': '',
u'description': u'',
u'short_description': html_unescape(
self._get_value(self.short_desc_regex, entry.summary)),
}
Expand Down
29 changes: 1 addition & 28 deletions tests/examples/pacer/rss_feeds/nysb_1.json
Original file line number Diff line number Diff line change
Expand Up @@ -717,34 +717,7 @@
"description": "",
"document_number": "47",
"pacer_doc_id": "126018830304",
"short_description": "Motion, Redact (Fee) (NOT to be used for redacting in Transcripts)"
}
],
"docket_number": "16-35015",
"jurisdiction": "",
"jury_demand": "",
"nature_of_suit": "",
"pacer_case_id": "263474",
"parties": null,
"referred_to_str": ""
},
{
"assigned_to_str": "",
"case_name": "Angela S. Bittencourt",
"cause": "",
"court_id": "nysb",
"date_converted": null,
"date_discharged": null,
"date_filed": null,
"date_terminated": null,
"demand": "",
"docket_entries": [
{
"date_filed": "2018-04-19",
"description": "",
"document_number": "47",
"pacer_doc_id": "126018830304",
"short_description": "Motion, Redact (Fee) (NOT to be used for redacting in Transcripts)"
"short_description": "Motion, Redact (Fee) (NOT to be used for redacting in Transcripts) AND Motion, Redact (Fee) (NOT to be used for redacting in Transcripts)"
}
],
"docket_number": "16-35015",
Expand Down
Loading