-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication in the UP MVP #14103
Comments
Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @david-navapbc @jack-h-wang @jalbinson @mkalish @thetaurean |
Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @jack-h-wang @jalbinson @mkalish @thetaurean |
Please add your planning poker estimate with Zenhub @david-navapbc |
@brandonnava please have a look - need reqs for this one |
Updated with specific use case and product rationale section outlining the scenario surrounding the specific use case that warrants for duplication detection |
Q: What does it mean to "error out"? Do you want us to generate a WARNING or ERROR that shows up in the history API? Anything else? Q: What is the de-duplication logic? Do we compare select fields of a message or the whole message byte for byte? Is this logic specific to a particular sender or shared by all senders? Whatever this logic is, how long of a record should we keep (should we compare to messages of all time or just last 6 months or whatever)? |
In the context of "some messages pass validation and others error out the sender can fix the errored messages", "error out" means that a specific item of the report cannot be processed by the pipeline (e.g., exception caught, invalid HL7 formatting, missing/ unsupported values).
I would not necessarily classify a duplicate item as an error or warning. But it would be useful to have visibility of which items (thus reports) are duplicates (and thus removed) so that the RS admins can identify and eventually contact senders that are prone to sending duplicate data.
I'm not exactly clear on "how" hashing works – whether it analyzes and compares an item's container size or looks at the specific contents/ fields. But the goal is to find a solution that does not utilize too much processing power and is fast. I assume byte for byte is the least efficient method. And if specific fields within the item is needed, I suggest using the ones that CDPH uses, which are:
All senders. It would be neat (but beyond our current scope) for us to be able to adjust (or turn on and off) the deduplication schema/ rules for specific senders/ receivers. Therefore, if it's simple and quick to build in that flexibility, it would be great to set the foundation for future iterations.
If it doesn't severely increase the processing speed (e.g., less than a second), keep for 7 years or as long as RS keeps the data (whichever comes first). If it does significantly impact processing efficiency (e.g., each item takes longer than 1 sec to check for duplication), set the timeframe for comparison as the last "1 year". @brandonnava Feel free to add your thoughts and/ or disagreements. |
From your detailed post (thank you!) I am seeing the following additional requirements should get added to the CHOICE (PICK ONE):
|
Not much to add on your points Jess your responses pretty much align with what we took away from our meeting on this with Mo. The one thing to add for new requirement 1 in Arnej's list is a part of our brainstorm with Mo was on trying to figure out if dedupe by message Id, patient id, and test identifier/code were enough to tell if a message resent because the first send had an error that needed fixing would be enough to catch that use case or if we needed to also check another field to account for that (I know we mentioned MSH status or OBR.25 though it doesn't seem like either of those is quite right). For new requirement 3 I'm wondering do we want to have the feature turn on/off based on sender or receiver? Seems like we'd end up asking receivers if they want dupe detection on or not. And on 4 we can talk about if it's a warning or not. On instinct warning seems right but technically don't we allow messages with warnings to be processed and not ones with errors? By that logic since the message wouldn't get processed/sent it would be an error. The flip logic is that usually errors you want to tell the sender so they fix the message but this isn't a type of error that involves improving the message in any way, it just means don't sent it again. |
Internal (to report) - this is what we have in the COVID Pipeline External (to report) - this covers the proposed scenario of a sender re-submitting a file External External |
Outcome/Objective
An MVP version of deduplication detection is implemented in the UP for ELR use cases (ORU-R01).
Context
Deduplication was originally built into the CP in response to a specific sender bug. This occurred the beginning of July 22 where a sender(PMG) introduced a bug in their system resulting in tests results not being marked as sent. That meant their system starting sending the same message over and over again. This continued for a few days before we turned off the auth for that sender. In those few days PMG was sender >800,000 messages/day. We worked with PMG and they eventually fixed the bug.
In followup to that that incident, we decided to implement deduplication logic to catch the resends in the event a similar situation occurred in the future. The idea being that duplicate reports would be discarded without the need to completely disable the sender.
Due to a bug, in the implementation the duplication feature never worked for the UP and needs to be re-implemented
Furthermore, recently to solve the file limit issue we've decoupled the batching step within the pipeline so that we can process messages one at a time. However, we do allow senders to send batched messages. Thus we will have scenarios where a sender sends a batch, some messages in that batch are good and processed and others error out. When we ask the sender to fix the errors and resend, if they send in batch, they could resend the whole batch including the messages that were already good and processed. To avoid processing the same messages twice we need a way to detect the duplicate message in the batch and ensure the duplicates are not processed, only the messages that needed fixing should get through.
Some receivers, such as CA Department of Public Health have deduplication systems in place, but not all receivers are fortunate to have the same resources. Therefore, ReportStream aims to provide the same opportunity to access to improved data accuracy by implementing deduplication prior to delivering to its receivers.
Use Cases
As a receiver of the UP, I would like to not receive duplicative data – so that I can have a more accurate account of data that I’m receiving.
As a sender to the UP, I would like to submit reports to ReportStream in the most convenient method possible, which may sometimes contain previously submitted, duplicate messages.
Scenarios
“Items” refer to the messages contained within a report. An item may contain multiple DiagnosticReports and Observations (including AOEs).
SCENARIOS SUPPORTED BY THIS DEDUPLICATION MVP
Example: As of Nov 2024, SimpleReport does not have a deduplication feature. An organization that uses SimpleReport’s _CSV Uploader_ adds new test result data by appending new rows to a previously submitted CSV file (new + old content) and uploads it to SimpleReport. SimpleReport will then submit this report to ReportStream (assigning a new message header for each upload instance).
XXX SCENARIOS NOT SUPPORTED XXX
This MVP will only deduplicate on an individual sender basis.
Product Requirements
Once a duplicate message is detected, prevent it from being further processed.
All incoming items of a report should be compared to all items submitted in at least the past year. This excludes data submitted prior to the implementation of the deduplication feature.
Duplication detection can be configured 'on' or 'off' for a specified sender (at the specific sender level, not sender org level). The default for senders is to have their deduplication configuration as 'on', which means deduplication is active. However, if the sender's settings for dedupe is configured as 'off', the message contents will still be hashed in event that the sender configuration will be toggled back to 'on' in the future.
Senders shall not be able to change their deduplication configuration without the intervention of/ request through a ReportStream admin.
If an Item is determined to be a duplicate, ReportStream shall log (application log AND action log) an error that an item was rejected. The log message should be: ERR, UNKNOWN, "Duplicate message was detected and removed."
For the MVP iteration, the deduplication feature will apply only to test results (ORU_R01) data types. However, one must keep in mind that there may be future iterations that implement deduplication for other data types, such as test orders (for ETOR).
Acceptance criteria
Technical Design Considerations
De-duplication in the covid pipeline currently occurs in the Receive function, but that likely does not make sense for the UP. It's worth considering if the logic in the UP should live somewhere else (i.e. convert step)
The current logic for de-duping needs to be reconsidered for the UP; likely the item hash should be generated from the FHIR bundle so that the logic only needs to get written once - the issue here is if mappings change then FHIR bundle will be different even if original message is the same...
What if a message fails because of a bug on our side? If they resent, would hashing see it as the same message and filter out.
If a value is blank or not provided, consider how the hash will represent those blank fields (so that it can compare other messages with the same blank fields).
Consider how to purge stored hashes once they have "expired" after a year.
Should MessageID uniqueness be enforced here as well?
In the case of SimpleReport, some unique IDs (accession number [specimen ID], CLIA number, etc.) are not guaranteed to be unique or validated due to some users of SimpleReport not having mature systems to generate and track them. If that is the case for SR, it could be true for other senders as well, making deduplication strategies less-effective or error prone. To mitigate accidentally detecting and removing non-duplicates, we recommend being more strict (comparing across many more data elements) with the deduplication criteria. This theoretically allows for more duplicates to pass through RS, but lessens the risk of accidentally removing data that is valid.
Verify that the deduplication criteria works for all types of ELR messages (e.g., ELIMS, RADxMARS, full-ELR) and doesn't accidentally remove non-duplicates.
Criteria for an item (ORU-0R1 message) to be considered a duplicate
Combine all the following into a hash. Should the hashes match, remove the corresponding item(s) that correlate to the subsequent instance(s).
DiagnosticReport.effectiveDateTime DiagnosticReport.effectivePeriod.start
The text was updated successfully, but these errors were encountered: