Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication in the UP MVP #14103

Open
7 tasks
mkalish opened this issue Apr 18, 2024 · 10 comments · May be fixed by #17143
Open
7 tasks

Deduplication in the UP MVP #14103

mkalish opened this issue Apr 18, 2024 · 10 comments · May be fixed by #17143
Labels
blocked Issue State label to flag PRs and issues to show they are blocked Epic ZenHub Epic label platform Platform Team

Comments

@mkalish
Copy link
Collaborator

mkalish commented Apr 18, 2024

Outcome/Objective

An MVP version of deduplication detection is implemented in the UP for ELR use cases (ORU-R01).

Context

Deduplication was originally built into the CP in response to a specific sender bug. This occurred the beginning of July 22 where a sender(PMG) introduced a bug in their system resulting in tests results not being marked as sent. That meant their system starting sending the same message over and over again. This continued for a few days before we turned off the auth for that sender. In those few days PMG was sender >800,000 messages/day. We worked with PMG and they eventually fixed the bug.

In followup to that that incident, we decided to implement deduplication logic to catch the resends in the event a similar situation occurred in the future. The idea being that duplicate reports would be discarded without the need to completely disable the sender.

Due to a bug, in the implementation the duplication feature never worked for the UP and needs to be re-implemented

Furthermore, recently to solve the file limit issue we've decoupled the batching step within the pipeline so that we can process messages one at a time. However, we do allow senders to send batched messages. Thus we will have scenarios where a sender sends a batch, some messages in that batch are good and processed and others error out. When we ask the sender to fix the errors and resend, if they send in batch, they could resend the whole batch including the messages that were already good and processed. To avoid processing the same messages twice we need a way to detect the duplicate message in the batch and ensure the duplicates are not processed, only the messages that needed fixing should get through.

Some receivers, such as CA Department of Public Health have deduplication systems in place, but not all receivers are fortunate to have the same resources. Therefore, ReportStream aims to provide the same opportunity to access to improved data accuracy by implementing deduplication prior to delivering to its receivers.

Use Cases

  • As a receiver of the UP, I would like to not receive duplicative data – so that I can have a more accurate account of data that I’m receiving.

  • As a sender to the UP, I would like to submit reports to ReportStream in the most convenient method possible, which may sometimes contain previously submitted, duplicate messages.

Scenarios

“Items” refer to the messages contained within a report. An item may contain multiple DiagnosticReports and Observations (including AOEs).
SCENARIOS SUPPORTED BY THIS DEDUPLICATION MVP

1. Deduplicating reports
This is what was historically supported by the COVID Pipeline.
Sender submits Report1 on Nov 1 and submits Report1 (with no changes to the content) on Nov 2. The deduplication feature will remove any subsequent submissions of Report1 (in this case, the one submitted on Nov 2).

image.png

2. Deduplicating items within the same report
This is what was historically supported by the COVID Pipeline.
Sender submits Report1 with items A, B, C, A within it. The deduplication feature will remove any subsequent instances of item A.

image.png

3. Deduplicating items across reports
Sender submits two reports: Report1 with items A, B, C & Report2 with items D, E, A. The deduplication feature will detect that item A is a duplicate across multiple submitted reports and remove any subsequent instances of item A (in this case, the one that was submitted within Report2). Items D and E will continue to be processed.

image.png

Example: As of Nov 2024, SimpleReport does not have a deduplication feature. An organization that uses SimpleReport’s _CSV Uploader_ adds new test result data by appending new rows to a previously submitted CSV file (new + old content) and uploads it to SimpleReport. SimpleReport will then submit this report to ReportStream (assigning a new message header for each upload instance).

XXX SCENARIOS NOT SUPPORTED XXX

4. Deduplicating items across senders
Sender ABC submits Report1 [with items A, B, C]. Sender XYZ submits Report1 [with items A, B, C]. Both reports from both senders are exactly identical (duplicates). A deduplication feature will detect, not only all the items as duplicates, but also the reports as duplicates. Subsequent instances are removed.

image.png

This MVP will only deduplicate on an individual sender basis.

5. Deduplicating observations across items
Sender submits Report1 with items A and B. Item A contains test result observations for COVID, flu, and RSV. Item B contains the same COVID and flu test result observations (for the same patient, test performed, time, etc.) and an additional observation for syphilis. A deduplication feature will detect that the COVID and Flu observations from Item A and B are identical and remove any subsequent instances (in this case, the ones from item B). The syphilis observation from Item B will continue to be processed by the pipeline.

image.png

6. Deduplicating across systems
ReportStream sends Report1 with items A, B, C to a Public Health Dpt. AIMS sends Report2 with items D, E, A. A deduplication feature (mostly likely on the PHD’s side) will detect item A as a duplicate and remove subsequent instances of it.

image.png

Product Requirements

  1. Once a duplicate message is detected, prevent it from being further processed.

  2. All incoming items of a report should be compared to all items submitted in at least the past year. This excludes data submitted prior to the implementation of the deduplication feature.

  3. Duplication detection can be configured 'on' or 'off' for a specified sender (at the specific sender level, not sender org level). The default for senders is to have their deduplication configuration as 'on', which means deduplication is active. However, if the sender's settings for dedupe is configured as 'off', the message contents will still be hashed in event that the sender configuration will be toggled back to 'on' in the future.

  4. Senders shall not be able to change their deduplication configuration without the intervention of/ request through a ReportStream admin.

  5. If an Item is determined to be a duplicate, ReportStream shall log (application log AND action log) an error that an item was rejected. The log message should be: ERR, UNKNOWN, "Duplicate message was detected and removed."

  6. For the MVP iteration, the deduplication feature will apply only to test results (ORU_R01) data types. However, one must keep in mind that there may be future iterations that implement deduplication for other data types, such as test orders (for ETOR).

Acceptance criteria

  • All the requirements above are met in staging and production
  • Run a series of tests that represent the supported scenarios above
  • Each test scenario will simulate at least 5 duplicate items/ reports being detected and removed
  • Confirm that the duplications are being logged correctly (syntax and timing)
  • Set deduplication feature ‘off’ for all senders under the ELIMS org.
  • Prior to deployment, notify Deliver and Product Mgrs of potential deduplication dependencies with active pilot sender partners.
  • Technical design and final implementation are documented

Technical Design Considerations

  • De-duplication in the covid pipeline currently occurs in the Receive function, but that likely does not make sense for the UP. It's worth considering if the logic in the UP should live somewhere else (i.e. convert step)

  • The current logic for de-duping needs to be reconsidered for the UP; likely the item hash should be generated from the FHIR bundle so that the logic only needs to get written once - the issue here is if mappings change then FHIR bundle will be different even if original message is the same...

  • What if a message fails because of a bug on our side? If they resent, would hashing see it as the same message and filter out.

  • If a value is blank or not provided, consider how the hash will represent those blank fields (so that it can compare other messages with the same blank fields).

  • Consider how to purge stored hashes once they have "expired" after a year.

  • Should MessageID uniqueness be enforced here as well?

  • In the case of SimpleReport, some unique IDs (accession number [specimen ID], CLIA number, etc.) are not guaranteed to be unique or validated due to some users of SimpleReport not having mature systems to generate and track them. If that is the case for SR, it could be true for other senders as well, making deduplication strategies less-effective or error prone. To mitigate accidentally detecting and removing non-duplicates, we recommend being more strict (comparing across many more data elements) with the deduplication criteria. This theoretically allows for more duplicates to pass through RS, but lessens the risk of accidentally removing data that is valid.

  • Verify that the deduplication criteria works for all types of ELR messages (e.g., ELIMS, RADxMARS, full-ELR) and doesn't accidentally remove non-duplicates.

Criteria for an item (ORU-0R1 message) to be considered a duplicate
Combine all the following into a hash. Should the hashes match, remove the corresponding item(s) that correlate to the subsequent instance(s).

Data Element HL7 FHIR
Specimen ID SPM.2 Specimen.Identifier
Accession number SPM.30 (not v2.5.1 compatible) Specimen.accessionIdentifier
Specimen Collection Date/Time (if different from other date/time) SPM.17 Specimen.collection.collectedDateTime
Patient ID PID.3 Patient.identifier
Patient name PID.5 Patient.name
Patient DOB PID.7 Patient.birthDate birthDate.extension[1].valueDateTime
Results Rpt/Status Chng – Date/Time OBR.22 DiagnosticReport.issued
Result Status OBR.25 DiagnosticReport.status
Performing Organization/ Testing facility CLIA OBX.23 Observation.Performer -> Organization.Identifier.value
Performing Organization/ Testing facility name OBX.23 Observation.Performer -> Organization.name
Test performed code OBX.3.1 Observation.resource.code.system
Test performed code system OBX.3.3 Observation.resource.code.coding
Date/Time of the Observation (this appears in multiple HL7 locations) OBX.14, OBR.7, SPM.17 Observation.resource.issued
DiagnosticReport.effectiveDateTime DiagnosticReport.effectivePeriod.start
Observation Value/ Test result code OBX.5 Observation.resource.valueCodeableConcept.coding.code
Observation Value/ Test result code system OBX.3.3 Observation.resource.valueCodeableConcept.coding.system
@mkalish mkalish added the platform Platform Team label Apr 18, 2024
@mkalish mkalish added the ready-for-refinement Ticket is a point where we can productively discuss it label Apr 18, 2024
@JFisk42
Copy link
Collaborator

JFisk42 commented Apr 18, 2024

@Andrey-Glazkv
Copy link
Collaborator

@Andrey-Glazkv
Copy link
Collaborator

Please add your planning poker estimate with Zenhub @david-navapbc

@Andrey-Glazkv
Copy link
Collaborator

@brandonnava please have a look - need reqs for this one

@arnejduranovic arnejduranovic added the blocked Issue State label to flag PRs and issues to show they are blocked label May 13, 2024
@brandonnava
Copy link
Collaborator

Updated with specific use case and product rationale section outlining the scenario surrounding the specific use case that warrants for duplication detection

@arnejduranovic arnejduranovic removed the ready-for-refinement Ticket is a point where we can productively discuss it label Jul 22, 2024
@brandonnava brandonnava changed the title Implement and re-enable duplication detection for the UP Implement and re-enable duplication detection for the UP MVP Nov 12, 2024
@arnejduranovic
Copy link
Collaborator

arnejduranovic commented Nov 14, 2024

Detection should work on an individual message level such that if a sender submits a batch where some messages pass validation and others error out the sender can fix the errored messages and resubmit the whole batch without the messages that submitted successfully the first time being sent on to the receiver a second time (the duplicates in the batch are detected and removed)

Q: What does it mean to "error out"? Do you want us to generate a WARNING or ERROR that shows up in the history API? Anything else?

Q: What is the de-duplication logic? Do we compare select fields of a message or the whole message byte for byte? Is this logic specific to a particular sender or shared by all senders? Whatever this logic is, how long of a record should we keep (should we compare to messages of all time or just last 6 months or whatever)?

@brandonnava brandonnava changed the title Implement and re-enable duplication detection for the UP MVP [TICKET WIP] Design UP deduplication strategy Nov 14, 2024
@jsutantio
Copy link
Collaborator

Q: What does it mean to "error out"?

In the context of "some messages pass validation and others error out the sender can fix the errored messages", "error out" means that a specific item of the report cannot be processed by the pipeline (e.g., exception caught, invalid HL7 formatting, missing/ unsupported values).

Do you want us to generate a WARNING or ERROR that shows up in the history API? Anything else?

I would not necessarily classify a duplicate item as an error or warning. But it would be useful to have visibility of which items (thus reports) are duplicates (and thus removed) so that the RS admins can identify and eventually contact senders that are prone to sending duplicate data.
If it doesn't overload our logs or compromise efficiency, I would like the system to log the de-duplication of an item (and any additional info needed for a RS admin to trace back to the original message/ report).
One nice to have feature (which is out of scope), is the ability for an error to be logged (and then viewed via History API), when the ENTIRE report is a duplicate, not just portions of the report. This seems like an erroneous use case; as compared to the current use case of someone re-uploading a report with partially corrected items.

Q: What is the de-duplication logic? Do we compare select fields of a message or the whole message byte for byte?

I'm not exactly clear on "how" hashing works – whether it analyzes and compares an item's container size or looks at the specific contents/ fields. But the goal is to find a solution that does not utilize too much processing power and is fast. I assume byte for byte is the least efficient method. And if specific fields within the item is needed, I suggest using the ones that CDPH uses, which are:

  • message id
  • patient identifier
  • test identifier/ code

Is this logic specific to a particular sender or shared by all senders?

All senders. It would be neat (but beyond our current scope) for us to be able to adjust (or turn on and off) the deduplication schema/ rules for specific senders/ receivers. Therefore, if it's simple and quick to build in that flexibility, it would be great to set the foundation for future iterations.

Whatever this logic is, how long of a record should we keep (should we compare to messages of all time or just last 6 months or whatever)?

If it doesn't severely increase the processing speed (e.g., less than a second), keep for 7 years or as long as RS keeps the data (whichever comes first). If it does significantly impact processing efficiency (e.g., each item takes longer than 1 sec to check for duplication), set the timeframe for comparison as the last "1 year".
I assume there are ways to increase efficiency and reducing the storage size, such as partitioning by sender and then performing deduplication.

@brandonnava Feel free to add your thoughts and/ or disagreements.

@arnejduranovic
Copy link
Collaborator

arnejduranovic commented Nov 15, 2024

From your detailed post (thank you!) I am seeing the following additional requirements should get added to the Product Requirement(s) section (please confirm or deny):

CHOICE (PICK ONE):
1.A: A message shall be considered a duplicate of another message if at least its message Id, patient id, and test identifier/code match another message's.
- In HL7v2, these are the following fields:
- In FHIR, these are the following elements:
1.B: A message shall be considered a duplicate of another message if it is an exact string match of another message

  1. All incoming messages should be compared to all messages submitted in at least the past year, regardless of the sender of each message.

  2. duplication detection logic shall only run on an incoming message if the feature is configured for a particular sender. However, their submitted messages shall still be taken into account for future duplication detection scenarios.

  3. If an Item is determined to be a duplicate, ReportStream shall log (application log AND action log) an error that an item was rejected. The log message should be of the form: JESSICA TODO

@brandonnava
Copy link
Collaborator

brandonnava commented Nov 15, 2024

Not much to add on your points Jess your responses pretty much align with what we took away from our meeting on this with Mo.

The one thing to add for new requirement 1 in Arnej's list is a part of our brainstorm with Mo was on trying to figure out if dedupe by message Id, patient id, and test identifier/code were enough to tell if a message resent because the first send had an error that needed fixing would be enough to catch that use case or if we needed to also check another field to account for that (I know we mentioned MSH status or OBR.25 though it doesn't seem like either of those is quite right).

For new requirement 3 I'm wondering do we want to have the feature turn on/off based on sender or receiver? Seems like we'd end up asking receivers if they want dupe detection on or not.

And on 4 we can talk about if it's a warning or not. On instinct warning seems right but technically don't we allow messages with warnings to be processed and not ones with errors? By that logic since the message wouldn't get processed/sent it would be an error. The flip logic is that usually errors you want to tell the sender so they fix the message but this isn't a type of error that involves improving the message in any way, it just means don't sent it again.

@jsutantio
Copy link
Collaborator

Internal (to report) - this is what we have in the COVID Pipeline
Sender submits a report with items A, B, C, A within it. "Internal" de-dupe would detect that item A is a duplicate and get rid of it. "Internal" de-dupe does not detect a duplicate of item A in another submitted report.

External (to report) - this covers the proposed scenario of a sender re-submitting a file
Sender submits two reports: Report 1 with items A, B, C & Report 2 with items D, E, A. "External" de-dupe would detect that item A is a duplicate across multiple submitted reports.
Because this looks across different reports, the effort to implement the "External" de-dupe is larger than originally scoped.

External External
ReportStream sends Report 1 with items A, B, C
AIMS sends Report 2 with items D, E, A
Public Health Dpt receives duplicates of item A. Some PHDs have de-duplication systems in place (but some don't). ReportStream has no way to help with "External External" scenarios.

@jsutantio jsutantio changed the title [TICKET WIP] Design UP deduplication strategy Deduplication in the UP MVP Nov 15, 2024
@jsutantio jsutantio added Epic ZenHub Epic label and removed Epic ZenHub Epic label labels Nov 15, 2024
@jsutantio jsutantio changed the title Deduplication in the UP MVP Deduplication in the UP MVP [Prod Req WIP] Nov 18, 2024
@jsutantio jsutantio changed the title Deduplication in the UP MVP [Prod Req WIP] Deduplication in the UP MVP [WIP] Nov 21, 2024
@jsutantio jsutantio changed the title Deduplication in the UP MVP [WIP] Deduplication in the UP MVP Nov 25, 2024
@JFisk42 JFisk42 linked a pull request Jan 24, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Issue State label to flag PRs and issues to show they are blocked Epic ZenHub Epic label platform Platform Team
Projects
Development

Successfully merging a pull request may close this issue.

6 participants