Deduplication in the UP MVP #14103

mkalish · 2024-04-18T15:44:05Z

Outcome/Objective

An MVP version of deduplication detection is implemented in the UP for ELR use cases (ORU-R01).

Context

Deduplication was originally built into the CP in response to a specific sender bug. This occurred the beginning of July 22 where a sender(PMG) introduced a bug in their system resulting in tests results not being marked as sent. That meant their system starting sending the same message over and over again. This continued for a few days before we turned off the auth for that sender. In those few days PMG was sender >800,000 messages/day. We worked with PMG and they eventually fixed the bug.

In followup to that that incident, we decided to implement deduplication logic to catch the resends in the event a similar situation occurred in the future. The idea being that duplicate reports would be discarded without the need to completely disable the sender.

Due to a bug, in the implementation the duplication feature never worked for the UP and needs to be re-implemented

Furthermore, recently to solve the file limit issue we've decoupled the batching step within the pipeline so that we can process messages one at a time. However, we do allow senders to send batched messages. Thus we will have scenarios where a sender sends a batch, some messages in that batch are good and processed and others error out. When we ask the sender to fix the errors and resend, if they send in batch, they could resend the whole batch including the messages that were already good and processed. To avoid processing the same messages twice we need a way to detect the duplicate message in the batch and ensure the duplicates are not processed, only the messages that needed fixing should get through.

Some receivers, such as CA Department of Public Health have deduplication systems in place, but not all receivers are fortunate to have the same resources. Therefore, ReportStream aims to provide the same opportunity to access to improved data accuracy by implementing deduplication prior to delivering to its receivers.

Use Cases

As a receiver of the UP, I would like to not receive duplicative data – so that I can have a more accurate account of data that I’m receiving.
As a sender to the UP, I would like to submit reports to ReportStream in the most convenient method possible, which may sometimes contain previously submitted, duplicate messages.

Scenarios

“Items” refer to the messages contained within a report. An item may contain multiple DiagnosticReports and Observations (including AOEs).
SCENARIOS SUPPORTED BY THIS DEDUPLICATION MVP

1. Deduplicating reports
This is what was historically supported by the COVID Pipeline.
Sender submits Report1 on Nov 1 and submits Report1 (with no changes to the content) on Nov 2. The deduplication feature will remove any subsequent submissions of Report1 (in this case, the one submitted on Nov 2).

2. Deduplicating items within the same report
This is what was historically supported by the COVID Pipeline.
Sender submits Report1 with items A, B, C, A within it. The deduplication feature will remove any subsequent instances of item A.

3. Deduplicating items across reports
Sender submits two reports: Report1 with items A, B, C & Report2 with items D, E, A. The deduplication feature will detect that item A is a duplicate across multiple submitted reports and remove any subsequent instances of item A (in this case, the one that was submitted within Report2). Items D and E will continue to be processed.

Example: As of Nov 2024, SimpleReport does not have a deduplication feature. An organization that uses SimpleReport’s _CSV Uploader_ adds new test result data by appending new rows to a previously submitted CSV file (new + old content) and uploads it to SimpleReport. SimpleReport will then submit this report to ReportStream (assigning a new message header for each upload instance).

XXX SCENARIOS NOT SUPPORTED XXX

4. Deduplicating items across senders
Sender ABC submits Report1 [with items A, B, C]. Sender XYZ submits Report1 [with items A, B, C]. Both reports from both senders are exactly identical (duplicates). A deduplication feature will detect, not only all the items as duplicates, but also the reports as duplicates. Subsequent instances are removed.

This MVP will only deduplicate on an individual sender basis.

5. Deduplicating observations across items
Sender submits Report1 with items A and B. Item A contains test result observations for COVID, flu, and RSV. Item B contains the same COVID and flu test result observations (for the same patient, test performed, time, etc.) and an additional observation for syphilis. A deduplication feature will detect that the COVID and Flu observations from Item A and B are identical and remove any subsequent instances (in this case, the ones from item B). The syphilis observation from Item B will continue to be processed by the pipeline.

6. Deduplicating across systems
ReportStream sends Report1 with items A, B, C to a Public Health Dpt. AIMS sends Report2 with items D, E, A. A deduplication feature (mostly likely on the PHD’s side) will detect item A as a duplicate and remove subsequent instances of it.

Product Requirements

Once a duplicate message is detected, prevent it from being further processed.
All incoming items of a report should be compared to all items submitted in at least the past year. This excludes data submitted prior to the implementation of the deduplication feature.
Duplication detection can be configured 'on' or 'off' for a specified sender (at the specific sender level, not sender org level). The default for senders is to have their deduplication configuration as 'on', which means deduplication is active. However, if the sender's settings for dedupe is configured as 'off', the message contents will still be hashed in event that the sender configuration will be toggled back to 'on' in the future.
Senders shall not be able to change their deduplication configuration without the intervention of/ request through a ReportStream admin.
If an Item is determined to be a duplicate, ReportStream shall log (application log AND action log) an error that an item was rejected. The log message should be: ERR, UNKNOWN, "Duplicate message was detected and removed."
For the MVP iteration, the deduplication feature will apply only to test results (ORU_R01) data types. However, one must keep in mind that there may be future iterations that implement deduplication for other data types, such as test orders (for ETOR).

Acceptance criteria

All the requirements above are met in staging and production
Run a series of tests that represent the supported scenarios above
Each test scenario will simulate at least 5 duplicate items/ reports being detected and removed
Confirm that the duplications are being logged correctly (syntax and timing)
Set deduplication feature ‘off’ for all senders under the ELIMS org.
Prior to deployment, notify Deliver and Product Mgrs of potential deduplication dependencies with active pilot sender partners.
Technical design and final implementation are documented

Technical Design Considerations

De-duplication in the covid pipeline currently occurs in the Receive function, but that likely does not make sense for the UP. It's worth considering if the logic in the UP should live somewhere else (i.e. convert step)
The current logic for de-duping needs to be reconsidered for the UP; likely the item hash should be generated from the FHIR bundle so that the logic only needs to get written once - the issue here is if mappings change then FHIR bundle will be different even if original message is the same...
What if a message fails because of a bug on our side? If they resent, would hashing see it as the same message and filter out.
If a value is blank or not provided, consider how the hash will represent those blank fields (so that it can compare other messages with the same blank fields).
Consider how to purge stored hashes once they have "expired" after a year.
Should MessageID uniqueness be enforced here as well?
In the case of SimpleReport, some unique IDs (accession number [specimen ID], CLIA number, etc.) are not guaranteed to be unique or validated due to some users of SimpleReport not having mature systems to generate and track them. If that is the case for SR, it could be true for other senders as well, making deduplication strategies less-effective or error prone. To mitigate accidentally detecting and removing non-duplicates, we recommend being more strict (comparing across many more data elements) with the deduplication criteria. This theoretically allows for more duplicates to pass through RS, but lessens the risk of accidentally removing data that is valid.
Verify that the deduplication criteria works for all types of ELR messages (e.g., ELIMS, RADxMARS, full-ELR) and doesn't accidentally remove non-duplicates.

Criteria for an item (ORU-0R1 message) to be considered a duplicate
Combine all the following into a hash. Should the hashes match, remove the corresponding item(s) that correlate to the subsequent instance(s).

Data Element	HL7	FHIR
Specimen ID	SPM.2	Specimen.Identifier
Accession number	SPM.30 (not v2.5.1 compatible)	Specimen.accessionIdentifier
Specimen Collection Date/Time (if different from other date/time)	SPM.17	Specimen.collection.collectedDateTime
Patient ID	PID.3	Patient.identifier
Patient name	PID.5	Patient.name
Patient DOB	PID.7	Patient.birthDate birthDate.extension[1].valueDateTime
Results Rpt/Status Chng – Date/Time	OBR.22	DiagnosticReport.issued
Result Status	OBR.25	DiagnosticReport.status
Performing Organization/ Testing facility CLIA	OBX.23	Observation.Performer -> Organization.Identifier.value
Performing Organization/ Testing facility name	OBX.23	Observation.Performer -> Organization.name
Test performed code	OBX.3.1	Observation.resource.code.system
Test performed code system	OBX.3.3	Observation.resource.code.coding
Date/Time of the Observation (this appears in multiple HL7 locations)	OBX.14, OBR.7, SPM.17	Observation.resource.issued DiagnosticReport.effectiveDateTime DiagnosticReport.effectivePeriod.start
Observation Value/ Test result code	OBX.5	Observation.resource.valueCodeableConcept.coding.code
Observation Value/ Test result code system	OBX.3.3	Observation.resource.valueCodeableConcept.coding.system

JFisk42 · 2024-04-18T16:21:47Z

Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @david-navapbc @jack-h-wang @jalbinson @mkalish @thetaurean

Andrey-Glazkv · 2024-04-22T15:54:31Z

Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @jack-h-wang @jalbinson @mkalish @thetaurean

Andrey-Glazkv · 2024-04-22T15:54:40Z

Please add your planning poker estimate with Zenhub @david-navapbc

Andrey-Glazkv · 2024-05-06T16:39:34Z

@brandonnava please have a look - need reqs for this one

brandonnava · 2024-05-14T21:09:36Z

Updated with specific use case and product rationale section outlining the scenario surrounding the specific use case that warrants for duplication detection

arnejduranovic · 2024-11-14T21:29:12Z

Detection should work on an individual message level such that if a sender submits a batch where some messages pass validation and others error out the sender can fix the errored messages and resubmit the whole batch without the messages that submitted successfully the first time being sent on to the receiver a second time (the duplicates in the batch are detected and removed)

Q: What does it mean to "error out"? Do you want us to generate a WARNING or ERROR that shows up in the history API? Anything else?

Q: What is the de-duplication logic? Do we compare select fields of a message or the whole message byte for byte? Is this logic specific to a particular sender or shared by all senders? Whatever this logic is, how long of a record should we keep (should we compare to messages of all time or just last 6 months or whatever)?

jsutantio · 2024-11-15T00:13:57Z

Q: What does it mean to "error out"?

In the context of "some messages pass validation and others error out the sender can fix the errored messages", "error out" means that a specific item of the report cannot be processed by the pipeline (e.g., exception caught, invalid HL7 formatting, missing/ unsupported values).

Do you want us to generate a WARNING or ERROR that shows up in the history API? Anything else?

I would not necessarily classify a duplicate item as an error or warning. But it would be useful to have visibility of which items (thus reports) are duplicates (and thus removed) so that the RS admins can identify and eventually contact senders that are prone to sending duplicate data.
If it doesn't overload our logs or compromise efficiency, I would like the system to log the de-duplication of an item (and any additional info needed for a RS admin to trace back to the original message/ report).
One nice to have feature (which is out of scope), is the ability for an error to be logged (and then viewed via History API), when the ENTIRE report is a duplicate, not just portions of the report. This seems like an erroneous use case; as compared to the current use case of someone re-uploading a report with partially corrected items.

Q: What is the de-duplication logic? Do we compare select fields of a message or the whole message byte for byte?

I'm not exactly clear on "how" hashing works – whether it analyzes and compares an item's container size or looks at the specific contents/ fields. But the goal is to find a solution that does not utilize too much processing power and is fast. I assume byte for byte is the least efficient method. And if specific fields within the item is needed, I suggest using the ones that CDPH uses, which are:

message id
patient identifier
test identifier/ code

Is this logic specific to a particular sender or shared by all senders?

All senders. It would be neat (but beyond our current scope) for us to be able to adjust (or turn on and off) the deduplication schema/ rules for specific senders/ receivers. Therefore, if it's simple and quick to build in that flexibility, it would be great to set the foundation for future iterations.

Whatever this logic is, how long of a record should we keep (should we compare to messages of all time or just last 6 months or whatever)?

If it doesn't severely increase the processing speed (e.g., less than a second), keep for 7 years or as long as RS keeps the data (whichever comes first). If it does significantly impact processing efficiency (e.g., each item takes longer than 1 sec to check for duplication), set the timeframe for comparison as the last "1 year".
I assume there are ways to increase efficiency and reducing the storage size, such as partitioning by sender and then performing deduplication.

@brandonnava Feel free to add your thoughts and/ or disagreements.

arnejduranovic · 2024-11-15T16:29:42Z

From your detailed post (thank you!) I am seeing the following additional requirements should get added to the Product Requirement(s) section (please confirm or deny):

CHOICE (PICK ONE):
1.A: A message shall be considered a duplicate of another message if at least its message Id, patient id, and test identifier/code match another message's.
- In HL7v2, these are the following fields:
- In FHIR, these are the following elements:
1.B: A message shall be considered a duplicate of another message if it is an exact string match of another message

All incoming messages should be compared to all messages submitted in at least the past year, regardless of the sender of each message.
duplication detection logic shall only run on an incoming message if the feature is configured for a particular sender. However, their submitted messages shall still be taken into account for future duplication detection scenarios.
If an Item is determined to be a duplicate, ReportStream shall log (application log AND action log) an error that an item was rejected. The log message should be of the form: JESSICA TODO

brandonnava · 2024-11-15T20:30:46Z

Not much to add on your points Jess your responses pretty much align with what we took away from our meeting on this with Mo.

The one thing to add for new requirement 1 in Arnej's list is a part of our brainstorm with Mo was on trying to figure out if dedupe by message Id, patient id, and test identifier/code were enough to tell if a message resent because the first send had an error that needed fixing would be enough to catch that use case or if we needed to also check another field to account for that (I know we mentioned MSH status or OBR.25 though it doesn't seem like either of those is quite right).

For new requirement 3 I'm wondering do we want to have the feature turn on/off based on sender or receiver? Seems like we'd end up asking receivers if they want dupe detection on or not.

And on 4 we can talk about if it's a warning or not. On instinct warning seems right but technically don't we allow messages with warnings to be processed and not ones with errors? By that logic since the message wouldn't get processed/sent it would be an error. The flip logic is that usually errors you want to tell the sender so they fix the message but this isn't a type of error that involves improving the message in any way, it just means don't sent it again.

jsutantio · 2024-11-15T20:58:54Z

Internal (to report) - this is what we have in the COVID Pipeline
Sender submits a report with items A, B, C, A within it. "Internal" de-dupe would detect that item A is a duplicate and get rid of it. "Internal" de-dupe does not detect a duplicate of item A in another submitted report.

External (to report) - this covers the proposed scenario of a sender re-submitting a file
Sender submits two reports: Report 1 with items A, B, C & Report 2 with items D, E, A. "External" de-dupe would detect that item A is a duplicate across multiple submitted reports.
Because this looks across different reports, the effort to implement the "External" de-dupe is larger than originally scoped.

External External
ReportStream sends Report 1 with items A, B, C
AIMS sends Report 2 with items D, E, A
Public Health Dpt receives duplicates of item A. Some PHDs have de-duplication systems in place (but some don't). ReportStream has no way to help with "External External" scenarios.

mkalish added the platform Platform Team label Apr 18, 2024

github-project-automation bot added this to PRIME ReportStream Platform Apr 18, 2024

github-project-automation bot moved this to New items in PRIME ReportStream Platform Apr 18, 2024

mkalish added the ready-for-refinement Ticket is a point where we can productively discuss it label Apr 18, 2024

mkalish mentioned this issue Apr 25, 2024

14102: disable dedupe detection parameter for the UP #14204

Merged

9 tasks

arnejduranovic added the blocked Issue State label to flag PRs and issues to show they are blocked label May 13, 2024

arnejduranovic removed the ready-for-refinement Ticket is a point where we can productively discuss it label Jul 22, 2024

brandonnava changed the title ~~Implement and re-enable duplication detection for the UP~~ Implement and re-enable duplication detection for the UP MVP Nov 12, 2024

brandonnava changed the title ~~Implement and re-enable duplication detection for the UP MVP~~ [TICKET WIP] Design UP deduplication strategy Nov 14, 2024

jsutantio changed the title ~~[TICKET WIP] Design UP deduplication strategy~~ Deduplication in the UP MVP Nov 15, 2024

jsutantio added Epic ZenHub Epic label and removed Epic ZenHub Epic label labels Nov 15, 2024

jsutantio changed the title ~~Deduplication in the UP MVP~~ Deduplication in the UP MVP [Prod Req WIP] Nov 18, 2024

jsutantio changed the title ~~Deduplication in the UP MVP [Prod Req WIP]~~ Deduplication in the UP MVP [WIP] Nov 21, 2024

jsutantio changed the title ~~Deduplication in the UP MVP [WIP]~~ Deduplication in the UP MVP Nov 25, 2024

JFisk42 linked a pull request Jan 24, 2025 that will close this issue

UP Deduplication Design Documentation #17143

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication in the UP MVP #14103

Deduplication in the UP MVP #14103

mkalish commented Apr 18, 2024 •

edited by jsutantio

Loading

JFisk42 commented Apr 18, 2024

Andrey-Glazkv commented Apr 22, 2024

Andrey-Glazkv commented Apr 22, 2024

Andrey-Glazkv commented May 6, 2024

brandonnava commented May 14, 2024

arnejduranovic commented Nov 14, 2024 •

edited

Loading

jsutantio commented Nov 15, 2024

arnejduranovic commented Nov 15, 2024 •

edited

Loading

brandonnava commented Nov 15, 2024 •

edited

Loading

jsutantio commented Nov 15, 2024

Deduplication in the UP MVP #14103

Deduplication in the UP MVP #14103

Comments

mkalish commented Apr 18, 2024 • edited by jsutantio Loading

Outcome/Objective

Context

Use Cases

Scenarios

Product Requirements

Acceptance criteria

Technical Design Considerations

JFisk42 commented Apr 18, 2024

Andrey-Glazkv commented Apr 22, 2024

Andrey-Glazkv commented Apr 22, 2024

Andrey-Glazkv commented May 6, 2024

brandonnava commented May 14, 2024

arnejduranovic commented Nov 14, 2024 • edited Loading

jsutantio commented Nov 15, 2024

arnejduranovic commented Nov 15, 2024 • edited Loading

brandonnava commented Nov 15, 2024 • edited Loading

jsutantio commented Nov 15, 2024

mkalish commented Apr 18, 2024 •

edited by jsutantio

Loading

arnejduranovic commented Nov 14, 2024 •

edited

Loading

arnejduranovic commented Nov 15, 2024 •

edited

Loading

brandonnava commented Nov 15, 2024 •

edited

Loading