-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG qsv diff produces different results for the same command #2443
Comments
Thanks for the heads up @datatraveller1 . I know it uses parallelized hashing, which may account for this non-deterministic behavior. Copying in @janriemer. He contributed the Hopefully, he can quickly identify the reason why |
Following up on this @datatraveller1 , on my end, it does produce different results but the diff result is always correct, it's just how the diff result is listed. Perhaps, the results just need to be sorted. |
Hi @jqnatividad Thank you, however I sometimes get wrong results with more or less rows after the second or further call. Hopefully I can create small sample files soon. |
Hi @jqnatividad @datatraveller1 First, thank you @datatraveller1 for the bug report and thank you @jqnatividad for tagging me. A strong suspicion@datatraveller1 in your command, you're using column Thank you! Also: are you able to give a rough estimate on how many csv lines we are talking about that are diffed? Just for additional context: The following solved issue (June 2024) might be related, although the title doesn't suggest it: On the topic of sorting
|
Follow up: Duplicate entries are written to stderr (according to usage text, if I understand correctly). |
Thank you @janriemer . ok, so I was now able to produce small sample files. The keys are unique. a.csv:
b.csv:
command: Output with several calls on the MS Windows 11 command line (cmd.exe). The first one is the correct result.
|
I can confirm same on
if I run |
Thank you @ondohotola - I forgot to mention my version (the same on MS Windows 11 64bit):
With my example, it may also happen that the first result I get is wrong and the second or third is the correct one. |
... and if I have understood |
Thank you all for all the comments and this very minimal example @datatraveller1 ! This helps a lot! ❤️ I'm currently trying to reproduce (on Linux, though...) |
On same random error |
I think I found the bug. The following by @datatraveller1 was a very important hint!
Yes! This statement is correct, @datatraveller1! The exact issueThe issue happens when specifying the Exact line where the error occurs (the Line 180 in 183c835
And here for the other (right-hand side) csv file: Line 197 in 183c835
When has this been introducedThis new feature has been introduced in v0.130.0 of The potential solutionWhen I remove the Why have our tests not caught this?Unfortunately, the second and third column in our test is also unique, hiding the bug: Lines 308 to 309 in 183c835
Additional bug discovered regarding sortingThe same problem with the off-by-one error seems to be present regarding specifying column names to sort: Line 238 in 183c835
Workaround@datatraveller1 until this bug is not fixed, please use comma-separated indices when specifying key (and sort) columns, so in your specific case it should be: This will always work. Next stepsI'll provide a proper fix for it during the weekend. Thank you everyone for your patience and collaboration. ❤️ |
I just tried a But when I use one of my files with a larger sample I get a difference even though the straight Try this on your original file? |
I get the wished nil result when comparing the same big csv file. Note that as mentioned by @janriemer, the key (first column) has to be unique in your file |
ah, ok, thank you |
@janriemer However, I still encounter sorting issues with commands like |
@janriemer I created a small example. The different sorting order may happen if the key order of file 1 differs from file 2 (see art_no 5 and 6). file a2.csv:
file b2.csv:
command: result (first and second call):
|
@datatraveller1 Oh, that doesn't look good.😬 This will definitely be a bug in csv-diff itself and not in the Thank you for reporting!❤ I'll have a closer look and can hopefully make it deterministic. |
I think I found the sorting bug in The following can result in Both rows need to be marked as (please translate if for_deleted_a < for_added_a {
&for_deleted_a
} else {
&for_added_a
}
.cmp(if for_deleted_b < for_added_b {
&for_deleted_b
} else {
&for_added_b
}) so I think in case of And I need to rename those variables - what was I thinking!? 😨 |
This fixes part of dathere#2443, where sorting the diff result by line has been non-deterministic. For further details, please see the MR in `csv-diff`: https://gitlab.com/janriemer/csv-diff/-/merge_requests/31
This fixes the conversion from column name -> index for `diff` options `--key` and `--sort-columns`, which is part of dathere#2443. The issue was that `enumeration idx` and `1` had been added to the already correct result (obtained via `.position` iterator method). So now, only `.position` is used to get the correct idx of a column name. Before this change, one test has tried to validate the logic of providing column names instead of indices, but the test itself was wrong, as the diff result that was validated did not have the correct sort order. This test is now also fixed. Additionally, this provides some more tests for error conditions regarding name -> index conversion.
Thanks heaps @janriemer ! |
Hey folks 👋 just want to let you know that the bug regarding non-deterministic sorting behavior is now erased from the universe and can never ever happen again! The comparison function that is used for sorting the diff result by lines is now formally verified to be correct! It uses the kani model checker for the proof. And it even runs in CI now! 🚀 The following is the actual test function that verifies/proofs that the comparison function, used for sorting the diff result by lines, never returns All other scenarios with respect to line ordering and kind of row ( Looking forward to eventually formally verify the actual diffing algorithm of Happy diffing! 🤓 |
This is an interesting issue. Have you noticed that successive invocations of the same command with
qsv diff
give different results? The results are usually correct, but sometimes wrong.qsv diff --key=art_no a.csv b.csv -o diff.csv
If you call this command twice or more, the file
diff.csv
often (but not always) contains different content.I have not yet succeeded in creating a small test file without confidential data, but perhaps you can already do something with this information.
The text was updated successfully, but these errors were encountered: