Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG qsv diff produces different results for the same command #2443

Closed
datatraveller1 opened this issue Jan 14, 2025 · 22 comments
Closed

BUG qsv diff produces different results for the same command #2443

datatraveller1 opened this issue Jan 14, 2025 · 22 comments

Comments

@datatraveller1
Copy link

datatraveller1 commented Jan 14, 2025

This is an interesting issue. Have you noticed that successive invocations of the same command with qsv diff give different results? The results are usually correct, but sometimes wrong.

qsv diff --key=art_no a.csv b.csv -o diff.csv

If you call this command twice or more, the file diff.csv often (but not always) contains different content.

I have not yet succeeded in creating a small test file without confidential data, but perhaps you can already do something with this information.

@jqnatividad
Copy link
Collaborator

Thanks for the heads up @datatraveller1 . I know it uses parallelized hashing, which may account for this non-deterministic behavior.

Copying in @janriemer.

He contributed the diff command, which takes advantage of his csv-diff crate...

Hopefully, he can quickly identify the reason why diff sometimes produces different results...

@jqnatividad
Copy link
Collaborator

Following up on this @datatraveller1 , on my end, it does produce different results but the diff result is always correct, it's just how the diff result is listed.

Perhaps, the results just need to be sorted.

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 16, 2025

Hi @jqnatividad Thank you, however I sometimes get wrong results with more or less rows after the second or further call. Hopefully I can create small sample files soon.

@janriemer
Copy link

janriemer commented Jan 16, 2025

Hi @jqnatividad @datatraveller1

First, thank you @datatraveller1 for the bug report and thank you @jqnatividad for tagging me.

A strong suspicion

@datatraveller1 in your command, you're using column art_no as primary key. The values in that column must be unique per csv file. diff doesn't verify this uniqueness property. Can you please verify that there are no duplicate entries in that column in both csv files (there should be a command in qsv for checking duplicates, but I have to look it up see follow up post). Otherwise diff will produce wrong results.

Thank you!

Also: are you able to give a rough estimate on how many csv lines we are talking about that are diffed?

Just for additional context: The following solved issue (June 2024) might be related, although the title doesn't suggest it:
#1892

On the topic of sorting diffs result

@jqnatividad

Following up on this @datatraveller1 , on my end, it does produce different results but the diff result is always correct, it's just how the diff result is listed.

Perhaps, the results just need to be sorted.

This is very surprising to me. The diff result should definitely be deterministic. Do you have a concrete example of this that I can look at or is it the exact (confidential) data @datatraveller1 tries to diff (in that case, see suspicion above)?

Surprising, because by default diff sorts by line (there is also the option of sorting by certain columns) and our tests validate this (otherwise we had flaky tests, which I don't think is the case).

Edit: I've thought about this some more: sorting by columns can actually lead to different sortings between runs (when the values in the columns that are sorted are not unique), but as you've said, the result itself should have the same "structure".

PR that implements diff sort by line:
#808

PR that implements diff sort by column:
#827


I hope we get this issue solved soon. Please let me know if I can be of any other help.

@janriemer
Copy link

janriemer commented Jan 16, 2025

Follow up:
Found the command for checking duplicates in certain columns:

dedup command in qsv
with -s option.

Duplicate entries are written to stderr (according to usage text, if I understand correctly).

@datatraveller1
Copy link
Author

Thank you @janriemer . ok, so I was now able to produce small sample files. The keys are unique.

a.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musical Concepts
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

b.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

command:
qsv diff --key=art_no a.csv b.csv

Output with several calls on the MS Windows 11 command line (cmd.exe). The first one is the correct result.

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
+,2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
+,3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

@ondohotola
Copy link

I can confirm same on Mac OS 15.2 ARM using

qsv 2.1.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.653;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;19.20 GiB-0 B-12.05 GiB-24.00 GiB (aarch64-apple-darwin compiled with Rust 1.84) prebuilt

if I run qsv diff a.csv b.csv repeatedly I get the correct result, by the way

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 16, 2025

Thank you @ondohotola - I forgot to mention my version (the same on MS Windows 11 64bit):

qsv 2.1.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.653;prompt;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;4.75 GiB-1.89 GiB-1008.00 MiB-5.94 GiB (x86_64-pc-windows-msvc compiled with Rust 1.84) prebuilt

With my example, it may also happen that the first result I get is wrong and the second or third is the correct one.

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 16, 2025

qsv diff --key=0 a.csv b.csv
seems to work as expected - always the same correct result:

C:\test\qsv>qsv diff --key=0 a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte

... and if I have understood qsv diff --help correctly, qsv diff --key=0 a.csv b.csv is the same as @ondohotola 's command:
qsv diff a.csv b.csv

@janriemer
Copy link

Thank you all for all the comments and this very minimal example @datatraveller1 ! This helps a lot! ❤️

I'm currently trying to reproduce (on Linux, though...)

@ondohotola
Copy link

On Ubuntu 24.04.1 LTS using qsv 2.1.0-mimalloc-apply;fetch;foreach;Luau 0.653;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;4.63 GiB-3.95 GiB-5.32 GiB-5.79 GiB (x86_64-unknown-linux-gnu compiled with Rust 1.84) compiled

same random error

@janriemer
Copy link

I think I found the bug.

The following by @datatraveller1 was a very important hint!

qsv diff --key=0 a.csv b.csv seems to work as expected - always the same correct result:

C:\test\qsv>qsv diff --key=0 a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte

... and if I have understood qsv diff --help correctly, qsv diff --key=0 a.csv b.csv is the same as @ondohotola 's command: qsv diff a.csv b.csv

Yes! This statement is correct, @datatraveller1!

The exact issue

The issue happens when specifying the --key as a header name, because diff converts this internally into a column index, but the conversion is wrong (off-by-one error)!

Exact line where the error occurs (the + 1 is the bug):

.map(|pos| pos + index + 1)

And here for the other (right-hand side) csv file:

.map(|pos| pos + index + 1)

When has this been introduced

This new feature has been introduced in v0.130.0 of qsv in PR #2010.

The potential solution

When I remove the + 1 in both places the following command always works correctly:
qsv diff --key=art_no a.csv b.csv

Why have our tests not caught this?

Unfortunately, the second and third column in our test is also unique, hiding the bug:

qsv/tests/test_diff.rs

Lines 308 to 309 in 183c835

svec!["1", "foo", "bar"],
svec!["2", "fooz", "bart"],

Additional bug discovered regarding sorting

The same problem with the off-by-one error seems to be present regarding specifying column names to sort:

.map(|pos| pos + index + 1)

Workaround

@datatraveller1 until this bug is not fixed, please use comma-separated indices when specifying key (and sort) columns, so in your specific case it should be:
qsv diff --key=0 a.csv b.csv

This will always work.

Next steps

I'll provide a proper fix for it during the weekend.

Thank you everyone for your patience and collaboration. ❤️

@ondohotola
Copy link

ondohotola commented Jan 17, 2025

I just tried a qsv diff a.csv a.csv and arrive at the expected result.

But when I use one of my files with a larger sample I get a difference even though the straight diff c.csv c.csv produces the expected nil difference.

Try this on your original file?

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 17, 2025

I get the wished nil result when comparing the same big csv file. Note that as mentioned by @janriemer, the key (first column) has to be unique in your file c.csv when you call qsv diff c.csv c.csv (which equals to qsv diff --key=0 c.csv c.csv ).

@ondohotola
Copy link

ah, ok, thank you

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 17, 2025

@janriemer However, I still encounter sorting issues with commands like qsv diff --key=0 a2.csv b2.csv for big files. The result is correct but the sorting order of the key pairs may differ from call to call. I'm not sure if fixing the "the off-by-one error" will solve the sorting issue?

@datatraveller1
Copy link
Author

datatraveller1 commented Jan 17, 2025

@janriemer I created a small example. The different sorting order may happen if the key order of file 1 differs from file 2 (see art_no 5 and 6).

file a2.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musical Concepts
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
4,Cembalowerke - Gustav Leonhardt,Das alte Werk
5,Cembalowerke - Gustav Leonhardt,Das alte Werk
6,Kantaten - Nikolaus Harnoncourt,Das alte Werk

file b2.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
4,Cembalo works - Gustav Leonhardt,Das alte Werk
6,Cantatas - Nikolaus Harnoncourt,Das neue Werk
5,Cembalo works - Gustav Leonhardt,Das alte Werk

command:
qsv diff --key=0 a2.csv b2.csv

result (first and second call):

C:\test\qsv>qsv diff --key=0 a2.csv b2.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,4,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,4,Cembalo works - Gustav Leonhardt,Das alte Werk
-,6,Kantaten - Nikolaus Harnoncourt,Das alte Werk
+,6,Cantatas - Nikolaus Harnoncourt,Das neue Werk
-,5,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,5,Cembalo works - Gustav Leonhardt,Das alte Werk

C:\test\qsv>qsv diff --key=0 a2.csv b2.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,4,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,4,Cembalo works - Gustav Leonhardt,Das alte Werk
-,5,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,5,Cembalo works - Gustav Leonhardt,Das alte Werk
-,6,Kantaten - Nikolaus Harnoncourt,Das alte Werk
+,6,Cantatas - Nikolaus Harnoncourt,Das neue Werk

C:\test\qsv>

@janriemer
Copy link

janriemer commented Jan 18, 2025

@datatraveller1 Oh, that doesn't look good.😬

This will definitely be a bug in csv-diff itself and not in the diff command.

Thank you for reporting!❤ I'll have a closer look and can hopefully make it deterministic.

@janriemer
Copy link

janriemer commented Jan 18, 2025

I think I found the sorting bug in csv-diff:

https://gitlab.com/janriemer/csv-diff/-/blob/1234a93026a0edc09b0e03faa9c160b8c66e1e4a/src/diff_result.rs#L109-118

The following can result in Equal comparison and is therefore not deterministic (when taking your csv data as example), when the lines with art_no 6 are put into the internal diff result (that will be sorted later on) before the lines with art_no 5.

Both rows need to be marked as Modify in the result for this to happen (won't happen for Added or Deleted rows).

(please translate for_deleted_x with line_deleted_x and for_added_x with line_added_x)

                if for_deleted_a < for_added_a {
                    &for_deleted_a
                } else {
                    &for_added_a
                }
                .cmp(if for_deleted_b < for_added_b {
                    &for_deleted_b
                } else {
                    &for_added_b
                })

so I think in case of Equal we need to compare for_deleted_a.cmp(for_deleted_b).then(for_added_a.cmp(for_added_b)) to cover all cases and be deterministic.

And I need to rename those variables - what was I thinking!? 😨

janriemer pushed a commit to janriemer/qsv that referenced this issue Jan 19, 2025
This fixes part of dathere#2443, where sorting the diff result by line
has been non-deterministic.

For further details, please see the MR in `csv-diff`:
https://gitlab.com/janriemer/csv-diff/-/merge_requests/31
janriemer pushed a commit to janriemer/qsv that referenced this issue Jan 19, 2025
This fixes the conversion from column name -> index for `diff` options
`--key` and `--sort-columns`, which is part of dathere#2443.

The issue was that `enumeration idx` and `1` had been added to the already
correct result (obtained via `.position` iterator method).
So now, only `.position` is used to get the correct idx of a column name.

Before this change, one test has tried to validate the logic of providing
column names instead of indices, but the test itself was wrong, as the
diff result that was validated did not have the correct sort order.
This test is now also fixed.

Additionally, this provides some more tests for error conditions
regarding name -> index conversion.
@janriemer
Copy link

Hey everyone,

PRs #2456 and #2457 incoming...! 🚀

When both are successfully merged, this issue can be closed.

@jqnatividad
Copy link
Collaborator

Thanks heaps @janriemer !

Just merged #2456 and #2457 and closing this issue.

@janriemer
Copy link

janriemer commented Feb 3, 2025

Hey folks 👋

just want to let you know that the bug regarding non-deterministic sorting behavior is now erased from the universe and can never ever happen again!

The comparison function that is used for sorting the diff result by lines is now formally verified to be correct! It uses the kani model checker for the proof. And it even runs in CI now! 🚀

The following is the actual test function that verifies/proofs that the comparison function, used for sorting the diff result by lines, never returns Ordering::Equal and is therefore always deterministic:

https://gitlab.com/janriemer/csv-diff/-/blob/466a73806d2345cb4eb4902b31b6db0c18bae49a/src/diff_row.rs#L194-262

All other scenarios with respect to line ordering and kind of row (Add, Delete, Modify) are checked as well. If you're interested, please have a look at the whole MR here.

Looking forward to eventually formally verify the actual diffing algorithm of csv-diff itself. 😏

Happy diffing! 🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants