BUG qsv diff produces different results for the same command #2443

datatraveller1 · 2025-01-14T21:04:42Z

This is an interesting issue. Have you noticed that successive invocations of the same command with qsv diff give different results? The results are usually correct, but sometimes wrong.

qsv diff --key=art_no a.csv b.csv -o diff.csv

If you call this command twice or more, the file diff.csv often (but not always) contains different content.

I have not yet succeeded in creating a small test file without confidential data, but perhaps you can already do something with this information.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2025-01-15T03:16:33Z

Thanks for the heads up @datatraveller1 . I know it uses parallelized hashing, which may account for this non-deterministic behavior.

Copying in @janriemer.

He contributed the diff command, which takes advantage of his csv-diff crate...

Hopefully, he can quickly identify the reason why diff sometimes produces different results...

jqnatividad · 2025-01-16T16:16:48Z

Following up on this @datatraveller1 , on my end, it does produce different results but the diff result is always correct, it's just how the diff result is listed.

Perhaps, the results just need to be sorted.

datatraveller1 · 2025-01-16T17:38:47Z

Hi @jqnatividad Thank you, however I sometimes get wrong results with more or less rows after the second or further call. Hopefully I can create small sample files soon.

janriemer · 2025-01-16T19:29:41Z

Hi @jqnatividad @datatraveller1

First, thank you @datatraveller1 for the bug report and thank you @jqnatividad for tagging me.

A strong suspicion

@datatraveller1 in your command, you're using column art_no as primary key. The values in that column must be unique per csv file. diff doesn't verify this uniqueness property. Can you please verify that there are no duplicate entries in that column in both csv files (there should be a command in qsv for checking duplicates, ~~but I have to look it up~~ see follow up post). Otherwise diff will produce wrong results.

Thank you!

Also: are you able to give a rough estimate on how many csv lines we are talking about that are diffed?

Just for additional context: The following solved issue (June 2024) might be related, although the title doesn't suggest it:
#1892

On the topic of sorting `diff`s result

@jqnatividad

Following up on this @datatraveller1 , on my end, it does produce different results but the diff result is always correct, it's just how the diff result is listed.

Perhaps, the results just need to be sorted.

This is very surprising to me. The diff result should definitely be deterministic. Do you have a concrete example of this that I can look at or is it the exact (confidential) data @datatraveller1 tries to diff (in that case, see suspicion above)?

Surprising, because by default diff sorts by line (there is also the option of sorting by certain columns) and our tests validate this (otherwise we had flaky tests, which I don't think is the case).

Edit: I've thought about this some more: sorting by columns can actually lead to different sortings between runs (when the values in the columns that are sorted are not unique), but as you've said, the result itself should have the same "structure".

PR that implements diff sort by line:
#808

PR that implements diff sort by column:
#827

I hope we get this issue solved soon. Please let me know if I can be of any other help.

janriemer · 2025-01-16T19:34:03Z

Follow up:
Found the command for checking duplicates in certain columns:

dedup command in qsv with -s option.

Duplicate entries are written to stderr (according to usage text, if I understand correctly).

datatraveller1 · 2025-01-16T21:00:38Z

Thank you @janriemer . ok, so I was now able to produce small sample files. The keys are unique.

a.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musical Concepts
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

b.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

command:
qsv diff --key=art_no a.csv b.csv

Output with several calls on the MS Windows 11 command line (cmd.exe). The first one is the correct result.

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
+,2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

C:\test\qsv>qsv diff --key=art_no a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
+,3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua

ondohotola · 2025-01-16T21:14:38Z

I can confirm same on Mac OS 15.2 ARM using

qsv 2.1.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.653;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;19.20 GiB-0 B-12.05 GiB-24.00 GiB (aarch64-apple-darwin compiled with Rust 1.84) prebuilt

if I run qsv diff a.csv b.csv repeatedly I get the correct result, by the way

datatraveller1 · 2025-01-16T21:21:03Z

Thank you @ondohotola - I forgot to mention my version (the same on MS Windows 11 64bit):

qsv 2.1.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.653;prompt;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;4.75 GiB-1.89 GiB-1008.00 MiB-5.94 GiB (x86_64-pc-windows-msvc compiled with Rust 1.84) prebuilt

With my example, it may also happen that the first result I get is wrong and the second or third is the correct one.

datatraveller1 · 2025-01-16T21:31:06Z

qsv diff --key=0 a.csv b.csv
seems to work as expected - always the same correct result:

C:\test\qsv>qsv diff --key=0 a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte

... and if I have understood qsv diff --help correctly, qsv diff --key=0 a.csv b.csv is the same as @ondohotola 's command:
qsv diff a.csv b.csv

janriemer · 2025-01-16T21:36:49Z

Thank you all for all the comments and this very minimal example @datatraveller1 ! This helps a lot! ❤️

I'm currently trying to reproduce (on Linux, though...)

ondohotola · 2025-01-16T21:53:55Z

On Ubuntu 24.04.1 LTS using qsv 2.1.0-mimalloc-apply;fetch;foreach;Luau 0.653;to;polars-0.45.1:py-1.19.0:74dad84;self_update-8-8;4.63 GiB-3.95 GiB-5.32 GiB-5.79 GiB (x86_64-unknown-linux-gnu compiled with Rust 1.84) compiled

same random error

janriemer · 2025-01-16T23:14:30Z

I think I found the bug.

The following by @datatraveller1 was a very important hint!

qsv diff --key=0 a.csv b.csv seems to work as expected - always the same correct result:
C:\test\qsv>qsv diff --key=0 a.csv b.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
... and if I have understood qsv diff --help correctly, qsv diff --key=0 a.csv b.csv is the same as @ondohotola 's command: qsv diff a.csv b.csv

Yes! This statement is correct, @datatraveller1!

The exact issue

The issue happens when specifying the --key as a header name, because diff converts this internally into a column index, but the conversion is wrong (off-by-one error)!

Exact line where the error occurs (the + 1 is the bug):

qsv/src/cmd/diff.rs

Line 180 in 183c835

.map(|pos| pos + index + 1)

And here for the other (right-hand side) csv file:

qsv/src/cmd/diff.rs

Line 197 in 183c835

.map(|pos| pos + index + 1)

When has this been introduced

This new feature has been introduced in v0.130.0 of qsv in PR #2010.

The potential solution

When I remove the + 1 in both places the following command always works correctly:
qsv diff --key=art_no a.csv b.csv

Why have our tests not caught this?

Unfortunately, the second and third column in our test is also unique, hiding the bug:

qsv/tests/test_diff.rs

Lines 308 to 309 in 183c835

    
           svec!["1", "foo", "bar"], 
        
           svec!["2", "fooz", "bart"],

Additional bug discovered regarding sorting

The same problem with the off-by-one error seems to be present regarding specifying column names to sort:

qsv/src/cmd/diff.rs

Line 238 in 183c835

.map(|pos| pos + index + 1)

Workaround

@datatraveller1 until this bug is not fixed, please use comma-separated indices when specifying key (and sort) columns, so in your specific case it should be:
qsv diff --key=0 a.csv b.csv

This will always work.

Next steps

I'll provide a proper fix for it during the weekend.

Thank you everyone for your patience and collaboration. ❤️

ondohotola · 2025-01-17T14:39:15Z

I just tried a qsv diff a.csv a.csv and arrive at the expected result.

But when I use one of my files with a larger sample I get a difference even though the straight diff c.csv c.csv produces the expected nil difference.

Try this on your original file?

datatraveller1 · 2025-01-17T14:53:56Z

I get the wished nil result when comparing the same big csv file. Note that as mentioned by @janriemer, the key (first column) has to be unique in your file c.csv when you call qsv diff c.csv c.csv (which equals to qsv diff --key=0 c.csv c.csv ).

ondohotola · 2025-01-17T17:27:29Z

ah, ok, thank you

datatraveller1 · 2025-01-17T17:43:31Z

@janriemer However, I still encounter sorting issues with commands like qsv diff --key=0 a2.csv b2.csv for big files. The result is correct but the sorting order of the key pairs may differ from call to call. I'm not sure if fixing the "the off-by-one error" will solve the sorting issue?

datatraveller1 · 2025-01-17T18:44:22Z

@janriemer I created a small example. The different sorting order may happen if the key order of file 1 differs from file 2 (see art_no 5 and 6).

file a2.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musical Concepts
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
4,Cembalowerke - Gustav Leonhardt,Das alte Werk
5,Cembalowerke - Gustav Leonhardt,Das alte Werk
6,Kantaten - Nikolaus Harnoncourt,Das alte Werk

file b2.csv:

art_no,title,operator
1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
2,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
3,Praeconium Solitudinis - Werke für Violine solo,Nova Antiqua
4,Cembalo works - Gustav Leonhardt,Das alte Werk
6,Cantatas - Nikolaus Harnoncourt,Das neue Werk
5,Cembalo works - Gustav Leonhardt,Das alte Werk

command:
qsv diff --key=0 a2.csv b2.csv

result (first and second call):

C:\test\qsv>qsv diff --key=0 a2.csv b2.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,4,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,4,Cembalo works - Gustav Leonhardt,Das alte Werk
-,6,Kantaten - Nikolaus Harnoncourt,Das alte Werk
+,6,Cantatas - Nikolaus Harnoncourt,Das neue Werk
-,5,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,5,Cembalo works - Gustav Leonhardt,Das alte Werk

C:\test\qsv>qsv diff --key=0 a2.csv b2.csv
diffresult,art_no,title,operator
-,1,Die Klaviermusik/Klavierwerke,Musical Concepts
+,1,Die Klaviermusik/Klavierwerke,Musikalische Konzepte
-,4,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,4,Cembalo works - Gustav Leonhardt,Das alte Werk
-,5,Cembalowerke - Gustav Leonhardt,Das alte Werk
+,5,Cembalo works - Gustav Leonhardt,Das alte Werk
-,6,Kantaten - Nikolaus Harnoncourt,Das alte Werk
+,6,Cantatas - Nikolaus Harnoncourt,Das neue Werk

C:\test\qsv>

janriemer · 2025-01-18T10:52:14Z

@datatraveller1 Oh, that doesn't look good.😬

This will definitely be a bug in csv-diff itself and not in the diff command.

Thank you for reporting!❤ I'll have a closer look and can hopefully make it deterministic.

janriemer · 2025-01-18T11:37:13Z

I think I found the sorting bug in csv-diff:

https://gitlab.com/janriemer/csv-diff/-/blob/1234a93026a0edc09b0e03faa9c160b8c66e1e4a/src/diff_result.rs#L109-118

The following can result in Equal comparison and is therefore not deterministic (when taking your csv data as example), when the lines with art_no 6 are put into the internal diff result (that will be sorted later on) before the lines with art_no 5.

Both rows need to be marked as Modify in the result for this to happen (won't happen for Added or Deleted rows).

(please translate for_deleted_x with line_deleted_x and for_added_x with line_added_x)

                if for_deleted_a < for_added_a {
                    &for_deleted_a
                } else {
                    &for_added_a
                }
                .cmp(if for_deleted_b < for_added_b {
                    &for_deleted_b
                } else {
                    &for_added_b
                })

so I think in case of Equal we need to compare for_deleted_a.cmp(for_deleted_b).then(for_added_a.cmp(for_added_b)) to cover all cases and be deterministic.

And I need to rename those variables - what was I thinking!? 😨

This fixes part of dathere#2443, where sorting the diff result by line has been non-deterministic. For further details, please see the MR in `csv-diff`: https://gitlab.com/janriemer/csv-diff/-/merge_requests/31

This fixes the conversion from column name -> index for `diff` options `--key` and `--sort-columns`, which is part of dathere#2443. The issue was that `enumeration idx` and `1` had been added to the already correct result (obtained via `.position` iterator method). So now, only `.position` is used to get the correct idx of a column name. Before this change, one test has tried to validate the logic of providing column names instead of indices, but the test itself was wrong, as the diff result that was validated did not have the correct sort order. This test is now also fixed. Additionally, this provides some more tests for error conditions regarding name -> index conversion.

janriemer · 2025-01-19T21:52:01Z

Hey everyone,

PRs #2456 and #2457 incoming...! 🚀

When both are successfully merged, this issue can be closed.

jqnatividad · 2025-01-20T03:27:58Z

Thanks heaps @janriemer !

Just merged #2456 and #2457 and closing this issue.

janriemer · 2025-02-03T22:58:00Z

Hey folks 👋

just want to let you know that the bug regarding non-deterministic sorting behavior is now erased from the universe and can never ever happen again!

The comparison function that is used for sorting the diff result by lines is now formally verified to be correct! It uses the kani model checker for the proof. And it even runs in CI now! 🚀

The following is the actual test function that verifies/proofs that the comparison function, used for sorting the diff result by lines, never returns Ordering::Equal and is therefore always deterministic:

https://gitlab.com/janriemer/csv-diff/-/blob/466a73806d2345cb4eb4902b31b6db0c18bae49a/src/diff_row.rs#L194-262

All other scenarios with respect to line ordering and kind of row (Add, Delete, Modify) are checked as well. If you're interested, please have a look at the whole MR here.

Looking forward to eventually formally verify the actual diffing algorithm of csv-diff itself. 😏

Happy diffing! 🤓

jqnatividad mentioned this issue Jan 18, 2025

qsv diff -s <arg> if both files use the same delimiter #2447

Closed

janriemer mentioned this issue Jan 19, 2025

Bump csv-diff to 0.1.1 #2456

Merged

janriemer mentioned this issue Jan 19, 2025

diff: Fix name to index conversion #2457

Merged

jqnatividad closed this as completed Jan 20, 2025

datatraveller1 mentioned this issue Feb 8, 2025

diff showing different results for the same input #2509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG qsv diff produces different results for the same command #2443

BUG qsv diff produces different results for the same command #2443

datatraveller1 commented Jan 14, 2025 •

edited

Loading

jqnatividad commented Jan 15, 2025

jqnatividad commented Jan 16, 2025

datatraveller1 commented Jan 16, 2025 •

edited

Loading

janriemer commented Jan 16, 2025 •

edited

Loading

janriemer commented Jan 16, 2025 •

edited

Loading

datatraveller1 commented Jan 16, 2025

ondohotola commented Jan 16, 2025

datatraveller1 commented Jan 16, 2025 •

edited

Loading

datatraveller1 commented Jan 16, 2025 •

edited

Loading

janriemer commented Jan 16, 2025

ondohotola commented Jan 16, 2025

janriemer commented Jan 16, 2025

ondohotola commented Jan 17, 2025 •

edited

Loading

datatraveller1 commented Jan 17, 2025 •

edited

Loading

ondohotola commented Jan 17, 2025

datatraveller1 commented Jan 17, 2025 •

edited

Loading

datatraveller1 commented Jan 17, 2025 •

edited

Loading

janriemer commented Jan 18, 2025 •

edited

Loading

janriemer commented Jan 18, 2025 •

edited

Loading

janriemer commented Jan 19, 2025

jqnatividad commented Jan 20, 2025

janriemer commented Feb 3, 2025 •

edited

Loading

BUG qsv diff produces different results for the same command #2443

BUG qsv diff produces different results for the same command #2443

Comments

datatraveller1 commented Jan 14, 2025 • edited Loading

jqnatividad commented Jan 15, 2025

jqnatividad commented Jan 16, 2025

datatraveller1 commented Jan 16, 2025 • edited Loading

janriemer commented Jan 16, 2025 • edited Loading

A strong suspicion

On the topic of sorting diffs result

janriemer commented Jan 16, 2025 • edited Loading

datatraveller1 commented Jan 16, 2025

ondohotola commented Jan 16, 2025

datatraveller1 commented Jan 16, 2025 • edited Loading

datatraveller1 commented Jan 16, 2025 • edited Loading

janriemer commented Jan 16, 2025

ondohotola commented Jan 16, 2025

janriemer commented Jan 16, 2025

The exact issue

When has this been introduced

The potential solution

Why have our tests not caught this?

Additional bug discovered regarding sorting

Workaround

Next steps

ondohotola commented Jan 17, 2025 • edited Loading

datatraveller1 commented Jan 17, 2025 • edited Loading

ondohotola commented Jan 17, 2025

datatraveller1 commented Jan 17, 2025 • edited Loading

datatraveller1 commented Jan 17, 2025 • edited Loading

janriemer commented Jan 18, 2025 • edited Loading

janriemer commented Jan 18, 2025 • edited Loading

janriemer commented Jan 19, 2025

jqnatividad commented Jan 20, 2025

janriemer commented Feb 3, 2025 • edited Loading

datatraveller1 commented Jan 14, 2025 •

edited

Loading

datatraveller1 commented Jan 16, 2025 •

edited

Loading

janriemer commented Jan 16, 2025 •

edited

Loading

On the topic of sorting `diff`s result

janriemer commented Jan 16, 2025 •

edited

Loading

datatraveller1 commented Jan 16, 2025 •

edited

Loading

datatraveller1 commented Jan 16, 2025 •

edited

Loading

ondohotola commented Jan 17, 2025 •

edited

Loading

datatraveller1 commented Jan 17, 2025 •

edited

Loading

datatraveller1 commented Jan 17, 2025 •

edited

Loading

datatraveller1 commented Jan 17, 2025 •

edited

Loading

janriemer commented Jan 18, 2025 •

edited

Loading

janriemer commented Jan 18, 2025 •

edited

Loading

janriemer commented Feb 3, 2025 •

edited

Loading