Cramer's V Correlation Inconsistency for Identical Variables #735

okankcb · 2023-10-02T07:26:24Z

Hello,

It appears that there is an inconsistency in the Cramer correlation method when applied to two identical variables. For two identical variables, the Cramer correlation coefficient should be 1; however, it is not the case with VerticaPy versions 0.8.1 and 0.12.0.

To reproduce this error, I used the Iris dataset. I created a copy of the SepalWidthCm variable and applied the Cramér correlation method between these two variables. The coefficient obtained was 0.97 when it should have been 1, thereby indicating a possible inaccuracy in the used method.

Below is a screenshot of the results obtained using the Iris dataset: :

Kind regards,
Okan

oualib · 2023-10-12T09:42:20Z

Hi Okan (@okankcb),

Thank you for reporting this issue.
VerticaPy Cramer's V correlation uses its well known formula available in the 'calculation' section:
https://en.wikipedia.org/wiki/Cramér%27s_V
However, VerticaPy uses 'APPROXIMATE_COUNT_DISTINCT' instead of 'COUNT DISTINCT' to be more performant. Adding to that database calculation, the result for variable with high cardinality can be approximate.
In your case, you use a numerical column which has a high cardinality. It can be the reason of this big approximation.
We can add an option to use 'COUNT DISTINCT' instead of 'APPROXIMATE_COUNT_DISTINCT'. It might solve your problem.

Badr

oualib · 2023-10-29T19:46:29Z

Indeed, the computation showed in wikipedia is incorrect. The chi2 was not considering expected values out of the diagonal. I just did a PR to correct it.

Now the result is correct and works for all use-cases. Thank you for raising this issue.

oualib · 2023-10-29T19:46:53Z

@okankcb please have a look at this #783 PR for more information.

oualib added Bug Something isn't working. Core - vDataFrame Anything related to the vDataFrame object. labels Oct 12, 2023

oualib linked a pull request Oct 29, 2023 that will close this issue

Correcting Cramer's V #783

Merged

oualib closed this as completed in #783 Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cramer's V Correlation Inconsistency for Identical Variables #735

Cramer's V Correlation Inconsistency for Identical Variables #735

okankcb commented Oct 2, 2023 •

edited

Loading

oualib commented Oct 12, 2023 •

edited

Loading

oualib commented Oct 29, 2023

oualib commented Oct 29, 2023

Cramer's V Correlation Inconsistency for Identical Variables #735

Cramer's V Correlation Inconsistency for Identical Variables #735

Comments

okankcb commented Oct 2, 2023 • edited Loading

oualib commented Oct 12, 2023 • edited Loading

oualib commented Oct 29, 2023

oualib commented Oct 29, 2023

okankcb commented Oct 2, 2023 •

edited

Loading

oualib commented Oct 12, 2023 •

edited

Loading