You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that there is an inconsistency in the Cramer correlation method when applied to two identical variables. For two identical variables, the Cramer correlation coefficient should be 1; however, it is not the case with VerticaPy versions 0.8.1 and 0.12.0.
To reproduce this error, I used the Iris dataset. I created a copy of the SepalWidthCm variable and applied the Cramér correlation method between these two variables. The coefficient obtained was 0.97 when it should have been 1, thereby indicating a possible inaccuracy in the used method.
Below is a screenshot of the results obtained using the Iris dataset: :
Kind regards,
Okan
The text was updated successfully, but these errors were encountered:
Thank you for reporting this issue.
VerticaPy Cramer's V correlation uses its well known formula available in the 'calculation' section: https://en.wikipedia.org/wiki/Cramér%27s_V
However, VerticaPy uses 'APPROXIMATE_COUNT_DISTINCT' instead of 'COUNT DISTINCT' to be more performant. Adding to that database calculation, the result for variable with high cardinality can be approximate.
In your case, you use a numerical column which has a high cardinality. It can be the reason of this big approximation.
We can add an option to use 'COUNT DISTINCT' instead of 'APPROXIMATE_COUNT_DISTINCT'. It might solve your problem.
Indeed, the computation showed in wikipedia is incorrect. The chi2 was not considering expected values out of the diagonal. I just did a PR to correct it.
Now the result is correct and works for all use-cases. Thank you for raising this issue.
Hello,
It appears that there is an inconsistency in the Cramer correlation method when applied to two identical variables. For two identical variables, the Cramer correlation coefficient should be 1; however, it is not the case with VerticaPy versions 0.8.1 and 0.12.0.
To reproduce this error, I used the Iris dataset. I created a copy of the SepalWidthCm variable and applied the Cramér correlation method between these two variables. The coefficient obtained was 0.97 when it should have been 1, thereby indicating a possible inaccuracy in the used method.
Below is a screenshot of the results obtained using the Iris dataset: :
Kind regards,
Okan
The text was updated successfully, but these errors were encountered: