Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cramer's V Correlation Inconsistency for Identical Variables #735

Closed
okankcb opened this issue Oct 2, 2023 · 3 comments · Fixed by #783
Closed

Cramer's V Correlation Inconsistency for Identical Variables #735

okankcb opened this issue Oct 2, 2023 · 3 comments · Fixed by #783
Labels
Bug Something isn't working. Core - vDataFrame Anything related to the vDataFrame object.

Comments

@okankcb
Copy link

okankcb commented Oct 2, 2023

Hello,

It appears that there is an inconsistency in the Cramer correlation method when applied to two identical variables. For two identical variables, the Cramer correlation coefficient should be 1; however, it is not the case with VerticaPy versions 0.8.1 and 0.12.0.

To reproduce this error, I used the Iris dataset. I created a copy of the SepalWidthCm variable and applied the Cramér correlation method between these two variables. The coefficient obtained was 0.97 when it should have been 1, thereby indicating a possible inaccuracy in the used method.

Below is a screenshot of the results obtained using the Iris dataset: :

image

Kind regards,
Okan

@oualib oualib added Bug Something isn't working. Core - vDataFrame Anything related to the vDataFrame object. labels Oct 12, 2023
@oualib
Copy link
Member

oualib commented Oct 12, 2023

Hi Okan (@okankcb),

Thank you for reporting this issue.
VerticaPy Cramer's V correlation uses its well known formula available in the 'calculation' section:
https://en.wikipedia.org/wiki/Cramér%27s_V
However, VerticaPy uses 'APPROXIMATE_COUNT_DISTINCT' instead of 'COUNT DISTINCT' to be more performant. Adding to that database calculation, the result for variable with high cardinality can be approximate.
In your case, you use a numerical column which has a high cardinality. It can be the reason of this big approximation.
We can add an option to use 'COUNT DISTINCT' instead of 'APPROXIMATE_COUNT_DISTINCT'. It might solve your problem.

Badr

@oualib oualib linked a pull request Oct 29, 2023 that will close this issue
@oualib
Copy link
Member

oualib commented Oct 29, 2023

Indeed, the computation showed in wikipedia is incorrect. The chi2 was not considering expected values out of the diagonal. I just did a PR to correct it.

Now the result is correct and works for all use-cases. Thank you for raising this issue.

@oualib
Copy link
Member

oualib commented Oct 29, 2023

@okankcb please have a look at this #783 PR for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working. Core - vDataFrame Anything related to the vDataFrame object.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants