Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting Cramer's V #783

Merged
merged 3 commits into from
Oct 30, 2023
Merged

Correcting Cramer's V #783

merged 3 commits into from
Oct 30, 2023

Conversation

oualib
Copy link
Member

@oualib oualib commented Oct 29, 2023

This PR corrects the computation of Cramer's V.
TODO:

  • optimise the corr code.

@oualib oualib added the Core - vDataFrame Anything related to the vDataFrame object. label Oct 29, 2023
@oualib oualib requested a review from mail4umar October 29, 2023 19:43
@oualib oualib self-assigned this Oct 29, 2023
@oualib oualib linked an issue Oct 29, 2023 that may be closed by this pull request
@oualib oualib requested a review from vikash018 October 29, 2023 19:51
@oualib
Copy link
Member Author

oualib commented Oct 29, 2023

@vikash018 you might need to update correlation tests for Cramer's V.

To compute it using Python:

import scipy.stats as ss
import numpy as np

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

import pandas as pd

df = vpd.load_titanic().to_pandas() # taking a dataset
confusion_matrix = pd.crosstab(df["pclass"], df["age"])

cramers_corrected_stat(confusion_matrix)

Result: 0.2940639041452328

Same for VerticaPy.

 vpd.load_titanic().corr(["pclass", "age"], method="cramer")

You can take multiple examples.

@oualib oualib merged commit e929cda into master Oct 30, 2023
@mail4umar mail4umar deleted the correcting_cramer_1 branch September 24, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core - vDataFrame Anything related to the vDataFrame object.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cramer's V Correlation Inconsistency for Identical Variables
2 participants