Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to include structural zeros? #152

Open
windisch opened this issue Aug 23, 2023 · 2 comments
Open

How to include structural zeros? #152

windisch opened this issue Aug 23, 2023 · 2 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@windisch
Copy link

What's the preferred way to model structural zeros in a Formula?

Assume the following toy example: I have a $3\times 2$ contingency table that looks like this

e f
a 1 0
b 2 3
c 4 0

given as a pandas dataframe as follows:

df = pd.DataFrame(
    data={
        'F1': ['a', 'a', 'b', 'b', 'c', 'c'],
        'F2': ['e', 'f', 'e', 'f', 'e', 'f'],
        'n': [ 1, 0, 2, 3, 4, 0]
    })

The combinations $(a, f)$ and $(c,f)$ are structural zeros (i.e., it's impossible to have non-zero values in these cells). Now, assume I want to fit the model n ~ C(F1):C(F2) on that data as follows

y, X = Formula('n ~ C(F1):C(F2)').get_model_matrix(df, ensure_full_rank=False)

then the corresponding variables C(F1)[T.a]:C(F2)[T.f] and C(F1)[T.c]:C(F2)[T.f] are columns of X. Is there a way to remove these parameters already in the formula? Is there another concept in formulaic to deal with this type of constraints?

@matthewwardrop
Copy link
Owner

Hi @windisch ,

Apologies for the delay in my response. Life has been pretty hectic of late.

At present, there is no way to handle this in Formulaic (short of deleting these columns after the model matrix is created). Is there precedent for supporting this kind of transformation in other formula implementations? (This isn't a requisite for including it in Formulaic, but it does help to think through how others have solved this issue).

If we were to add support for this, I think the easiest approach would be to generate the matrix as is, and then remove any columns that are identically zero. This does mean that some unnecessary work is done, which is a little inelegant... but I'm not sure it makes sense to pass around richer metadata than this. Of course, that means it could just as easily be done outside formulaic too.

In an ideal world, what would you like to see done?

@matthewwardrop matthewwardrop added enhancement New feature or request question Further information is requested labels Dec 20, 2023
@lwiklendt
Copy link

In an ideal world, what would you like to see done?

When creating a model matrix, an extra argument could be supported such as formulaic.model_matrix(formula, df, drop_structural_zeros=True) where all structural zeros have been dropped.

A more manual approach could be to facilitate the resulting ModelSpec with methods to drop columns. This way we can manually modify the spec in an iterative process that is up to the researcher. A researcher could build the full model then check which columns are only 0s (e.g. cols_to_drop = model_mat.columns[((model_mat != 0).sum() == 0)].to_list()) and then drop those columns from the model spec (model_mat.model_spec.drop(cols_to_drop)) returning a new spec, then run get_model_matrix on this updated spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants