Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case affecting trimming #8

Open
xyjprc opened this issue Feb 7, 2021 · 0 comments
Open

Edge case affecting trimming #8

xyjprc opened this issue Feb 7, 2021 · 0 comments

Comments

@xyjprc
Copy link
Owner

xyjprc commented Feb 7, 2021

if KLD(MaxDivergence(Distribution[Curr]), Distr) < KLDThreshold(order+1, Curr):

Francois (@fqueyroi) raised the following issue in an email:

Hi Jian,

I forgot to thank for your previous answer, sorry for that, it was very helpful.
My colleague (Célestin Coquidé in cc.) and I are still working on high-order networks following your important results.

We have a question regarding the elimination of the max order parameter in your paper
"Efficient modeling of higher-order dependencies in networks: from algorithm to application for anomaly detection" Saebi et al., 2020.
In the file "pyHONv2_algorithm.pdf" given in your GitHub repo (which I believe is the algorithm described in the paper), line 38 corresponds to Eq. (1) : the maximum value of KLD(Extended_Distr,Valid_Distr) is given by -log2(min(Valid_Distr)) i.e. the maximum is achieved when (one of) the least likely outcome with the valid rule becomes the only possible outcome when trying to extend the rule.

However in the python script "pyHON/BuildRulesFastParameterFreeFreq.py", the computed quantity (function ExtendRule, line 124) is different. It is KLD that would be achieved if (one of) the least likely outcome with the current rule become the only possible outcomes when trying to extend the rule. Assuming i = argmin(Current_Distr), the returned quantity is now -log2(Valid_Distr[i]).

The latter quantity may vary since the minimum of Current_Distr may not be unique and the entries associated with the lowest probability can correspond to different Valid_Distr values. When this happens, the quantity used in line 124 is arbitrary as it depends on the way the values are stored in the python dicts.

As an example, assume we have
Distr_Valid = {A: 0.7, B: 0.3}
Distr_Current = {A: 0.5, B: 0.5}
Distr_Extended = {A: 1}

where Extended is on possible extension of the current rule.
In this example, the call MaxDivergence(Distribution[Curr]) could return {A: 1} and the computed maximum would be
-log_2(0.7) = 0.51
which is actually lower than the maximum possible divergence
KLD(Distr_Extended,Distr_Valid) = -log_2(0.3) = 1.74

This issue arised when testing both approaches on some datasets, we actually obtained different set of rules.

I believe the threshold described in the paper can be computed by changing line 124 from
if KLD(MaxDivergence(Distribution[Curr]), Distr) < KLDThreshold(order+1, Curr):
to
if KLD(MaxDivergence(Distr), Distr) < KLDThreshold(order+1, Curr):
or, even simplier, to
if -math.log(min(Distr.values()),2) < KLDThreshold(order+1, Curr):

Maybe we missed something and we would like to have your opinion on this.

Take care and happy new year !
François Queyroi

#################

My thoughts:

I think you are right, my intention is to compare the max divergence between last Valid rule with max possible divergence to the Valid rule, to allow for trimming of search path. There is no need to test the distribution of Currs. My code implementation under certain edge cases may be less effective in trimming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant