Edge case affecting trimming #8

xyjprc · 2021-02-07T20:58:20Z

hon/pyHON/BuildRulesFastParameterFreeFreq.py

Line 124 in 2c89ad4

if KLD(MaxDivergence(Distribution[Curr]), Distr) < KLDThreshold(order+1, Curr):

Francois (@fqueyroi) raised the following issue in an email:

Hi Jian,

I forgot to thank for your previous answer, sorry for that, it was very helpful.
My colleague (Célestin Coquidé in cc.) and I are still working on high-order networks following your important results.

We have a question regarding the elimination of the max order parameter in your paper
"Efficient modeling of higher-order dependencies in networks: from algorithm to application for anomaly detection" Saebi et al., 2020.
In the file "pyHONv2_algorithm.pdf" given in your GitHub repo (which I believe is the algorithm described in the paper), line 38 corresponds to Eq. (1) : the maximum value of KLD(Extended_Distr,Valid_Distr) is given by -log2(min(Valid_Distr)) i.e. the maximum is achieved when (one of) the least likely outcome with the valid rule becomes the only possible outcome when trying to extend the rule.

However in the python script "pyHON/BuildRulesFastParameterFreeFreq.py", the computed quantity (function ExtendRule, line 124) is different. It is KLD that would be achieved if (one of) the least likely outcome with the current rule become the only possible outcomes when trying to extend the rule. Assuming i = argmin(Current_Distr), the returned quantity is now -log2(Valid_Distr[i]).

The latter quantity may vary since the minimum of Current_Distr may not be unique and the entries associated with the lowest probability can correspond to different Valid_Distr values. When this happens, the quantity used in line 124 is arbitrary as it depends on the way the values are stored in the python dicts.

As an example, assume we have
Distr_Valid = {A: 0.7, B: 0.3}
Distr_Current = {A: 0.5, B: 0.5}
Distr_Extended = {A: 1}

where Extended is on possible extension of the current rule.
In this example, the call MaxDivergence(Distribution[Curr]) could return {A: 1} and the computed maximum would be
-log_2(0.7) = 0.51
which is actually lower than the maximum possible divergence
KLD(Distr_Extended,Distr_Valid) = -log_2(0.3) = 1.74

This issue arised when testing both approaches on some datasets, we actually obtained different set of rules.

I believe the threshold described in the paper can be computed by changing line 124 from
if KLD(MaxDivergence(Distribution[Curr]), Distr) < KLDThreshold(order+1, Curr):
to
if KLD(MaxDivergence(Distr), Distr) < KLDThreshold(order+1, Curr):
or, even simplier, to
if -math.log(min(Distr.values()),2) < KLDThreshold(order+1, Curr):

Maybe we missed something and we would like to have your opinion on this.

Take care and happy new year !
François Queyroi

#################

My thoughts:

I think you are right, my intention is to compare the max divergence between last Valid rule with max possible divergence to the Valid rule, to allow for trimming of search path. There is no need to test the distribution of Currs. My code implementation under certain edge cases may be less effective in trimming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edge case affecting trimming #8

Edge case affecting trimming #8

xyjprc commented Feb 7, 2021 •

edited

Loading

Edge case affecting trimming #8

Edge case affecting trimming #8

Comments

xyjprc commented Feb 7, 2021 • edited Loading

xyjprc commented Feb 7, 2021 •

edited

Loading