Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUC decreases A LOT after re-generating cached data #3

Open
FeiGSSS opened this issue Oct 10, 2022 · 1 comment
Open

AUC decreases A LOT after re-generating cached data #3

FeiGSSS opened this issue Oct 10, 2022 · 1 comment

Comments

@FeiGSSS
Copy link

FeiGSSS commented Oct 10, 2022

Hi,
When I remove the cached data you provided from property prediction datasets , and generate them by myself using your codes, the AUC of property prediction decreases a lot. On the other hand, when I use the cached data you provided, the reported results can be reproduced.
I've checked that the required version of pysmiles and networkx are used.

@FeiGSSS
Copy link
Author

FeiGSSS commented Oct 11, 2022

Specifically, I found the inconsistency, i.e., the node features in cached data provided are not aligned with the feature_encoder:
For instance, as shown below, the charge attributes of nodes in the first DGL graph of BBBP dataset are all 22.

In [1]: import dgl

In [2]: bbbp = dgl.load_graphs("./BBBP.bin")[0]

In [3]: bbbp[0].ndata["feature"]
Out[3]: 
tensor([[ 8, 22, 27, 30],
        [17, 22, 27, 33],
        [17, 22, 27, 31],
        [17, 22, 27, 33],
        [14, 22, 27, 31],
        [17, 22, 27, 32],
        [17, 22, 27, 31],
        [ 6, 22, 27, 31],
        [17, 22, 27, 32],
        [ 6, 22, 27, 30],
        [17, 22, 28, 30],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 30],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 30]])

However, when I load the feature_encoder saved in the pertained model, such as gcn_1024/feature_enc.pkl, gives:

In [6]: with open("../../saved/gcn_1024/feature_enc.pkl", "rb") as f:
   ...:     feature_encoder = pkl.load(f)
   ...: 

In [7]: feature_encoder
Out[7]: 
{'element': {'Li': 0,
  'Mn': 1,
  'O': 2,
  'Zr': 3,
  'Cl': 4,
  'Na': 5,
  'In': 6,
  'Cu': 7,
  'Sb': 8,
  'Pb': 9,
  'F': 10,
  'K': 11,
  'B': 12,
  'Ge': 13,
  'N': 14,
  'Hg': 15,
  'As': 16,
  'Zn': 17,
  'Ru': 18,
  'Mg': 19,
  'Si': 20,
  'S': 21,
  'Cr': 22,
  'Sn': 23,
  'P': 24,
  'Ta': 25,
  'C': 26,
  'Bi': 27,
  'Pt': 28,
  'Cd': 29,
  'Ti': 30,
  'Xe': 31,
  'Al': 32,
  'Br': 33,
  'Se': 34,
  'Ga': 35,
  'Ag': 36,
  'I': 37,
  'unknown': 38},
 'charge': {0: 39, 1: 40, 2: 41, 3: 42, 4: 43, -1: 44, 'unknown': 45},
 'aromatic': {False: 46, True: 47, 'unknown': 48},
 'hcount': {0: 49, 1: 50, 2: 51, 3: 52, 4: 53, 'unknown': 54}}

the value of charge attribute starts from 39 (i.e., with this encoder, the node features of BBBP above are all in the range of elements).
I think this is why the AUC decreases a lot after I regenerate the node features of BBBP dataset. Actually, the node feature matrix generated using the above feature_encoder is:

In [4]: bbbp[0].ndata["feature"]
Out[4]: 
tensor([[ 4, 39, 46, 49],
        [26, 39, 46, 52],
        [26, 39, 46, 50],
        [26, 39, 46, 52],
        [14, 39, 46, 50],
        [26, 39, 46, 51],
        [26, 39, 46, 50],
        [ 2, 39, 46, 50],
        [26, 39, 46, 51],
        [ 2, 39, 46, 49],
        [26, 39, 47, 49],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 49],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 49]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant