-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BayesDB falls over when passed some really tiny values #388
Comments
What else did you try to conclude that 'most tiny values are OK'? This setup satisfies the hypotheses of probcomp/crosscat#85. |
The following script records 46 failures out of 200 experiments: import pylab
pylab.ion()
from bayeslite.bayesdb import bayesdb_open
bad = []
for i in pylab.arange(.999, 1.001, 0.00001):
print i
try:
bdb = bayesdb_open()
bdb.sql_execute('CREATE TABLE D (c float)')
bdb.sql_execute('INSERT INTO D (c) VALUES (%g)' % (-1.0010415476e-146*i))
# bdb.sql_execute('INSERT INTO D (c) VALUES (0)')
bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
bdb.execute('INITIALIZE 1 MODELS for D_cc')
bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')
except AssertionError, e:
assert any('bad X_L before' in a for a in e.args)
bad.append(i*-1.0010415476e-146)
pylab.hist(bad, bins=100)
pylab.show() |
@alxempirical You have an insert 0 commented out, which would nix the bug 85 connection. Can you give us a little more history on that? |
The test does not fail if you insert a single 0 into the table (as opposed The possible connection to #85 is unclear to me, as the sufficient Best, On Tue, Mar 29, 2016 at 9:48 AM, Gregory Marton [email protected]
|
The values don't have to be particularly tiny after all. The example also fails with -0.00114661120796 |
And for larger values, they all fail. |
Here is an example with two values inserted into the table. bdb.sql_execute('INSERT INTO D (c) VALUES (2952788047.4), (2952788169.25)') Failure is AssertionError: bad X_L before {'column_partition': {'assignments': [0], 'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_hypers': [{'mu': 2952788092.0783334, 's': 7423.711238379478, 'r': 0.5484124898473131, 'fixed': 0.0, 'nu': 2.0}], 'view_state': [{'column_component_suffstats': [[{'sum_x': 2952788169.25, 'sum_x_squared': 8.718957972462767e+18, 'N': 1.0}, {'sum_x': 2952788047.4, 'sum_x_squared': 8.718957252868305e+18, 'N': 1.0}]], 'row_partition_model': {'counts': [1, 1], 'hypers': {'alpha': 1.2030250360821166}}, 'column_names': [u'c']}]} after {u'column_partition': {u'assignments': [0], u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_hypers': [{u'mu': 2952788055.5233335, u's': 3445.7815188027907, u'r': 0.6299605249474366, u'nu': 1.2311444133449163, u'fixed': 0.0}], u'view_state': [{u'column_component_suffstats': [[{u'sum_x': 2952788169.25, u'sum_x_squared': 8.718957972462767e+18, u'N': 1.0}, {u'sum_x': 2952788047.4, u'sum_x_squared': 8.718957252868305e+18, u'N': 1.0}]], u'row_partition_model': {u'counts': [1, 1], u'hypers': {u'alpha': 2.0}}, u'column_names': [u'c']}]} |
Here's an example with 10 values: bdb.sql_execute('INSERT INTO D (c) VALUES (32003.9998936), (32004.0002834), (32003.999974), (32003.9998355), (32004.0003761), (32003.9997346), (32003.9999356), (32004.0003391), (32004.0002069), (32003.999385)') |
I think these multi-value examples reflect a different bug, though. I think crosscat is having a hard time with the fact that the variance is so low relative to the mean. Probably some kind of pre-analysis massaging is in order (e.g, zeroing the mean.) |
Although, here's an example of a failure case with mean / variance < 1e5:
|
I think that must be a break down in the prior. Again, zeroing the mean would likely help, here. |
We could probably do better by using Welford's algorithm for incrementally computing normal sufficient statistics, which stores the mean and n*var instead of the sum and the sum of squares or similar. This would require figuring out the changes to the math in the continuous component model, its sufficient statistics, and the updates to the hyperparameters. I'm not sure it's the right thing -- I haven't looked closely enough to be sure what goes wrong in update_continuous_hypers in all cases -- but it is a plausible place to investigate. We have code in bayeslite to compute Welford's algorithm, in |
It results in a NaN logscore. Most tiny values are OK, so it may be a cancellation or roundoff error.
E.g.
Results in
The text was updated successfully, but these errors were encountered: