Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BayesDB falls over when passed some really tiny values #388

Closed
alxempirical opened this issue Mar 28, 2016 · 13 comments
Closed

BayesDB falls over when passed some really tiny values #388

alxempirical opened this issue Mar 28, 2016 · 13 comments

Comments

@alxempirical
Copy link
Contributor

It results in a NaN logscore. Most tiny values are OK, so it may be a cancellation or roundoff error.

E.g.

from bayeslite.bayesdb import bayesdb_open
bdb = bayesdb_open()
bdb.sql_execute('CREATE TABLE D (c float)')
bdb.sql_execute('INSERT INTO D (c) VALUES (-1.0010415476e-146)')
# bdb.sql_execute('INSERT INTO D (c) VALUES (0)')
bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
bdb.execute('INITIALIZE 1 MODELS for D_cc')
bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')

Results in

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-168-ba232d0a75d6> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/var/folders/8w/h52wthts2gl7kn2sx3g5nql00000gn/T/py614324xn''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/var/folders/8w/h52wthts2gl7kn2sx3g5nql00000gn/T/py614324xn''');exec(compile(__code, '''/tmp/tst.py''', 'exec'));

/tmp/tst.py in <module>()
      6 bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
      7 bdb.execute('INITIALIZE 1 MODELS for D_cc')
----> 8 bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in execute(self, string, bindings)
    220             bindings = ()
    221         return self._maybe_trace(
--> 222             self.tracer, self._do_execute, string, bindings)
    223 
    224     def _maybe_trace(self, tracer, meth, string, bindings):

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in _maybe_trace(self, tracer, meth, string, bindings)
    228         if tracer:
    229             tracer(string, bindings)
--> 230         return meth(string, bindings)
    231 
    232     def _qid(self):

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in _do_execute(self, string, bindings)
    269         else:
    270             raise ValueError('>1 phrase in string')
--> 271         cursor = bql.execute_phrase(self, phrase, bindings)
    272         return self._empty_cursor if cursor is None else cursor
    273 

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bql.py in execute_phrase(bdb, phrase, bindings)
    464             max_seconds=phrase.seconds,
    465             ckpt_iterations=phrase.ckpt_iterations,
--> 466             ckpt_seconds=phrase.ckpt_seconds)
    467         return empty_cursor(bdb)
    468 

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/metamodels/crosscat.py in analyze_models(self, bdb, generator_id, modelnos, iterations, max_seconds, ckpt_iterations, ckpt_seconds)
   1036                     assert diagnostics['logscore'][-1][i] is not None
   1037                     assert not math.isnan(diagnostics['logscore'][-1][i]), \
-> 1038                         'bad X_L before %r after %r' % (X_L, X_L_list_0[i])
   1039                     assert 0 < len(diagnostics['num_views'])
   1040                     assert 0 < len(diagnostics['column_crp_alpha'])

AssertionError: bad X_L before {'column_partition': {'assignments': [0], 'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_hypers': [{'mu': -1.0010415476e-146, 's': 2.2250738585072014e-308, 'r': 1.0, 'fixed': 0.0, 'nu': 1.0}], 'view_state': [{'column_component_suffstats': [[{'sum_x': -1.0010415476e-146, 'sum_x_squared': 1.0020841800214032e-292, 'N': 1.0}]], 'row_partition_model': {'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_names': [u'c']}]} after {u'column_partition': {u'assignments': [0], u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_hypers': [{u'mu': -1.0010415476e-146, u's': 2.2250738585072014e-308, u'r': 1.0, u'nu': 1.0, u'fixed': 0.0}], u'view_state': [{u'column_component_suffstats': [[{u'sum_x': -1.0010415476e-146, u'sum_x_squared': 1.0020841800214032e-292, u'N': 1.0}]], u'row_partition_model': {u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_names': [u'c']}]}
@riastradh-probcomp
Copy link
Contributor

What else did you try to conclude that 'most tiny values are OK'?

This setup satisfies the hypotheses of probcomp/crosscat#85.

@alxempirical
Copy link
Contributor Author

The following script records 46 failures out of 200 experiments:

import pylab
pylab.ion()
from bayeslite.bayesdb import bayesdb_open
bad = []
for i in pylab.arange(.999, 1.001, 0.00001):
    print i
    try:
        bdb = bayesdb_open()
        bdb.sql_execute('CREATE TABLE D (c float)')
        bdb.sql_execute('INSERT INTO D (c) VALUES (%g)' % (-1.0010415476e-146*i))
        # bdb.sql_execute('INSERT INTO D (c) VALUES (0)')
        bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
        bdb.execute('INITIALIZE 1 MODELS for D_cc')
        bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')
    except AssertionError, e:
        assert any('bad X_L before' in a for a in e.args)
        bad.append(i*-1.0010415476e-146)

pylab.hist(bad, bins=100)
pylab.show()

@gregory-marton
Copy link
Contributor

@alxempirical You have an insert 0 commented out, which would nix the bug 85 connection. Can you give us a little more history on that?

@alxempirical
Copy link
Contributor Author

The test does not fail if you insert a single 0 into the table (as opposed
to some of the other values), that's all.

The possible connection to #85 is unclear to me, as the sufficient
statistics are nonzero, according to the AssertionError reports.

Best,
Alex

On Tue, Mar 29, 2016 at 9:48 AM, Gregory Marton [email protected]
wrote:

@alxempirical https://github.com/alxempirical You have an insert 0
commented out, which would nix the bug 85 connection. Can you give us a
little more history on that?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#388 (comment)

@alxempirical
Copy link
Contributor Author

The values don't have to be particularly tiny after all. The example also fails with -0.00114661120796

@alxempirical
Copy link
Contributor Author

And for larger values, they all fail.

@alxempirical
Copy link
Contributor Author

Here is an example with two values inserted into the table.

bdb.sql_execute('INSERT INTO D (c) VALUES (2952788047.4), (2952788169.25)')

Failure is

AssertionError: bad X_L before {'column_partition': {'assignments': [0], 'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_hypers': [{'mu': 2952788092.0783334, 's': 7423.711238379478, 'r': 0.5484124898473131, 'fixed': 0.0, 'nu': 2.0}], 'view_state': [{'column_component_suffstats': [[{'sum_x': 2952788169.25, 'sum_x_squared': 8.718957972462767e+18, 'N': 1.0}, {'sum_x': 2952788047.4, 'sum_x_squared': 8.718957252868305e+18, 'N': 1.0}]], 'row_partition_model': {'counts': [1, 1], 'hypers': {'alpha': 1.2030250360821166}}, 'column_names': [u'c']}]} after {u'column_partition': {u'assignments': [0], u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_hypers': [{u'mu': 2952788055.5233335, u's': 3445.7815188027907, u'r': 0.6299605249474366, u'nu': 1.2311444133449163, u'fixed': 0.0}], u'view_state': [{u'column_component_suffstats': [[{u'sum_x': 2952788169.25, u'sum_x_squared': 8.718957972462767e+18, u'N': 1.0}, {u'sum_x': 2952788047.4, u'sum_x_squared': 8.718957252868305e+18, u'N': 1.0}]], u'row_partition_model': {u'counts': [1, 1], u'hypers': {u'alpha': 2.0}}, u'column_names': [u'c']}]}

@alxempirical
Copy link
Contributor Author

Here's an example with 10 values:

bdb.sql_execute('INSERT INTO D (c) VALUES (32003.9998936), (32004.0002834), (32003.999974), (32003.9998355), (32004.0003761), (32003.9997346), (32003.9999356), (32004.0003391), (32004.0002069), (32003.999385)')

@alxempirical
Copy link
Contributor Author

I think these multi-value examples reflect a different bug, though. I think crosscat is having a hard time with the fact that the variance is so low relative to the mean. Probably some kind of pre-analysis massaging is in order (e.g, zeroing the mean.)

@alxempirical
Copy link
Contributor Author

Although, here's an example of a failure case with mean / variance < 1e5:

(-8349033494.51), (-8349033446.63), (-8349033483.54), (-8349033459.14), (-8349033564.8), (-8349033545.5), (-8349033500.63), (-8349033544.15), (-8349033524.37), (-8349033519.97), (-8349033527.93), (-8349033519.13), (-8349033470.27), (-8349033465.52), (-8349033505.4), (-8349033471.94), (-8349033508.88), (-8349033585.78), (-8349033472.22), (-8349033517.04), (-8349033526.65), (-8349033395.85), (-8349033480.05), (-8349033449.03), (-8349033563.84), (-8349033577.02), (-8349033603.64), (-8349033478.46), (-8349033435.14), (-8349033546.53), (-8349033519.77), (-8349033555.33), (-8349033447.69), (-8349033433.39), (-8349033506.26), (-8349033455.17), (-8349033487.83), (-8349033574.37), (-8349033578.44), (-8349033527.87), (-8349033531.29), (-8349033495.57), (-8349033475.23), (-8349033446.2), (-8349033489.05), (-8349033450.93), (-8349033499.91), (-8349033544.6), (-8349033490.2), (-8349033471.97), (-8349033522.98), (-8349033469.58), (-8349033571.43), (-8349033476.97), (-8349033530.76), (-8349033483.41), (-8349033546.37), (-8349033530.12), (-8349033549.73), (-8349033546.71), (-8349033504.6), (-8349033494.51), (-8349033528.86), (-8349033537.92), (-8349033489.17), (-8349033510.14), (-8349033575.92), (-8349033438.21), (-8349033419.29), (-8349033522.73), (-8349033540.04), (-8349033457.77), (-8349033439.46), (-8349033491.53), (-8349033517.42), (-8349033500.49), (-8349033528.28), (-8349033527.71), (-8349033546.02), (-8349033515.45), (-8349033447.93), (-8349033542.91), (-8349033450.29), (-8349033565.23), (-8349033500.77), (-8349033561.56), (-8349033505.69), (-8349033461.75), (-8349033464.14), (-8349033517.55), (-8349033571.39), (-8349033532.61), (-8349033434.58), (-8349033531.11), (-8349033435.72), (-8349033426.6), (-8349033500.53), (-8349033499.87), (-8349033483.35), (-8349033565.21)

@alxempirical
Copy link
Contributor Author

I think that must be a break down in the prior. Again, zeroing the mean would likely help, here.

@riastradh-probcomp
Copy link
Contributor

We could probably do better by using Welford's algorithm for incrementally computing normal sufficient statistics, which stores the mean and n*var instead of the sum and the sum of squares or similar. This would require figuring out the changes to the math in the continuous component model, its sufficient statistics, and the updates to the hyperparameters.

I'm not sure it's the right thing -- I haven't looked closely enough to be sure what goes wrong in update_continuous_hypers in all cases -- but it is a plausible place to investigate. We have code in bayeslite to compute Welford's algorithm, in src/stats.py.

@fsaad
Copy link
Collaborator

fsaad commented Dec 17, 2017

d2376c1

@fsaad fsaad closed this as completed Dec 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants