BayesDB falls over when passed some really tiny values #388

alxempirical · 2016-03-28T23:46:49Z

It results in a NaN logscore. Most tiny values are OK, so it may be a cancellation or roundoff error.

E.g.

from bayeslite.bayesdb import bayesdb_open
bdb = bayesdb_open()
bdb.sql_execute('CREATE TABLE D (c float)')
bdb.sql_execute('INSERT INTO D (c) VALUES (-1.0010415476e-146)')
# bdb.sql_execute('INSERT INTO D (c) VALUES (0)')
bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
bdb.execute('INITIALIZE 1 MODELS for D_cc')
bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')

Results in

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-168-ba232d0a75d6> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/var/folders/8w/h52wthts2gl7kn2sx3g5nql00000gn/T/py614324xn''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/var/folders/8w/h52wthts2gl7kn2sx3g5nql00000gn/T/py614324xn''');exec(compile(__code, '''/tmp/tst.py''', 'exec'));

/tmp/tst.py in <module>()
      6 bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
      7 bdb.execute('INITIALIZE 1 MODELS for D_cc')
----> 8 bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in execute(self, string, bindings)
    220             bindings = ()
    221         return self._maybe_trace(
--> 222             self.tracer, self._do_execute, string, bindings)
    223 
    224     def _maybe_trace(self, tracer, meth, string, bindings):

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in _maybe_trace(self, tracer, meth, string, bindings)
    228         if tracer:
    229             tracer(string, bindings)
--> 230         return meth(string, bindings)
    231 
    232     def _qid(self):

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bayesdb.py in _do_execute(self, string, bindings)
    269         else:
    270             raise ValueError('>1 phrase in string')
--> 271         cursor = bql.execute_phrase(self, phrase, bindings)
    272         return self._empty_cursor if cursor is None else cursor
    273 

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/bql.py in execute_phrase(bdb, phrase, bindings)
    464             max_seconds=phrase.seconds,
    465             ckpt_iterations=phrase.ckpt_iterations,
--> 466             ckpt_seconds=phrase.ckpt_seconds)
    467         return empty_cursor(bdb)
    468 

/Users/alx/Google Drive/alien/probcomp/bayeslite/bayeslite/metamodels/crosscat.py in analyze_models(self, bdb, generator_id, modelnos, iterations, max_seconds, ckpt_iterations, ckpt_seconds)
   1036                     assert diagnostics['logscore'][-1][i] is not None
   1037                     assert not math.isnan(diagnostics['logscore'][-1][i]), \
-> 1038                         'bad X_L before %r after %r' % (X_L, X_L_list_0[i])
   1039                     assert 0 < len(diagnostics['num_views'])
   1040                     assert 0 < len(diagnostics['column_crp_alpha'])

AssertionError: bad X_L before {'column_partition': {'assignments': [0], 'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_hypers': [{'mu': -1.0010415476e-146, 's': 2.2250738585072014e-308, 'r': 1.0, 'fixed': 0.0, 'nu': 1.0}], 'view_state': [{'column_component_suffstats': [[{'sum_x': -1.0010415476e-146, 'sum_x_squared': 1.0020841800214032e-292, 'N': 1.0}]], 'row_partition_model': {'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_names': [u'c']}]} after {u'column_partition': {u'assignments': [0], u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_hypers': [{u'mu': -1.0010415476e-146, u's': 2.2250738585072014e-308, u'r': 1.0, u'nu': 1.0, u'fixed': 0.0}], u'view_state': [{u'column_component_suffstats': [[{u'sum_x': -1.0010415476e-146, u'sum_x_squared': 1.0020841800214032e-292, u'N': 1.0}]], u'row_partition_model': {u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_names': [u'c']}]}

riastradh-probcomp · 2016-03-28T23:53:57Z

What else did you try to conclude that 'most tiny values are OK'?

This setup satisfies the hypotheses of probcomp/crosscat#85.

alxempirical · 2016-03-29T00:06:06Z

The following script records 46 failures out of 200 experiments:

import pylab
pylab.ion()
from bayeslite.bayesdb import bayesdb_open
bad = []
for i in pylab.arange(.999, 1.001, 0.00001):
    print i
    try:
        bdb = bayesdb_open()
        bdb.sql_execute('CREATE TABLE D (c float)')
        bdb.sql_execute('INSERT INTO D (c) VALUES (%g)' % (-1.0010415476e-146*i))
        # bdb.sql_execute('INSERT INTO D (c) VALUES (0)')
        bdb.execute('CREATE GENERATOR D_cc FOR D USING crosscat("c" numerical)')
        bdb.execute('INITIALIZE 1 MODELS for D_cc')
        bdb.execute('ANALYZE D_cc FOR 1 ITERATIONS WAIT')
    except AssertionError, e:
        assert any('bad X_L before' in a for a in e.args)
        bad.append(i*-1.0010415476e-146)

pylab.hist(bad, bins=100)
pylab.show()

gregory-marton · 2016-03-29T13:48:18Z

@alxempirical You have an insert 0 commented out, which would nix the bug 85 connection. Can you give us a little more history on that?

alxempirical · 2016-03-29T14:14:43Z

The test does not fail if you insert a single 0 into the table (as opposed
to some of the other values), that's all.

The possible connection to #85 is unclear to me, as the sufficient
statistics are nonzero, according to the AssertionError reports.

Best,
Alex

On Tue, Mar 29, 2016 at 9:48 AM, Gregory Marton [email protected]
wrote:

@alxempirical https://github.com/alxempirical You have an insert 0
commented out, which would nix the bug 85 connection. Can you give us a
little more history on that?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#388 (comment)

alxempirical · 2016-03-29T15:27:25Z

The values don't have to be particularly tiny after all. The example also fails with -0.00114661120796

alxempirical · 2016-03-29T15:32:29Z

And for larger values, they all fail.

alxempirical · 2016-03-29T15:36:58Z

Here is an example with two values inserted into the table.

bdb.sql_execute('INSERT INTO D (c) VALUES (2952788047.4), (2952788169.25)')

Failure is

AssertionError: bad X_L before {'column_partition': {'assignments': [0], 'counts': [1], 'hypers': {'alpha': 1.0}}, 'column_hypers': [{'mu': 2952788092.0783334, 's': 7423.711238379478, 'r': 0.5484124898473131, 'fixed': 0.0, 'nu': 2.0}], 'view_state': [{'column_component_suffstats': [[{'sum_x': 2952788169.25, 'sum_x_squared': 8.718957972462767e+18, 'N': 1.0}, {'sum_x': 2952788047.4, 'sum_x_squared': 8.718957252868305e+18, 'N': 1.0}]], 'row_partition_model': {'counts': [1, 1], 'hypers': {'alpha': 1.2030250360821166}}, 'column_names': [u'c']}]} after {u'column_partition': {u'assignments': [0], u'counts': [1], u'hypers': {u'alpha': 1.0}}, u'column_hypers': [{u'mu': 2952788055.5233335, u's': 3445.7815188027907, u'r': 0.6299605249474366, u'nu': 1.2311444133449163, u'fixed': 0.0}], u'view_state': [{u'column_component_suffstats': [[{u'sum_x': 2952788169.25, u'sum_x_squared': 8.718957972462767e+18, u'N': 1.0}, {u'sum_x': 2952788047.4, u'sum_x_squared': 8.718957252868305e+18, u'N': 1.0}]], u'row_partition_model': {u'counts': [1, 1], u'hypers': {u'alpha': 2.0}}, u'column_names': [u'c']}]}

alxempirical · 2016-03-29T16:17:50Z

Here's an example with 10 values:

bdb.sql_execute('INSERT INTO D (c) VALUES (32003.9998936), (32004.0002834), (32003.999974), (32003.9998355), (32004.0003761), (32003.9997346), (32003.9999356), (32004.0003391), (32004.0002069), (32003.999385)')

alxempirical · 2016-03-29T16:23:29Z

I think these multi-value examples reflect a different bug, though. I think crosscat is having a hard time with the fact that the variance is so low relative to the mean. Probably some kind of pre-analysis massaging is in order (e.g, zeroing the mean.)

alxempirical · 2016-03-29T16:32:52Z

Although, here's an example of a failure case with mean / variance < 1e5:

(-8349033494.51), (-8349033446.63), (-8349033483.54), (-8349033459.14), (-8349033564.8), (-8349033545.5), (-8349033500.63), (-8349033544.15), (-8349033524.37), (-8349033519.97), (-8349033527.93), (-8349033519.13), (-8349033470.27), (-8349033465.52), (-8349033505.4), (-8349033471.94), (-8349033508.88), (-8349033585.78), (-8349033472.22), (-8349033517.04), (-8349033526.65), (-8349033395.85), (-8349033480.05), (-8349033449.03), (-8349033563.84), (-8349033577.02), (-8349033603.64), (-8349033478.46), (-8349033435.14), (-8349033546.53), (-8349033519.77), (-8349033555.33), (-8349033447.69), (-8349033433.39), (-8349033506.26), (-8349033455.17), (-8349033487.83), (-8349033574.37), (-8349033578.44), (-8349033527.87), (-8349033531.29), (-8349033495.57), (-8349033475.23), (-8349033446.2), (-8349033489.05), (-8349033450.93), (-8349033499.91), (-8349033544.6), (-8349033490.2), (-8349033471.97), (-8349033522.98), (-8349033469.58), (-8349033571.43), (-8349033476.97), (-8349033530.76), (-8349033483.41), (-8349033546.37), (-8349033530.12), (-8349033549.73), (-8349033546.71), (-8349033504.6), (-8349033494.51), (-8349033528.86), (-8349033537.92), (-8349033489.17), (-8349033510.14), (-8349033575.92), (-8349033438.21), (-8349033419.29), (-8349033522.73), (-8349033540.04), (-8349033457.77), (-8349033439.46), (-8349033491.53), (-8349033517.42), (-8349033500.49), (-8349033528.28), (-8349033527.71), (-8349033546.02), (-8349033515.45), (-8349033447.93), (-8349033542.91), (-8349033450.29), (-8349033565.23), (-8349033500.77), (-8349033561.56), (-8349033505.69), (-8349033461.75), (-8349033464.14), (-8349033517.55), (-8349033571.39), (-8349033532.61), (-8349033434.58), (-8349033531.11), (-8349033435.72), (-8349033426.6), (-8349033500.53), (-8349033499.87), (-8349033483.35), (-8349033565.21)

alxempirical · 2016-03-29T16:34:30Z

I think that must be a break down in the prior. Again, zeroing the mean would likely help, here.

riastradh-probcomp · 2016-03-29T17:25:21Z

We could probably do better by using Welford's algorithm for incrementally computing normal sufficient statistics, which stores the mean and n*var instead of the sum and the sum of squares or similar. This would require figuring out the changes to the math in the continuous component model, its sufficient statistics, and the updates to the hyperparameters.

I'm not sure it's the right thing -- I haven't looked closely enough to be sure what goes wrong in update_continuous_hypers in all cases -- but it is a plausible place to investigate. We have code in bayeslite to compute Welford's algorithm, in src/stats.py.

fsaad · 2017-12-17T21:16:33Z

d2376c1

riastradh-probcomp mentioned this issue May 31, 2016

NOT NULL constraint failed: bayesdb_crosscat_diagnostics.logscore #284

Closed

fsaad mentioned this issue Nov 20, 2017

Engineering plan for restructuring software architecture of bayeslite #588

Closed

54 tasks

fsaad closed this as completed Dec 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BayesDB falls over when passed some really tiny values #388

BayesDB falls over when passed some really tiny values #388

alxempirical commented Mar 28, 2016

riastradh-probcomp commented Mar 28, 2016

alxempirical commented Mar 29, 2016

gregory-marton commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

riastradh-probcomp commented Mar 29, 2016

fsaad commented Dec 17, 2017

BayesDB falls over when passed some really tiny values #388

BayesDB falls over when passed some really tiny values #388

Comments

alxempirical commented Mar 28, 2016

riastradh-probcomp commented Mar 28, 2016

alxempirical commented Mar 29, 2016

gregory-marton commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

alxempirical commented Mar 29, 2016

riastradh-probcomp commented Mar 29, 2016

fsaad commented Dec 17, 2017