[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592

icfaust · 2023-11-21T12:56:43Z

Description

Each MT2203 RNG engine is independently uniform when taking samples. However, when two or more engines are compared, the initial aggregated random numbers are not uniform. Because the randomness between trees needs to be guaranteed for the decision forest algorithm (each with its own RNG engine), a imperceptible performance loss is introduced to burn RNG values to where the engine collection is empirically uniform.

The second issue is a problem with the binary search associated with finding a split for ExtraTrees (regressor and classifier). The search failed to find the largest bin left edge in its current orientation, and so it has been switched to always guarantee a valid split. This change comes from the ambiguity of using a binning approach with the Extra Trees algorithm definition. All use of the .min parameter are removed, and so it is completely removed from IndexedFeatures and initial binning scripts.

This will fix the following deselected_tests from sklearnex:
tests/test_multioutput.py::test_classifier_chain_tuple_order
ensemble/tests/test_forest.py::test_distribution

However, this is changing the determinism of the trees used in the sklearnex tests, which means some tests which passed by chance could now fail.

This non-uniformity negatively impacts both the random forest in the bootstrapping process, and in extra trees in the initial chosen splits.

Changes proposed in this pull request:

Check for a family engine (only MT2203)
Burn a magic number of samples for every engine (400)
remove .min() from IndexedFeatures
change binary search in genRandomBinIdx for classification and regression

icfaust · 2023-11-21T13:05:21Z

/intelci: run

icfaust · 2023-11-21T13:10:58Z

special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee886edb-4adb-f114-a7aa-a4bf010d0e2e

icfaust · 2023-11-21T13:11:24Z

Performance comparisons are available upon request

icfaust · 2023-11-21T14:40:37Z

The test_distribution for ExtraTreesRegressor is not fixed. This requires further analysis.

icfaust · 2023-11-22T10:08:10Z

special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee891bc7-4b31-f12d-a430-a4bf010d0e2e

icfaust · 2023-11-22T10:17:46Z

/intelci: run

icfaust · 2023-11-22T10:53:52Z

Private CI shows this causes an issue with ExtraTreesClassification now, will need to investigate

KulikovNikita

LGTM

KulikovNikita · 2023-12-13T05:00:40Z

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i

+            RNGsInst<unsigned int, cpu> rng;
+            services::internal::TArray<unsigned int, cpu> temp(burn);


Suggested change

RNGsInst<unsigned int, cpu> rng;

services::internal::TArray<unsigned int, cpu> temp(burn);

RNGsInst<uint32_t, cpu> rng;

services::internal::TArray<uint32_t, cpu> temp(burn);

icfaust · 2024-01-19T16:54:11Z

/intelci: run ml_benchmarks

icfaust added 6 commits November 21, 2023 10:17

Update df_train_dense_default_impl.i

08e89b1

clang-formatting

6e4eb1e

Update df_train_dense_default_impl.i

8171367

Update df_train_dense_default_impl.i

d5f7276

Update df_train_dense_default_impl.i

93f5a87

Update df_train_dense_default_impl.i

4d5d71d

icfaust requested review from samir-nasibli and Vika-F November 21, 2023 12:57

icfaust marked this pull request as ready for review November 21, 2023 12:58

icfaust requested review from razdoburdin, ahuber21 and avolkov-intel as code owners November 21, 2023 12:58

icfaust added 2 commits November 22, 2023 01:32

remove issues with bin edges and binarysearch

24fac17

clang-formatting

5997103

icfaust changed the title ~~[bug] fix MT2203 engine-to-engine non-uniformity by burning RNG samples in decision forest training~~ [bug] fix MT2203 RNG non-uniformity and bin Random indices in decision forest training Nov 22, 2023

icfaust changed the title ~~[bug] fix MT2203 RNG non-uniformity and bin Random indices in decision forest training~~ [bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training Nov 22, 2023

KulikovNikita approved these changes Dec 13, 2023

View reviewed changes

Merge branch 'oneapi-src:main' into df_rng_burn

5048969

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592

[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592

icfaust commented Nov 21, 2023 •

edited

Loading

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 22, 2023

icfaust commented Nov 22, 2023

icfaust commented Nov 22, 2023

KulikovNikita left a comment

KulikovNikita Dec 13, 2023

icfaust commented Jan 19, 2024

		RNGsInst<unsigned int, cpu> rng;
		services::internal::TArray<unsigned int, cpu> temp(burn);

[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592

Are you sure you want to change the base?

[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592

Conversation

icfaust commented Nov 21, 2023 • edited Loading

Description

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 21, 2023

icfaust commented Nov 22, 2023

icfaust commented Nov 22, 2023

icfaust commented Nov 22, 2023

KulikovNikita left a comment

Choose a reason for hiding this comment

KulikovNikita Dec 13, 2023

Choose a reason for hiding this comment

icfaust commented Jan 19, 2024

icfaust commented Nov 21, 2023 •

edited

Loading