-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training #2592
base: main
Are you sure you want to change the base?
Conversation
/intelci: run |
special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee886edb-4adb-f114-a7aa-a4bf010d0e2e |
Performance comparisons are available upon request |
The test_distribution for ExtraTreesRegressor is not fixed. This requires further analysis. |
special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee891bc7-4b31-f12d-a430-a4bf010d0e2e |
/intelci: run |
Private CI shows this causes an issue with ExtraTreesClassification now, will need to investigate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
RNGsInst<unsigned int, cpu> rng; | ||
services::internal::TArray<unsigned int, cpu> temp(burn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RNGsInst<unsigned int, cpu> rng; | |
services::internal::TArray<unsigned int, cpu> temp(burn); | |
RNGsInst<uint32_t, cpu> rng; | |
services::internal::TArray<uint32_t, cpu> temp(burn); |
/intelci: run ml_benchmarks |
Description
Each MT2203 RNG engine is independently uniform when taking samples. However, when two or more engines are compared, the initial aggregated random numbers are not uniform. Because the randomness between trees needs to be guaranteed for the decision forest algorithm (each with its own RNG engine), a imperceptible performance loss is introduced to burn RNG values to where the engine collection is empirically uniform.
The second issue is a problem with the binary search associated with finding a split for ExtraTrees (regressor and classifier). The search failed to find the largest bin left edge in its current orientation, and so it has been switched to always guarantee a valid split. This change comes from the ambiguity of using a binning approach with the Extra Trees algorithm definition. All use of the .min parameter are removed, and so it is completely removed from IndexedFeatures and initial binning scripts.
This will fix the following deselected_tests from sklearnex:
tests/test_multioutput.py::test_classifier_chain_tuple_order
ensemble/tests/test_forest.py::test_distribution
However, this is changing the determinism of the trees used in the sklearnex tests, which means some tests which passed by chance could now fail.
This non-uniformity negatively impacts both the random forest in the bootstrapping process, and in extra trees in the initial chosen splits.
Changes proposed in this pull request: