forked from numba/numba
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGE_LOG
1857 lines (1493 loc) · 73.6 KB
/
CHANGE_LOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Version 0.38.1
--------------
This is a critical bug fix release addressing:
https://github.com/numba/numba/issues/3006
The bug does not impact users using conda packages from Anaconda or Intel Python
Distribution (but it does impact conda-forge). It does not impact users of pip
using wheels from PyPI.
This only impacts a small number of users where:
* The ICC runtime (specifically libsvml) is present in the user's environment.
* The user is using an llvmlite statically linked against a version of LLVM
that has not been patched with SVML support.
* The platform is 64-bit.
The release fixes a code generation path that could lead to the production of
incorrect results under the above situation.
Fixes:
* PR #3007: Augment SVML detection with llvmlite SVML patch detection.
Contributors:
The following people contributed to this release.
* Stuart Archibald (core dev)
Version 0.38.0
--------------
Following on from the bug fix focus of the last release, this release swings
back towards the addition of new features and usability improvements based on
community feedback. This release is comparatively large! Three key features/
changes to note are:
* Numba (via llvmlite) is now backed by LLVM 6.0, general vectorization is
improved as a result. A significant long standing LLVM bug that was causing
corruption was also found and fixed.
* Further considerable improvements in vectorization are made available as
Numba now supports Intel's short vector math library (SVML).
Try it out with `conda install -c numba icc_rt`.
* CUDA 8.0 is now the minimum supported CUDA version.
Other highlights include:
* Bug fixes to `parallel=True` have enabled more vectorization opportunities
when using the ParallelAccelerator technology.
* Much effort has gone into improving error reporting and the general usability
of Numba. This includes highlighted error messages and performance tips
documentation. Try it out with `conda install colorama`.
* A number of new NumPy functions are supported, `np.convolve`, `np.correlate`
`np.reshape`, `np.transpose`, `np.permutation`, `np.real`, `np.imag`, and
`np.searchsorted` now supports the`side` kwarg. Further, `np.argsort` now
supports the `kind` kwarg with `quicksort` and `mergesort` available.
* The Numba extension API has gained the ability operate more easily with
functions from Cython modules through the use of
`numba.extending.get_cython_function_address` to obtain function addresses
for direct use in `ctypes.CFUNCTYPE`.
* Numba now allows the passing of jitted functions (and containers of jitted
functions) as arguments to other jitted functions.
* The CUDA functionality has gained support for a larger selection of bit
manipulation intrinsics, also SELP, and has had a number of bugs fixed.
* Initial work to support the PPC64LE platform has been added, full support is
however waiting on the LLVM 6.0.1 release as it contains critical patches
not present in 6.0.0.
It is hoped that any remaining issues will be fixed in the next release.
* The capacity for advanced users/compiler engineers to define their own
compilation pipelines.
Enhancements:
* PR #2660: Support bools from cffi in nopython.
* PR #2741: Enhance error message for undefined variables.
* PR #2744: Add diagnostic error message to test suite discovery failure.
* PR #2748: Added Intel SVML optimizations as opt-out choice working by default
* PR #2762: Support transpose with axes arguments.
* PR #2777: Add support for np.correlate and np.convolve
* PR #2779: Implement np.random.permutation
* PR #2801: Passing jitted functions as args
* PR #2802: Support np.real() and np.imag()
* PR #2807: Expose `import_cython_function`
* PR #2821: Add kwarg 'side' to np.searchsorted
* PR #2822: Adds stable argsort
* PR #2832: Fixups for llvmlite 0.23/llvm 6
* PR #2836: Support `index` method on tuples
* PR #2839: Support for np.transpose and np.reshape.
* PR #2843: Custom pipeline
* PR #2847: Replace signed array access indices in unsiged prange loop body
* PR #2859: Add support for improved error reporting.
* PR #2880: This adds a github issue template.
* PR #2881: Build recipe to clone Intel ICC runtime.
* PR #2882: Update TravisCI to test SVML
* PR #2893: Add reference to the data buffer in array.ctypes object
* PR #2895: Move to CUDA 8.0
Fixes:
* PR #2737: Fix #2007 (part 1). Empty array handling in np.linalg.
* PR #2738: Fix install_requires to allow pip getting pre-release version
* PR #2740: Fix 2208. Generate better error message.
* PR #2765: Fix Bit-ness
* PR #2780: PowerPC reference counting memory fences
* PR #2805: Fix six imports.
* PR #2813: Fix #2812: gufunc scalar output bug.
* PR #2814: Fix the build post #2727
* PR #2831: Attempt to fix #2473
* PR #2842: Fix issue with test discovery and broken CUDA drivers.
* PR #2850: Add rtsys init guard and test.
* PR #2852: Skip vectorization test with targets that are not x86
* PR #2856: Prevent printing to stdout in `test_extending.py`
* PR #2864: Correct C code to prevent compiler warnings.
* PR #2889: Attempt to fix #2386.
* PR #2891: Removed test skipping for inspect_cfg
* PR #2898: Add guard to parallel test on unsupported platforms
* PR #2907: Update change log for PPC64LE LLVM dependency.
* PR #2911: Move build requirement to llvmlite>=0.23.0dev0
* PR #2912: Fix random permutation test.
* PR #2914: Fix MD list syntax in issue template.
Documentation Updates:
* PR #2739: Explicitly state default value of error_model in docstring
* PR #2803: DOC: parallel vectorize requires signatures
* PR #2829: Add Python 2.7 EOL plan to docs
* PR #2838: Use automatic numbering syntax in list.
* PR #2877: Add performance tips documentation.
* PR #2883: Fix #2872: update rng doc about thread/fork-safety
* PR #2908: Add missing link and ref to docs.
* PR #2909: Tiny typo correction
ParallelAccelerator enhancements/fixes:
* PR #2727: Changes to enable vectorization in ParallelAccelerator.
* PR #2816: Array analysis for transpose with arbitrary arguments
* PR #2874: Fix dead code eliminator not to remove a call with side-effect
* PR #2886: Fix ParallelAccelerator arrayexpr repr
CUDA enhancements:
* PR #2734: More Constants From cuda.h
* PR #2767: Add len(..) Support to DeviceNDArray
* PR #2778: Add More Device Array API Functions to CUDA Simulator
* PR #2824: Add CUDA Primitives for Population Count
* PR #2835: Emit selp Instructions to Avoid Branching
* PR #2867: Full support for CUDA device attributes
CUDA fixes:
* PR #2768: Don't Compile Code on Every Assignment
* PR #2878: Fixes a Win64 issue with the test in Pr/2865
Contributors:
The following people contributed to this release.
* Abutalib Aghayev
* Alex Olivas
* Anton Malakhov
* Dong-hee Na
* Ehsan Totoni (core dev)
* John Zwinck
* Josh Wilson
* Kelsey Jordahl
* Nick White
* Olexa Bilaniuk
* Rik-de-Kort
* Siu Kwan Lam (core dev)
* Stan Seibert (core dev)
* Stuart Archibald (core dev)
* Thomas Arildsen
* Todd A. Anderson (core dev)
Version 0.37.0
--------------
This release focuses on bug fixing and stability but also adds a few new
features including support for Numpy 1.14. The key change for Numba core was the
long awaited addition of the final tranche of thread safety improvements that
allow Numba to be run concurrently on multiple threads without hitting known
thread safety issues inside LLVM itself. Further, a number of fixes and
enhancements went into the CUDA implementation and ParallelAccelerator gained
some new features and underwent some internal refactoring.
Misc enhancements:
* PR #2627: Remove hacks to make llvmlite threadsafe
* PR #2672: Add ascontiguousarray
* PR #2678: Add Gitter badge
* PR #2691: Fix #2690: add intrinsic to convert array to tuple
* PR #2703: Test runner feature: failed-first and last-failed
* PR #2708: Patch for issue #1907
* PR #2732: Add support for array.fill
Misc Fixes:
* PR #2610: Fix #2606 lowering of optional.setattr
* PR #2650: Remove skip for win32 cosine test
* PR #2668: Fix empty_like from readonly arrays.
* PR #2682: Fixes 2210, remove _DisableJitWrapper
* PR #2684: Fix #2340, generator error yielding bool
* PR #2693: Add travis-ci testing of NumPy 1.14, and also check on Python 2.7
* PR #2694: Avoid type inference failure due to a typing template rejection
* PR #2695: Update llvmlite version dependency.
* PR #2696: Fix tuple indexing codegeneration for empty tuple
* PR #2698: Fix #2697 by deferring deletion in the simplify_CFG loop.
* PR #2701: Small fix to avoid tempfiles being created in the current directory
* PR #2725: Fix 2481, LLVM IR parsing error due to mutated IR
* PR #2726: Fix #2673: incorrect fork error msg.
* PR #2728: Alternative to #2620. Remove dead code ByteCodeInst.get.
* PR #2730: Add guard for test needing SciPy/BLAS
Documentation updates:
* PR #2670: Update communication channels
* PR #2671: Add docs about diagnosing loop vectorizer
* PR #2683: Add docs on const arg requirements and on const mem alloc
* PR #2722: Add docs on numpy support in cuda
* PR #2724: Update doc: warning about unsupported arguments
ParallelAccelerator enhancements/fixes:
Parallel support for `np.arange` and `np.linspace`, also `np.mean`, `np.std`
and `np.var` are added. This was performed as part of a general refactor and
cleanup of the core ParallelAccelerator code.
* PR #2674: Core pa
* PR #2704: Generate Dels after parfor sequential lowering
* PR #2716: Handle matching directly supported functions
CUDA enhancements:
* PR #2665: CUDA DeviceNDArray: Support numpy tranpose API
* PR #2681: Allow Assigning to DeviceNDArrays
* PR #2702: Make DummyArray do High Dimensional Reshapes
* PR #2714: Use CFFI to Reuse Code
CUDA fixes:
* PR #2667: Fix CUDA DeviceNDArray slicing
* PR #2686: Fix #2663: incorrect offset when indexing cuda array.
* PR #2687: Ensure Constructed Stream Bound
* PR #2706: Workaround for unexpected warp divergence due to exception raising
code
* PR #2707: Fix regression: cuda test submodules not loading properly in
runtests
* PR #2731: Use more challenging values in slice tests.
* PR #2720: A quick testsuite fix to not run the new cuda testcase in the
multiprocess pool
Contributors:
The following people contributed to this release.
* Coutinho Menezes Nilo
* Daniel
* Ehsan Totoni
* Nick White
* Paul H. Liu
* Siu Kwan Lam
* Stan Seibert
* Stuart Archibald
* Todd A. Anderson
Version 0.36.2
--------------
This is a bugfix release that provides minor changes to address:
* PR #2645: Avoid CPython bug with ``exec`` in older 2.7.x.
* PR #2652: Add support for CUDA 9.
Version 0.36.1
--------------
This release continues to add new features to the work undertaken in partnership
with Intel on ParallelAccelerator technology. Other changes of note include the
compilation chain being updated to use LLVM 5.0 and the production of conda
packages using conda-build 3 and the new compilers that ship with it.
NOTE: A version 0.36.0 was tagged for internal use but not released.
ParallelAccelerator:
NOTE: The ParallelAccelerator technology is under active development and should
be considered experimental.
New features relating to ParallelAccelerator, from work undertaken with Intel,
include the addition of the `@stencil` decorator for ease of implementation of
stencil-like computations, support for general reductions, and slice and
range fusion for parallel slice/bit-array assignments. Documentation on both the
use and implementation of the above has been added. Further, a new debug
environment variable `NUMBA_DEBUG_ARRAY_OPT_STATS` is made available to give
information about which operators/calls are converted to parallel for-loops.
ParallelAccelerator features:
* PR #2457: Stencil Computations in ParallelAccelerator
* PR #2548: Slice and range fusion, parallelizing bitarray and slice assignment
* PR #2516: Support general reductions in ParallelAccelerator
ParallelAccelerator fixes:
* PR #2540: Fix bug #2537
* PR #2566: Fix issue #2564.
* PR #2599: Fix nested multi-dimensional parfor type inference issue
* PR #2604: Fixes for stencil tests and cmath sin().
* PR #2605: Fixes issue #2603.
Additional features of note:
This release of Numba (and llvmlite) is updated to use LLVM version 5.0 as the
compiler back end, the main change to Numba to support this was the addition of
a custom symbol tracker to avoid the calls to LLVM's `ExecutionEngine` that was
crashing when asking for non-existent symbol addresses. Further, the conda
packages for this release of Numba are built using conda build version 3 and the
new compilers/recipe grammar that are present in that release.
* PR #2568: Update for LLVM 5
* PR #2607: Fixes abort when getting address to "nrt_unresolved_abort"
* PR #2615: Working towards conda build 3
Thanks to community feedback and bug reports, the following fixes were also
made.
Misc fixes/enhancements:
* PR #2534: Add tuple support to np.take.
* PR #2551: Rebranding fix
* PR #2552: relative doc links
* PR #2570: Fix issue #2561, handle missing successor on loop exit
* PR #2588: Fix #2555. Disable libpython.so linking on linux
* PR #2601: Update llvmlite version dependency.
* PR #2608: Fix potential cache file collision
* PR #2612: Fix NRT test failure due to increased overhead when running in coverage
* PR #2619: Fix dubious pthread_cond_signal not in lock
* PR #2622: Fix `np.nanmedian` for all NaN case.
* PR #2633: Fix markdown in CONTRIBUTING.md
* PR #2635: Make the dependency on compilers for AOT optional.
CUDA support fixes:
* PR #2523: Fix invalid cuda context in memory transfer calls in another thread
* PR #2575: Use CPU to initialize xoroshiro states for GPU RNG. Fixes #2573
* PR #2581: Fix cuda gufunc mishandling of scalar arg as array and out argument
Version 0.35.0
--------------
This release includes some exciting new features as part of the work
performed in partnership with Intel on ParallelAccelerator technology.
There are also some additions made to Numpy support and small but
significant fixes made as a result of considerable effort spent chasing bugs
and implementing stability improvements.
ParallelAccelerator:
NOTE: The ParallelAccelerator technology is under active development and should
be considered experimental.
New features relating to ParallelAccelerator, from work undertaken with Intel,
include support for a larger range of `np.random` functions in `parallel`
mode, printing Numpy arrays in no Python mode, the capacity to initialize Numpy
arrays directly from list comprehensions, and the axis argument to `.sum()`.
Documentation on the ParallelAccelerator technology implementation has also
been added. Further, a large amount of work on equivalence relations was
undertaken to enable runtime checks of broadcasting behaviours in parallel mode.
ParallelAccelerator features:
* PR #2400: Array comprehension
* PR #2405: Support printing Numpy arrays
* PR #2438: from Support more np.random functions in ParallelAccelerator
* PR #2482: Support for sum with axis in nopython mode.
* PR #2487: Adding developer documentation for ParallelAccelerator technology.
* PR #2492: Core PA refactor adds assertions for broadcast semantics
ParallelAccelerator fixes:
* PR #2478: Rename cfg before parfor translation (#2477)
* PR #2479: Fix broken array comprehension tests on unsupported platforms
* PR #2484: Fix array comprehension test on win64
* PR #2506: Fix for 32-bit machines.
Additional features of note:
Support for `np.take`, `np.finfo`, `np.iinfo` and `np.MachAr` in no Python
mode is added. Further, three new environment variables are added, two for
overriding CPU target/features and another to warn if `parallel=True` was set
no such transform was possible.
* PR #2490: Implement np.take and ndarray.take
* PR #2493: Display a warning if parallel=True is set but not possible.
* PR #2513: Add np.MachAr, np.finfo, np.iinfo
* PR #2515: Allow environ overriding of cpu target and cpu features.
Due to expansion of the test farm and a focus on fixing bugs, the following
fixes were also made.
Misc fixes/enhancements:
* PR #2455: add contextual information to runtime errors
* PR #2470: Fixes #2458, poor performance in np.median
* PR #2471: Ensure LLVM threadsafety in {g,}ufunc building.
* PR #2494: Update doc theme
* PR #2503: Remove hacky code added in 2482 and feature enhancement
* PR #2505: Serialise env mutation tests during multithreaded testing.
* PR #2520: Fix failing cpu-target override tests
CUDA support fixes:
* PR #2504: Enable CUDA toolkit version testing
* PR #2509: Disable tests generating code unavailable in lower CC versions.
* PR #2511: Fix Windows 64 bit CUDA tests.
Version 0.34.0
--------------
This release adds a significant set of new features arising from combined work
with Intel on ParallelAccelerator technology. It also adds list comprehension
and closure support, support for Numpy 1.13 and a new, faster, CUDA reduction
algorithm. For Linux users this release is the first to be built on Centos 6,
which will be the new base platform for future releases. Finally a number of
thread-safety, type inference and other smaller enhancements and bugs have been
fixed.
ParallelAccelerator features:
NOTE: The ParallelAccelerator technology is under active development and should
be considered experimental.
The ParallelAccelerator technology is accessed via a new "nopython" mode option
"parallel". The ParallelAccelerator technology attempts to identify operations
which have parallel semantics (for instance adding a scalar to a vector), fuse
together adjacent such operations, and then parallelize their execution across
a number of CPU cores. This is essentially auto-parallelization.
In addition to the auto-parallelization feature, explicit loop based
parallelism is made available through the use of `prange` in place of `range`
as a loop iterator.
More information and examples on both auto-parallelization and `prange` are
available in the documentation and examples directory respectively.
As part of the necessary work for ParallelAccelerator, support for closures
and list comprehensions is added:
* PR #2318: Transfer ParallelAccelerator technology to Numba
* PR #2379: ParallelAccelerator Core Improvements
* PR #2367: Add support for len(range(...))
* PR #2369: List comprehension
* PR #2391: Explicit Parallel Loop Support (prange)
The ParallelAccelerator features are available on all supported platforms and
Python versions with the exceptions of (with view of supporting in a future
release):
* The combination of Windows operating systems with Python 2.7.
* Systems running 32 bit Python.
CUDA support enhancements:
* PR #2377: New GPU reduction algorithm
CUDA support fixes:
* PR #2397: Fix #2393, always set alignment of cuda static memory regions
Misc Fixes:
* PR #2373, Issue #2372: 32-bit compatibility fix for parfor related code
* PR #2376: Fix #2375 missing stdint.h for py2.7 vc9
* PR #2378: Fix deadlock in parallel gufunc when kernel acquires the GIL.
* PR #2382: Forbid unsafe casting in bitwise operation
* PR #2385: docs: fix Sphinx errors
* PR #2396: Use 64-bit RHS operand for shift
* PR #2404: Fix threadsafety logic issue in ufunc compilation cache.
* PR #2424: Ensure consistent iteration order of blocks for type inference.
* PR #2425: Guard code to prevent the use of 'parallel' on win32 + py27
* PR #2426: Basic test for Enum member type recovery.
* PR #2433: Fix up the parfors tests with respect to windows py2.7
* PR #2442: Skip tests that need BLAS/LAPACK if scipy is not available.
* PR #2444: Add test for invalid array setitem
* PR #2449: Make the runtime initialiser threadsafe
* PR #2452: Skip CFG test on 64bit windows
Misc Enhancements:
* PR #2366: Improvements to IR utils
* PR #2388: Update README.rst to indicate the proper version of LLVM
* PR #2394: Upgrade to llvmlite 0.19.*
* PR #2395: Update llvmlite version to 0.19
* PR #2406: Expose environment object to ufuncs
* PR #2407: Expose environment object to target-context inside lowerer
* PR #2413: Add flags to pass through to conda build for buildbot
* PR #2414: Add cross compile flags to local recipe
* PR #2415: A few cleanups for rewrites
* PR #2418: Add getitem support for Enum classes
* PR #2419: Add support for returning enums in vectorize
* PR #2421: Add copyright notice for Intel contributed files.
* PR #2422: Patch code base to work with np 1.13 release
* PR #2448: Adds in warning message when using 'parallel' if cache=True
* PR #2450: Add test for keyword arg on .sum-like and .cumsum-like array
methods
Version 0.33.0
--------------
This release resolved several performance issues caused by atomic
reference counting operations inside loop bodies. New optimization
passes have been added to reduce the impact of these operations. We
observe speed improvements between 2x-10x in affected programs due to
the removal of unnecessary reference counting operations.
There are also several enhancements to the CUDA GPU support:
* A GPU random number generator based on `xoroshiro128+ algorithm <http://xoroshiro.di.unimi.it/>`_ is added.
See details and examples in :ref:`documentation <cuda-random>`.
* ``@cuda.jit`` CUDA kernels can now call ``@jit`` and ``@njit``
CPU functions and they will automatically be compiled as CUDA device
functions.
* CUDA IPC memory API is exposed for sharing memory between proceses.
See usage details in :ref:`documentation <cuda-ipc-memory>`.
Reference counting enhancements:
* PR #2346, Issue #2345, #2248: Add extra refcount pruning after inlining
* PR #2349: Fix refct pruning not removing refct op with tail call.
* PR #2352, Issue #2350: Add refcount pruning pass for function that does not need refcount
CUDA support enhancements:
* PR #2023: Supports CUDA IPC for device array
* PR #2343, Issue #2335: Allow CPU jit decorated function to be used as cuda device function
* PR #2347: Add random number generator support for CUDA device code
* PR #2361: Update autotune table for CC: 5.3, 6.0, 6.1, 6.2
Misc fixes:
* PR #2362: Avoid test failure due to typing to int32 on 32-bit platforms
* PR #2359: Fixed nogil example that threw a TypeError when executed.
* PR #2357, Issue #2356: Fix fragile test that depends on how the script is executed.
* PR #2355: Fix cpu dispatcher referenced as attribute of another module
* PR #2354: Fixes an issue with caching when function needs NRT and refcount pruning
* PR #2342, Issue #2339: Add warnings to inspection when it is used on unserialized cached code
* PR #2329, Issue #2250: Better handling of missing op codes
Misc enhancements:
* PR #2360: Adds missing values in error mesasge interp.
* PR #2353: Handle when get_host_cpu_features() raises RuntimeError
* PR #2351: Enable SVML for erf/erfc/gamma/lgamma/log2
* PR #2344: Expose error_model setting in jit decorator
* PR #2337: Align blocking terminate support for fork() with new TBB version
* PR #2336: Bump llvmlite version to 0.18
* PR #2330: Core changes in PR #2318
Version 0.32.0
--------------
In this release, we are upgrading to LLVM 4.0. A lot of work has been done
to fix many race-condition issues inside LLVM when the compiler is
used concurrently, which is likely when Numba is used with Dask.
Improvements:
* PR #2322: Suppress test error due to unknown but consistent error with tgamma
* PR #2320: Update llvmlite dependency to 0.17
* PR #2308: Add details to error message on why cuda support is disabled.
* PR #2302: Add os x to travis
* PR #2294: Disable remove_module on MCJIT due to memory leak inside LLVM
* PR #2291: Split parallel tests and recycle workers to tame memory usage
* PR #2253: Remove the pointer-stuffing hack for storing meminfos in lists
Fixes:
* PR #2331: Fix a bug in the GPU array indexing
* PR #2326: Fix #2321 docs referring to non-existing function.
* PR #2316: Fixing more race-condition problems
* PR #2315: Fix #2314. Relax strict type check to allow optional type.
* PR #2310: Fix race condition due to concurrent compilation and cache loading
* PR #2304: Fix intrinsic 1st arg not a typing.Context as stated by the docs.
* PR #2287: Fix int64 atomic min-max
* PR #2286: Fix #2285 `@overload_method` not linking dependent libs
* PR #2303: Missing import statements to interval-example.rst
Version 0.31.0
--------------
In this release, we added preliminary support for debugging with GDB
version >= 7.0. The feature is enabled by setting the ``debug=True`` compiler
option, which causes GDB compatible debug info to be generated.
The CUDA backend also gained limited debugging support so that source locations
are showed in memory-checking and profiling tools.
For details, see :ref:`numba-troubleshooting`.
Also, we added the ``fastmath=True`` compiler option to enable unsafe
floating-point transformations, which allows LLVM to auto-vectorize more code.
Other important changes include upgrading to LLVM 3.9.1 and adding support for
Numpy 1.12.
Improvements:
* PR #2281: Update for numpy1.12
* PR #2278: Add CUDA atomic.{max, min, compare_and_swap}
* PR #2277: Add about section to conda recipies to identify license and other
metadata in Anaconda Cloud
* PR #2271: Adopt itanium C++-style mangling for CPU and CUDA targets
* PR #2267: Add fastmath flags
* PR #2261: Support dtype.type
* PR #2249: Changes for llvm3.9
* PR #2234: Bump llvmlite requirement to 0.16 and add install_name_tool_fixer to
mviewbuf for OS X
* PR #2230: Add python3.6 to TravisCi
* PR #2227: Enable caching for gufunc wrapper
* PR #2170: Add debugging support
* PR #2037: inspect_cfg() for easier visualization of the function operation
Fixes:
* PR #2274: Fix nvvm ir patch in mishandling "load"
* PR #2272: Fix breakage to cuda7.5
* PR #2269: Fix caching of copy_strides kernel in cuda.reduce
* PR #2265: Fix #2263: error when linking two modules with dynamic globals
* PR #2252: Fix path separator in test
* PR #2246: Fix overuse of memory in some system with fork
* PR #2241: Fix #2240: __module__ in dynamically created function not a str
* PR #2239: Fix fingerprint computation failure preventing fallback
Version 0.30.1
--------------
This is a bug-fix release to enable Python 3.6 support. In addition,
there is now early Intel TBB support for parallel ufuncs when building from
source with TBBROOT defined. The TBB feature is not enabled in our official
builds.
Fixes:
* PR #2232: Fix name clashes with _Py_hashtable_xxx in Python 3.6.
Improvements:
* PR #2217: Add Intel TBB threadpool implementation for parallel ufunc.
Version 0.30.0
--------------
This release adds preliminary support for Python 3.6, but no official build is
available yet. A new system reporting tool (``numba --sysinfo``) is added to
provide system information to help core developers in replication and debugging.
See below for other improvements and bug fixes.
Improvements:
* PR #2209: Support Python 3.6.
* PR #2175: Support ``np.trace()``, ``np.outer()`` and ``np.kron()``.
* PR #2197: Support ``np.nanprod()``.
* PR #2190: Support caching for ufunc.
* PR #2186: Add system reporting tool.
Fixes:
* PR #2214, Issue #2212: Fix memory error with ndenumerate and flat iterators.
* PR #2206, Issue #2163: Fix ``zip()`` consuming extra elements in early
exhaustion.
* PR #2185, Issue #2159, #2169: Fix rewrite pass affecting objmode fallback.
* PR #2204, Issue #2178: Fix annotation for liftedloop.
* PR #2203: Fix Appveyor segfault with Python 3.5.
* PR #2202, Issue #2198: Fix target context not initialized when loading from
ufunc cache.
* PR #2172, Issue #2171: Fix optional type unpacking.
* PR #2189, Issue #2188: Disable freezing of big (>1MB) global arrays.
* PR #2180, Issue #2179: Fix invalid variable version in looplifting.
* PR #2156, Issue #2155: Fix divmod, floordiv segfault on CUDA.
Version 0.29.0
--------------
This release extends the support of recursive functions to include direct and
indirect recursion without explicit function type annotations. See new example
in `examples/mergesort.py`. Newly supported numpy features include array
stacking functions, np.linalg.eig* functions, np.linalg.matrix_power, np.roots
and array to array broadcasting in assignments.
This release depends on llvmlite 0.14.0 and supports CUDA 8 but it is not
required.
Improvements:
* PR #2130, #2137: Add type-inferred recursion with docs and examples.
* PR #2134: Add ``np.linalg.matrix_power``.
* PR #2125: Add ``np.roots``.
* PR #2129: Add ``np.linalg.{eigvals,eigh,eigvalsh}``.
* PR #2126: Add array-to-array broadcasting.
* PR #2069: Add hstack and related functions.
* PR #2128: Allow for vectorizing a jitted function. (thanks to @dhirschfeld)
* PR #2117: Update examples and make them test-able.
* PR #2127: Refactor interpreter class and its results.
Fixes:
* PR #2149: Workaround MSVC9.0 SP1 fmod bug kb982107.
* PR #2145, Issue #2009: Fixes kwargs for jitclass ``__init__`` method.
* PR #2150: Fix slowdown in objmode fallback.
* PR #2050, Issue #1259: Fix liveness problem with some generator loops.
* PR #2072, Issue #1995: Right shift of unsigned LHS should be logical.
* PR #2115, Issue #1466: Fix inspect_types() error due to mangled variable name.
* PR #2119, Issue #2118: Fix array type created from record-dtype.
* PR #2122, Issue #1808: Fix returning a generator due to datamodel error.
Version 0.28.1
--------------
This is a bug-fix release to resolve packaging issues with setuptools
dependency.
Version 0.28.0
--------------
Amongst other improvements, this version improves again the level of
support for linear algebra -- functions from the :mod:`numpy.linalg`
module. Also, our random generator is now guaranteed to be thread-safe
and fork-safe.
Improvements:
* PR #2019: Add the ``@intrinsic`` decorator to define low-level
subroutines callable from JIT functions (this is considered
a private API for now).
* PR #2059: Implement ``np.concatenate`` and ``np.stack``.
* PR #2048: Make random generation fork-safe and thread-safe, producing
independent streams of random numbers for each thread or process.
* PR #2031: Add documentation of floating-point pitfalls.
* Issue #2053: Avoid polling in parallel CPU target (fixes severe performance
regression on Windows).
* Issue #2029: Make default arguments fast.
* PR #2052: Add logging to the CUDA driver.
* PR #2049: Implement the built-in ``divmod()`` function.
* PR #2036: Implement the ``argsort()`` method on arrays.
* PR #2046: Improving CUDA memory management by deferring deallocations
until certain thresholds are reached, so as to avoid breaking asynchronous
execution.
* PR #2040: Switch the CUDA driver implementation to use CUDA's
"primary context" API.
* PR #2017: Allow ``min(tuple)`` and ``max(tuple)``.
* PR #2039: Reduce fork() detection overhead in CUDA.
* PR #2021: Handle structured dtypes with titles.
* PR #1996: Rewrite looplifting as a transformation on Numba IR.
* PR #2014: Implement ``np.linalg.matrix_rank``.
* PR #2012: Implement ``np.linalg.cond``.
* PR #1985: Rewrite even trivial array expressions, which opens the door
for other optimizations (for example, ``array ** 2`` can be converted
into ``array * array``).
* PR #1950: Have ``typeof()`` always raise ValueError on failure.
Previously, it would either raise or return None, depending on the input.
* PR #1994: Implement ``np.linalg.norm``.
* PR #1987: Implement ``np.linalg.det`` and ``np.linalg.slogdet``.
* Issue #1979: Document integer width inference and how to workaround.
* PR #1938: Numba is now compatible with LLVM 3.8.
* PR #1967: Restrict ``np.linalg`` functions to homogenous dtypes. Users
wanting to pass mixed-typed inputs have to convert explicitly, which
makes the performance implications more obvious.
Fixes:
* PR #2006: ``array(float32) ** int`` should return ``array(float32)``.
* PR #2044: Allow reshaping empty arrays.
* Issue #2051: Fix refcounting issue when concatenating tuples.
* Issue #2000: Make Numpy optional for setup.py, to allow ``pip install``
to work without Numpy pre-installed.
* PR #1989: Fix assertion in ``Dispatcher.disable_compile()``.
* Issue #2028: Ignore filesystem errors when caching from multiple processes.
* Issue #2003: Allow unicode variable and function names (on Python 3).
* Issue #1998: Fix deadlock in parallel ufuncs that reacquire the GIL.
* PR #1997: Fix random crashes when AOT compiling on certain Windows platforms.
* Issue #1988: Propagate jitclass docstring.
* Issue #1933: Ensure array constants are emitted with the right alignment.
Version 0.27.0
--------------
Improvements:
* Issue #1976: improve error message when non-integral dimensions are given
to a CUDA kernel.
* PR #1970: Optimize the power operator with a static exponent.
* PR #1710: Improve contextual information for compiler errors.
* PR #1961: Support printing constant strings.
* PR #1959: Support more types in the print() function.
* PR #1823: Support ``compute_50`` in CUDA backend.
* PR #1955: Support ``np.linalg.pinv``.
* PR #1896: Improve the ``SmartArray`` API.
* PR #1947: Support ``np.linalg.solve``.
* Issue #1943: Improve error message when an argument fails typing.4
* PR #1927: Support ``np.linalg.lstsq``.
* PR #1934: Use system functions for hypot() where possible, instead of our
own implementation.
* PR #1929: Add cffi support to ``@cfunc`` objects.
* PR #1932: Add user-controllable thread pool limits for parallel CPU target.
* PR #1928: Support self-recursion when the signature is explicit.
* PR #1890: List all lowering implementations in the developer docs.
* Issue #1884: Support ``np.lib.stride_tricks.as_strided()``.
Fixes:
* Issue #1960: Fix sliced assignment when source and destination areas are
overlapping.
* PR #1963: Make CUDA print() atomic.
* PR #1956: Allow 0d array constants.
* Issue #1945: Allow using Numpy ufuncs in AOT compiled code.
* Issue #1916: Fix documentation example for ``@generated_jit``.
* Issue #1926: Fix regression when caching functions in an IPython session.
* Issue #1923: Allow non-intp integer arguments to carray() and farray().
* Issue #1908: Accept non-ASCII unicode docstrings on Python 2.
* Issue #1874: Allow ``del container[key]`` in object mode.
* Issue #1913: Fix set insertion bug when the lookup chain contains deleted
entries.
* Issue #1911: Allow function annotations on jitclass methods.
Version 0.26.0
--------------
This release adds support for ``cfunc`` decorator for exporting numba jitted
functions to 3rd party API that takes C callbacks. Most of the overhead of
using jitclasses inside the interpreter are eliminated. Support for
decompositions in ``numpy.linalg`` are added. Finally, Numpy 1.11 is
supported.
Improvements:
* PR #1889: Export BLAS and LAPACK wrappers for pycc.
* PR #1888: Faster array power.
* Issue #1867: Allow "out" keyword arg for dufuncs.
* PR #1871: ``carray()`` and ``farray()`` for creating arrays from pointers.
* PR #1855: ``@cfunc`` decorator for exporting as ctypes function.
* PR #1862: Add support for ``numpy.linalg.qr``.
* PR #1851: jitclass support for '_' and '__' prefixed attributes.
* PR #1842: Optimize jitclass in Python interpreter.
* Issue #1837: Fix CUDA simulator issues with device function.
* PR #1839: Add support for decompositions from ``numpy.linalg``.
* PR #1829: Support Python enums.
* PR #1828: Add support for ``numpy.random.rand()``` and
``numpy.random.randn()``
* Issue #1825: Use of 0-darray in place of scalar index.
* Issue #1824: Scalar arguments to object mode gufuncs.
* Issue #1813: Let bitwise bool operators return booleans, not integers.
* Issue #1760: Optional arguments in generators.
* PR #1780: Numpy 1.11 support.
Version 0.25.0
--------------
This release adds support for ``set`` objects in nopython mode. It also
adds support for many missing Numpy features and functions. It improves
Numba's compatibility and performance when using a distributed execution
framework such as dask, distributed or Spark. Finally, it removes
compatibility with Python 2.6, Python 3.3 and Numpy 1.6.
Improvements:
* Issue #1800: Add erf(), erfc(), gamma() and lgamma() to CUDA targets.
* PR #1793: Implement more Numpy functions: np.bincount(), np.diff(),
np.digitize(), np.histogram(), np.searchsorted() as well as NaN-aware
reduction functions (np.nansum(), np.nanmedian(), etc.)
* PR #1789: Optimize some reduction functions such as np.sum(), np.prod(),
np.median(), etc.
* PR #1752: Make CUDA features work in dask, distributed and Spark.
* PR #1787: Support np.nditer() for fast multi-array indexing with
broadcasting.
* PR #1799: Report JIT-compiled functions as regular Python functions
when profiling (allowing to see the filename and line number where a
function is defined).
* PR #1782: Support np.any() and np.all().
* Issue #1788: Support the iter() and next() built-in functions.
* PR #1778: Support array.astype().
* Issue #1775: Allow the user to set the target CPU model for AOT compilation.
* PR #1758: Support creating random arrays using the ``size`` parameter
to the np.random APIs.
* PR #1757: Support len() on array.flat objects.
* PR #1749: Remove Numpy 1.6 compatibility.
* PR #1748: Remove Python 2.6 and 3.3 compatibility.
* PR #1735: Support the ``not in`` operator as well as operator.contains().
* PR #1724: Support homogenous sets in nopython mode.
* Issue #875: make compilation of array constants faster.
Fixes:
* PR #1795: Fix a massive performance issue when calling Numba functions
with distributed, Spark or a similar mechanism using serialization.
* Issue #1784: Make jitclasses usable with NUMBA_DISABLE_JIT=1.
* Issue #1786: Allow using linear algebra functions when profiling.
* Issue #1796: Fix np.dot() memory leak on non-contiguous inputs.
* PR #1792: Fix static negative indexing of tuples.
* Issue #1771: Use fallback cache directory when __pycache__ isn't writable,
such as when user code is installed in a system location.
* Issue #1223: Use Numpy error model in array expressions (e.g. division
by zero returns ``inf`` or ``nan`` instead of raising an error).
* Issue #1640: Fix np.random.binomial() for large n values.
* Issue #1643: Improve error reporting when passing an invalid spec to
``@jitclass``.
* PR #1756: Fix slicing with a negative step and an omitted start.
Version 0.24.0
--------------
This release introduces several major changes, including the ``@generated_jit``
decorator for flexible specializations as with Julia's "``@generated``" macro,
or the SmartArray array wrapper type that allows seamless transfer of array
data between the CPU and the GPU.
This will be the last version to support Python 2.6, Python 3.3 and Numpy 1.6.
Improvements:
* PR #1723: Improve compatibility of JIT functions with the Python profiler.
* PR #1509: Support array.ravel() and array.flatten().
* PR #1676: Add SmartArray type to support transparent data management in
multiple address spaces (host & GPU).
* PR #1689: Reduce startup overhead of importing Numba.
* PR #1705: Support registration of CFFI types as corresponding to known
Numba types.
* PR #1686: Document the extension API.
* PR #1698: Improve warnings raised during type inference.
* PR #1697: Support np.dot() and friends on non-contiguous arrays.
* PR #1692: cffi.from_buffer() improvements (allow more pointer types,
allow non-Numpy buffer objects).
* PR #1648: Add the ``@generated_jit`` decorator.
* PR #1651: Implementation of np.linalg.inv using LAPACK. Thanks to
Matthieu Dartiailh.
* PR #1674: Support np.diag().
* PR #1673: Improve error message when looking up an attribute on an
unknown global.
* Issue #1569: Implement runtime check for the LLVM locale bug.
* PR #1612: Switch to LLVM 3.7 in sync with llvmlite.
* PR #1624: Allow slice assignment of sequence to array.
* PR #1622: Support slicing tuples with a constant slice.
Fixes:
* Issue #1722: Fix returning an optional boolean (bool or None).
* Issue #1734: NRT decref bug when variable is del'ed before being defined,
leading to a possible memory leak.
* PR #1732: Fix tuple getitem regression for CUDA target.
* PR #1718: Mishandling of optional to optional casting.
* PR #1714: Fix .compile() on a JIT function not respecting ._can_compile.
* Issue #1667: Fix np.angle() on arrays.
* Issue #1690: Fix slicing with an omitted stop and a negative step value.
* PR #1693: Fix gufunc bug in handling scalar formal arg with non-scalar
input value.
* PR #1683: Fix parallel testing under Windows.
* Issue #1616: Use system-provided versions of C99 math where possible.
* Issue #1652: Reductions of bool arrays (e.g. sum() or mean()) should
return integers or floats, not bools.
* Issue #1664: Fix regression when indexing a record array with a constant
index.
* PR #1661: Disable AVX on old Linux kernels.
* Issue #1636: Allow raising an exception looked up on a module.
Version 0.23.1
--------------
This is a bug-fix release to address several regressions introduced
in the 0.23.0 release, and a couple other issues.
Fixes:
* Issue #1645: CUDA ufuncs were broken in 0.23.0.
* Issue #1638: Check tuple sizes when passing a list of tuples.
* Issue #1630: Parallel ufunc would keep eating CPU even after finishing
under Windows.