forked from ROCm/Thrust
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGELOG
864 lines (708 loc) · 35.6 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
#######################################
# Thrust v1.9.0 #
#######################################
Summary
TODO
Breaking API Changes
None.
New Features
Types
thrust::transform_output_iterator
New Examples
transform_output_iterator demonstrates use of a transform_output_iterator -
a new fancy output iterator which transform output before storing result the memory
Other Enhancements
If C++11 support is enabled, functors do not have to inherit from thrust::unary_function/thrust::binary_function
anymore when using them with thrust::transform_iterator.
The performance of thrust::unique* is improved.
If C++11 support is enabled, the move constructor and move assignment operator have been implemented
for host_vector,device_vector,cpp::vector,cuda::vector,omp::vector and tbb::vector.
Bug Fixes
calculating sin(complex<double>) no longer has precision loss to float
Known Issues
TODO
Acknowledgments
Thanks to Manuel Schiller for contributing a C++11 based enhancement regarding the deduction of
functor return types, improving the performance of thrust::unique and implementing transform_output_iterator
Thanks to Thibault Notargiacomo for the implementation of move semantics for the vector_base based class.
#######################################
# Thrust v1.8.3-2 #
#######################################
Summary
Small bug fixes
Introduces THRUST_PATCH_NUMBER macro, defined in thrust/version.h, to track bug fixes after a new CUDA release.
New Examples
range_view demonstrates use of a view: a non-owning wrapper for an iterator range with a container-like interface
Bug Fixes
copy_if, set_operations, reduce_by_key, and their ilks access temporary data in a user provided stream instead of a default one
{min,max,minmax}_element can now accept raw device pointer with device execution policy
If C++11 support is enabled, functors do not have to inherit from thrust::unary_function/thrust::binary_function
anymore when using them with thrust::transform_iterator.
clear() operations on vector types no longer requires the element type to have a default constructor
Known Issues
TODO
#######################################
# Thrust v1.8.2 #
#######################################
Summary
Small bug fixes
Bug Fixes
Avoid warnings and errors concerning user functions called from __host__ __device__ functions
#632 CUDA set_intersection_by_key error
#651 thrust::copy between host & device is not interoperable with thrust::cuda::par.on(stream)
#664 CUDA for_each ignores execution policy's stream
Known Issues
#628 CUDA's reduce_by_key fails on sm_50 devices
#######################################
# Thrust v1.8.1 #
#######################################
Summary
Small bug fixes
Bug Fixes
#615 CUDA for_each accesses illegal memory locations when given a large range
#620 CUDA's reduce_by_key fails on large input
Known Issues
#628 CUDA's reduce_by_key fails on sm_50 devices
#######################################
# Thrust v1.8.0 #
#######################################
Summary
Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams,
and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code,
providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing
Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy
allows users to require sequential algorithm execution in the calling thread and makes a
sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to
request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm
implementations provide substantial performance improvements.
Breaking API Changes
None.
New Features
Algorithms in CUDA __device__ code
Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.
Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
__device__ int my_device_sort(int *data, size_t n)
{
thrust::sort(thrust::device, data, data + n);
}
The following execution policies are supported in CUDA __device__ code:
thrust::seq
thrust::cuda::par
thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
Execution Policies
CUDA Streams
The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm
execution should occur on a given stream:
// execute for_each on stream s
thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary
storage or returning results to the CPU.
thrust::seq
The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:
// execute for_each sequentially in this thread
thrust::for_each(thrust::seq, begin, end, my_functor);
Other
The new thrust::complex template provides complex number support.
New Examples
simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.
Other Enhancements
CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
fallback_allocator example is simpler.
Bug Fixes
#364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
#371 do not redefine __CUDA_ARCH__
#379 fix crash when dereferencing transform_iterator on the CPU
#391 avoid use of uppercase variable names
#392 fix thrust::copy between cusp::complex & std::complex
#396 program compiled with gcc < 4.3 hangs during comparison sort
#406 fallback_allocator.cu example checks device for unified addressing support
#417 avoid using std::less<T> in binary search algorithms
#418 avoid various warnings
#443 including version.h no longer configures default systems
#578 nvcc produces warnings when sequential algorithms are used with cpu systems
Known Issues
When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, & thrust::stable_sort_by_key may
fail to link in some cases with nvcc -rdc=true.
The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.
Acknowledgments
Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
Thanks to Filipe Maia for contributing the implementation of thrust::complex.
#######################################
# Thrust v1.7.2 #
#######################################
Summary
Small bug fixes
Bug Fixes
Avoid use of std::min in generic find implementation
#######################################
# Thrust v1.7.1 #
#######################################
Summary
Small bug fixes
Bug Fixes
Eliminate identifiers in set_operations.cu example with leading underscore
Eliminate unused variable warning in CUDA reduce_by_key implemention
Avoid deriving function objects from std::unary_function and std::binary_function
#######################################
# Thrust v1.7.0 #
#######################################
Summary
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as
well as several new algorithms and performance improvements. With this new
interface, users may directly control how algorithms execute as well as details
such as the allocation of temporary storage. Key/value versions of thrust::merge
and the set operation algorithms have been added, as well stencil versions of
partitioning algorithms. thrust::tabulate has been introduced to tabulate the
values of functions taking integers. For 32b types, new CUDA merge and set
operations provide 2-15x faster performance while a new CUDA comparison sort
provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation
provides 80% faster performance.
Breaking API Changes
Dispatch
Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead
of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.
Iterators
iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
Other
normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
Placeholder expressions may no longer include the comma operator.
New Features
Execution Policies
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
The following execution policies are supported in this version:
thrust::host
thrust::device
thrust::cpp::par
thrust::cuda::par
thrust::omp::par
thrust::tbb::par
Algorithms
free
get_temporary_buffer
malloc
merge_by_key
partition with stencil
partition_copy with stencil
return_temporary_buffer
set_difference_by_key
set_intersection_by_key
set_symmetric_difference_by_key
set_union_by_key
stable_partition with stencil
stable_partition_copy with stencil
tabulate
New Examples
uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.
Other Enhancements
Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
CUDA merge performance is 2-15x faster.
CUDA comparison sort performance is 1.3-4x faster.
CUDA set operation performance is 1.5-15x faster.
TBB reduce_by_key performance is 80% faster.
Several algorithms have been parallelized with TBB.
Support for user allocators in vectors has been improved.
The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
Warnings have been eliminated in various contexts.
Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
Documentation about algorithm requirements have been improved.
Simplified the minimal_custom_backend example.
Simplified the cuda/custom_temporary_allocation example.
Simplified the cuda/fallback_allocator example.
Bug Fixes
#248 fix broken counting_iterator<float> behavior with OpenMP
#231, #209 fix set operation failures with CUDA
#187 fix incorrect occupancy calculation with CUDA
#153 fix broken multigpu behavior with CUDA
#142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
#208 correctly initialize elements in temporary storage when necessary
#16 fix compilation error when sorting bool with CUDA
#10 fix ambiguous overloads of reinterpret_tag
Known Issues
g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
Acknowledgments
Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
#######################################
# Thrust v1.6.0 #
#######################################
Summary
Thrust v1.6.0 provides an interface for customization and extension and a new
backend system based on the Threading Building Blocks library. With this
new interface, programmers may customize the behavior of specific algorithms
as well as control the allocation of temporary storage or invent entirely new
backends. These enhancements also allow multiple different backend systems
such as CUDA and OpenMP to coexist within a single program. Support for TBB
allows Thrust programs to integrate more naturally into applications which
may already employ the TBB task scheduler.
Breaking API Changes
The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
thrust::host_space_tag has been renamed thrust::host_system_tag
thrust::device_space_tag has been renamed thrust::device_system_tag
thrust::any_space_tag has been renamed thrust::any_system_tag
thrust::iterator_space has been renamed thrust::iterator_system
New Features
Backend Systems
Threading Building Blocks (TBB) is now supported
Functions
for_each_n
raw_reference_cast
Types
pointer
reference
New Examples
cuda/custom_temporary_allocation
cuda/fallback_allocator
device_ptr
expand
minimal_custom_backend
raw_reference_cast
set_operations
Other Enhancements
thrust::for_each now returns the end of the input range similar to most other algorithms
thrust::pair and thrust::tuple have swap functionality
all CUDA algorithms now support large data types
iterators may be dereferenced in user __device__ or __global__ functions
the safe use of different backend systems is now possible within a single binary
Bug Fixes
#469 min_element and max_element algorithms no longer require a const comparison operator
Known Issues
cudafe++.exe may crash when parsing TBB headers on Windows.
#######################################
# Thrust v1.5.3 #
#######################################
Summary
Small bug fixes
Bug Fixes
Avoid warnings about potential race due to __shared__ non-POD variable
#######################################
# Thrust v1.5.2 #
#######################################
Summary
Small bug fixes
Bug Fixes
Fixed warning about C-style initialization of structures
#######################################
# Thrust v1.5.1 #
#######################################
Summary
Small bug fixes
Bug Fixes
Sorting data referenced by permutation_iterators on CUDA produces invalid results
#######################################
# Thrust v1.5.0 #
#######################################
Summary
Thrust v1.5.0 provides introduces new programmer productivity and performance
enhancements. New functionality for creating anonymous "lambda" functions has
been added. A faster host sort provides 2-10x faster performance for sorting
arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
arithmetic types with the OpenMP backend the combined performance improvement
is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
(8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
performance.
Breaking API Changes
device_ptr<void> no longer unsafely converts to device_ptr<T> without an
explicit cast. Use the expression
device_pointer_cast(static_cast<int*>(void_ptr.get()))
to convert, for example, device_ptr<void> to device_ptr<int>.
New Features
Functions
stencil-less transform_if
Types
lambda placeholders
New Examples
lambda
Other Enhancements
host sort is 2-10x faster for arithmetic types
OMP sort provides speedup over host sort
reduce_by_key is 2-3x faster
reduce_by_key no longer requires O(N) temporary storage
CUDA scan algorithms are 10-40% faster
host_vector and device_vector are now documented
out-of-memory exceptions now provide detailed information from CUDART
improved histogram example
device_reference now has a specialized swap
reduce_by_key and scan algorithms are compatible with discard_iterator
Removed Functionality
Bug Fixes
#44 allow host_vector to compile when value_type uses __align__
#198 allow adjacent_difference to permit safe in-situ operation
#303 make thrust thread-safe
#313 avoid race conditions in device_vector::insert
#314 avoid unintended adl invocation when dispatching copy
#365 fix merge and set operation failures
Known Issues
None
Acknowledgments
Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.
#######################################
# Thrust v1.4.0 #
#######################################
Summary
Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
and performance improvements. New set theoretic algorithms operating on
sorted sequences have been added. Additionally, a new fancy iterator
allows discarding redundant or otherwise unnecessary output from
algorithms, conserving memory storage and bandwidth.
Breaking API Changes
Eliminations
thrust/is_sorted.h
thrust/utility.h
thrust/set_intersection.h
thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
New Features
Functions
copy_n
merge
set_difference
set_symmetric_difference
set_union
Types
discard_iterator
Device support
Compute Capability 2.1 GPUs
New Examples
run_length_decoding
Other Enhancements
Compilation warnings are substantially reduced in various contexts.
The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key,
and thrust::stable_sort_by_key are substantially reduced.
A fast sort implementation is used when sorting primitive types with thrust::greater.
The performance of thrust::set_intersection is improved.
The performance of thrust::fill is improved on SM 1.x devices.
A code example is now provided in each algorithm's documentation.
thrust::reverse now operates in-place
Removed Functionality
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
thrust::experimental::cuda::ogl_interop_allocator
thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
Operations which modify the elements of a thrust::device_vector are no longer
available from source code compiled without nvcc when the device backend is CUDA.
Instead, use the idiom from the cpp_interop example.
Bug Fixes
#212 set_intersection works correctly for large input sizes.
#275 counting_iterator and constant_iterator work correctly with OpenMP as the
backend when compiling with optimization
#256 min and max correctly return their first argument as a tie-breaker
#248 NDEBUG is interpreted correctly
Known Issues
nvcc may generate code containing warnings when compiling some Thrust algorithms.
When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue
benign pointer advisories.
When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.
Acknowledgments
Thanks to David Tarjan for improving the performance of set_intersection.
Thanks to Duane Merrill for continued help with sort.
Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
#######################################
# Thrust v1.3.0 #
#######################################
Summary
Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
and performance enhancements.
Performance of the sort and sort_by_key algorithms is improved by as much
as 3x in certain situations. The performance of stream compaction algorithms,
such as copy_if, is improved by as much as 2x. Reduction performance is
also improved, particularly for small input sizes.
CUDA errors are now converted to runtime exceptions using the system_error
interface. Combined with a debug mode, also new in v1.3, runtime errors
can be located with greater precision.
Lastly, a few header files have been consolidated or renamed for clarity.
See the deprecations section below for additional details.
Breaking API Changes
Promotions
thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
thrust::next::gather has been renamed thrust::gather
thrust::next::gather_if has been renamed thrust::gather_if
thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
Deprecations
thrust::copy_when has been renamed thrust::deprecated::copy_when
thrust::absolute_value has been renamed thrust::deprecated::absolute_value
The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
The header thrust/utility.h is now deprecated; use thrust/swap.h instead
The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
Eliminations
thrust::deprecated::gather
thrust::deprecated::gather_if
thrust/experimental/arch.h and the functions therein
thrust/sorting/merge_sort.h
thrust/sorting/radix_sort.h
New Features
Functions
exclusive_scan_by_key
find
find_if
find_if_not
inclusive_scan_by_key
is_partitioned
is_sorted_until
mismatch
partition_point
reverse
reverse_copy
stable_partition_copy
Types
system_error and related types
experimental::cuda::ogl_interop_allocator
bit_and, bit_or, and bit_xor
Device support
gf104-based GPUs
New Examples
opengl_interop.cu
repeated_range.cu
simple_moving_average.cu
sparse_vector.cu
strided_range.cu
Other Enhancements
Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
Performance of thrust::copy_if is substantially improved
Performance of thrust::reduce and related reductions is improved
THRUST_DEBUG mode added
Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
The number of compiler warnings generated by Thrust has been substantially reduced
Comparison sort now works correctly for input sizes > 32M
min & max usage no longer collides with <windows.h> definitions
Compiling against the OpenMP backend no longer requires nvcc
Performance of device_vector initialized in .cpp files is substantially improved in common cases
Performance of thrust::sort_by_key on the host is substantially improved
Removed Functionality
nvcc 2.3 is no longer supported
Bug Fixes
Debug device code now compiles correctly
thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host
Known Issues
#212 set_intersection is known to fail for large input sizes
partition_point is known to fail for 64b types with nvcc 3.2
Acknowledgments
Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
Thanks to Erich Elsen for contributing an implementation of find_if
Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
Thanks to Cliff Woolley for help with testing
#######################################
# Thrust v1.2.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 3.1
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48.
#######################################
# Thrust v1.2.0 #
#######################################
Summary
Thrust v1.2 introduces support for compilation to multicore CPUs
and the Ocelot virtual machine, and several new facilities for
pseudo-random number generation. New algorithms such as set
intersection and segmented reduction have also been added. Lastly,
improvements to the robustness of the CUDA backend ensure
correctness across a broad set of (uncommon) use cases.
Breaking API Changes
thrust::gather's interface was incorrect and has been removed.
The old interface is deprecated but will be preserved for Thrust
version 1.2 at thrust::deprecated::gather &
thrust::deprecated::gather_if. The new interface is provided at
thrust::next::gather & thrust::next::gather_if. The new interface
will be promoted to thrust:: in Thrust version 1.3. For more details,
please refer to this thread:
http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd
The thrust::sorting namespace has been deprecated in favor of the
top-level sorting functions, such as thrust::sort() and
thrust::sort_by_key().
New Features
Functions
reduce_by_key
set_intersection
tie
unique_copy
unique_by_key
unique_copy_by_key
Types
Random Number Generation
discard_block_engine
default_random_engine
linear_congruential_engine
linear_feedback_shift_engine
minstd_rand
minstd_rand0
normal_distribution (experimental)
ranlux24
ranlux48
ranlux24_base
ranlux48_base
subtract_with_carry_engine
taus88
uniform_int_distribution
uniform_real_distribution
xor_combine_engine
Functionals
project1st
project2nd
Fancy Iterators
permutation_iterator
reverse_iterator
Device support
Add support for multicore CPUs via OpenMP
Add support for Fermi-class GPUs
Add support for Ocelot virtual machine
New Examples
cpp_integration
histogram
mode
monte_carlo
monte_carlo_disjoint_sequences
padded_grid_reduction
permutation_iterator
row_sum
run_length_encoding
segmented_scan
stream_compaction
summary_statistics
transform_iterator
word_count
Other Enhancements
vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
integer sorting performance is improved when max is large but (max - min) is small and when min is negative
performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
support for nvcc 3.0
Removed Functionality
removed support for equal between host & device sequences
removed support for gather() and scatter() between host & device sequences
Bug Fixes
# 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
# 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
# 46 gather & scatter handle any space iterators correctly
# 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
# 52 avoid collisions with common user macros such as BLOCK_SIZE
# 62 provide better documentation for device_reference
# 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
# 102 eliminated a race condition in device_vector::erase
various compilation warnings eliminated
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
Acknowledgments
Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
Thanks to Tom Bradley for contributing an implementation of normal_distribution
Thanks to Joseph Rhoads for contributing the example summary_statistics
#######################################
# Thrust v1.1.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.
#######################################
# Thrust v1.1.0 #
#######################################
Summary
Thrust v1.1 introduces fancy iterators, binary search functions, and
several specialized reduction functions. Experimental support for
segmented scan has also been added.
Breaking API Changes
counting_iterator has been moved into the thrust namespace (previously thrust::experimental)
New Features
Functions
copy_if
lower_bound
upper_bound
vectorized lower_bound
vectorized upper_bound
equal_range
binary_search
vectorized binary_search
all_of
any_of
none_of
minmax_element
advance
inclusive_segmented_scan (experimental)
exclusive_segmented_scan (experimental)
Types
pair
tuple
device_malloc_allocator
Fancy Iterators
constant_iterator
counting_iterator
transform_iterator
zip_iterator
New Examples
computing the maximum absolute difference between vectors
computing the bounding box of a two-dimensional point set
sorting multiple arrays together (lexicographical sorting)
constructing a summed area table
using zip_iterator to mimic an array of structs
using constant_iterator to increment array values
Other Enhancements
added pinned memory allocator (experimental)
added more methods to host_vector & device_vector (issue #4)
added variant of remove_if with a stencil argument (issue #29)
scan and reduce use cudaFuncGetAttributes to determine grid size
exceptions are reported when temporary device arrays cannot be allocated
Bug Fixes
#5 make vector work for larger data types
#9 stable_partition_copy doesn't respect OutputIterator concept semantics
#10 scans should return OutputIterator
#16 make algorithms work for larger data types
#27 dispatch radix_sort even when comp=less<T> is explicitly provided
Known Issues
Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1
uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device.
inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver
#######################################
# Thrust v1.0.0 #
#######################################
Breaking API changes
Rename top level namespace komrade to thrust.
Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface.
Rename range() to sequence() to avoid collision with Boost.Range.
Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if().
New Features
Add C++0x style cbegin() & cend() methods to host_vector & device_vector.
Add transform_if function.
Add stencil versions of replace_if() & replace_copy_if().
Allow counting_iterator to work with for_each().
Allow types with constructors in comparison sort & reduce.
Other Enhancements
merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device.
Bug fixes
Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6)
Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)