forked from adaptivecomputing/torque
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGELOG
2336 lines (2160 loc) · 125 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
c - crash b - bug fix e - enhancement f - new feature n - note
4.2.0
f - Support the MIC architecture. This was co-developed with Doug Johnson at
Ohio Supercomputer Center (OSC) and provides support for the Intel® MIC
architecture similar to GPU support in TORQUE.
4.1.4
e - When in cray mode, write physmem and availmem in addition to totmem so that
Moab correctly reads memory info.
e - Specifying size, nodes, and mppwidth and all mutually exclusize, so reject
job submissions that attempt to specify more than one of these. TRQ-1185.
b - Merged changes for revision 7000 by hand because the merge was not clean. This
fixes problems with a deadlock when doing job dependencies using synccount/syncwith.
TRQ-1374
b - Fix a segfault in req_jobobit due to an off-by-one error. TRQ-1361.
e - Add the svn revision to --version outputs. TRQ-1357.
b - Fix a race condition in mom hierarchy reporting. TRQ-1378.
b - Fixed pbs_mom so epilogue will only run once. TRQ-1134
b - Fix some debug output escaping into job output. TRQ-1360.
b - Fixed a problem where server threads all get stuck in a poll. The problem
was an infinite loop created in socket_wait_for_read if poll return -1.
TRQ-1382
b - Fix a Cray-mode bug with jobs ending immediately when spanning nodes of
different proc counts when specifying -l procs. TRQ-1365.
b - Don't fail to make the tmpdir for sister moms. bugzilla #220, TRQ-1403.
c - Fix crashes due to unprotected array accesses. TRQ-1395.
b - Fixed a deadlock in get_parent_dest_queues when the queue_parent_name
and queue_dest_name are the same. TRQ-1413. 11/7/12
b - Fixed segfault in req_movejob where the job ji_qhdr was NULL. TRQ-1416
b - Fix a conflict in the code for herogeneous jobs and regular jobs.
b - For alps jobs, use the login nodes evenly even when one goes down. TRQ-1317.
b - Display the correct 'Assigned Cpu Count' in momctl output. TRQ-1307.
b - Make pbs_original_connect() no longer hang if the host is down. TRQ-1388.
b - Make epilogues run only once and be executed by the child and not the main
pbs_mom process. TRQ-937.
b - Reduce the error messages in HA mode from moms. They now only log errors if
no server could be contacted. TRQ-1385.
b - Fixed a seg-fault in send_depend_req. Also fixed a deadlock in the depend_on_term
TRQ-1430 and TRQ-1436
b - Fixed a null pointer dereference seg-fault when checking for disallowed types
TRQ-1408.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
b - Fixed a problem where qsub was not applying the submit filter when given in the torque.cfg
file. TRQ-1446
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Made it so that threads taken up by poll job tasks cannot consume all available
threads in the thread pool. This will make it so other work can continue if
poll jobs get stuck for whatever reason and that the server will recover. TRQ-1433
b - Fix a deadlock when recording alps reservations. TRQ-1421.
b - Fixed a segfault in req_jobobit caused by NULL pointer assignment to variable
pa. TRQ-1467
b - Fixed deadlock in remove_array. remove_array was calling get_arry with allarrays_mutex
locked. TRQ-1466
b - Fixed a problem with an end of file error when running momctl -dx. TRQ-1432.
b - Fix a deadlock in rare cases on job insertion. TRQ-1472.
b - Fix a deadlock after restarting pbs_server when it was SIGKILL'd before a
job array was done cloning. TRQ-1474.
b - Fix a Cray-related deadlock. Always lock the reporter mom before a compute
node. TRQ-1445
b - Additional fix for TRQ-1472. In rm_request on the mom pbs_tcp_timeout was
getting set to 0 which made it so the MOM would fail reading incoming data
if it had not already arrived. This would cause momctl -to fail with an
end of file message.
e - Add a safety net to resend any obits for exiting jobs on the mom that still
haven't cleaned up after five minutes. TRQ-1458.
b - Fix cray running jobs being cancelled after a restart due to jobs not being
set to the login nodes. TRQ-1482.
b - Fix a bug that using -V got rid of -v. TRQ-1457.
b - Make qsub -I -x work again. TRQ-1483.
c - Fix a potential crash when getting the status of a login node in cray mode.
TRQ-1491.
4.1.3
b - fix a security loophole that potentially allowed an interactive job to run
as root due to not resetting a value when $attempt_to_make_dir and $tmpdir
are set. TRQ-1078.
b - fix down_on_error for the server. TRQ-1074.
b - prevent pbs_server from spinning in select due to sockets in CLOSE_WAIT.
TRQ-1161.
e - Have pbs_server save the queues each time before exiting so that legacy
formats are converted to xml after upgrading. TRQ-1120.
b - Fix phantom jobs being left on the pbs_moms and blocking jobs for Cray
hardware. TRQ-1162. (Thanks Matt Ezell)
b - Fix a race condition on free'd memory when check for orphaned alps
reservations. TRQ-1181. (Thanks Matt Ezell)
b - If interrupted when reading the terminal type for an interactive job continue
trying to read instead of giving up. TRQ-1091.
b - Fix displaying elapsed time for a job. TRQ-1133.
b - Make offlining nodes persistent after shutting down. TRQ-1087.
b - Fixed a memory leak when calling net_move. net_move allocates memory for args
and starts a thread on send_job. However, args were not getting released
in send_job. TRQ-1199
b - Changed pbs_connect to check for a server name. If it is passed in only that
server name is tried for a connection. If no server name is given then the
default list is used. The previous behavior was to try the name passed in and
the default server list. This would lead to confusion in utilities like qstat
when querying for a specific server. If the server specified was no available
information from the remaining list would still be returned.
TRQ-1143.
e - Make issue_Drequest wait for the reply and have functions continue processing
immediately after instead of the added overhead of using the threadpool.
c - tm_adopt() calls caused pbs_mom to crash. Fix this. TRQ-1210.
b - Array element 0 wasn't showing up in qstat -t output. TRQ-1155.
b - Cores with multiple processing units were being incorrectly assigned in cpusets.
Additionally, multi-node jobs were getting the cpu list from each node in each
cpuset, also causing problems. TRQ-1202.
b - Finding subjobs (for heterogeneous jobs) wasn't compatible with hostnames that
have dashes. TRQ-1229.
b - Removed the call to wait_request the main_loop on pbs_server. All of our communication
is handled directly and there is no longer a need to wait for an out of band
reply from a client. TRQ-1161.
e - Modfied output for qstat -r. Expanded Req'd Time to include seconds and centered Elap Time
over it's column.
b - Fixed a bug found at Univ. of Michigan where a corrupt .JB file would cause
pbs_server to seg-fault and restart.
b - Don't leave quotes on any arguments passed to the resource list. TRQ-1209.
b - Fix a race condition that causes deadlock when two threads are routing the same job.
b - Fixed a bug with qsub where environment variables were not getting populated with the
-v option. TRQ-1228.
b - This time for sure. TRQ-1228. When max_queuable or max_user_queuable were set it
was still possible to go over the limit. This was because a job is qualified
in the call to req_quejob but does not get inserted into the queue until svr_enquejob
is called in req_commit, four network requests later. In a multi-threaded environment
this allowed several jobs to be qualified and put in the pipeline before they
were actually commited to a queue.
b - If max_user_queuable or max_queuable were set on a queue TORQUE would not honor
the limit when filling those queues from a routing queue. This has now
been fixed. TRQ-1088.
b - Fixed seg-fault when running jobs asynchronously. TRQ-1252.
b - Fixed a bug with SIGHUP to pbs_server. The signal handler (change_logs()) does file I/O
which is not allowed for signal interruption. This caused pbs_server to be up but
unresponsive to any commands. TRQ-1250 and TR!-1224
b - Job dependencies didn't work with display_server_suffix=false. Fixed. TRQ-1255.
b - Don't report alps reservation ids if a node is in interactive mode. TRQ-1251.
b - Only attempt to cancel an orphaned alps reservation a maximum of one time per
iteration. TRQ-1251.
b - Fix a deadlock when recording an alps reservation on the server side. Cray only.
TRQ-1272.
c - Fix mismanagement of the ji_globid. TRQ-1262.
c - Setting display_job_server_suffix=false crashed with job arrays. Fixed. bugzilla #216
b - Restore the asynchronous functionality. TRQ-1284.
e - Made it so pbs_server will come up even if a job cannot recover because of a missing
job dependency. TRQ-1287
b - Fixed a segfault in the path from do_tcp to tm_request to tm_eof. In this path we freed
the tcp channel three times. the call to DIS_tcp_cleanup was removed from tm_eof and
tm_request. TRQ-1232.
b - Fix a deadlock in logging when the machine is out of disk space. TRQ-1302.
b - Fixed a deadlock which occurs when there is a job with a dependency that is being moved
from a routing queue to an execution queue. TRQ-1294
e - Retry cleanup with the mom every 20 seconds for jobs that are stuck in an exiting state.
TRQ-1299.
b - Enabled qsub filters to be access from a non-default location.i TRQ-1127
b - Put the ability to write the resources_used data to the accounting logs. This was in 4.1.1
and 4.1.2 but failed to make it into 4.1.3. TRQ-1329
c - Fix a double free if the same chan is stored on two tasks for a job. TRQ-1299.
b - Changed pbs_original_connect to retry a failed connect attempt
MAX_RETRIES (5) times before returning failure. This will
reduce the number of client commands that fail due to a connection
failure. TRQ-1355
b - Fix the proliferation of "Non-digit found where a digit was expected" messages, due
to an off-by-one error. TRQ-1230.
b - Fixed a deadlock caused by queue not getting released when jobs are aborted when
moving jobs from a routing queue to an execution queue. TRQ-1344.
4.1.2
e - Add the ability to run a single job partially on CRAY hardware and partially
on hardware external to the CRAY in order to allow visualization of
large simulations.
4.1.1
e - pbs_server will now detect and release orphaned ALPS reservations
b - Fixed a deadlock with nodes in stream_eof after call to svr_connect.
b - resources_used information now appears in the accounting log again
TRQ-1083 and bugzilla 198.
b - Fixed a seg-fault found a LBNL where freeaddrinfo would crash because
of uninitialized memory.
b - Fixed a deadlock in handle_complete_second_time. We were not unlocking
when exiting svr_job_purge.
e - Added the wrappers lock_ji_mutex and unlock_ji_mutex to do the mutex locking
for all job->ji_mutex locks.
e - admins can now set the global max_user_queuable limit using qmgr. TRQ-978.
b - No longer make multiple alps reservation parameters for each alps reservation.
This creates problems for the aprun -B command.
b - Fix a problem running extremely large jobs with alps 1.1 and 1.2. Reservations
weren't correctly created in the past. TRQ-1092.
b - Fixed a deadlock with a queue mutex caused by call qstat -a <queue1> <queue2>
b - Fixed a memory corruption bug, double free in check_if_orphaned. To fix this
issue_Drequest was modified to always free the batch request regardless of
any errors.
b - Fix a potential segfault when using munge but not having set authorized users.
TRQ-1102
b - Added a modified version of a patch submitted by Matt Ezell for Bugzilla 207.
This fixes a seg-fault in qsub if Moab passes an environment variable without
a value.
b - fix an error in parsing environment variables with commas, newlines, etc. TRQ-1113
b - fixed a deadlock with array jobs running simultaneously with qstat.
b - Fixed qsub -v option. Variable list was not getting passed in to job environment.
TRQ-1128
b - TRQ-1116. mail is now sent on job start again.
b - TRQ-1118. Cray jobs are now recovered correctly after a restart.
b - TRQ-1109. Fixed x11 forwarding for interactive jobs. (qsub -I -X). Previous to
this fix interactive jobs would not run any x applications such as xterm, xclock,
etc.
b - TRQ-1161, Fixes a problem where TORQUE gets into a high CPU utilization condition.
The problem was that in the function process_pbs_server_port there was not
error returned if the call to getpeername() failed in the default case.
b - TRQ-1161. This fixes another case that would cause a thread to spin on poll
in start_process_pbs_server_port. A call to the dis function would return
and error but the code would close the connection and return the error code which
was a value less than 20. start_process_pbs_server_port did not recognize the low
error code value and would keep calling into process_pbs_server_port.
b - qdel'ing a running job in the cray environment was trying to communicate with the
cray compute instead of the login node. This is now fixed. TRQ-1184.
b - TRQ-1161. Fixed a problem in stream_eof where a svr_connect was used to connect
to a MOM to see if it was still there. On successful connection the connection
is closed but the wrong function (close_conn) with the wrong argument (the
handle returned by svr_connect()) was used. Replaced with svr_disconnect
b - Make it so that procct is never shown to Moab or users. TRQ-872.
b - TRQ-1182. Fixed a problem where jobs with dependencies were deleted on
the restart of pbs_server.
b - TRQ-1199. Fixed memory leaks found by Valgrind. Fixed a leak when routing jobs
to a remote server, memory leak with procct, memory leak creating queues,
memory leak with mom_server_valid_message_source and a memory leak in req_track.
4.1.0
e - make free_nodes() only look at nodes in the exec_host list and not examine
all nodes to check if the job at hand was there. This should greatly speed
up freeing nodes.
f - add the server parameter interactive_jobs_can_roam (Cray only). When set to
true, interactive jobs can have any login as mother superior, but by default
all interactive jobs with have their submit_host as mother superior
b - Fixed TRQ-696. Jobs get stuck in running state.
b - Fixed a problem where interactive jobs using X-forwarding would fail
because TORQUE though DISPLAY was not set. The problem was that
DISPLAY was set using lowercase internally. TRQ-1010
e - Add a hostname/address caching feature to alleviate stress on DNS.
4.0.3
b - fix qdel -p all - was performing a qdel all. TRQ-947
b - fix some memory leaks in 4.0.2 on the mom and server TRQ-944
c - TRQ-973. Fix a possibility of a segfault in netcounter_incr()
b - removed memory manager from alloc_br and free_br to solve a memory leak
b - fixes to communications between pbs_sched and pbs_server. TRQ-884
b - fix server crash caused by gpu mode not being right after gpus=x:. TRQ-948.
b - fix logic in torque.setup so it does not say successfully started when
trqauthd failed to start. TRQ-938.
b - fix segfaults on job deletes, dependencies, and cases where a batch
request is held in multiple places. TRQ-933, 988, 990
e - TRQ-961/bugzilla-176 - add the configure option --with-hwloc-path=PATH
to allow installing hwloc to a non-default location.
c - fix a crash when using job dependencies that fail - TRQ-990
e - Cache addresses and names to prevent calling getnameinfo() and getaddrinfo()
too often. TRQ-993
c - fix a crash around re-running jobs
e - change so some Moab envirionment variables will be put into environment for
the prologue and epilogue scripts. TRQ-967.
b - make command line arguments override the job script arguments. TRQ-1033.
b - fix a pbs_mom crash when using blcr. TRQ-1020.
e - Added patch to buildutils/pbs_mkdirs.in which enables pbs_mkdirs to run
silently. Patch submitted by Bas van der Vlies. Bugzilla 199.
4.0.2
e - Change so init.d script variables get set based on the configure command.
TRQ-789, TRQ-792.
b - Fix so qrun jobid[] does not cause pbs_server segfault. TRQ-865.
b - Fix to validate qsub -l nodes=x against resources_max.nodes the same as v2.4.
TRQ-897.
b - bugzilla #185. Empty arrays should no longer be loaded and now when qdel'ed
they will be deleted.
b - bugzilla #182. The serverdb will now correctly write out memory allocated.
b - bugzilla #188. The deadlock when using job logging is resolved
b - bugzilla #184. pbs_server will no longer log an erroneous error when the 12th
job array is submitted.
e - Allow pbs_mom to change users group on stderr/stdout files. Enabled by configuring
Torque with CFLAGS='-DRESETGROUP'. TRQ-908.
e - Have the parent intermediate mom process wait for the child to open the demux before
moving on for more precise synchronization for radix jobs.
e - Changed the way jobs queued in a routing queue are updated. A thread is now launched
at startup and by default checks every 10 seconds to see if there are jobs
in the routing queues that can be promoted to execution queues.
b - Fix so pbs_mom will compile when configured with --with-nvml-lib=/usr/lib and
--with-nvml-include. TRQ-926.
b - fix pbs_track to add its process to the cpuset as well. TRQ-925.
b - Fix so gpu count gets written out to server nodes file when using
--enable-nvidia-gpus. TRQ-927.
b - change pbs_server to listen on all interfaces. TRQ-923
b - Fix so "pbs_server --ha" does not fail when checking path for server.lock file. TRQ-907.
b - Fixed a problem in qmgr where only 9 commands could be completed before a failure.
Bugzilla 192 and TRQ-931
b - Fix to prevent deadlock on server restart with completed job that had a dependency.
TRQ-936.
b - prevent TORQUE from losing connectivity with Moab when starting jobs asynchronously
TRQ-918
b - prevent the API from segfaulting when passed a negative socket descriptor
b - don't allow pbs_tcp_timeout to ever be less than 5 minutes - may be temporary
b - fix pbs_server so it fails if another instance of pbs_server is already
running on same port. TRQ-914.
4.0.1
b - Fix trqauthd init scripts to use correct path to trqauthd.
b - fix so multiple stage in/out files can again be used with qsub -W
b - fix so comma separated file list can be used with qsub -W stagein/stageout.
Matches qsub documentation again.
b - Only seed the random number generator once
b - The code to run the epilogue set of scripts was removed when refactoring the
obit code. The epilogues are now run as part of post_epilogue. preobit_reply
is no longer used.
b - if using a default hierarchy and moms on non-default ports, pass that information
along in the hierarchy
e - Make pbs_server contact pbs_moms in the order in which they appear in the hierarchy
in order to reduce errors on start-up of a large cluster.
b - fix another possibility for deadlock with routing queues
e - move some the the main loop functionality to the threapool in order to increase
responsiveness.
e - Enabled the configuration to be able to write the path of the library directory
to /etc/ld.so.conf.d in a file named libtorque.conf. The file will be created
by default during make install. The configuration can be made to not install this
file by using the configure option --without-loadlibfile
b - Fixed a bug where Moab was using the option SYNCJOBID=TRUE which allows Moab
to create the job ids in TORQUE. With this in place if TORQUE were terminated
it would delete all jobs submitted through msub when pbs_server was restarted.
This fix recovers all jobs whether submitted with msub or qsub when pbs_server
restarts.
b - fix for where pbsnodes displays outdated gpu_status information.
b - fix problem with '+ and segfault when using multiple node gpu requests.
b - Fixed a bug in svr_connect. If the value for func were null then the newly
created connection was not added to the svr_conn table. This was not right.
We now always add the new connection to svr_conn.
b - fix problem with mom segfault when using 8 or more gpus on mom node.
b - Fix so child pbs_mom does not remain running after qdel on slow starting job.
TRQ-860.
b - Made it so the MOM will let pbs_server know it is down after momctl -s is invoked.
e - Made it so localhost is no longer hard coded. The string comes from getnameinfo.
b - fix a mom hiearchy error for running the moms on non-default ports
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - Fix so pbs_mom won't segfault after a qdel is done for a job that is still
running the prologue. TRQ-832.
b - Fix for segfault when using routing queues in pbs_server. TRQ-808
b - Fix so epilogue.precancel runs only once and only for cancelled jobs. TRQ-831.
b - Added a close socket to validate_socket to properly terminate the connection.
Moved the free of the incoming variable sock to process_svr_conn from the
beginning of the function to the end. This fixed a problem where the client
would always get a RST when trying to close its end of the connection.
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - routing to a routing queue now works again, TRQ-905, bugzilla 186
b - Fix server segfaults that happened doing qhold for blcr job. TRQ-900.
n - TORQUE 4.0.1 released 5/3/2012
4.0.0
e - make a threadpool for TORQUE server. The number of threads is
customizable using min_threads and max_threads, and idle time before
exiting can be set using thread_idle_seconds.
e - make pbs_server multi-threaded in order to increase responsiveness and scalability.
e - remove the forking from pbs_server running a job, the thread handling the request just
waits until the job is run.
e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel
of every individual job
e - no longer fork to send mail, just use a thread
e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies)
e - add the boolean variable $use_smt to mom config. If set to false, this skips logical
cores and uses only physical cores for the job. It is true by default.
(contributed by Dr. Bernd Kallies)
n - with the multi-threading the pbs_server -t create and -t cold commands could no longer
ask for user input from the command line. The call to ask if the user wants to continue
was moved higher in the initialization process and some of the wording changed to
reflect what is now happening.
e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to
start instead of failing silently.
e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect
to nodes. We only loop over each node once at a maximum.
e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can
perform pbs_iff's functionality, increasing speed and enabling future security
enhancements
e - add mom_hierarchy functionality for reporting. The file is located in
<TORQUE_HOME>/server_priv/mom_hierarchy, and can be written to tell moms to send
updates to other moms who will pass them on to pbs_server. See docs for details
e - add a unit testing framework (check). It is compiled with --with-check and tests
are executed using make check. The framework is complete but not many tests have
been written as of yet.
b - Made changes to IM protocol where commands were not either waiting for a reply
or not sending a reply. Also made changes to close connections that were left
open.
b - Fix for where qmgr record_job_info is True and server hangs on startup.
e - Mom rejection messages are now passed back to qrun when possible
e - Added the option -c for startup. By default, the server attempts to send the mom
hierarchy file to all moms on startup, and all moms update the server and request
the hierarchy file. If both are trying to do this at once, it can cause a lot of
traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that
haven't contacted it, reducing this traffic.
e - Added mom parameter -w to reduce start times. This parameter wait to send it's
first update until the server sends it the mom hierarchy file, or until 10
minutes have passed. This should reduce large cluster startup times.
3.0.5
b - fix for writing too much data when job_script is saved to job log.
b - fix for where pbs_mom would not automatically set gpu mode.
b - fix for alligning qstat -r output when configured with -DTXT.
e - Change size of transfer block used on job rerun from 4k to 64k.
b - With nvidia gpus, TORQUE was losing the directive of what nodes it should
run the job on from Moab. Corrected.
e - add the $PBS_WALLTIME variable to jobs, thanks to a patch from Mark Roberts
n - change moab_array_compatible server parameter so it defaults to true
e - change to allow pbs_mom to run if configured with --enable-nvidia-gpus but
installed on a node without Nvidia gpus.
3.0.4
c - fix a buffer being overrun with nvidia gpus enabled
b - no longer leave zombie processes when munge authenticating.
b - no longer reject procs if it is the second argument to -l
b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom
attempted to communicate with those as well. Now they are cleared and only the
new server(s) are contacted.
b - pbsnodes -l can now search on all valid node states
e - Added functionality that allows the values for the server parameter
authorized_users to use wild cards for both the user and host portion.
e - Improvements in munge handling of client connections and authentication.
3.0.3
b - fix for bugzilla #141 - qsub was overwriting the path variable in PBSD_authenticate
e - automatically create and mount /dev/cpuset when TORQUE is configured but the cpuset
directory isn't there
b - fix a bug where node lines past 256 characters were rejected. This buffer has been
made much larger (8192 characters)
b - clear out exec_gpus as needed
b - fix for bugzilla #147 - recreate $PBS_NODESFILE file when restarting a blcr
checkpointed job
b - Applied patch submitted by Eric Roman for resmom/Makefile.am (Bugzilla #147)
b - Fix for adding -lcr for BLCR makefiles (Bugzilla #146)
c - fix a potential segfault when using asynchronous runjob with an array slot limit
b - fix bugzilla #135, stagein was deleting directory instead of file
b - fix bugzilla #133, qsub submit filter, the -W arguments are not all there
e - add a mom config option - $attempt_to_make_dir - to give the user the option to
have TORQUE attempt to create the directories for their output file if they don't exist
b - Fixed momctl to return an error on failure. Prior to this fix momctl always returned 0
regardless of success or failure.
e - Change to allow qsub -l ncpus=x:gpus=x which adds a resource list entry for both
b - fix so user epilogues are run as user instead of root
b - No longer report a completion code if a job is pre-empted using qrerun.
c - Fix a crash in record_jobinfo() - this is fixed by backporting dynamic strings from
4.0.0 so that all of the resizing is done in a central location, fixing the crash.
b - No longer count down walltime for jobs that are suspending or have stopped running
for any other reasons
e - add a mom config option - $ext_pwd_retry - to specify # of retries on
checking for password validity.
3.0.2
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
b - fix a potential buffer overflow security issue in job names and host address names
b - restore += functionality for nodes when using qmgr. It was overwriting old properties
b - fix bugzilla #134, qmgr -= was deleting all entries
e - added the ability in qsub to submit jobs requesting total gpus for job instead of gpus per node:
-l ncpus=X,gpus=Y
b - do not prepend ${HOME} with the current dir for -o and -e in qsub
e - allow an administator using the proxy user submission to also set the job id to be used
in TORQUE. This makes TORQUE easier to use in grid configurations.
b - fix jobs named with -J not always having the server name appended correctly
b - make it so that jobs named like arrays via -J have legal output and error file names
b - make a fix for ATTR_node_exclusive - qsub wasn't accepting -n as a valid argument
3.0.1
e - updated qsub's man page to include ATTR_node_exclusive
b - when updating the nodes file, write out the ports for the mom if needed
b - fix a bug for non-NUMA systems that was continuously increasing memory values
e - the queue files are now stored as XML, just like the serverdb
e - Added code from 2.5-fixes which will try and find nodes that did not
resolve when pbs_server started up. This is in reference to Bugzilla
bug 110.
e - make gpus compatible with NUMA systems, and add the node attribute
numa_gpu_node_str for an additional way to specify gpus on node boards
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
wasn't deleted unless a geometry request was made.
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
pbs_server wasn't always re-queued, but were being deleted instead.
e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on
pbs_server. We recommend --with-tcp-retry-limit=2
n - Changing the way to set ATTR_node_exclusive from -E to -n, in order to continue
compatibility with Moab.
b - preserve the order on array strings in TORQUE, like the route_destinations for a
routing queue
b - fix bugzilla #111, multi-line environment variables causing errors in TORQUE.
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
characters
b - restored functionality for -W umask as reported in bugzilla 115
b - Updated torque.spec.in to be able to handle the snapshot names of builds.
b - fix pbs_mom -q to work with parallel jobs
b - Added code to free the mom.lock file during MOM shutdown.
e - Added new MOM configure option job_starter. This options will execute
the script submitted in qsub to the executable or script provided
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - altered the prologue/epilogue code to allow root squashing
f - added the mom config parameter $reduce_prolog_checks. This makes it so TORQUE only checks
to verify that the file is a regular file and is executable.
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
b - fix a segfault when receiving an obit for a job that no longer exists
e - Added options to conditionally build munge, BLCR, high-availability, cpusets,
and spooling. Also allows customization of the sendmail path and allows for
optional XML conversion to serverdb.
b - also remove the procct resource when it is applied because of a default
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
true and no acl_groups are defined.
3.0.0
e - serverdb is now stored as xml, this is no longer configurable.
f - added --enable-numa-support for supporting NUMA-type architectures. We
have tested this build on UV and Altix machines. The server treats the
mom as a node with several special numa nodes embedded, and the pbs_mom
reports on these numa nodes instead of itself as a whole.
f - for numa configurations, pbs_mom creates cpusets for memory as well as
cpus
e - adapted the task manager interface to interact properly with NUMA
systems, including tm_adopt
e - Addeded autogen.sh go make life easier in a Makefile.in-less world.
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
at install time. The file only shows examples and a link to the
TORQUE documentation.
f - added ATTR_node_exclusive to allow a job to have a node exclusively.
f - added --enable-memacct to use an extra protocol in order to
accurately track jobs that exceed over their memory limits and kill
them
e - when ATTR_node_exclusive is set, reserve the entire node (or entire
numa node if applicable) in the cpuset
n - Changed the protocol versions for all client-to-server, mom-to-server and
mom-to-mom protocols from 1 to 2. The changes to the protocol in this version
of TORQUE will make it incompatible with previous versions.
e - when a select statement is used, tally up the memory requests and mark
the total in the resource list. This allows memory enforcement for
NUMA jobs, but doesn't affect others as memory isn't enforced for
multinode jobs
e - add an asynchronous option to qdel
b - do not reply when an asynchronous reply has already been sent
e - make the mem, vmem, and cput usage available on a per-mom basis using momctl -d2
(Dr. Bernd Kallies)
e - move the memory monitor functionality to linux/mom_mach.c in order to store the
more accurate statistics for usage, and still use it for applying limits.
(Dr. Bernd Kallies)
e - when pbs_mom is compiled to use cpusets, instead of looking at all processes,
only examine the ones in cpuset task files. For busy machines (especially large
systems like UVs) this can exponentially reduce job monitoring/harvesting times.
(Dr. Bernd Kallies)
e - when cpusets are configured and memory pressure enabled, add the ability to
check memory pressure for a job. Using $memory_pressure_threshold and
$memory_pressure_duration in the mom's config, the admin sets a threshold at
which a job becomes a problem. If duration is set, the job will be killed if
it exceeds the threshold for the configured number of checks. If duration isn't
set, then an arror is logged.
(Dr. Bernd Kallies)
e - change pbs_track to look for the executable in the existing path so it doesn't always
need a complete path.
(Dr. Bernd Kallies)
e - report sessions on a per numa node basis when NUMA is enabled
(Dr. Bernd Kallies)
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
(request no mail on qsub) was not always being recongnized.
e - Merged buildutils/torque.spec.in from 2.4-fixes.
Refactored torque spec file to comply with established RPM best
practices, including the following:
- Standard installation locations based on RPM macro configuration
(e.g., %{_prefix})
- Latest upstream RPM conditional build semantics with fallbacks for
older versions of RPM (e.g., RHEL4)
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
planned
- Basic working configuration automatically generated at install-time
- Reduce the number of unnecessary subpackages by consolidating where
it makes sense and using existing RPM features (e.g., --excludedocs).
2.5.10
b - Fixed a problem where pbs_mom will crash of check_pwd returns NULL. This could
happen for example if LDAP was down and getpwnam returns NULL.
e - Added code to delete a job on the MOM if a job is in the EXITED substate and
going through the scan_for_exiting code. This happens when an obit has been
sent and the obit reply received by the PBS_BATCH_DeleteJob has not been
received from the server on the MOM. This fix allows the MOM to delete the
job and free up resources even if the server for some reason does not send
the delete job request.
b - TRQ-608: Removed code to check for blocking mode in write_nonblocking_socket().
Fixes problem with interactive jobs (qsub -I) exiting prematurely.
c - fix a buffer being overrun with nvidia gpus enabled (backported from 3.0.4)
b - To fix a problem in 2.5.9 where the job_array structure was modified
without changing the version or creating an upgrade path. This made
it incompatible with previous versions of TORQUE 2.5 and 3.0.
Added new array structure job_array_259. This is the original torque
2.5.9 job_array structure with the num_purged element added in the middle
of the structure. job_array_259 was created so users could upgrade from 2.5.9
and 3.0.3 to later versions of TORQUE. The job_array structure was
modified by moving the num_purged element to the bottom of the structure.
pbsd_init now has an upgrade path for job arrays from version 3 to version
4. However, there is an exceptional case when upgrading from 2.5.9 or 3.0.3
where pbs_server must be started using a new -u option.
b - no longer leave zombie processes when munge authenticating. (backported from 3.0.4)
2.5.9
e - change mom to only log "cannot find nvidia-smi in PATH" once when built
with --enable-nvidia-gpus and running on a node that does not have Nvidia
drivers installed.
b - Change so gpu states get set/unset correctly. Fixes problems with multiple
exclusive jobs being assigned to same gpu and where next job gets rejected
because gpu state was not reset after last shared gpu job finished.
e - Added a 1 millisecond sleep to src/lib/Libnet/net_client.c client_to_svr()
if connect fails with EADDRINTUSE EINVAL or EADDRNOTAVAIL case. For these cases
TORQUE will retry the connect again. This fix increases the chance of success
on the next iteration.
b - Changes to decrease some gpu error messages and to detect unusual gpu
drivers and configurations.
b - Change so user cannot impersonate a different user when using munge.
e - Added new option to torque.cfg name TRQ_IFNAME. This allows the user to designate
a preferred outbound interface for TORQUE requests. The interface is the name
of the NIC interface, for example eth0.
e - Added instructions concerning the server parameter moab_array_compatible to the
README.array_changes file.
b - Fixed a problem where pbs_server would seg-fault if munged was not running. It would
also seg-fault if an invalid credential were sent from a client. The seg-fault was
occurred in the same place for both cases.
b - Fixed a problem where jobs dependent on an array using afteranyarray would not start
when a job element of the array completed.
b - Fixed a bug where array jobs .AZ file would not be deleted when the array job was done.
e - Modified qsub so that it will set PBS_O_HOST on the server from the incoming interface.
(with this fix QSUBHOST from torque.cfg will no longer work. Do we need to make it
to override the host name?)
b - fix so user epilogues are run as user instead of root (backported from 3.0.3)
b - fix the prevent pbs_server from hanging when doing server to server job moves.
(backported from 3.0.3)
b - Fixed a problem where array jobs would always lose their state when pbs_server was
restarted. Array jobs now retain their previous state between restarts of the server
the same as non-array jobs. This fix takes care of a problem where Moab and TORQUE
would get out of sync on jobs because of this discrepency between states.
b - Made a fix related to procct. If no resources are requested on the qsub line previous
versions of TORQUE did not create a Resource_List attribute. Specifically a node and
nodect element for Resource_List. Adding this broke some applications. I made it so
if no nodes or procs resources are requested the procct is set to 1 without creating
the nodes element.
e - Changed enable-job-create to with-job-create with an optional CFLAG argument.
--with-job-create=<CFLAG options>
e - Changed qstat.c to display 6 instead of 5 digits for Req'd Memory for a qstat -a.
2.5.8
e - added util function getpwnam_ext() that has retry and errno logging
capability for calls to getpwnam().
c - fix a potential segfault when using asynchronous runjob with an array slot limit
(backported from 3.0.3)
b - In pbs_original_connect() only the first NCONNECT entries of the connection table
were checked for availability. NCONNECT is defined as 10. However, the connection
table is PBS_NET_MAX_CONNECTIONS in size. PBS_NET_MAX_CONNECTIONS is 10240.
NCONNECT is now defined as PBS_NET_MAX_CONNECTIONS.
b - fix bugzilla #135, stagein was deleting directory instead of file (backported
from 3.0.3)
b - If the resources nodes or procs are not submitted on the qsub command line then
the nodes attribute does not get set. This causes a problem if procct is set on
queues because there is no proc count available to evaluate. This fix sets
a default nodes value of 1 if the nodes or procs resources are not requested.
e - Change so Nvidia drivers 260, 270 and above are recognized.
e - Added server attribute no_mail_force which when set True eliminates all
e-mail when job mail_points is set to "n"
2.5.7
e - Added new qsub argument -F. This argument takes a quoted string as
an argument. The string is a list of space separated commandline
arguments which are available to the job script.
b - Fixed a potential buffer overflow problem in src/resmom/checkpoint.c function
mom_checkpoint_recover. I modified the code to change strcpy and strcat to strncpy
and strncpy.
b - Fixed a bug for high availability. The -l listener option for pbs_server was not
complete and did not allow pbs_server to properly communicate with the scheduler.
Also fixed a bug with job dependencies where the second server or later in the
$TORQUE_HOME/server_name directory was not added as part of the job dependecny
so dependent jobs would get stuck on hold if the current server was not the first
server in the server_name file.
2.5.6
b - Made changes to record_jobinfo and supporting functions to be
able to use dynamically allcated buffers for data. This fixed
a problem where incoming data overran fixed sized buffers.
b - Updated torque.spec.in to be able to handle the snapshot
names of builds.
e - Added new MOM configure option job_starter. This options will execute
the script submitted in qsub to the executable or script provided
as the argument to the job_starter option of the MOM configure file.
b - fixed a problem with pbs_server high availability where the current
server could not keep the HA lock. The problem was a result of truncating
the directory name where the lock file was kept. TORQUE would fail to
validate permissions because it would do a stat on the wrong directory.
b - Added code to free the mom.lock file during MOM shutdown.
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - Added new symbol JOB_EXEC_OVERLIMIT. When a job exceeds a limit (i.e. walltime) the
job will fail with the JOB_EXEC_OVERLIMIT value and
also produce an abort case for mailing purposes. Previous to this change
a job exceeding a limit returned 0 on success and no mail
was sent to the user if requested on abort.
e - Added options to buildutils/torque.spec.in to conditionally build munge, BLCR,
high-availability, cpusets, and spooling. Also allows customization of the
sendmail path and allows for optional XML conversion to serverdb.
b - --with-tcp-retry-limit now actually changes things without needing to run autoheader
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
the logic checking the queue against the user request used and && when it need a || in the
comparison.
e - The -e and -o options of qsub allow a user to specify a path or optionally a filename for output.
If the path given by the user ended with a directory name but no '/' character at the end then
TORQUE was confused and would not convert the .OU or .ER file to the final output/error file. The
code has now been changed to stat the path to see if the end path element is a path or directory
and handled appropriately.
e - Added new MOM configuration option $rpp_throttle. The syntax for this in the
$TORQUE_HOME/mom_priv/config file is $rpp_throttle <value> where value is a long
representing microseconds. Setting this values causes rpp data to pause after every
sendto for <value> microseconds. This may help with large jobs where full data does
not arrive at sister nodes.
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
(backported from 3.0.2)
b - Added patch from Michael Jennings to buildutils/torque.spec.in. This patch
allows an rpm configured with DRMAA to complete even if all of the
support files are not present on the system.
b - commited patch submitted by Michael Jennings to fix bug 130. TORQUE on the MOM would call
lstat as root when it should call it as user in open_std_file.
f - Added the ability to detect Nvidia gpus using nvidia-smi (default) or NVML.
Server receives gpu statuses from pbs_mom. Added server attribute auto_node_gpu
that allows automatically setting number of gpus for nodes based on gpu
statuses. Added new configure options --enable-nvidia-gpus,
--with-nvml-include and --with-nvml-lib.
c - fix a segfault when using --enable-nvidia-gpus and pbs_mom has Nvidia driver
older than 260 that still has nvidia-smi command
e - Added capability to automatically set mode on Nvidia gpus. Added support for
gpu reseterr option on qsub. The nodes file will be updated with Nvidia gpu
count when --enable-nvidia-gpu configure option is used. Moved some code
out of job_purge_thread to prevent segfault on mom.
e - Applied patch submitted by Eric Roman. This patch addresses some build issues
with BLCR, and fixes an error where BLCR would report -ENOSUPPORT when trying
to checkpoint a parallel job. The patch adds a --with-blcr option to configure
to find the path to the BLCR libaries. There are --with-blcr-include,
--with-blcr-lib and --with-blcr-bin to override the search paths, if necessary.
The last option, --with-blcr-bin is used to generate contrib/blcr/checkpoint_script
and contrib/blcr/restart_script from the information supplied at configure time.
b - Fixed problem where calling qstat with a non-existent job id would hang the qstat
command. This was only a problem when configured with MUNGE.
b - fix a potential buffer overflow security issue in job names and host address names
2.5.5
b - change so gpus get written back to nodes file
e - make it so that even if an array request has multiple consecutive '%' the slot
limit will be set correctly
b - Fixed bug in job_log_open where the global variable logpath was freed instead
of joblogpath.
b - Fixed memory leak in function procs_requested.
b - Validated incoming data for escape_xml to prevent a seg-fault with incoming
null pointers
e - Added submit_host and init_work_dir as job attributes. These two
values are now displayed with a qstat -f. The submit_host is
the name of the host from where the job was submitted. init_work_dir
is the working directory as in PBS_O_WORKDIR.
e - change so blcr checkpoint jobs can restart on different node. Use
configure --enable-blcr to allow.
b - remove the use of a GNU specific function, and fix an error for solaris builds
b - Updated PBS_License.txt to remove the implication that the software
is not freely redistributable.
b - remove the $PBS_GPUFILE when job is done on mom
b - fix a race condition when issuing a qrerun followed by a qdel that caused
the job to be queued instead of deleted sometimes.
e - Implemented Bugzilla Bug 110. If a host in the nodes file cannot be resolved
at startup the server will try once every 5 minutes until the node
will resolve and it will add it to the nodes list.
e - Added a "create" method to pbs_server init.d script so a serverdb file
can be created if it does not exist at startup time. This is an enhancement
in reference to Bugzilla bug 90.
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
past the end of the line input if there is no newline character at the end of the nodes
file.
e - To fix Bugzilla Bug 121 I created a thread in job_purge on the mom in the file src/resmom/job_func.c
All job purging now happens on its own thread. If any of the system calls fail to return
the thread will hang but the MOM will still be able to process work.
2.5.4
f - added the ability to track gpus. Users set gpus=X in the nodes file for
relevant node, and then request gpus in the nodes request:
-l nodes=X[:ppn=Y][:gpus=Z]. The gpus appear in $PBS_GPUFILE, a new
environment variable, in the form: <hostname>-gpu<index> and in a
new job attribute exec_gpus:
<hostname>-gpu/<index>[+<hostname>-gpu/<index>...]
b - clean up job mom checkpoint directory on checkpoint failure
e - Bugzilla bug 91. Check the status before the service is actually started.
(Steve Traylen - CERN)
e - Bugzilla bug 89. Only touch lock/subsys files if service actually starts.
(Steve Traylen - CERN)
c - when using job_force_cancel_time, fix a crash in rare cases
e - add server parameter moab_array_compatible. When set to true, this parameter
places a limit hold on jobs past the slot limit. Once one of the unheld jobs
completes or is deleted, one of the held jobs is freed.
b - fix a potential memory corruption for walltime remaining for jobs
(Vikentsi Lapa)
b - fix potential buffer overrun in pbs_sched (Bugzilla #98, patch from
Stephen Usher @ University of Oxford)
e - check if a process still exists before killing it and sleeping. This speeds up
the time for killing a task exponentially, although this will show mostly for
SMP/NUMA systems, but it will help everywhere.
(Dr. Bernd Kallies)
b - Fix for reque failures on mom. Forked pbs_mom would silently segfault and
job was left in Exiting state.
b - change so "mom_checkpoint_job_has_checkpoint" and "execing command" log
messages do not always get logged
2.5.3
b - stop reporting errors on success when modifying array ranges
b - don't try to set the user id multiple times
b - added some retrying to get connection and changed some log messages when
doing a pbs_alterjob after a checkpoint
c - fix segfault in tracejob. It wasn't malloc'ing space for the null
terminator
e - add the variables PBS_NUM_NODES and PBS_NUM_PPN to the job environment
(TRQ-6)
e - be able to append to the job's variable_list through the API
(TRQ-5)
e - Added support for munge authentication. This is an alternative for the
default ruserok remote authentication and pbs_iff. This is a compile
time option. The configure option to use is --enable-munge-auth.
Ken Nielson (TRQ-7) September 15, 2010.
b - fix the dependency hold for arrays. They were accidentally cleared
before (RT 8593)
e - add a logging statement if sendto fails at any points in rpp_send_out
b - Applied patch submitted by Will Nolan to fix bug 76.
"blocking read does not time out using signal handler"
b - fix a bug in the $spool_as_final_name code if HAVE_WORDEXP is
undefined
b - Bugzilla bug 84. Security bug on the way checkpoint is being handled.
(Robin R. - Miami Univ. of Ohio)
e - Now saving serverdb as an xml file instead of a byte-dump, thus
allowing canned installations without qmgr scripts, as well as more
portability. Able to upgrade automatically from 2.1, 2.3, and 2.4
b - fix to cleanup job files on mom after a BLCR job is checkpointed and held
b - make the tcp reading buffer able to grow dynamically to read larger
values in order to avoid "invalid protocol" messages
e - change so checkpoint files are transfered as the user, not as root.
f - Added configure option --with-servchkptdir which allows specifying path
for server's checkpoint files
b - could not set the server HA parameters lock_file_update_time and
lock_file_check_time previously. Fixed.
e - qpeek now has the options --ssh, --rsh, --spool, --host, -o, and
-e. Can now output both the STDOUT and STDERR files. Eliminated
numlines, which didn't work.
b - fix to prevent a possible segfault when using checkpointing.
2.5.2
e - Allow the nodes file to use the syntax node[0-100] in the name to
create identical nodes with names node0, node1, ..., node100.
(also node[000-100] => node000, node001, ... node100)
b - fix support of the 'procs' functionality for qsub.
b - remove square brackets [] from job and default stdout/stderr filenames
for job arrays (fixes conflict with some non-bash shells)
n - fix build system so README.array_changes is included in tar.gz file made
with "make dist"
n - fix build system so contrib/pbsweb-lite-0.95.tar.gz, contrib/qpool.gz
and contrib/README.pbstools are included the the tar.gz file made
with "make dist"
c - fixed crash when moving the job to a different queue (bugzilla 73)
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
at install time. The file only shows examples and a link to the
TORQUE documentation. This enhancement was first committed to trunk.
c - fix pbs_server crash from invalid qsub -t argument
b - fix so blcr checkpoint jobs work correctly when put on hold
b - fixed bugzilla #75 where pbs_server would segfault with a double free when
calling qalter on a running job or job array.
e - Changed free_br back to its original form and modifed copy_batchrequest
to make a copy of the rq_extend element which will be freed in
free_br.
b - fix condition where job array "template" may not get cleaned up properly
after a server restart
b - fix to get new pagg ID and add additional CSA records when restarting from
checkpoint
e - added documentation for pbs_alterjob_async(), pbs_checkpointjob(),
pbs_fbserver(), pbs_get_server_list() and pbs_sigjobasync().
b - Commited patch from Eygene Ryanbinkin to fix bug 61. /dev/null would
under some circumstances have its permissions modified when jobs exited
on a compute node.
e - add --enable-top-tempdir-only to only create the top directory of the
job's temporary directory when configured
b - make the code for reconnecting to the server more robust, and remove
elements of not connecting if a job isn't running
e - allow input of walltime in the format of [DD]:HH:MM:SS
b - Fix so BLCR checkpoint files get copied to server on qchkpt and periodic
checkpoints
c - corrected a segfault when display_job_server_suffix is set to false
and job_suffix_alias was unset.
2.5.1
b - modified Makefile.in and Makefile.am at root to include contrib/AddPrivileges
2.5.0
e - Added new server config option alias_server_name. This option allows
the MOM to add an additional server name to be added to the list
of trusted addresses. The point of this is to be able to handle
alias ip addresses. UDP requests that come into an aliased ip address
are returned through the primary ip address in TORQUE. Because
the address of the reply packet from the server is not the same address
the MOM sent its HELLO1 request, the MOM drops the packet and the MOM
cannot be added to the server.
n - auto_node_np will now adjust np values down as well as up.
e - Enabled TORQUE to be able to parse the -l procs=x node spec. Previously
TORQUE simply recored the value of x for procs in Resources_List. It
now takes that value and allocates x processors packed on any available
node. (Ken Nielson Adaptive Computing. June 17, 2010)
f - added full support (server-scheduler-mom) for Cygwin (UIIP NAS of Belarus,
uiip.bas-net.by)
b - fixed EINPROGRESS in net_client.c. This signal appears every time of
connecting and requires individual processing. The old erroneous
processing brought a large network delay, especially on Cygwin.
e - improved signal processing after connecting in client_to_svr and added own
implementation of bindresvport for OS which lack it (Igor Ilyenko,
UIIP Minsk)
f - created permission checking of Windows (Cygwin) users, using mkpasswd,
mkgroup and own functions IamRoot, IamUser (Yauheni Charniauski,
UIIP Minsk)
f - created permission checking of submitted jobs (Vikentsi Lapa,
UIIP Minsk)
f - Added the --disable-daemons configure option for start server-sched-mom
as Windows services, cygrunsrv.exe goes its into background
independently.
e - Adapted output of Cygwin's diagnostic information (Yauheni
Charniauski, UIIP Minsk)
b - Changed pbsd_main to call daemonize_server early only if
high_availability_mode is set.
e - added new qmgr server attributes (clone_batch_size, clone_batch_delay)
for controlling job cloning (Bugzilla #4)
e - added new qmgr attribute (checkpoint_defaults) for setting default
checkpoint values on Execution queues (Bugzilla #1)
e - print a more informative error if pbs_iff isn't found when trying to
authenticate a client
e - added qmgr server attribute job_start_timeout, specifies timeout to be
used for sending job to mom. If not set, tcp_timeout is used.
e - added -DUSESAVEDRESOURCES code that uses servers saved resources used
for accounting end record instead of current resources used for jobs that
stopped running while mom was not up.
e - TORQUE job arrays now use arrays to hold the job pointers and not
linked lists (allows constant lookup).
f - Allow users to delete a range of jobs from the job array (qdel -t)
f - Added a slot limit to the job arrays - this restricts the number of
jobs that can concurrently run from one job array.
f - added support for holding ranges of jobs from an array with a single
qhold (using the -t option).
f - now ranges of jobs in an array can be modified through qalter
(using the -t option).
f - jobs can now depend on arrays using these dependencies:
afterstartarray, afterokarray, afternotokarray, afteranyarray,
f - added support for using qrls on arrays with the -t option
e - complte overhaul of job array submission code
f - by default show only a single entry in qstat output for the whole array
(qstat -t expands the job array)
f - server parameter max_job_array_size limits the number of jobs allowed
in an array
b - job arrays can no longer circumvent max_user_queuable
b - job arrays can no longer circumvent max_queuable
f - added server parameter max_slot_limit to restrict slot limits
e - changed array names from jobid-index to jobid[index] for consistency
2.4.13
e - change so blcr checkpoint jobs can restart on different node. Use
configure --enable-blcr to allow. (Bugzilla 68, backported from 2.5.5)
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
(backported from 3.0.1)
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
wasn't deleted unless a geometry request was made. (backported from 3.0.1)
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
pbs_server wasn't always re-queued, but were being deleted instead. (backported from 3.0.1)
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
characters (backported from 3.0.1)
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
past the end of the line input if there is no newline character at the end of the nodes
file.
b - Updated torque.spec.in to be able to handle the snapshot
names of builds.
b - Merged revisions 4555, 4556 and 4557 from 2.5-fixes branch. This revisions fix problems in
High availability mode and also a problem where the MOM was not releasing the lock on
mom.lock on exit.
b - fix pbs_mom -q to work with parallel jobs (backported from 3.0.1)
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
(backported from 3.0.1)
b - fix a segfault when receiving an obit for a job that no longer exists (backported from 3.0.1)
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
the logic checking the queue against the user request used and && when it need a || in the
comparison.
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
true and no acl_groups are defined. (backported from 3.0.1)