-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
464 lines (362 loc) · 22.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
==================================================
libtorque, a multithreaded I/O event library
Copyright © 2009--2021 Nick Black <[email protected]>
Render this document with fixed-width fonts!
==================================================
___________________________________________________________________
888 ,e, 888 d8 "...tear the roof off the sucka..."
888 " 888 88e d88 e88 88e 888,8, e88 888 8888 8888 ,e e,
888 888 888 888b d88888 d888 888b 888 " d888 888 8888 8888 d88 88b
888 888 888 888P 888 Y888 888P 888 Y888 888 Y888 888P 888 ,
888 888 888 88" 888 "88 88" 888 "88 888 "88 88" "YeeP"
_____________________________________________ 888 _________________
continuation-based unix i/o for manycore numa\888/© nick black 20xx
Wiki - https://nick-black.com/dankwiki/index.php/Libtorque
Mailing list - http://groups.google.com/group/libtorque-devel
GitHub project page - http://github.com/dankamongmen/libtorque
Primary git repository - git://github.com/dankamongmen/libtorque.git
Bugzilla - https://nick-black.com/bugzilla/buglist.cgi?product=libtorque
I. History and licensing
II. Minimum requirements
1. architecture
2. operating system
3. compiler
4. libc/pthreads
5. cpuset
6. numa
7. cuda
8. ssl
9. dns
A. doc
B. misc
III. Building libtorque
IV. Design issues
1. design docs
V. Writing libtorque applications
1. overview
2. common mistakes
VI. FAQ
1. building
2. general use
3. file descriptors
4. signals
----------------------------------------------------------------libtorque-----
-=+ I. History and licensing +=-
----------------------------------------------------------------libtorque-----
libtorque was conceived as a project for Professor Richard Vuduc's Fall 2009
"CSE 6230: Tools and Applications for High Performance Computing" at the
Georgia Institute of Technology. The original proposal for libtorque is
available at:
https://nick-black.com/tabpower/cse6230proposal.pdf
libtorque is licensed under version 2 of the Apache License:
http://www.apache.org/licenses/LICENSE-2.0.html
A copy can be found in the toplevel file COPYING.
Development of libtorque would have been impossible without the extraordinary
grace, patience and benevolence of management at McAfee Research, particularly
Dmitri Alperovitch and Dr. Sven Krasser.
----------------------------------------------------------------libtorque-----
-=+ II. Minimum requirements +=-
----------------------------------------------------------------libtorque-----
--architecture requirements---------------------------------------------------
Only x86 processors with the CPUID instruction are currently supported (most
everything from the Pentium Pro onwards). Further hardware support is intended.
--operating system requirements-----------------------------------------------
libtorque has been tested on Linux (versions 2.6.19 through 3.2.6), and
FreeBSD (version 7.1). It might work on earlier versions of Linux. Support for
other operating systems, and earlier versions, is intended.
--compiler requirements-------------------------------------------------------
libtorque is reliant upon GNU Make and the GNU Compiler Collection. gcc is
tracked quite closely, and only recent versions might be supported at any time;
4.3 is the minimum gcc version explicitly supported or recommended
gcc 4.2, and also llvm using the 4.2 frontend, appear to work if some -W
options are removed from WFLAGS. The results have not been extensively tested.
--libc/pthreads requirements--------------------------------------------------
On Linux, the GNU C Library is required, using the NPTL threading
implementation (NPTL is the default on 2.6 kernels since GNU libc 2.3.2).
Versions 2.5 through 2.10 have been tested.
On FreeBSD, only the libthr threading implementation is explicitly supported or
recommended (this is the default in FreeBSD 7, and the only supported mode in
FreeBSD 8)). If rebuilding world, ensure NO_LIBTHR is not active in make.conf.
If using another pthread library as the default, bind libpthread references to
libthr via the following entries in /etc/libmap.conf:
libpthread.so.2 libthr.so.2
libpthread.so libthr.so
If using a 32-bit version of the library on a 64-bit system, place these same
lines in /etc/libmap32.conf. The mapping may be restricted to libtorque if
necessary (this author recommends general use of the libthr implementation).
--cpuset requirements---------------------------------------------------------
On FreeBSD, the native code added during 7.1 development is used.
On Linux, administrative support for cpusets requires CONFIG_CPUSET to be
enabled in the kernel (if cpuset partitioning is in effect, a "cpuset" or
"cgroups" filesystem will be mounted on /dev/cpuset). Affinities can and will
still be used by libtorque without this support, but it will be difficult to
partition processing and memory elements up among processes. Affinities have
been part of Linux since 2.5.8. See the Linux kernel's
Documentation/cpusets.txt and libtorque bug #14
(https://nick-black.com/bugzilla/show_bug.cgi?id=14) for more info. If cgroups
are used, you likely also want CONFIG_GROUP_SCHED.
The SGI libcpuset library (http://oss.sgi.com/projects/cpusets/) was evaluated,
but I decided against it due to stability, portability and maintenance issues.
Version 1.0 was tested.
--numa requirements---------------------------------------------------------
On Linux, the libNUMA library (http://oss.sgi.com/projects/libnuma/) is used.
Version 2.0.3 has been tested. CONFIG_NUMA must be enabled in the kernel; if
NUMA is properly supported, devices/system/node* directories will be
present in mounted sysfs filesystems. FreeBSD does not, to my knowledge, expose
NUMA details as of 7.2.
--cuda requirements---------------------------------------------------------
On Linux, the "Driver API" libcuda library
(http://www.nvidia.com/object/cuda_get.html) is used. Version 2.3 has been
tested.
--ssl requirements----------------------------------------------------------
OpenSSL is supported. Version 0.9.8 has been tested.
GnuTLS support is being considered.
--dns requirements----------------------------------------------------------
GNU adns is supported. Version 1.4 has been tested.
C-ares support is being considered. We might roll our own, one designed for
highly concurrent operation.
--doc requirements----------------------------------------------------------
Building the man pages (distributed in Docbook XML) requires xsltproc (part of
the GNOME project's libxslt) and DocBook. A network connection is required if
the Docbook DTD's and XSL stylesheets are not installed; building the
documentation will be much faster with local copies. Install:
- docbook-xml, docbook-xsl, xsltproc (Debian)
- textproc/docbook-xml, textproc/docbook-xsl, textproc/xsltproc (FreeBSD)
Building the other documentation (papers, figures, etc) requires GraphViz's
dot(1) utility. Version 2.20.2--2.26.0 have been tested. Install:
- graphviz (Debian)
- graphics/graphviz (FreeBSD)
--misc requirements---------------------------------------------------------
Exuberant Ctags are required to build the tagfile. Install:
- devel/ctags (FreeBSD)
- exuberant-ctags (Debian)
----------------------------------------------------------------libtorque-----
-=+ III. Building libtorque +=-
----------------------------------------------------------------libtorque-----
If you have downloaded a release tarball, "configure" will already be present.
If you're building from a source checkout, you'll need the GNU Autotools. Run
"autoreconf -fi" to (re)generate "configure".
Run "./configure" and "make" to build the library, and "make install" to
install it.
Environment variables can affect the build by overriding defaults:
DESTDIR (Installation prefix. Default: /usr/local)
DOCPREFIX (Doc installation prefix. Default: /usr/local/share (Linux),
/usr/local (FreeBSD))
CC (C compiler executable. Default: gcc-4.4 (Linux), gcc44 (FreeBSD))
TAGBIN (Source tag generator. Default: exctags if on path, otherwise ctags)
XSLTPROC (XSL processor. Default: xsltproc)
MARCH/MTUNE (Code generation settings. See below)
Build policy can be modified by defining certain variables:
LIBTORQUE_WITHOUT_ADNS (do not build in GNU adns support)
LIBTORQUE_WITHOUT_CUDA (do not build in NVIDIA CUDA support)
LIBTORQUE_WITHOUT_OPENSSL (do not build in OpenSSL support)
LIBTORQUE_WITHOUT_NUMA (do not build in libNUMA support)
LIBTORQUE_WITHOUT_EV (do not build libev-based testing binaries)
LIBTORQUE_WITHOUT_WERROR (do not compile with -Werror -- use is discouraged)
Changing environment variables ought be followed by the 'clean' target;
this is one of the very few times the 'clean' target must be used.
By default, libtorque is built optimizing for the buildhost's µ-architecture
and ISA, using gcc 4.3's "native" option to -march and -mtune. If you don't
have gcc 4.3 or greater, you'll need to define appropriate march and mtune
values for your system (see gcc's "Submodel Options" info page). Libraries
intended to be run on arbitrary x86 hardware must be built with MARCH
explicitly defined as "generic", and MTUNE unset. The resulting libraries will
be wretchedly suboptimal on the vast majority of x86 processors.
From the toplevel, invoke GNU make. On Linux, 'make' is almost always GNU make.
On FreeBSD, the devel/gmake Port supplies GNU make as 'gmake'. This will build
the libtorque library, and run the supplied unit tests. Unit test failures are
promoted to full build failures. The install target can then be run to install
the library.
Note: The 'install' target depends on unit testing targets, and thus will not
install a known-unsafe library. This might be undesirable when hacking on the
library, and testing with another application. The 'unsafe-install' target is
provided to facilitate such operation. Its use is not typically recommended.
The 'deinstall' target will remove the files installed by that version of
libtorque (it cannot remove files installed only by previous versions). Since
libtorque does not install any active configuration files, use of 'deinstall'
is thus recommended prior to updating and rebuilding libtorque. Non-existence
of files is not considered an error by the 'deinstall' target.
libtorque can be brought up to date via 'git pull'. The 'clean' target ought
never be necessary to run, save when hacking on the build process itself (or
changing build parameters, as noted above), or (re)moving source files.
----------------------------------------------------------------libtorque-----
-=+ IV. Design Issues +=-
----------------------------------------------------------------libtorque-----
- Execution unit detection, differentiation, and effective use. This might
have to deal with symmetric multiprocessing, one or many multicore packages,
simultaneous multithreading (ie HyperThreading), heterogenous cores, limited
cpusets, and processors which are removed from or added to the workset at
runtime. Power management capabilities, functional units, memory and I/O
paths and interconnection properties all play roles in data placement and
event scheduling. Instruction set details ought not matter so much.
libtorque will initially operate as the sole user of any processing units it
is allocated; consideration of other processes, if it exists, is incidental.
Later, this might change. We might support prioritizing within a cpuset, so
that for instance two libtorque programs can share the entirety of a cpuset,
but stomp on each other minimally. It would of course generally be best to
combine these various components into a single libtorque program.
- Memory detection, differentiation and effective use. This might have to deal
with unified vs split caches, n-way associativities, line sizes, total store
sizes, page sizes and types, prefetching, eviction policies, DMA into DRAM
or even cache SRAM, multiprocessor coherence and sharing, inclusive and
exclusive levels, bank count, and TLB sizes. It is unexpected that libtorque
will take into consideration memory pipelining, writethrough vs writeback,
memory bandwidth, or absolute latency.
libtorque will, for instance, want to generally schedule two functionally
pipelined gyres on a shared die, whereas functionally parallel codes might be
usually scheduled irrespective of die-sharing. Stacks can freely alias one
another across exclusive, independent caches, but ought not relative to a
shared cache. Meanwhile, multiple states scheduled on a given thread ought
not be aliasing. These issues combine in complex, interesting ways as the
eventspace becomes irregular, and states must be moved among processors (for
instance, select a processor serving no aliasing states if one's available).
- Not only event-handling, but also event receipt must be scheduled. Any given
set of threads can invoke event discovery, on shared or distinct sets of
events, where shared events could employ shared or distinct kernel-side event
sets. Multiple listeners on an event means more flexibility, but also more
communication and wasted work; it is likely better to move the event.
If no more than one thread can wait for an event, and either one-shot
handling or edge-triggering is used, a majority of locking and possible
contention can be excised from the core.
--design docs---------------------------------------------------------------
Various design documents can be found in the doc/ subdirectory. Included among
them are:
doc/mteventqueues - "Event Queues and Threads"
doc/termination - "Termination"
----------------------------------------------------------------libtorque-----
-=+ V. Writing libtorque applications +=-
----------------------------------------------------------------libtorque-----
--overview------------------------------------------------------------------
The only interfaces available to users of libtorque are those in libtorque.h,
which attempts to be authoritative and current regarding technical details.
Numerous example applications live in tools/testing/ and various src/
directories. That having been said:
- A torque_ctx is required to use any libtorque functionality. A program
may use more than one torque_ctx, although this constrains event handling
and is thus not generally optimal. This support exists because:
- multiple libraries used by an app might each use libtorque
- multiple-architecture processes might one day need it
- it seems unlikely that refusing to support multiple contexts would lead
to any bugs being discovered more quickly
This is not primarily a security- or billing-related issue; to effect
QoS and accounting, multiple libtorque applications ought be run in distinct
operating system containers. Alternatively, use libtorque's priority system
in conjunction with handrolled stats.
- A torque_ctx can be created only via torque_init(). It cannot be used
after passing it to torque_stop(). Side-effects of torque_init()
include:
- (re-)detection of system topology and processor details
- populating allocated processors with an event thread each (note that
N libtorque contexts in a process lead to N threads per processor,
assuming the process's cpuset doesn't change between initializations)
- SIGPIPE is ignored if it was previously handled via the default action
- SIGTERM will be intercepted by some instance of libtorque subject to
operating system-specific rules. See kqueue(7) or signalfd(2) (this also
applies to any signals registered via torque_addsignal())
- allocation of moderate amounts of memory and a handful of file descriptors
- Add event sources to the libtorque context via torque_add*(). Fundamental
event sources include:
- file descriptors (rx / tx)
- signals (rx)
- timers (absolute or relative; see timerfd(2) or kqueue(7))
- filesystem events (see inotify(7) or kqueue(7))
- pthread condition variables
Synthesized atop these are numerous derived sources (event systems)...
- SSLv3/TLSv1 servers and clients
- DNS queries
- Network events (via netlink/PF_ROUTE sockets)
...and also stream transforms:
- SSLv3/TLSv1
- gzip/bzip2
- architecture-adaptive buffering
Event sources may be registered with more than one libtorque context; the
events will be repeated to (and thus handled by) all associated contexts.
- Once registered, libtorque will immediately begin facilitating callbacks for
the specified event. No more than one libtorque thread will dispatch a given
event's handler at once (though subsequent events may be handled by any
thread). Locking is thus only necessary for mutable data referenced by
multiple events' handlers (including, for instance, thread-unsafe libraries
called by potentially-concurrent handlers).
This is efficiently implemented via exclusive use of edge-triggered I/O
notification (we'd otherwise need locks in the event dispatching).
Edge-triggered I/O is covered in epoll_wait(2) and kqueue(2); most important
to note is that all available data must be read in each callback, or events
will cease to be generated. This means every dequeuing operation (read(2),
accept(2), etc) must be repeated until either:
- An attempt to dequeue returns with EAGAIN or EWOULDBLOCK. Further
read-type events will be processed and dispatched as they occur.
- Further handling would block on some other resource, a mutex for instance
or perhaps buffer space. Ensure appropriate related continuations are
registered, and compose a read-type callback across them (this is the
most general definition of an event queue as mentioned in epoll_wait(2)).
- EOF is reached (read() returns 0). Either close(2) the descriptor or, if
still writing, ensure appropriate related continuations are registered,
as no more read-type events will be dispatched.
- The connection is invalidated, in which case it must be close(2)d lest it
possibly be leaked (there is no assurance of further read-type events).
- It is (currently) critical that handlers not block. Only non-blocking or
asynchronous I/O operations ought be used, and preferably only file
descriptors explicitly marked non-blocking. Rather than sleeping on a
contended mutex, update the continuation and yield the processing context.
Remember that non-blocking operation is typically meaningless in the context
of a disk-backed read(2); asynchronous I/O is thus preferable for disk reads
(especially since a "non-blocking" read retried or failed at the block layer
can block arbitrarily). Major computations upon which handling is dependent
ought be implemented via libtorque's opportunistic or dedicated
compute-thread infrastructure and more fine-grained continuations.
- Handlers can themselves call libtorque functions (even torque_stop()),
even on their own contexts (this is of course necessary for any kind of
accept(2)ing socket).
--common mistakes-----------------------------------------------------------
- Using torque_addfd() for a listen(2)ing socket. The proper activation is
torque_addfd_unbuffered(). The accept(2)ing callback will never be
invoked from a default (buffered) fd.
- Failing to account for EINTR or short returns. Whether a system call
interrupted by signal delivery is automatically restarted depends on the
operating system and libc, the capability operand, the system call, and the
current signal-handling state. Since libtorque intercepts signals prior to
uninterested threads' receipt (see signalfd(2) and kqueue(2)), applications
needn't worry about yielding on EINTR, nor unbounded looping thereon. Also,
EINTR is indicated only when no data had been moved upon signal delivery; a
short result is returned otherwise.
- Failure to mask signals registered with libtorque in all other threads.
----------------------------------------------------------------libtorque-----
-=+ VI. FAQ +=-
----------------------------------------------------------------libtorque-----
--building------------------------------------------------------------------
Q: I get errors about NUMA-related functionality.
A: See the NUMA requirements in section II.6. If you can't provide the
required minimum support, build with LIBTORQUE_WITHOUT_NUMA.
--general use---------------------------------------------------------------
Q: Can an (unthreaded) program use libtorque even if it doesn't use -pthread
during compilation?
A: Yes. Most of the binaries built as part of libtorque don't use -pthread;
see CFLAGS vs MT_CFLAGS in the GNUmakefile.
--file descriptors----------------------------------------------------------
Q: Why is torque_addfd() failing on very high fds?
A: Did your file descriptor rlimit change after the relevant torque_ctx was
created? torque_init() detects and uses the file descriptor rlimit to shape
some internal arrays, and will reject file descriptors outside this range.
--signals-------------------------------------------------------------------
Q: Why can't I listen to SIGTERM, SIGKILL or SIGSTOP?
A: SIGKILL and SIGSTOP can't be caught or used through signalfd/kqueue, so any
attempt to use them will be rejected. Libtorque uses SIGTERM internally, so
attempts to use it will also be rejected (technically, it uses
EVTHREAD_TERM, which is #defined on current platforms to SIGTERM).
Q: Will externally-generated signals be delivered to libtorque threads, or
other threads, or both?
A: libtorque uses POSIX threads, and reflects those semantics. An IPC signal
will be delivered to an arbitrary thread which is not masking that signal.
By default, libtorque threads mask all possible signals (all save KILL,
STOP, and TERM -- see above), and thus signals will prefer other threads.
When a signal is registered with libtorque, that signal will be unmasked in
at least one libtorque thread.
In summary: to ensure delivery to non-libtorque threads, don't register the
relevant signals with libtorque, and mask them prior to calling
torque_init(). To ensure handling within libtorque, mask the relevant
signals (in all threads) prior to calling torque_init(), and register
them for handling. Always be sure to keep SIGTERM blocked.
Q: Will signal handlers be called if libtorque is listening for that signal?
A: Signal delivered to libtorque threads will be consumed. It doesn't matter
anyway, since you ought have the relevant signals blocked in other threads.
Furthermore, libtorque might modify the (process-wide) signal handler.