README

                 ==================================================
                    libtorque, a multithreaded I/O event library
                 Copyright © 2009--2021 Nick Black <dank@qemfd.net>
                    Render this document with fixed-width fonts!
                 ==================================================
___________________________________________________________________
888 ,e, 888        d8           "...tear the roof off the sucka..."
888  "  888 88e   d88    e88 88e  888,8,  e88 888 8888 8888  ,e e,
888 888 888 888b d88888 d888 888b 888 "  d888 888 8888 8888 d88 88b
888 888 888 888P  888   Y888 888P 888    Y888 888 Y888 888P 888   ,
888 888 888 88"   888    "88 88"  888     "88 888  "88 88"   "YeeP"
_____________________________________________ 888 _________________
continuation-based unix i/o for manycore numa\888/© nick black 20xx

Wiki - https://nick-black.com/dankwiki/index.php/Libtorque
Mailing list - http://groups.google.com/group/libtorque-devel
GitHub project page - http://github.com/dankamongmen/libtorque
Primary git repository - git://github.com/dankamongmen/libtorque.git
Bugzilla - https://nick-black.com/bugzilla/buglist.cgi?product=libtorque

 I. History and licensing
 II. Minimum requirements
  1. architecture
  2. operating system
  3. compiler
  4. libc/pthreads
  5. cpuset
  6. numa
  7. cuda
  8. ssl
  9. dns
  A. doc
  B. misc
 III. Building libtorque
 IV. Design issues
  1. design docs
 V. Writing libtorque applications
  1. overview
  2. common mistakes
 VI. FAQ
  1. building
  2. general use
  3. file descriptors
  4. signals

----------------------------------------------------------------libtorque-----
-=+ I. History and licensing +=-
----------------------------------------------------------------libtorque-----

libtorque was conceived as a project for Professor Richard Vuduc's Fall 2009
"CSE 6230: Tools and Applications for High Performance Computing" at the
Georgia Institute of Technology. The original proposal for libtorque is
available at:

        https://nick-black.com/tabpower/cse6230proposal.pdf

libtorque is licensed under version 2 of the Apache License:

        http://www.apache.org/licenses/LICENSE-2.0.html

A copy can be found in the toplevel file COPYING.

Development of libtorque would have been impossible without the extraordinary
grace, patience and benevolence of management at McAfee Research, particularly
Dmitri Alperovitch and Dr. Sven Krasser.

----------------------------------------------------------------libtorque-----
-=+ II. Minimum requirements +=-
----------------------------------------------------------------libtorque-----

--architecture requirements---------------------------------------------------

Only x86 processors with the CPUID instruction are currently supported (most
everything from the Pentium Pro onwards). Further hardware support is intended.

--operating system requirements-----------------------------------------------

libtorque has been tested on Linux (versions 2.6.19 through 3.2.6), and
FreeBSD (version 7.1). It might work on earlier versions of Linux. Support for
other operating systems, and earlier versions, is intended.

--compiler requirements-------------------------------------------------------

libtorque is reliant upon GNU Make and the GNU Compiler Collection. gcc is
tracked quite closely, and only recent versions might be supported at any time;
4.3 is the minimum gcc version explicitly supported or recommended

gcc 4.2, and also llvm using the 4.2 frontend, appear to work if some -W
options are removed from WFLAGS. The results have not been extensively tested.

--libc/pthreads requirements--------------------------------------------------

On Linux, the GNU C Library is required, using the NPTL threading
implementation (NPTL is the default on 2.6 kernels since GNU libc 2.3.2).
Versions 2.5 through 2.10 have been tested.

On FreeBSD, only the libthr threading implementation is explicitly supported or
recommended (this is the default in FreeBSD 7, and the only supported mode in
FreeBSD 8)). If rebuilding world, ensure NO_LIBTHR is not active in make.conf.
If using another pthread library as the default, bind libpthread references to
libthr via the following entries in /etc/libmap.conf:

        libpthread.so.2         libthr.so.2
        libpthread.so           libthr.so

If using a 32-bit version of the library on a 64-bit system, place these same
lines in /etc/libmap32.conf. The mapping may be restricted to libtorque if
necessary (this author recommends general use of the libthr implementation).

--cpuset requirements---------------------------------------------------------

On FreeBSD, the native code added during 7.1 development is used.

On Linux, administrative support for cpusets requires CONFIG_CPUSET to be
enabled in the kernel (if cpuset partitioning is in effect, a "cpuset" or
"cgroups" filesystem will be mounted on /dev/cpuset). Affinities can and will
still be used by libtorque without this support, but it will be difficult to
partition processing and memory elements up among processes. Affinities have
been part of Linux since 2.5.8. See the Linux kernel's
Documentation/cpusets.txt and libtorque bug #14
(https://nick-black.com/bugzilla/show_bug.cgi?id=14) for more info. If cgroups
are used, you likely also want CONFIG_GROUP_SCHED.

The SGI libcpuset library (http://oss.sgi.com/projects/cpusets/) was evaluated,
but I decided against it due to stability, portability and maintenance issues.
Version 1.0 was tested.

--numa requirements---------------------------------------------------------

On Linux, the libNUMA library (http://oss.sgi.com/projects/libnuma/) is used.
Version 2.0.3 has been tested. CONFIG_NUMA must be enabled in the kernel; if
NUMA is properly supported, devices/system/node* directories will be
present in mounted sysfs filesystems. FreeBSD does not, to my knowledge, expose
NUMA details as of 7.2.

--cuda requirements---------------------------------------------------------

On Linux, the "Driver API" libcuda library
(http://www.nvidia.com/object/cuda_get.html) is used. Version 2.3 has been
tested.

--ssl requirements----------------------------------------------------------

OpenSSL is supported. Version 0.9.8 has been tested.

GnuTLS support is being considered.

--dns requirements----------------------------------------------------------

GNU adns is supported. Version 1.4 has been tested.

C-ares support is being considered. We might roll our own, one designed for
highly concurrent operation.

--doc requirements----------------------------------------------------------

Building the man pages (distributed in Docbook XML) requires xsltproc (part of
the GNOME project's libxslt) and DocBook. A network connection is required if
the Docbook DTD's and XSL stylesheets are not installed; building the
documentation will be much faster with local copies. Install:

 - docbook-xml, docbook-xsl, xsltproc (Debian)
 - textproc/docbook-xml, textproc/docbook-xsl, textproc/xsltproc (FreeBSD)

Building the other documentation (papers, figures, etc) requires GraphViz's
dot(1) utility. Version 2.20.2--2.26.0 have been tested. Install:

 - graphviz (Debian)
 - graphics/graphviz (FreeBSD)

--misc requirements---------------------------------------------------------

Exuberant Ctags are required to build the tagfile. Install:

 - devel/ctags (FreeBSD)
 - exuberant-ctags (Debian)

----------------------------------------------------------------libtorque-----
-=+ III. Building libtorque +=-
----------------------------------------------------------------libtorque-----

If you have downloaded a release tarball, "configure" will already be present.
If you're building from a source checkout, you'll need the GNU Autotools. Run
"autoreconf -fi" to (re)generate "configure".

Run "./configure" and "make" to build the library, and "make install" to
install it.

Environment variables can affect the build by overriding defaults:

 DESTDIR (Installation prefix. Default: /usr/local)
 DOCPREFIX (Doc installation prefix. Default: /usr/local/share (Linux),
                                                /usr/local (FreeBSD))
 CC (C compiler executable. Default: gcc-4.4 (Linux), gcc44 (FreeBSD))
 TAGBIN (Source tag generator. Default: exctags if on path, otherwise ctags)
 XSLTPROC (XSL processor. Default: xsltproc)
 MARCH/MTUNE (Code generation settings. See below)

Build policy can be modified by defining certain variables:

 LIBTORQUE_WITHOUT_ADNS (do not build in GNU adns support)
 LIBTORQUE_WITHOUT_CUDA (do not build in NVIDIA CUDA support)
 LIBTORQUE_WITHOUT_OPENSSL (do not build in OpenSSL support)
 LIBTORQUE_WITHOUT_NUMA (do not build in libNUMA support)
 LIBTORQUE_WITHOUT_EV (do not build libev-based testing binaries)
 LIBTORQUE_WITHOUT_WERROR (do not compile with -Werror -- use is discouraged)

Changing environment variables ought be followed by the 'clean' target;
this is one of the very few times the 'clean' target must be used.

By default, libtorque is built optimizing for the buildhost's µ-architecture
and ISA, using gcc 4.3's "native" option to -march and -mtune. If you don't
have gcc 4.3 or greater, you'll need to define appropriate march and mtune
values for your system (see gcc's "Submodel Options" info page). Libraries
intended to be run on arbitrary x86 hardware must be built with MARCH
explicitly defined as "generic", and MTUNE unset. The resulting libraries will
be wretchedly suboptimal on the vast majority of x86 processors.

From the toplevel, invoke GNU make. On Linux, 'make' is almost always GNU make.
On FreeBSD, the devel/gmake Port supplies GNU make as 'gmake'. This will build
the libtorque library, and run the supplied unit tests. Unit test failures are
promoted to full build failures. The install target can then be run to install
the library.

Note: The 'install' target depends on unit testing targets, and thus will not
install a known-unsafe library. This might be undesirable when hacking on the
library, and testing with another application. The 'unsafe-install' target is
provided to facilitate such operation. Its use is not typically recommended.

The 'deinstall' target will remove the files installed by that version of
libtorque (it cannot remove files installed only by previous versions). Since
libtorque does not install any active configuration files, use of 'deinstall'
is thus recommended prior to updating and rebuilding libtorque. Non-existence
of files is not considered an error by the 'deinstall' target.

libtorque can be brought up to date via 'git pull'. The 'clean' target ought
never be necessary to run, save when hacking on the build process itself (or
changing build parameters, as noted above), or (re)moving source files.

----------------------------------------------------------------libtorque-----
-=+ IV. Design Issues +=-
----------------------------------------------------------------libtorque-----

- Execution unit detection, differentiation, and effective use. This might
  have to deal with symmetric multiprocessing, one or many multicore packages,
  simultaneous multithreading (ie HyperThreading), heterogenous cores, limited
  cpusets, and processors which are removed from or added to the workset at
  runtime. Power management capabilities, functional units, memory and I/O
  paths and interconnection properties all play roles in data placement and
  event scheduling. Instruction set details ought not matter so much.

  libtorque will initially operate as the sole user of any processing units it
  is allocated; consideration of other processes, if it exists, is incidental.
  Later, this might change. We might support prioritizing within a cpuset, so
  that for instance two libtorque programs can share the entirety of a cpuset,
  but stomp on each other minimally. It would of course generally be best to
  combine these various components into a single libtorque program.

- Memory detection, differentiation and effective use. This might have to deal
  with unified vs split caches, n-way associativities, line sizes, total store
  sizes, page sizes and types, prefetching, eviction policies, DMA into DRAM
  or even cache SRAM, multiprocessor coherence and sharing, inclusive and
  exclusive levels, bank count, and TLB sizes. It is unexpected that libtorque
  will take into consideration memory pipelining, writethrough vs writeback,
  memory bandwidth, or absolute latency.

  libtorque will, for instance, want to generally schedule two functionally
  pipelined gyres on a shared die, whereas functionally parallel codes might be
  usually scheduled irrespective of die-sharing. Stacks can freely alias one
  another across exclusive, independent caches, but ought not relative to a
  shared cache. Meanwhile, multiple states scheduled on a given thread ought
  not be aliasing. These issues combine in complex, interesting ways as the
  eventspace becomes irregular, and states must be moved among processors (for
  instance, select a processor serving no aliasing states if one's available).

- Not only event-handling, but also event receipt must be scheduled. Any given
  set of threads can invoke event discovery, on shared or distinct sets of
  events, where shared events could employ shared or distinct kernel-side event
  sets. Multiple listeners on an event means more flexibility, but also more
  communication and wasted work; it is likely better to move the event.
  If no more than one thread can wait for an event, and either one-shot
  handling or edge-triggering is used, a majority of locking and possible
  contention can be excised from the core.

--design docs---------------------------------------------------------------

Various design documents can be found in the doc/ subdirectory. Included among
them are:

 doc/mteventqueues - "Event Queues and Threads"
 doc/termination - "Termination"

----------------------------------------------------------------libtorque-----
-=+ V. Writing libtorque applications +=-
----------------------------------------------------------------libtorque-----

--overview------------------------------------------------------------------

The only interfaces available to users of libtorque are those in libtorque.h,
which attempts to be authoritative and current regarding technical details.
Numerous example applications live in tools/testing/ and various src/
directories. That having been said:

 - A torque_ctx is required to use any libtorque functionality. A program
   may use more than one torque_ctx, although this constrains event handling
   and is thus not generally optimal. This support exists because:

    - multiple libraries used by an app might each use libtorque
    - multiple-architecture processes might one day need it
    - it seems unlikely that refusing to support multiple contexts would lead
       to any bugs being discovered more quickly
 
   This is not primarily a security- or billing-related issue; to effect
   QoS and accounting, multiple libtorque applications ought be run in distinct
   operating system containers. Alternatively, use libtorque's priority system
   in conjunction with handrolled stats.

 - A torque_ctx can be created only via torque_init(). It cannot be used
   after passing it to torque_stop(). Side-effects of torque_init()
   include:

   - (re-)detection of system topology and processor details
   - populating allocated processors with an event thread each (note that
      N libtorque contexts in a process lead to N threads per processor,
      assuming the process's cpuset doesn't change between initializations)
   - SIGPIPE is ignored if it was previously handled via the default action
   - SIGTERM will be intercepted by some instance of libtorque subject to
      operating system-specific rules. See kqueue(7) or signalfd(2) (this also
      applies to any signals registered via torque_addsignal())
   - allocation of moderate amounts of memory and a handful of file descriptors

 - Add event sources to the libtorque context via torque_add*(). Fundamental
   event sources include:

    - file descriptors (rx / tx)
    - signals (rx)
    - timers (absolute or relative; see timerfd(2) or kqueue(7))
    - filesystem events (see inotify(7) or kqueue(7))
    - pthread condition variables

   Synthesized atop these are numerous derived sources (event systems)...

    - SSLv3/TLSv1 servers and clients
    - DNS queries
    - Network events (via netlink/PF_ROUTE sockets)

   ...and also stream transforms:

    - SSLv3/TLSv1
    - gzip/bzip2
    - architecture-adaptive buffering

   Event sources may be registered with more than one libtorque context; the
   events will be repeated to (and thus handled by) all associated contexts.

 - Once registered, libtorque will immediately begin facilitating callbacks for
   the specified event. No more than one libtorque thread will dispatch a given
   event's handler at once (though subsequent events may be handled by any
   thread). Locking is thus only necessary for mutable data referenced by
   multiple events' handlers (including, for instance, thread-unsafe libraries
   called by potentially-concurrent handlers).

   This is efficiently implemented via exclusive use of edge-triggered I/O
   notification (we'd otherwise need locks in the event dispatching).
   Edge-triggered I/O is covered in epoll_wait(2) and kqueue(2); most important
   to note is that all available data must be read in each callback, or events
   will cease to be generated. This means every dequeuing operation (read(2),
   accept(2), etc) must be repeated until either:

    - An attempt to dequeue returns with EAGAIN or EWOULDBLOCK. Further
       read-type events will be processed and dispatched as they occur.
    - Further handling would block on some other resource, a mutex for instance
       or perhaps buffer space. Ensure appropriate related continuations are
       registered, and compose a read-type callback across them (this is the
       most general definition of an event queue as mentioned in epoll_wait(2)).
    - EOF is reached (read() returns 0). Either close(2) the descriptor or, if
       still writing, ensure appropriate related continuations are registered,
       as no more read-type events will be dispatched.
    - The connection is invalidated, in which case it must be close(2)d lest it
       possibly be leaked (there is no assurance of further read-type events).

 - It is (currently) critical that handlers not block. Only non-blocking or
   asynchronous I/O operations ought be used, and preferably only file
   descriptors explicitly marked non-blocking. Rather than sleeping on a
   contended mutex, update the continuation and yield the processing context.
   Remember that non-blocking operation is typically meaningless in the context
   of a disk-backed read(2); asynchronous I/O is thus preferable for disk reads
   (especially since a "non-blocking" read retried or failed at the block layer
   can block arbitrarily). Major computations upon which handling is dependent
   ought be implemented via libtorque's opportunistic or dedicated
   compute-thread infrastructure and more fine-grained continuations.

 - Handlers can themselves call libtorque functions (even torque_stop()),
   even on their own contexts (this is of course necessary for any kind of
   accept(2)ing socket).

--common mistakes-----------------------------------------------------------

 - Using torque_addfd() for a listen(2)ing socket. The proper activation is
   torque_addfd_unbuffered(). The accept(2)ing callback will never be
   invoked from a default (buffered) fd.

 - Failing to account for EINTR or short returns. Whether a system call
   interrupted by signal delivery is automatically restarted depends on the
   operating system and libc, the capability operand, the system call, and the
   current signal-handling state. Since libtorque intercepts signals prior to
   uninterested threads' receipt (see signalfd(2) and kqueue(2)), applications
   needn't worry about yielding on EINTR, nor unbounded looping thereon. Also,
   EINTR is indicated only when no data had been moved upon signal delivery; a
   short result is returned otherwise.

 - Failure to mask signals registered with libtorque in all other threads.

----------------------------------------------------------------libtorque-----
-=+ VI. FAQ +=-
----------------------------------------------------------------libtorque-----

--building------------------------------------------------------------------

Q: I get errors about NUMA-related functionality.
A: See the NUMA requirements in section II.6. If you can't provide the
   required minimum support, build with LIBTORQUE_WITHOUT_NUMA.

--general use---------------------------------------------------------------

Q: Can an (unthreaded) program use libtorque even if it doesn't use -pthread
   during compilation?
A: Yes. Most of the binaries built as part of libtorque don't use -pthread;
   see CFLAGS vs MT_CFLAGS in the GNUmakefile.

--file descriptors----------------------------------------------------------

Q: Why is torque_addfd() failing on very high fds?
A: Did your file descriptor rlimit change after the relevant torque_ctx was
   created? torque_init() detects and uses the file descriptor rlimit to shape
   some internal arrays, and will reject file descriptors outside this range.

--signals-------------------------------------------------------------------

Q: Why can't I listen to SIGTERM, SIGKILL or SIGSTOP?
A: SIGKILL and SIGSTOP can't be caught or used through signalfd/kqueue, so any
   attempt to use them will be rejected. Libtorque uses SIGTERM internally, so
   attempts to use it will also be rejected (technically, it uses
   EVTHREAD_TERM, which is #defined on current platforms to SIGTERM).

Q: Will externally-generated signals be delivered to libtorque threads, or
   other threads, or both?
A: libtorque uses POSIX threads, and reflects those semantics. An IPC signal
   will be delivered to an arbitrary thread which is not masking that signal.
   By default, libtorque threads mask all possible signals (all save KILL,
   STOP, and TERM -- see above), and thus signals will prefer other threads.
   When a signal is registered with libtorque, that signal will be unmasked in
   at least one libtorque thread.

   In summary: to ensure delivery to non-libtorque threads, don't register the
   relevant signals with libtorque, and mask them prior to calling
   torque_init(). To ensure handling within libtorque, mask the relevant
   signals (in all threads) prior to calling torque_init(), and register
   them for handling. Always be sure to keep SIGTERM blocked.

Q: Will signal handlers be called if libtorque is listening for that signal?
A: Signal delivered to libtorque threads will be consumed. It doesn't matter
   anyway, since you ought have the relevant signals blocked in other threads.
   Furthermore, libtorque might modify the (process-wide) signal handler.