-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper.tex
899 lines (741 loc) · 38.5 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
% JuliaCon proceedings template
\documentclass{juliacon}
\setcounter{page}{1}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{cleveref}
\usepackage{microtype}
% \usepackage{pifont}
%% extra links:
%% - GPU Julia: https://cuda.juliagpu.org/stable/tutorials/introduction/
%% - GPU x2: https://nextjournal.com/sdanisch/julia-gpu-programming
%% - NaN vs missing https://discourse.julialang.org/t/is-there-any-reason-to-use-nan-instead-of-missing/84396
\begin{document}
% Misc typographical tweaks.
\setlength{\parindent}{10pt}
% \setmonofont{JuliaMono}[Extension = .ttf, Path = ./, UprightFont = *-Regular, ItalicFont = *-Italic]
\input{header}
\maketitle
\begin{abstract}
Reliable numerical computations are central to scientific computing,
but the floating-point arithmetic that enables large-scale models
is error-prone.
Numeric exceptions are a common occurrence and can propagate through
code, leading to flawed results.
This paper presents FlowFPX, a toolkit for systematically debugging
floating-point exceptions by recording their flow,
coalescing exception contexts, and fuzzing in select locations.
These tools help scientists discover when exceptions happen
and track down their origin, smoothing the way to a reliable codebase.
\end{abstract}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\FlowFPX}{FlowFPX}
\newcommand{\GPUFPX}{GPU-FPX}
\newcommand{\juliapkg}[1]{\texttt{#1.jl}}
\newcommand{\TrackedFloats}{\juliapkg{TrackedFloats}}
\newcommand{\TF}{\TrackedFloats}
\newcommand{\ShallowWaters}{\juliapkg{ShallowWaters}}
\newcommand{\OrdinaryDiffEq}{\juliapkg{OrdinaryDiffEq}}
\newcommand{\RxInfer}{\juliapkg{RxInfer}}
\newcommand{\Sherlogs}{\juliapkg{Sherlogs}}
\newcommand{\NBodySimulator}{\juliapkg{NBodySimulator}}
\newcommand{\Oceananigans}{\juliapkg{Oceananigans}}
\newcommand{\Fp}{Floating-point}
\newcommand{\fp}{floating-point}
\newcommand{\CSTG}{stack graph}
\newcommand{\CPP}{\code{C++}}
\newcommand{\Dendro}{\textsc{Dendro}}
\newcommand{\urlaccess}[2]{\url{#1}}
\newcommand{\Nan}{\code{NaN}}
\newcommand{\NaN}{\Nan}
\newcommand{\Inf}{\code{Inf}}
\newcommand{\zerowidth}[1]{\makebox[0pt][l]{#1}}
\newcommand{\zerocode}[1]{\zerowidth{\code{#1}}}
\newcommand{\genpropkill}{\emph{gen}-\emph{prop}-\emph{kill}}
\newcommand{\bigcheckmark}{\ding{51}}
% \newcommand{\bigxmark}{\ding{53}}
\newcommand{\tblnext}{\(\Rightarrow\)}
% \newcommand{\tblY}{\mbox{\bigcheckmark}}
% \newcommand{\tblN}{\scalebox{0.8}{\bigxmark}}
\newcommand{\tblY}{\checkmark}
\newcommand{\tblN}{\scalebox{1.2}{$\times$}}
\section{Introduction}
Reliable numeric computations are central to high-performance computing,
machine learning, and scientific applications.
Yet the \fp{} arithmetic that powers these computations is fundamentally
unreliable~(\cref{s:background}).
Exceptional values, such as Not a Number (\Nan{}) and infinity (\Inf{}),
can and often do arise thanks to culprits such as roundoff error,
catastrophic cancellation, singular matrices, and vanishing
derivatives~\cite{bllmg-xloop-2022,ddghlllprr-correctness-2022,fpchecker-reports,gllprt-correctness-2021,llg-soap-2022,sdjmrstp-pc-2022}.
Developers are responsible for guarding against exceptions, but this task
is difficult because many operations can generate and propagate exceptions.
Worst of all, operations such as less-than (\code{<}) can kill an exceptional value,
leaving no trace of the problem.
There is little tool support to assist in exception debugging,
thus (unsurprisingly) a quick GitHub search reports over 4,000 open issues
related to \NaN{} or \Inf{} exceptions~\cite{github-issues}.
This paper introduces \FlowFPX{}, a toolkit for debugging
\fp{} exceptions~(\cref{s:flowfpx})
that has helped improve a variety of applications,
from ocean simulations to heat modes~(\cref{s:casestudies}).
\FlowFPX{} consists of three tools that can work together to debug \fp{} computations:
\begin{enumerate}
\item
The centerpiece of \FlowFPX{} is \TF{}, a dynamic analysis tool written in Julia
for Julia code that monitors exceptions and fuzzes for vulnerabilities.
\item
To visualize results, \FlowFPX{} uses coalesced stack trace graphs (CSTGs, or
\CSTG{}s)~\cite{humphreySystematicDebuggingMethods2014}.
Our CSTG library is written in \CPP{} and accepts text-format stack traces as input.
\item
For programs that offload work to CUDA GPUs,
the binary instrumentation tool \GPUFPX{}~\cite{llsflg-hpdc-2023}
adds logging to kernels.
\end{enumerate}
\noindent%
The toolkit is online: \url{https://utahplt.github.io/flowfpx}
\section{Floating-Point Exception Primer}
\label{s:background}
\begin{figure}[t]\centering
\includegraphics[trim=10 0 10 0,clip,width=0.95\columnwidth]{fig/real_vs_fp.pdf}
\caption{Floats are spread across the real number line}
\label{f:real-vs-fp}
\end{figure}
\Fp{} numbers use a finite number of bits to represent a spectrum of points along the real number line~(\cref{f:real-vs-fp}).
The implementation strategy is essentially that followed in scientific notation.
A \fp{} number packs a sign bit, an exponent, and a fraction part (also called the ``significand'' or ``mantissa'') into a bit string.
Typical strings are 64 or 32 bits long, but 16-bit and 8-bit formats are on the rise~\cite{klowerLowprecisionClimateComputing2021,fp8,p3109}.
This representation supports very small and very large numbers in a narrow range of bits:
\begin{lstlisting}[language = Julia]
julia> floatmax(Float64)
1.7976931348623157e308
julia> floatmin(Float64)
2.2250738585072014e-308
\end{lstlisting}
The flip side is that most real numbers fall into the gaps between \fp{} numbers and must be rounded, which introduces error
and can lead to surprising results.
For example, adding the tiny Planck constant to the large Avogadro constant results in
the Avogadro constant after rounding:
\begin{lstlisting}[language = Julia]
julia> planck = 6.62607015e-34
6.62607015e-34
julia> avogadro = 6.02214076e23
6.02214076e23
julia> avogadro + planck == avogadro
true
\end{lstlisting}
This is an extreme example, but many operations on \fp{} numbers closer in magnitude induce a loss of accuracy.
Refer to the literature for more details, e.g.,~\cite{knuthArtComputerProgramming1997,torontoPracticallyAccurateFloatingPoint2014,mullerHandbookFloatingPointArithmetic2018}.
\subsection{Exceptions and Exceptional Values}
\label{s:exnvalue}
The IEEE~754 \fp{} standard~\cite{IEEEStandardBinary1985}
defines exceptions and exceptional values as the outcome of
operations that have ``no single universally acceptable result''~\cite{p-draft-1997}.
For example, dividing by zero and exceeding the \code{Float64} range
both lead to exceptions:
\bigskip
% julia> 2 / 0
% Inf
\begin{lstlisting}[language = Julia]
julia> 0 / 0
NaN
julia> avogadro^avogadro
Inf
julia> log(0)
-Inf
julia> Inf + NaN
NaN
\end{lstlisting}
IEEE~754 defers the question of how to handle such exceptions to application code.
It is up to developers to watch for \Nan{}, \Inf{}, and subnormal numbers (underflow)
and implement an appropriate repair.
This is a difficult task because of the approximations inherent to
floating point.
Even an apparently-safe division could result in a \Nan{} if its
denominator gets truncated to zero.
Some \Nan{}s might be spurious while others might be fatal, but in any event
anticipating the various exceptions is a burden.
All too often, exceptional values go unhandled and flow through the code.
\subsection{Lifetime of an Unhandled Exceptions}
\label{s:to-kill-a-fp}
\begin{figure}[t]
\includegraphics[width=\columnwidth]{fig/genpropkill-outline_vector.pdf}
\caption{Gen, Prop, Kill: Lifetime of an exceptional value}
\label{f:gpk}
\end{figure}
Unhandled exceptional values have a lifetime: they are born, or \emph{generated}, by some operation; they \emph{propagate} through other operations; and they either appear in the program output, go out of scope, or get \emph{killed} by a numeric operation.
\Cref{f:gpk} summarizes this \genpropkill{} process.
The gens and props are straightforward; see above for examples~(\cref{s:exnvalue}).
The kills often arise from numeric comparisons (\code{<}, \code{=}, etc.),
but exponents (\code{1{\string^}NaN}) and over-eager matrix
optimizations~\cite{ddghlllprr-correctness-2022} can kill exceptions as well.
To illustrate the perils of killed exceptions, consider the following two ways
of finding the maximum value in a list.
The first compares numbers with \code{<=}
while the second uses the built-in \code{max} function:
\begin{lstlisting}[language = Julia]
function max1(lst)
max_seen = 0.0
for x in lst
# swap if x is not too small
if ! (x <= max_seen)
max_seen = x
end
end
max_seen
end
function max2(lst)
foldl(max, lst)
end
\end{lstlisting}
For lists with a \Nan{} inside, the functions can give different
results because \code{<=} kills \Nan{}s whereas \code{max}
propagates them:
\begin{lstlisting}[language = Julia]
julia> max1([1, 5, NaN, 4])
4.0
julia> max2([1, 5, NaN, 4])
NaN
\end{lstlisting}
Not only is the result from \code{max1} problematic for obscuring the fact that there was a \NaN{} in the list, the result is arguably wrong!
The result from \code{max2} at least shows that a \Nan{} was in the works, though in a realistic setting it may not be clear where the \Nan{} came from.
Both versions would thus benefit from tools that track exceptions across their lifetime.
FlowFPX can help.
\section{\FlowFPX{}}
\label{s:flowfpx}
\FlowFPX{} (FPX: Floating Point eXception) is a toolkit for tracking down \fp{} exceptions.
The primary tools in the \FlowFPX{} toolkit are \TF{}, which records lifetimes
and enables fuzzing, and \CSTG{}s (more precisely, CSTGs), which visualize
the flow of exceptions.
\GPUFPX{} is a third component of \FlowFPX{} that tracks \fp{} exceptions inside GPU kernels~\cite{llsflg-hpdc-2023}.
This section covers each tool in detail.
\subsection{\TF{}}
\label{s:floattracker}
\TF{} tracks exceptional values in Julia programs across
their \genpropkill{} lifetime and can fuzz code for vulnerabilities by
injecting a \Nan{} or \Inf{} as the result of an operation. \TF{} is
implemented in Julia as a library and is available on JuliaHub:
\noindent\begin{center}\noindent\!\!\url{https://juliahub.com/ui/Packages/TrackedFloats}\!\!
\end{center}
\subsubsection{Tracking Exceptional Values}
\label{s:trackingexceptionalvalues}
\begin{figure}[t]\centering
\begin{tabular}{ccccc}
Exn. Input? & \tblnext & Exn. Output? & = & Event \\ \hline
\tblN & \tblnext & \tblY{} & = & {gen} \\
\tblY & \tblnext & \tblY{} & = & {prop} \\
\tblY & \tblnext & \tblN{} & = & {kill} \\
\end{tabular}
\caption{How to classify operations that see exceptions}
\label{fig:lifetime-class}
\end{figure}
\TF{} monitors \Nan{} and \Inf{} exceptions by overloading arithmetic
operations and logging key events.
When the input to an operation is exception-free but the output is
exceptional, the operation is a {gen} event~(\cref{fig:lifetime-class}).
When the input and output contain exceptions, the operation is a {prop}
event.
And when the input contains an exception but the output does not,
the operation is a {kill} event.
Put together, the log of all {gen}s, {prop}s, and {kill}s
sheds light of how various exceptions traveled across the program.
The logs also record the call context (stack trace) and the arguments to the operation as a starting point for debugging efforts.
\TF{} writes logs to three files: one for {gen} events, one for {prop}s, and one for {kill}s.
This way, discovering where a \Nan{} came from is a matter of sifting through the {gen} file.
Users can also lower the overhead of logging by turning it off for prop events.
The instrumentation works through custom \fp{} types: \code{TrackedFloat64},
\code{TrackedFloat32}, and \code{TrackedFloat16}.
Developers must opt in to \TF{} by wrapping numbers in a custom type.
From then on, tracking is automatic and extends transitively to all outputs
of tracked operations.
Multiplying a \code{Float64} value with a \code{TrackedFloat64}, for example, yields
a \code{TrackedFloat64} to continue the logging trail.
Crucially,
\TF{} does nothing to modify the bit-level representation of floats or exceptions.
The tracked version of, say, a \NaN{}, points to the original exceptional
value.
Thus \NaN{}-packing and other creative uses of the payload continue to
work---provided the code does not use some other Julia operation
that does not preserve \NaN{}s bit-for-bit.\footnote{\url{https://github.com/JuliaLang/julia/issues/48523}}
\subsubsection{Fuzzing}
Julia's operator overloading lets us fuzz code from the inside
out.
Each overloaded function serves as a hook where \TF{} can decide whether to observe
the operation or replace the original result with an exceptional value.
Injecting faults in a random way~\cite{hamlet1994random}, also known as fuzzing,
can discover vulnerabilities in a large codebase.
Demmel et~al. propose essentially the same idea for BLAS and
LAPACK~\cite{ddghlllprr-correctness-2022}.
When fuzzing is enabled, every time \TF{} intercepts a \fp{}
operation, it decides whether or not it should return a \Nan{} instead
of the actual computed value.
This decision is based on several parameters:
\begin{itemize}
\item $\,$ \texttt{active::Bool} fuzz only when set to true.
\item $\,$ \texttt{odds::Int64} inject at each candidate location with probability $1 / \texttt{odds}$.
\item $\,$ \texttt{n\_inject::Int64} upper bound on the number of \Nan{}s to inject.
\item $\,$ \texttt{functions::Array{String}} limit fuzzing to the dynamic extent of the listed functions.
\item $\,$ \texttt{libraries::Array{String}} limit fuzzing to functions from the following libraries
\end{itemize}
\TF{} looks at these parameters in conjunction with stack trace at the
point where the operation was intercepted to make the decision.
When fuzzing is active (\texttt{active == true})
and there are \Nan{}s remaining to be injected (\texttt{n\_inject > 0})
and \texttt{rand(1:odds)} returns \texttt{1}, then \TF{} inspects the
current stack trace to see if the point is in the dynamic extent of
the declared \texttt{functions} or \texttt{libraries} (if any).
When all these conditions are met,
\TF{} returns a \NaN{} instead of calling the overloaded operation.
The value \texttt{n\_inject} is decremented before the next operation
gets intercepted.
Every intercepted operation is a candidate for injection,
which means that injected \Nan{}s can appear deep
inside of an algorithm instead of merely at its toplevel inputs.
This can reveal vulnerabilities that are difficult to catch
with input-based testing.
The parameters \texttt{functions} and \texttt{libraries} let
users control how deep to fuzz; for example, it is often helpful
to test specific libraries and avoid trusted standard ones.
When fuzzing reveals an error, the next step is to craft a regression test to
guide repairs.
\TF{} therefore records the sequence of injections that it makes during a fuzzing
run and enables a replay of any recording after the fact.
Replay runs proceed deterministically so that developers can harden
their code and check that the fixes remove the error.
When running a replay, each intercepted operation consults the reply
file (instead of the stack trace and fuzzing parameters) to decide
whether or not to return a \Nan{} from the computation.
\subsubsection{Internals}
\TF{} takes advantage of Julia's operator overloading to track exceptions and
fuzz for vulnerabilities.
For example, below is a variant of \code{+} that is overloaded
for \code{TrackedFloat64}.
It calls the basic \code{+} (or injects a \Nan{}) and checks for exceptions
before returning:
This lets \TF{} intercept all \fp{} operations involving \texttt{TrackedFloat} types.
\begin{lstlisting}[language = Julia]
function +(x::TrackedFloat64, y::TrackedFloat64)
result = run_or_inject(+, x.val, y.val)
check_error(+, result, x.val, y.val)
TrackedFloat64(result)
end
\end{lstlisting}
Every arithmetic operation requires a similar overloading.
The implementation of \TF{} uses a metaprogramming technique
adapted from \texttt{Sherlogs}~\cite{kMilanklSherlogsJl2021}
to abstract over the common patterns.
For every binary operation, the library creates an overloading similar
to the one for \code{+} above.
Unary operations and others work analogously.
This approach saved thousands of lines of code.
The implementation weighs in at 218 lines and defines 645 overloaded function;
assuming 5 lines of code per function, a handwritten version would require over 3,000 lines.
\subsection{Stack Graphs}
\label{s:cstg}
\TF{} can produce copious amounts of log files which can be challenging to sift through manually.
Coalesced stack trace graphs (CSTGs or \CSTG{}s for short) provide a way to visualize large amounts of stack traces in a compact form~\cite{humphreySystematicDebuggingMethods2014}.
This technique pairs well with \TF{} and eases the task of analyzing log files.
\Cref{fig:cstg_demo} illustrates the construction of a \CSTG{} from a collection of stack traces.
Each trace on the left contributes nodes and edges to the graph on the right.
Repeated edges get emphasized with darker lines and larger counts.
\begin{figure}[t]
\centering
\includegraphics[width=0.9\columnwidth]{./fig/cstg_static_diagram.pdf}
\caption{From stack traces (left) to \CSTG{} (right)}
\label{fig:cstg_demo}
\end{figure}
Reading bottom-up, a \CSTG{} based on the \emph{gen} events in a program
highlights the contexts that frequently produced exceptional values.
Using~\cref{fig:cstg_demo} as an example, \code{Frame D} produced
three exceptions, two of which arose under \code{Frame C}.
In a large program with many exceptions, \CSTG{}s offer a way to prioritize
debugging efforts: go for the heavily-trodden paths first.
\subsection{\GPUFPX{}}
\label{s:gpufpx}
Many programs offload work onto GPUs, which are no less susceptible to \fp{} exceptions than CPUs.
In fact, GPU programs are worse off because they lack exception-handling mechanisms from the CPU world~\cite{llg-soap-2022}.
Since \TF{} instruments Julia programs, it cannot help directly;
however, the companion tool \GPUFPX{} instruments GPU kernels
(specifically CUDA) to detect and report
\fp{} exceptions~\cite{llsflg-hpdc-2023}.
There is no formal connection between \TF{} and \GPUFPX{}.
Developers who choose to apply both tools to the respective Julia and CUDA
components of their code can gain insights for accelerated programs.
\section{Case Studies}
\label{s:casestudies}
\FlowFPX{} has helped to debug exceptions and fuzz for issues in a variety
of settings, some synthetic and some realistic.
The case studies include a shallow water simulation, the
\OrdinaryDiffEq{} solver, and a Bayesian inference library.
\subsection{\ShallowWaters{}}
\label{s:sw}
\ShallowWaters{} simulates the flow of water over a
seabed~\cite{klowerNumberFormatsError2020,klowerPositsAlternativeFloats2019}.
The library has dozens of parameters that a scientist can experiment with.
One notable parameter is the Courant-Friedrichs-Lewy (CFL) number, which roughly describes the size of the time step to take in running the simulation.
A small CFL number makes the simulation run slowly, but accurately; a large number speeds it up but loses precision
because the system does not get enough time to propagate information.
Normal values for the CFL number range between zero and one.
With a CFL of 0.9, the shallow water simulation produces the graph on the left half
of~\cref{fig:sw_nans} showing current speed in a rectangular basin.
Raising the CFL too high, to 1.6, causes trouble
(\cref{fig:sw_nans}, right) with large, white regions where the
current speed is \NaN{} instead of a normal value.
Running \TF{} on the high-CFL simulation reveals where the simulation drifted
from numbers to \NaN{} exceptions.
The first step in applying \TF{} is to convert relevant floats to tracked
floats, e.g., \code{Float32} to \code{TrackedFloat32}.
For \ShallowWaters{}, this step is easy because the simulation is parameterized by
a \fp{} type for internal use.
Swapping in a tracked type is enough:
\begin{lstlisting}[language = Julia]
run_model(T=TrackedFloat32,
cfl=10,
nx=100,
Ndays=100,
L_ratio=1,
bc="nonperiodic",
wind_forcing_x="double_gyre",
topography="seamount")
\end{lstlisting}
With tracking, the simulation runs as before, producing an ugly graph.
It additionally outputs logs for all gen, prop, and kill events
as they happen.
Below is one event from the gen logs:
\begin{lstlisting}
-([-Inf, -Inf]) TF/TrackedFloat.jl:106
momentum_u! SW/rhs.jl:246
rhs_nonlinear! SW/rhs.jl:50
rhs! SW/rhs.jl:14 [inlined]
time_integration SW/time_integration.jl:77
run_model SW/run_model.jl:37
top-level
\end{lstlisting}
We can see that the \NaN{} appeared as the result of subtracting two infinities (\code{-Inf - -Inf}).
The trace further shows that this subtraction happens inside the function \texttt{momentum\_u!} on line 246 of the \texttt{rhs.jl} file.
This solves the mystery of where one \NaN{} came from, but raises a new question
about the source of the \Inf{} value.
The logs for \Inf{} gens have an answer:
\begin{lstlisting}
^([-1.515f31, 2]) TF/TrackedFloat.jl:138
literal_pow() intfuncs.jl:325
...
materialize(^) broadcast.jl:860
top-level getproperty(...) examples/sw_nan_tf.jl:14
\end{lstlisting}
%
This \Inf{} came from an exponent that overflowed the float type (\code{-1.515e31\string^2}).
\TF{} has shown exactly which numeric values in which operation
caused exceptions to occur.
\begin{figure}[t]
\centering
\includegraphics[width=3in]{./fig/shallow_waters_cfl_diff.pdf}
\caption{Raising the CFL number creates white gaps due to \NaN{}s}
\label{fig:sw_nans}
\end{figure}
\subsubsection{Stack Graphs for a Bigger Picture}
While the logs for \ShallowWaters{} contain useful information,
there is an overwhelming amount of it.
There are over ten thousand lines in the gen file alone.
Converting these logs to a stack graph gives a quick overview
of the most common paths to exceptional values.
\Cref{fig:sw_nan_cstg} presents the stack graph for \NaN{} gen events
in \ShallowWaters{} with the high CFL number.
Reading bottom-up, every \NaN{} came from calls to the `\code{-}' and `\code{+}' operations.
Calls to `\code{+}' account for most of the \NaN{}s.
These \NaN{}s arose in two different contexts:
a momentum calculation (30~\NaN{}s) and
a continuity step (162~\NaN{}s).
Moving to the top of the graph, it shows that the function \code{run\_model} drove
the entire simulation.
With this overview of the program, a promising next step is to guard against \NaN{}s
in the momentum and continuity functions.
\begin{figure}[t]
\centering
\includegraphics[width=0.96\columnwidth]{fig/sw_nan_cstg_clean.pdf}
\caption{Stack graph for \NaN{} gens in \ShallowWaters{}}
\label{fig:sw_nan_cstg}
\end{figure}
% \begin{figure}[t]
% \centering
% \includegraphics[width=3in]{fig/sw_inf_cstg_clean.png}
% \caption{CSTG of \Inf{} generation logs from the same run as~\cref{fig:sw_nan_cstg}.}
% \label{fig:sw_inf_cstg}
% \end{figure}
\subsubsection{Stack Graph Differences}
Graph diffing works well for \CSTG{}s; it shows how flows in the program evolved
from one stage to another.
\Cref{fig:cstg_diff_demo} illustrates one diff in the context of \NaN{} gens for
\ShallowWaters{}:
it compares the first 10\% of gen events to the latter 90\% of gens.
The positive numbers and green lines indicate flows that are new in the latter part,
and the negative numbers and red lines show flows that disappeared in the latter part.
A domain expert might use these clues to find where an instability started in
the first part of the program.
In this case, the function \texttt{momentum\_u!} showed up only in the
beginning of the logs---it may be an effective point to check for \Nan{}s.
\begin{figure}[t]
\centering
\includegraphics[width=0.96\columnwidth]{./fig/cstg_diff_pretty.pdf}
\caption{Stack graph diff}
\label{fig:cstg_diff_demo}
\end{figure}
\subsection{Fuzzing from NBody to ODE}
\label{s:ode}
We applied the fuzzing abilities of \TF{} to the
\NBodySimulator{} package.\footnote{\url{https://github.com/SciML/NBodySimulator.jl}}
To avoid injecting in the standard library or other dependencies,
we initially configured the fuzzer to inject \NaN{}s {only} when inside of a
function from the simulator.
Surprisingly, fuzzing injected \emph{zero} exceptional values even
when the fuzzer's odds were configured to force injection at the
first available opportunity.
It turns out that all the \fp{} operations for the simulation happened
within the \OrdinaryDiffEq{}
solver.\footnote{\url{https://github.com/SciML/OrdinaryDiffEq.jl}}
Fuzzing on \OrdinaryDiffEq{} led to a curious situation.
The library itself reported a \NaN{} and printed a message stating that it would exit.
However, after printing that message, the program went into an infinite loop.
Using \CSTG{}s to guide our search, we found a \NaN{} kill that manifested repeatedly in the logs.
The stack traces for that kill originated from inside the file \texttt{solve.jl}:
\begin{lstlisting}
<([NaN, 3.0e6]) at TF/TrackedFloat.jl:193
solve! at ODE/solve.jl:515
...
\end{lstlisting}
The relevant part of \texttt{solve.jl} contains a pair of loops.
With injection, the variable \code{tdir} holds a \NaN{},
stopping all productivity:
% file here: ~/Research/ode_debug/dev/OrdinaryDiffEq/src/solve.jl
% see line 514
\begin{lstlisting}[language = Julia]
while !isempty(time_stops)
while tdir * t < first(time_stops)
# do integration work
# pop_off(time_stops)
end
end
\end{lstlisting}
In more detail, the \NaN{} for \code{tdir} propagates though the multiplication
and gets killed by the \code{<} comparison.
Hence the condition for inner \code{while} loop is always false,
which means the outer loop never ends up with an empty list.
This is a real-world example of how \NaN{} kills can affect control flow, and we filed
an issue for it.\footnote{\url{https://github.com/SciML/OrdinaryDiffEq.jl/issues/1939}}
Fortunately, the problem in this case is benign as the code was already trying to halt.
\subsection{Finch}
\label{s:finch}
Finch is a domain-specific language for specifying
PDEs~\cite{heislerFinchDomainSpecific2022}.\footnote{Not to be confused
with the loop optimizer \juliapkg{Finch}~\cite{adka-cgo-2023}.}
In the spirit of FEniCS~\cite{fenics} and related
tools~\cite{freefem,openfoam,dune,firedrake},
Finch helps scientists quickly convert math into code.
What sets Finch apart is its flexibility.
It supports multiple discretization methods (finite element and finite
volume) and multiple backends (Julia, \CPP{}, \Dendro{}~\cite{dendro}).
Furthermore it strives to output code that humans can easily fine-tune.
Fuzzing with \TF{} revealed two places where Finch needed
protection against user input.
The first was when reading an input mesh.\footnote{\urlaccess{https://github.com/paralab/Finch/issues/16}{2023-06-06}}
A \NaN{} injected in the mesh led to a crash further on:
\begin{lstlisting}
BoundsError: attempt to access 1-element
Vector{Int64} at index [2]
\end{lstlisting}
Before accessing the input, Finch needed to check for \NaN{}s.
The second place was in setting bounds for the
solver.\footnote{\urlaccess{https://github.com/paralab/Finch/issues/17}{2023-06-06}}
Here, a \NaN{} could leave bounds uninitialized, leading to a bounds error.
Additionally, \TF{} and \CSTG{}s have been useful for identifying \NaN{}s that
appear in unstable heat simulations written in Finch.
%% advection2d, high cfl and high T (time) led to NaN during simulation
\subsection{\Oceananigans{}}
\label{s:ocean}
\Oceananigans{}~\cite{OceananigansJOSS} is simulation package for incompressible
fluid dynamics that can generate code for Nvidia GPUs.
For example, the following program (from the project readme) simulates turbulence:
\begin{lstlisting}[language = Julia]
using Oceananigans
grid = RectilinearGrid(GPU(),
size=(128, 128), x=(0, 2π), y=(0, 2π),
topology=(Periodic, Periodic, Flat))
model = NonhydrostaticModel(; grid,
advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)
simulation = Simulation(model;
Δt=0.01, stop_time=4)
run!(simulation)
\end{lstlisting}
\GPUFPX{} provides detailed feedback on this program.
The output in~\cref{f:gpufpx} shows that 21 kernels
appear and generate six floating-point exceptions.
There are three \NaN{}s, one \Inf{}, and two division
by zero errors.
The report is a starting point for further investigation
of the reliability of the example.
%% TODO exactly how?
\begin{figure}[t]\centering
\begin{tabular}[t]{ll}
\begin{tabular}[t]{lr}
\zerocode{-{}- FP64 Operations -{}-} \\
\code{Total NaN:} & \code{2} \\
\code{Total INF:} & \code{1} \\
\code{Total subnormal:} & \code{0} \\
\code{Total div0:} & \code{2} \\
\end{tabular}
&
\begin{tabular}[t]{lr}
\zerocode{-{}- FP32 Operations -{}-} \\
\code{Total NaN:} & \code{1} \\
\code{Total INF:} & \code{0} \\
\code{Total subnormal:} & \code{0} \\
\code{Total div0:} & \code{0} \\
\end{tabular}
\end{tabular}
\begin{tabular}{lr}
\zerocode{-{}- Other Stats -{}-} \\
\code{Kernels:} & \code{21}
\end{tabular}
\caption{Example \GPUFPX{} output}
\label{f:gpufpx}
\end{figure}
\subsection{\RxInfer{}}
\label{s:safari}
We discovered an open issue in the \RxInfer{} library related to \NaN{} detection
and suggested \TF{} to the developers.\footnote{\url{https://github.com/biaslab/RxInfer.jl/issues/116}}
Within a day, they were able to track down the location of an important \NaN{} gen.
This success came with little input from our end; in fact, the application program was
closed-source.
We merely explained how to use \TF{} by wrapping inputs in a tracked type.
\section{Related Work}
\label{s:related}
\subsection{Error Analysis}
Demmel et~al.~\cite{ddghlllprr-correctness-2022} examine \fp{} exceptional value handling in BLAS and LAPACK, identify several inconsistencies, and propose an API for debugging and adding determinism.
The proposal includes an extension to the \texttt{INFO\_ARRAY} parameter with fields that record \genpropkill{} information.
It also includes fuzzing, which we realize in \TF{}.
Toronto and McCarthy~\cite{torontoPracticallyAccurateFloatingPoint2014} propose a test-driven method for detecting numeric error: plot the results of an expression on a range of inputs and look for sharp deviations, or \emph{badlands}.
They point out several ways to rewrite code to avoid badlands by rewriting code in a semantics-preserving way.
Herbie~\cite{panchekhaAutomaticallyImprovingAccuracy2015} automatically rewrites arithmetic expressions to reduce \fp{} error.
This would combine well with \TF{}: first identify the location of a \NaN{}, ask Herbie to find a repair.
The Odyssey~\cite{misbackOdysseyInteractiveWorkbench2023} workbench provides an interactive interface to Herbie.
\subsection{Diagnosing floating-point exceptions}
The Julia library \Sherlogs{}~\cite{kMilanklSherlogsJl2021} inspired our use of a custom number type to intercept operations on a number.
In contrast to \TF{} which monitors for exceptional values and logs stack traces at interesting points in their lifetime, \Sherlogs{} tracks and reports the range of values seen over the course of a computation.
This is intended to provide insight into whether or not a library could tolerate a lower-precision \fp{} format.
% TODO: make sure that what I said about what FPSpy can track is true.
FPSpy~\cite{dindaSpyingFloatingPoint2020} is an \texttt{LD\_PRELOAD} shared library that works on unmodified x86 binaries.
It monitors a program during execution for operations that generate an exception, such as division by zero, underflow, and overflow.
By contrast to \TF{}, it does not track prop or kill events.
FPSpy has the advantage of being lightweight enough to run on production code for certain loads.
\subsection{Stack Graphs}
Our stack graphs utilize the CSTG library for coalesced stack trace graphs~\cite{humphreySystematicDebuggingMethods2014}.
In turn, CSTGs build on the STAT tool from LLNL~\cite{arnoldStackTraceAnalysis2007}.
STAT collects, analyzes, and visualizes stack traces from concurrent processes
to highlight anomalies.
It produces visualizations similar to those of CSTG, thought CSTG offers more compact views and supports diffs.
\subsection{GPU Exception Tracking}
FPChecker~\cite{l-ase-2019} is a tool to report \fp{} exceptions occurring on the GPU.
FPChecker relies on LLVM-level instrumentation of GPU kernels, and so cannot run on the plethora of closed-source GPU kernels in usage today.
BinFPE~\cite{llg-soap-2022} is another tool in the same space;
BinFPE performs SASS-level analysis of GPU kernels, but is limited in that it is slow and does not catch errors that alter control flow.
The latter deficiency is particularly worrisome, as we have seen, silent \NaN{} kills can invalidate results without the user noticing.
\GPUFPX{}~\cite{llsflg-hpdc-2023} improves on the work of FPChecker and BinFPE by being more performant and catching a wider set of errors, including those that alter control flow.
\GPUFPX{} is a shared library that, like FPSpy uses \texttt{LD\_PRELOAD} to work on unmodified binaries.
\GPUFPX{} runs on CUDA cores from NVIDIA and reports total numbers of exceptional values.
Like \TF{}, \GPUFPX{} can catch \NaN{} gens and kills, but it does this on the GPU where \TF{} doesn't apply.
Despite these improvements, \GPUFPX{} is limited by the closed-source nature of common GPU cores, and cannot report at the rich level of source detail that \TF{} can.
% % Herbie, improving accuracy automatically.
% % ParetoHerbie
% %% http://eprints.maths.manchester.ac.uk/2873/1/fami22.pdf
% %% https://drive.google.com/drive/folders/1HkgMpxi6hsqLSZj9tSnuCxnFExPPjmvx?usp=sharing
% %% https://docs.google.com/presentation/d/1ueqzb5QY9YHmtY2DfY_hpN9oufSOcXErLWsJOZQxgzo/edit?usp=sharing
% %% https://ieeexplore.ieee.org/document/8916392
% %% https://link.springer.com/chapter/10.1007/978-3-319-73721-8_24
% FPChecker, " considers CPU OpenMP and MPI codes, again using LLVM
% instrumentation"~\cite{ltlg-iiswc-2022}
% % FPChecker has a nice table of tools; some already mentioned here
% 6-8,24,25
% Arnab Das scalable yet rigorous
% Marc Daumas Certification of Bounds Expression
% David Delmas Towards an Industrial Use of
% Alexey Solovyev FPTaylor Results table
% Titolo absint
% GPU-FPX state of art, companion project~\cite{llsflg-hpdc-2023}.
% Latest in line of work.
% BinFPE exn checking, slower than GPU-FPX,
% \cite{llg-soap-2022}.
% FPChecker, early pioneer, fpx in gpus, llvm-level instrumentation
% if compiled with Clang, but most gpus use closed NVCC compiler~\cite{l-ase-2019}.
\section{Discussion}
\label{s:discussion}
% future work
Lightweight tools for error analysis that can quickly identify
\fp{} problems and suggest repairs are an important topic.
The number of scale of scientific application has grown tremendously
over the years.
For small teams that cannot afford a full-time
analyst, tools like \FlowFPX{} fill a critical role.
\FlowFPX{} it itself an evolving toolkit.
Below we discuss some topics for future work.
\subsection{Performance}
\label{s:discussion-performance}
\TF{} incurs significant overhead on the order of 100x slower than a non-instrumented run of the same program.
It is a debugging tool, not a production tool.
To put this number into context, Valgrind runs with a similar level of slowdown.
In addition to the cost incurred by intercepting \fp{} operations, gathering stack traces is expensive.
We observe a 10x slowdown on \ShallowWaters{} with logging disabled.
We recommend that users of \TF{} make use of the \texttt{maxLogs} and \texttt{exclude\_stacktrace} configuration options to limit the number and kind of logs gathered, and thereby reduce the number of calls to \texttt{stacktrace()}.
Stack traces are essential to decide where to inject a fault when fuzzing,
but we defer them as late as possible to maximize performance.
\subsection{Enhanced Fuzzing}
While fuzzing is useful for discovering issues, its success rate
is low because {every} \fp{} operation is a candidate
for injection.
Even operations that are already well-defended against \NaN{}s are candidates.
\TF{} could use two sorts of tools for improving injection.
First, fine-grained control to let users decide where not to inject.
Second, tools for understanding the context of an injection point
after the fact.
Program slicing is especially relevant to the latter point and effectively
what we did by hand when fuzzing Finch (\cref{s:finch}).
For each operation, an expert needs to study the values that feed into it to
decide whether they are protected or not.
%% TODO example, start here at glbvertex https://github.com/paralab/Finch/blob/master/src/grid.jl#L360
\subsection{Tracking Exceptions in External Libraries}
\TF{} is limited to Julia code.
It cannot track the lifetime of exceptional values in external libraries, such as GPU kernels or C programs.
For GPUs it relies on \GPUFPX{}, through the connection between these tools is loose.
In the future, \TF{} would benefit from an API to plug in tools for external libraries, gather their output,
and present a comprehensive view of exceptions in a multi-language program.
\subsection{Interface Concerns, Multi-Threading}
\TF{} could be extended to monitor events and values that go beyond exceptions,
such as very-large or very-small numbers according to bounds supplied by the user.
On a similar note, the interface to \TF{} is primitive: users express interest in a number by wrapping it
in a constructor such as \code{TrackedFloat64}.
Helper functions that track data structures and provide a default float size would
make experimentation easier.
\TF{} stores its configuration and event logs in global data structures,
making it unsafe for a multithreaded environment.
Adding locks would restore safety, but perhaps a richer interface is in order.
Individual threads may wish to configure local logging for better
data organization and performance.
\section{Acknowledgments}
Thanks to the \Sherlogs{} developers for inspiring the architecture of \TF{}.
Thanks to the \RxInfer{} developers for testing \TF{} and for feedback on its user interface.
Thanks to Alex Larsen and Rob Durst for comments on an early draft.
This work is supported by
NSF grant \href{https://nsf.gov/awardsearch/showAward?AWD_ID=2030859&HistoricalAwards=false}{2030859}
to the CRA for the \href{https://cifellows2020.org}{CIFellows} project,
and DOE ASCR Award Number DE-SC0022252 and NSF CISE Awards 1956106 and 21241.
\input{bib.tex}
\end{document}
% Below is some Emacs-specific stuff to help me (Ashton) write this paper. If
% you're using Emacs, you should try using the new jinx package for
% spell-checking!
% Local Variables:
% jinx-local-words: "BinFPE CFL CSTG's CSTGs Courant Demmel FPChecker Friedrichs Herbie JuliaHub Klöwer Lewy OrdinaryDiffEq RxInfer ShallowWaters Sherlogs TrackedFloat isnan"
% citar-bibliography: nil
% End: