-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathmanual.html
1738 lines (1504 loc) · 79.9 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html><head>
<title>CUP User's Manual</title></head><body>
<hr>
<img src="cup_logo.gif" alt="[CUP Logo Image]">
<hr>
<h1>CUP User's Manual</h1>
<h3><a href="http://www.cc.gatech.edu/gvu/people/Faculty/Scott.E.Hudson.html">
Scott E. Hudson</a><br>
<a href="http://www.cc.gatech.edu/gvu/gvutop.html">
Graphics Visualization and Usability Center</a><br>
<a href="http://www.gatech.edu/TechHome.html">
Georgia Institute of Technology</a></h3>
Modified by <a href="http://www.princeton.edu/%7Efrankf">Frank
Flannery</a>, <a href="http://www.pdos.lcs.mit.edu/%7Ecananian/">C. Scott Ananian</a>,
<a href="http://www.cs.princeton.edu/%7Edanwang">Dan Wang</a> with advice from
<a href="http://www.cs.princeton.edu/%7Eappel">Andrew W. Appel</a><br>
Now actualized by <a href="http://www2.cs.tum.edu/%7Epetter">Michael Petter</a><br>
and further by <a href="http://www.jochen-hoenicke.de/">Jochen Hoenicke</a>.
Last updated June 2011 (v0.12joho)
<hr>
<h3>CUP Parser Generator Copyright Notice, License, and Disclaimer</h3>
<p>
Copyright 2008-2011 by Jochen Hoenicke<br>
Copyright 2005-2006 by Michael Petter<br>
Copyright 1996-1999 by Scott Hudson, Frank Flannery, C. Scott Ananian
</p>
<p>
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted, provided that
the above copyright notice appear in all copies and that both the copyright
notice and this permission notice and warranty disclaimer appear in supporting
documentation, and that the names of the authors or their employers not be used
in advertising or publicity pertaining to distribution of the software without
specific, written prior permission.
</p>
<p>
The authors and their employers disclaim all warranties with regard to this
software, including all implied warranties of merchantability and fitness. In
no event shall the authors or their employers be liable for any special,
indirect or consequential damages or any damages whatsoever resulting from loss
of use, data or profits, whether in an action of contract, negligence or other
tortious action, arising out of or in connection with the use or performance of
this software.
</p>
<hr>
<h3>Table of Contents</h3>
<dl compact="compact">
<dt> i. </dt><dd> <a href="#about">About CUP Version 0.12</a></dd>
<dt> 1. </dt><dd> <a href="#intro">Introduction and Example</a></dd>
<dt> 2. </dt><dd> <a href="#spec">Specification Syntax</a></dd>
<dt> 2.1 </dt><dd> <a href="#package_spec">Package Specification</a></dd>
<dt> 2.2 </dt><dd> <a href="#code_part">Options and User Code Components</a></dd>
<dt> 2.3 </dt><dd> <a href="#symbol_list">symbol (terminal and non-terminal) lists</a></dd>
<dt> 2.4 </dt><dd> <a href="#precedence">Precedence</a></dd>
<dt> 2.5 </dt><dd> <a href="#production_list">The grammar</a></dd>
<dt> 2.6 </dt><dd> <a href="#wildcard">Optional and Iterated Symbols</a></dd>
<dt> 3.1 </dt><dd> <a href="#running">Running CUP</a></dd>
<dt> 3.2 </dt><dd> <a href="#ant">CUP and ANT</a></dd>
<dt> 4. </dt><dd> <a href="#parser">Customizing the Parser</a></dd>
<dt> 5. </dt><dd> <a href="#scanner">Scanner interface</a></dd>
<dt> 5.1 </dt><dd> <a href="#basic-symbols">Basic Symbol management</a></dd>
<dt> 5.2 </dt><dd> <a href="#advanced-symbols">Advanced Symbol management</a></dd>
<dt> 6. </dt><dd> <a href="#errors">Error Recovery</a></dd>
<dt> 7. </dt><dd> <a href="#conclusion">Conclusion</a></dd>
<dt> </dt><dd> <a href="#refs">References</a></dd>
<dt> A. </dt><dd> <a href="#appendixa">Grammar for CUP Specification Files</a></dd>
<dt> B. </dt><dd> <a href="#appendixb">A Very Simple Example Scanner</a></dd>
<dt> C. </dt><dd> <a href="#changes">Incompatibilites between CUP 0.9 and CUP 0.10</a></dd>
<dt> D. </dt><dd> <a href="#bugs">Bugs</a></dd>
<dt> E. </dt><dd> <a href="#version">Change log</a></dd>
</dl>
<a name="about">
<h3>i. About CUP Version 0.12</h3>
</a>
<p>Since version 0.9 there are many new changes and features. These changes
attempt to make CUP more like its
predecessor, YACC. As a result, the old 0.9 parser specifications for CUP are
not compatible and a reading of <a href="#changes">appendix C</a> of the new
manual will be necessary to write new specifications. The new version,
however, gives the user more power and options, making parser specifications
easier to write.</p>
<p>In Version 0.11 the TUM team tries to continue the success story of
CUP 0.10,
beginning with the introduction of generic data types for non-terminal symbols
as well as a modernisation of the user interface with a comfortable ANT plugin
structure.</p>
<p>In Version 0.12 Java 5 support is extended. Furthermore there are some
performance improvements in the CUP code and the generated parser. This
comes with changes to the internal structure of the parser so it may lead
to incompatibilities.</p>
<a name="intro">
<h3>1. Introduction and Example</h3></a>
<p>This manual describes the basic operation and use of the
Java<a href="#trademark">(tm)</a>
Based Constructor of Useful Parsers (CUP for short).
CUP is a system for generating LALR parsers from simple specifications.
It serves the same role as the widely used program YACC
<a href="#YACCref">[1]</a> and in fact offers most of the features of YACC.
However, CUP is written in Java, uses specifications including embedded
Java code, and produces parsers which are implemented in Java.</p>
<p>Although this manual covers all aspects of the CUP system, it is relatively
brief, and assumes you have at least a little bit of knowledge of LR
parsing. A working knowledge of YACC is also very helpful in
understanding how CUP specifications work.
A number of compiler construction textbooks (such as
<a href="#dragonbook">[2</a>,<a href="#crafting">3]</a>) cover this material,
and discuss the YACC system (which is quite similar to this one) as a
specific example. </p>
<p>Using CUP involves creating a simple specification based on the
grammar for which a parser is needed, along with construction of a
scanner capable of breaking characters up into meaningful tokens (such
as keywords, numbers, and special symbols).</p>
<p>As a simple example, consider a
system for evaluating simple arithmetic expressions over integers.
This system would read expressions from standard input (each terminated
with a semicolon), evaluate them, and print the result on standard output.
A grammar for the input to such a system might look like: </p>
<pre> expr_list ::= expr_list expr_part | expr_part
expr_part ::= expr ';'
expr ::= expr '+' expr | expr '-' expr | expr '*' expr
| expr '/' expr | expr '%' expr | '(' expr ')'
| '-' expr | number
</pre>
<p>To specify a parser based on this grammar, our first step is to identify and
name the set of terminal symbols that will appear on input, and the set of
non-terminal symbols. In this case, the non-terminals are:</p>
<pre><tt> expr_list, expr_part </tt> and <tt> expr </tt>.</pre>
<p>For terminal names we might choose:</p>
<pre><tt> SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, NUMBER, LPAREN,</tt>
and <tt>RPAREN</tt></pre>
<p>The experienced user will note a problem with the above grammar. It is
ambiguous. An ambiguous grammar is a grammar which, given a certain
input, can reduce the parts of the input in two different ways such as
to give two different answers. Take the above grammar, for
example. given the following input: <br>
<tt>3 + 4 * 6</tt><br>
The grammar can either evaluate the <tt>3 + 4</tt> and then multiply
seven by six, or it can evaluate <tt>4 * 6</tt> and then add three.
Older versions of CUP forced the user to write unambiguous grammars, but
now there is a construct allowing the user to specify precedences and
associativities for terminals. This means that the above ambiguous
grammar can be used, after specifying precedences and associativities.
There is more explanation later.
Based on these namings we can construct a small CUP specification
as follows:<br></p>
<hr>
<pre><tt>// CUP specification for a simple expression evaluator (no actions)
import com.github.jhoenicke.javacup.runtime.*;
/* Preliminaries to set up and use the scanner. */
init with {: scanner.init(); :};
scan with {: return scanner.next_token(); :};
/* Terminals (tokens returned by the scanner). */
terminal SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD;
terminal UMINUS, LPAREN, RPAREN;
terminal Integer NUMBER;
/* Non terminals */
non terminal expr_list, expr_part;
non terminal Integer expr, term, factor;
/* Precedences */
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
/* The grammar */
expr_list ::= expr_list expr_part |
expr_part;
expr_part ::= expr SEMI;
expr ::= expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr MOD expr
| MINUS expr %prec UMINUS
| LPAREN expr RPAREN
| NUMBER
;
</tt></pre>
<hr>
<p>We will consider each part of the specification syntax in detail later.
However, here we can quickly see that the specification contains four
main parts. The first part provides preliminary and miscellaneous declarations
to specify how the parser is to be generated, and supply parts of the
runtime code. In this case we indicate that the <tt>com.github.jhoenicke.javacup.runtime</tt>
classes should be imported, then supply a small bit of initialization code,
and some code for invoking the scanner to retrieve the next input token.
The second part of the specification declares terminals and non-terminals,
and associates object classes with each. In this case, the terminals
are declared as either with no type, or of type
<tt>Integer</tt>. The specified type of the
terminal or non-terminal is the type of the value of those terminals or
non-terminals. If no type is specified, the terminal or non-terminal
carries no value. Here, no type indicates that these
terminals and non-terminals hold no value.
The third part specifies the precedence and
associativity of terminals. The last precedence declaration give its
terminals the highest precedence. The final
part of the specification contains the grammar.</p>
<p>To produce a parser from this specification we use the CUP generator.
If this specification were stored in a file <tt>parser.cup</tt>, then
(on a Unix system at least) we might invoke CUP using a command like:
</p><pre><tt> java -jar java-cup-11a.jar parser.cup</tt> </pre>
In this case, the system will produce two Java source files containing
parts of the generated parser: <tt>sym.java</tt> and <tt>parser.java</tt>.
As you might expect, these two files contain declarations for the classes
<tt>sym</tt> and <tt>parser</tt>. The <tt>sym</tt> class contains a series of
constant declarations, one for each terminal symbol. This is typically used
by the scanner to refer to symbols (e.g. with code such as
"<tt>return new Symbol(sym.SEMI);</tt>" ). The <tt>parser</tt> class
implements the parser itself.</p>
<p>The specification above, while constructing a full parser, does not perform
any semantic actions &emdash; it will only indicate success or failure of a parse.
To calculate and print values of each expression, we must embed Java
code within the parser to carry out actions at various points. In CUP,
actions are contained in <i>code strings</i> which are surrounded by delimiters
of the form <tt>{:</tt> and <tt>:}</tt> (we can see examples of this in the
<tt>init with</tt> and <tt>scan with</tt> clauses above). In general, the
system records all characters within the delimiters, but does not try to check
that it contains valid Java code.</p>
<p>A more complete CUP specification for our example system (with actions
embedded at various points in the grammar) is shown below. Note that this
uses autoboxing for integers and thus needs Java 1.5 to run.
</p><hr>
<pre><tt>// CUP specification for a simple expression evaluator (w/ actions)
import com.github.jhoenicke.javacup.runtime.*;
/* Preliminaries to set up and use the scanner. */
init with {: scanner.init(); :};
scan with {: return scanner.next_token(); :};
/* Terminals (tokens returned by the scanner). */
terminal SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD;
terminal UMINUS, LPAREN, RPAREN;
terminal Integer NUMBER;
/* Non-terminals */
non terminal expr_list, expr_part;
non terminal Integer expr;
/* Precedences */
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
/* The grammar */
expr_list ::= expr_list expr_part
|
expr_part;
expr_part ::= expr:e
{: System.out.println("= " + e); :}
SEMI
;
expr ::= expr:e1 PLUS expr:e2
{: RESULT = e1 + e2; :}
|
expr:e1 MINUS expr:e2
{: RESULT = e1 - e2; :}
|
expr:e1 TIMES expr:e2
{: RESULT = e1 * e2; :}
|
expr:e1 DIVIDE expr:e2
{: RESULT = e1 / e2; :}
|
expr:e1 MOD expr:e2
{: RESULT = e1 % e2; :}
|
NUMBER:n
{: RESULT = n; :}
|
MINUS expr:e
{: RESULT = e; :}
%prec UMINUS
|
LPAREN expr:e RPAREN
{: RESULT = e; :}
;
</tt></pre>
<hr>
<p>
<a name="RES_part">Here</a> we can see several changes. Most importantly, code to be executed at
various points in the parse is included inside code strings delimited by
<tt>{:</tt> and <tt>:}</tt>. In addition, labels have been placed on various
symbols in the right hand side of productions. For example in:</p>
<pre> expr:e1 PLUS expr:e2
{: RESULT = e1 + e2; :}
</pre>
<p>the first non-terminal <tt>expr</tt> has been labeled with <tt>e1</tt>, and
the second with <tt>e2</tt>. The left hand side value
of each production is always implicitly labeled as <tt>RESULT</tt>.</a></p>
<p>Each symbol appearing in a production is represented at runtime by an
object of type <tt>Symbol</tt> on the parse stack. The labels refer to
the instance variable <tt>value</tt> in those objects. In the
expression <tt>expr:e1 PLUS expr:e2</tt>, <tt>e1</tt> and <tt>e2</tt>
refer to objects of type Integer. These objects are in the value fields
of the objects of type <tt>Symbol</tt> representing those non-terminals
on the parse stack. <tt>RESULT</tt> is of type <tt>Integer</tt> as
well, since the resulting non-terminal <tt>expr</tt> was declared as of
type <tt>Integer</tt>. This object becomes the <tt>value</tt> instance
variable of a new <tt>Symbol</tt> object.</a></p>
<p>For each label, two more variables accessible to the user are declared.
A left and right value labels are passed to the code string, so that the
user can find out where the left and right side of each terminal or
non-terminal is in the input stream. The name of these variables is the
label name, plus <tt>left</tt> or <tt>right</tt>. for example, given
the right hand side of a production <tt>expr:e1 PLUS expr:e2</tt> the
user could not only access variables <tt>e1</tt> and <tt>e2</tt>, but
also <tt>e1left, e1right, e2left</tt> and <tt>e2right</tt>. these
variables are of type <tt>int</tt>.</a></p>
<p><a name="lex_part">The final step in creating a working parser is to create a <i>scanner</i> (also
known as a <i>lexical analyzer</i> or simply a <i>lexer</i>). This routine is
responsible for reading individual characters, removing things things like
white space and comments, recognizing which terminal symbols from the
grammar each group of characters represents, then returning Symbol objects
representing these symbols to the parser.
The terminals will be retrieved with a call to the
scanner function. In the example, the parser will call
<tt>scanner.next_token()</tt>. The scanner should return objects of
type <tt>com.github.jhoenicke.javacup.runtime.Symbol</tt>. This type is very different than
older versions of CUP's <tt>com.github.jhoenicke.javacup.runtime.symbol</tt>. These Symbol
objects contains the instance variable <tt>value</tt> of type Object,
which should be
set by the lexer. This variable refers to the value of that symbol, and
the type of object in value should be of the same type as declared in
the <tt>terminal</tt> and <tt>non terminal</tt> declarations. In the
above example, if the lexer wished to pass a NUMBER token, it should
create a <tt>Symbol</tt> with the <tt>value</tt> instance variable
filled with an object of type <tt>Integer</tt>. <code>Symbol</code>
objects corresponding to terminals and non-terminals with no value
have a null value field.</a></p>
<p>The code contained in the <tt>init with</tt> clause of the specification
will be executed before any tokens are requested. Each token will be
requested using whatever code is found in the <tt>scan with</tt> clause.
Beyond this, the exact form the scanner takes is up to you; however
note that each call to the scanner function should return a new
instance of <code>com.github.jhoenicke.javacup.runtime.Symbol</code> (or a subclass).
These symbol objects are annotated with parser information and pushed
onto a stack; reusing objects will result in the parser annotations
being scrambled. As of CUP 0.10j, <code>Symbol</code> reuse should be
detected if it occurs; the parser will throw an <code>Error</code>
telling you to fix your scanner.</p>
<p>In the <a href="#spec">next section</a> a more detailed and formal
explanation of all parts of a CUP specification will be given.
<a href="#running">Section 3</a> describes options for running the
CUP system. <a href="#parser">Section 4</a> discusses the details
of how to customize a CUP parser, while <a href="#scanner">section 5</a>
discusses the scanner interface added in CUP 0.10j. <a href="#errors">Section
6</a> considers error recovery. Finally, <a href="#conclusion">Section 7</a>
provides a conclusion.</p>
<h3><a name="spec">2.</a> Specification Syntax</h3>
<p>Now that we have seen a small example, we present a complete description of all
parts of a CUP specification. A specification has four sections with
a total of eight specific parts (however, most of these are optional).
A specification consists of:
</p>
<ul>
<li> <a href="#package_spec">package and import specifications</a>,
</li><li> <a href="#code_part">options and user code components</a>,
</li><li> <a href="#symbol_list">symbol (terminal and non-terminal) lists</a>,
</li><li> <a href="#precedence">precedence declarations</a>, and
</li><li> <a href="#production_list">the grammar</a>.
</li></ul>
<p>Each of these parts must appear in the order presented here. (A complete
grammar for the specification language is given in
<a href="#appendixa">Appendix A</a>.) The particulars of each part of
the specification are described in the subsections below.
<h4><a name="package_spec">2.1.</a> Package and Import Specifications</h4>
<p>A specification begins with optional <tt>package</tt> and <tt>import</tt>
declarations. These have the same syntax, and play the same
role, as the package and import declarations found in a normal Java program.
A package declaration is of the form:</p>
<pre><tt> package <i>name</i>;</tt></pre>
<p>where name <tt><i>name</i></tt> is a Java package identifier, possibly in
several parts separated by ".". In general, CUP employs Java lexical
conventions. So for example, both styles of Java comments are supported,
and identifiers are constructed beginning with a letter, dollar
sign ($), or underscore (_), which can then be followed by zero or more
letters, numbers, dollar signs, and underscores.</p>
<p>After an optional <tt>package</tt> declaration, there can be zero or more
<tt>import</tt> declarations. As in a Java program these have the form:</p>
<pre><tt> import <i>package_name.class_name</i>;</tt>
</pre>
or
<pre><tt> import <i>package_name</i>.*;</tt>
</pre>
<p>The package declaration indicates what package the <tt>sym</tt> and
<tt>parser</tt> classes that are generated by the system will be in.
Any import declarations that appear in the specification will also appear
in the source file for the <tt>parser</tt> class allowing various names from
that package to be used directly in user supplied action code.</p>
<h4><a name="code_part">2.2</a> Options and User Code Components</h5>
<p>Following the optional <tt>package</tt> and <tt>import</tt>
declarations the user can configure the name of the parser and the
symbol class and can give some additional options. An option is
set using the keyword <tt>option</tt>. Here is an example.</p>
<pre><tt> parser ExprParser;</tt>
<tt> option symbols=ExprSymbols;</tt>
<tt> option interface, expect=2;</tt></pre>
<p>Some options take a parameter. The parameter should be added after
an <tt>=</tt> symbol. In principle you can specify any option that
you can also put on the command line. The following options are most
useful to specify in the grammar file:</p>
<dl>
<dt><tt>parser =</tt> <i>name</i>
</dt><dd>Output parser and action code into a file (and class) with the given
name instead of the default of "<tt>parser</tt>". This can also be
set without <tt>option</tt> using the parser directive:
<tt>parser name;</tt></dd>
<dt><tt>typearg =</tt> <i>type</i>
</dt><dd>Add template type arguments to the parser class. This is
deprecated. Instead use the <tt>parser</tt> directive which allows
to add template parameters in Java Syntax, e.g.,
<tt>parser ExprParser<Value></tt>.</dd>
<dt><tt>symbols =</tt> <i>name</i>
</dt><dd>Output the symbol constant code into a class with the given
name instead of the default of "<tt>sym</tt>".</dd>
<dt><tt>java15</tt>
</dt><dd>Produce code that can only be compiled by Java 1.5 and
later. This includes template types and other Java-1.5 specific
things. It is highly recommended that you turn this option on,
unless you have to support Java-1.4.</dd>
<dt><tt>interface</tt>
</dt><dd>Outputs the symbol constant code as an <code>interface</code>
rather than as a <code>class</code>.</dd>
<dt><tt>newpositions</tt>
</dt><dd>This option keeps CUP from generating old variables for the left
and right hand values of terminals to non-terminals, and then from
non-terminals to other terminals. You should turn it on unless for
backwards compatibility.</dd>
<dt><tt>nonterms</tt>
</dt><dd>Place constants for non-terminals into the symbol constant class.
The parser does not need these symbol constants, so they are not normally
output. However, it can be very helpful to refer to these constants
when debugging a generated parser.</dd>
<dt><tt>expect =</tt> <i>number</i>
</dt><dd>During parser construction the system may detect that an ambiguous
situation would occur at runtime. This is called a <i>conflict</i>.
In general, the parser may be unable to decide whether to <i>shift</i>
(read another symbol) or <i>reduce</i> (replace the recognized right
hand side of a production with its left hand side). This is called a
<i>shift/reduce conflict</i>. Similarly, the parser may not be able
to decide between reduction with two different productions. This is
called a <i>reduce/reduce conflict</i>. Normally, if one or more of
these conflicts occur, parser generation is aborted. However, in
certain carefully considered cases it may be advantageous to
arbitrarily break such a conflict. In this case CUP uses YACC
convention and resolves shift/reduce conflicts by shifting, and
reduce/reduce conflicts using the "highest priority" production (the
one declared first in the specification). In order to enable automatic
breaking of conflicts the <tt>expect</tt> option must be given
indicating exactly how many conflicts are expected. Conflicts
resolved by precedences and associativities are not reported.</dd>
<dt><tt>compact_red</tt>
</dt><dd>Including this option enables a table compaction optimization involving
reductions. In particular, it allows the most common reduce entry in
each row of the parse action table to be used as the default for that
row. This typically saves considerable room in the tables, which can
grow to be very large. This optimization has the effect of replacing
all error entries in a row with the default reduce entry. While this
may sound dangerous, if not down right incorrect, it turns out that this
does not affect the correctness of the parser. In particular, some
changes of this type are inherent in LALR parsers (when compared to
canonical LR parsers), and the resulting parsers will still never
read past the first token at which the error could be detected.
The parser can, however, make extra erroneous reduces before detecting
the error, so this can degrade the parser's ability to do
<a href="#errors">error recovery</a>.
(Refer to reference [2] pp. 244-247 or reference [3] pp. 190-194 for a
complete explanation of this compaction technique.) <br>
This option is typically used to work-around the java bytecode
limitations on table initialization code sizes. However, CUP
0.10h introduced a string-encoding for the parser tables which
is not subject to the standard method-size limitations.
Consequently, use of this option should no longer be required
for large grammars.</dd>
<dt><tt>nopositions</tt>
</dt><dd>This option keeps CUP from generating code to propagate the left
and right hand values of terminals to non-terminals, and then from
non-terminals to other terminals. If the left and right values aren't
going to be used by the parser, then it will save some runtime
computation to not generate these position propagations. This option
also keeps the left and right label variables from being generated, so
any reference to these will cause an error.</dd>
<dt><tt>noscanner</tt>
</dt><dd>CUP 0.10j introduced <a href="#scanner">improved scanner
integration</a> and a new interface,
<code>com.github.jhoenicke.javacup.runtime.Scanner</code>. By default, the
generated parser refers to this interface, which means you cannot
use these parsers with CUP runtimes older than 0.10j. If your
parser does not use the new scanner integration features, then you
may specify the <code>noscanner</code> option to suppress the
<code>com.github.jhoenicke.javacup.runtime.Scanner</code> references and allow
compatibility with old runtimes. Not many people should have reason
to do this.</dd>
</dl>
<p>After the options a series of optional declarations can be added
that allow user code to be included as part of the generated parser
(see <a href="#parser">Section 4</a> for a full description of how the
parser uses this code). As a part of the parser file, a separate
non-public class to contain all embedded user actions is produced.
The first <tt>action code</tt> declaration section allows code to be
included in this class. Routines and variables for use by the code
embedded in the grammar would normally be placed in this section (a
typical example might be symbol table manipulation routines). This
declaration takes the form:</p>
<pre><tt> action code {: ... :};</tt>
</pre>
<p>where <tt>{: ... :}</tt> is a code string whose contents will be
placed directly within the <tt>action class</tt> class declaration.
This contains auxiliary functions that can be called from the actions
in the productions (see <a href="#production_list">below</a>).
</p>
<p>After the <tt>action code</tt> declaration is an optional
<tt>parser code</tt> declaration. This declaration allows methods and
variable to be placed directly within the generated parser class.
Although this is less common, it can be helpful when customizing the
parser &emdash; it is possible for example, to include scanning methods inside
the parser and/or override the default error reporting routines. This
declaration is very similar to the <tt>action code</tt> declaration and
takes the form:
</p><pre><tt> parser code {: ... :};</tt>
</pre>
<p>Again, code from the code string is placed directly into the generated parser
class definition.</p>
<p>Next in the specification is the optional <tt>init</tt> declaration
which has the form:</p>
<pre><tt> init with {: ... :};</tt></pre>
<p>This declaration provides code that will be executed by the parser
before it asks for the first token. Typically, this is used to initialize
the scanner as well as various tables and other data structures that might
be needed by semantic actions. In this case, the code given in the code
string forms the body of a <tt>void</tt> method inside the <tt>parser</tt>
class.</p>
<p>
The final (optional) user code section of the specification indicates how
the parser should ask for the next token from the scanner. This has the
form:</p>
<pre><tt> scan with {: ... :};</tt></pre>
<p>As with the <tt>init</tt> clause, the contents of the code string forms
the body of a method in the generated parser. However, in this case
the method returns an object of type <tt>com.github.jhoenicke.javacup.runtime.Symbol</tt>.
Consequently the code found in the <tt>scan with</tt> clause should
return such a value. See <a href="#scanner">section 5</a> for
information on the default behavior if the <code>scan with</code>
section is omitted.<p>
As of CUP 0.10j the action code, parser code, init code, and scan with
sections may appear in any order. They must, however, precede the
symbol lists.</p>
<pre><tt> after reduce {: ... :};</tt></pre>
<p>Defines code that is executed whenever a production rule is
reduced. The arrays symbols contains all terminal and non-terminal
symbols of the current production rule. RESULT can be used to access
and modify the value of the nonterminal created by the production
rule. Example usage:
</p>
<pre>
after reduce {:
int lineNumber = symbols[0].left + 1;
if (RESULT instanceof AST_Node) {
((AST_Node) RESULT).lineNumber = lineNumber
}
:}
</pre>
<h4><a name="symbol_list">2.3.</a> Symbol Lists</h4>
<p>Following user supplied code comes the first required part of the
specification: the symbol lists. These declarations are responsible
for naming and supplying a type for each terminal and non-terminal
symbol that appears in the grammar. As indicated above, each terminal
and non-terminal symbol is represented at runtime with a
<tt>Symbol</tt> object. In the case of terminals, these are created
by the scanner. For non-terminals the generated code will create
these symbols. Each <tt>Symbol</tt> has a <tt>value</tt> field which
can store an optional value. In order to tell the parser which object
types should be used as value for which symbol, <tt>terminal</tt> and
<tt>non terminal</tt> declarations are used.
These take the forms:</p>
<pre><tt> terminal <i>classname</i> <i>name1, name2,</i> ...;</tt>
<tt> non terminal <i>classname</i> <i>name1, name2,</i> ...;</tt>
<tt> terminal <i>name1, name2,</i> ...;</tt>
<tt> non terminal <i>name1, name2,</i> ...;</tt>
</pre>
<p>Here <tt><i>classname</i></tt> is a Java class name that specifies
the type of the value of that is stored together with the terminal or
non-terminal. You can also use an array type, e. g., <tt>String[]</tt>,
or an instantiated template type, e.g., <tt>ArrayList<String></tt>.
In the action code in the grammar the values can
be accessed and they will have the declared type.
If no <tt><i>classname</i></tt> is given, then the
terminal or non-terminal holds no value and it may not be accessed
in the action code. As of CUP 0.10j, you may also use the
declaration "<code>nonterminal</code>" (note, no
space) instead of the original "<code>non terminal</code>" spelling.</p>
<p>The lexer is responsible for putting a value for each terminal in the
<tt>value</tt> instance variable. The type of the value must match
its declared type or an exception will occur at run-time. In the case of
non-terminals the value is generated by the action code of the
production and placed into the Symbol created by the generated parser.</p>
<p>Names of terminals and non-terminals cannot be CUP reserved words;
these include "code", "action", "parser", "terminal", "non",
"nonterminal", "init", "scan", "with", "start", "precedence", "left",
"right", "nonassoc", "import", and "package".</p>
<h4><a name="precedence">2.4.</a> Precedence and Associativity declarations</h4>
<p>The third section, which is optional, specifies the precedences and
associativity of terminals. This is useful for parsing ambiguous
grammars, as done in the example above. There are three type of
precedence/associativity declarations:</p>
<pre><tt>
precedence left <i>terminal</i>[, <i>terminal</i>...];
precedence right <i>terminal</i>[, <i>terminal</i>...];
precedence nonassoc <i>terminal</i>[, <i>terminal</i>...];
</tt></pre>
<p>The comma separated list indicates that those terminals should have the
associativity specified at that precedence level and the precedence of
that declaration. The order of precedence, from highest to lowest, is
bottom to top. Hence, this declares that multiplication and division have
higher precedence than addition and subtraction:</p>
<pre><tt>
precedence left ADD, SUBTRACT;
precedence left TIMES, DIVIDE;
</tt></pre>
<p>Internally, precedence is used to automatically resolve shift reduce
problems. For example, consider the input to the above example
parser <tt>3 + 4 * 8</tt>. Without precedence declaration the parser
doesn't know whether to reduce <tt>3 + 4</tt> to an expr non-terminal
or shift the '*' onto the stack in order to later reduce <tt>4 * 8</tt> to
expr. With the precedence declaration, since '*' has a higher precedence
than '+', it will be shifted
and the multiplication will be performed before the addition.</p>
<p>CUP assigns each one of its terminals a precedence according to these
declarations. Any terminals not in this declaration have no
precedence. CUP also assigns each of its productions a precedence.
That precedence is usually equal to the precedence of the last terminal in
that production but it can also be explicitly set in the grammar.
If the production has no terminals, then it has no
precedence. For example, <tt>expr ::= expr TIMES expr</tt> would have
the same precedence as <tt>TIMES</tt>. When there is a shift/reduce
conflict, the parser determines whether the terminal to be shifted has a
higher precedence, or if the production to reduce by does. If the
terminal has higher precedence, it is shifted, if the production has
higher precedence, a reduce is performed. If they have equal
precedence, associativity of the terminal determine what happens.
</p>
<p>An associativity is assigned to each terminal used in the
precedence/associativity declarations. The three associativities are
<tt>left, right</tt> and <tt>nonassoc</tt> Associativities are also
used to resolve shift/reduce conflicts, but only in the case of equal
precedences. If the associativity of the terminal that can be shifted
is <tt>left</tt>, then a reduce is performed. This means, if the input
is a string of additions, like <tt>3 + 4 + 5 + 6 + 7</tt>, the parser
will <i>always</i> reduce them from left to right, in this case,
starting with <tt>3 + 4</tt>. If the associativity of the terminal is
<tt>right</tt>, it is shifted onto the stack. hence, the reductions
will take place from right to left. So, if PLUS were declared with
associativity of <tt>right</tt>, the <tt>6 + 7</tt> would be reduced
first in the above string. If a terminal is declared as
<tt>nonassoc</tt>, then two consecutive occurrences of equal precedence
non-associative terminals generates an error. This is useful for
comparison operations. For example, if the input string
<tt>6 == 7 == 8 == 9</tt> should not be allowed by the grammar.
If '==' is declared as <tt>nonassoc</tt> then a shift-reduce error will
still be generated when such an expression is possible. </p>
<p>You should be careful when using precedences, as it can hide unexpected
shift-reduce conflicts. It is usually safe to use it for infix operators
and it is also safe to give a production a precedence explictly that is not
an infix production such as int the example with the unary minus.</p>
<p>All terminals not used in the precedence/associativity declarations are
treated as no precedence. If a shift/reduce error results,
involving a terminal or a production rules with no precedence, it cannot
be resolved. So it will be reported.</p>
<h4><a name="production_list">2.5.</a> The Grammar</h4>
<p>The final section of a CUP declaration provides the grammar. This
section optionally starts with a declaration of the form:</p>
<pre><tt> start with <i>non-terminal</i>;</tt>
</pre>
<p>This indicates which non-terminal is the <i>start</i> or <i>goal</i>
non-terminal for parsing. If a start non-terminal is not explicitly
declared, then the non-terminal on the left hand side of the first
production will be used. At the end of a successful parse, CUP returns
an object of type <tt>com.github.jhoenicke.javacup.runtime.Symbol</tt>. This
<tt>Symbol</tt>'s value instance variable contains the final reduction
result.</p>
<p>The grammar itself follows the optional <tt>start</tt> declaration. Each
production in the grammar has a left hand side non-terminal followed by
the symbol "<tt>::=</tt>", which is then followed by a series of zero or more
actions, terminal, or non-terminal
symbols, followed by an optional contextual precedence assignment,
and terminated with a semicolon (;).
If there are several productions for the same non-terminal they may be
declared together. In this case the productions start with the non-terminal
and "<tt>::=</tt>". This is followed by multiple right hand sides each
separated by a bar (|). The full set of productions is then terminated by a
semicolon.</p>
<pre>
expr ::= expr:e1 PLUS expr:e2 {: RESULT = e1 + e2; :}
| expr:e1 TIMES expr:e2 {: RESULT = e1 * e2; :}
| MINUS expr:e {: RESULT = -e; :} %prec UMINUS
| LPAREN expr:e RPAREN {: RESULT = e; :}
| NUMBER:n {: RESULT = n; :}
;
</pre>
<p><a name="label_part">Each</a> symbol on the right hand side can
optionally be labeled with a name. Label names appear after the symbol
name separated by a colon (:). Label names must be unique within the
production, and can be used within action code to refer to the value
of the symbol. Along with the label an additional variable for the
underlying symbol is created, which is the label plus a dollar ($)
symbol. This can be used to access other information, such as the left
and right position of that symbol in the source file.
Unless the option <tt>-nopositions</tt> or <tt>-newpositions</tt> is
given,two more variables are created, which are the label plus
<tt>left</tt> and the label plus <tt>right</tt>. These are
<tt>int</tt> values that contain the right and left locations.</p>
<p>Actions appear in the right hand side as code strings (e.g., Java
code inside <tt>{:</tt> ... <tt>:}</tt> delimiters). These are
executed by the parser at the point when the portion of the production
to the left of the action has been recognized. Note that the scanner
will have returned the token one past the point of the action since
the parser needs this extra
<i>lookahead</i> token for resolving ambiguities.
If no action is given and the right-hand-side consists of a single
(terminal or non-terminal) symbol, the default action is to return
the value of this symbol. This default action is handled very efficiently in
the parser.</p>
<p><a name="cpp">
Contextual</a> precedence assignments follow all the symbols and actions of
the right hand side of the production whose precedence it is assigning.
Contextual precedence assignment allows a production to be assigned a
precedence not based on the last terminal in it. A good example is
shown in the above sample parser specification:</p>
<pre><tt>
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
expr ::= MINUS expr:e
{: RESULT = -e; :}
%prec UMINUS
</tt></pre>
<p>Here, there production is declared as having the precedence of UMINUS.
Hence, the parser can give the MINUS sign two different precedences,
depending on whether it is a unary minus or a subtraction operation. </p>
<h4><a name="wildcard">2.6.</a> Optional and Iterated Symbols</h4>
For every terminal or non-terminal symbol, one can use the symbol
followed by a question mark (?), star (*), or plus (+) to denote that
this symbol is optional or may appear several times. E.g. for a symbol
declared as
<pre><tt>
terminal String sym
</tt></pre>
one can use <tt>sym?</tt>, <tt>sym*</tt> or <tt>sym+</tt>, as if the
following definitions and rules would have been added:
<pre><tt>
nonterminal String sym?
nonterminal String[] sym*, sym+
sym? ::= {: RESULT = null; :} | sym;
sym* ::= {: RESULT = new String[0]; :} | sym+;
sym+ ::= sym:s {: RESULT = new String[] { s }; :}
| sym+:list sym:s
{: RESULT = new String[list.length+1];
System.arraycopy(list, 0, RESULT, 0, nts.length);
RESULT[list.length] = s; :};
</tt></pre>
These definitions are only added when needed and they are implemented more
efficiently internally (e.g. the array is only created once).
<h3><a name="running">3. Running CUP</a></h3>
<h4>3.1 Command line interface</h4>
<p>As mentioned above, CUP is written in Java. To invoke it, one needs
to use the Java interpreter to invoke the static method
<tt>com.github.jhoenicke.javacup.Main()</tt>, passing an array of strings containing options.
Assuming a Unix machine, the simplest way to do this is typically to invoke it
directly from the command line with a command such as: </p>
<pre><tt> java -jar java-cup-11a.jar <i>options</i> <i>inputfile</i></tt></pre>
<p>Once running, CUP expects to find a specification file on standard input
and produces two Java source files as output.</p>
<p>In addition to the specification file, CUP's behavior can also be changed
by passing various options to it. Legal options are documented in
<code>Main.java</code> and include:
</p><dl>
<dt><tt>-package</tt> <i>name</i>
</dt><dd>Specify that the <tt>parser</tt> and <tt>sym</tt> classes are to be
placed in the named package.
This is deprecated; you should put a package directive in the
grammar file.</dd>
<dt><tt>-parser</tt> <i>name</i>
</dt><dd>Output parser and action code into a file (and class) with the given
name instead of the default of "<tt>parser</tt>".
This is deprecated; you should put a parser directive in the
grammar file.</dd>
<dt><tt>-symbols</tt> <i>name</i>
</dt><dd>Output the symbol constant code into a class with the given
name instead of the default of "<tt>sym</tt>".
This option should better be set in the grammar file.</dd>
<dt><tt>-interface</tt>
</dt><dd>Outputs the symbol constant code as an <code>interface</code>
rather than as a <code>class</code>.
This option should better be set in the grammar file.</dd>
<dt><tt>-nonterms</tt>
</dt><dd>Place constants for non-terminals into the symbol constant class.
The parser does not need these symbol constants, so they are not normally
output. However, it can be very helpful to refer to these constants
when debugging a generated parser.</dd>
<dt><tt>-expect</tt> <i>number</i>
</dt><dd>
This option should better be set in the grammar file.<br>
During parser construction the system may detect that an ambiguous
situation would occur at runtime. This is called a <i>conflict</i>.
In general, the parser may be unable to decide whether to <i>shift</i>
(read another symbol) or <i>reduce</i> (replace the recognized right
hand side of a production with its left hand side). This is called a
<i>shift/reduce conflict</i>. Similarly, the parser may not be able
to decide between reduction with two different productions. This is
called a <i>reduce/reduce conflict</i>. Normally, if one or more of
these conflicts occur, parser generation is aborted. However, in
certain carefully considered cases it may be advantageous to
arbitrarily break such a conflict. In this case CUP uses YACC
convention and resolves shift/reduce conflicts by shifting, and
reduce/reduce conflicts using the "highest priority" production (the
one declared first in the specification). In order to enable automatic
breaking of conflicts the <tt>-expect</tt> option must be given
indicating exactly how many conflicts are expected. Conflicts
resolved by precedences and associativities are not reported.</dd>
<dt><tt>-compact_red</tt>
</dt><dd>Including this option enables a table compaction optimization involving
reductions. In particular, it allows the most common reduce entry in
each row of the parse action table to be used as the default for that
row. This typically saves considerable room in the tables, which can
grow to be very large. This optimization has the effect of replacing
all error entries in a row with the default reduce entry. While this
may sound dangerous, if not down right incorrect, it turns out that this
does not affect the correctness of the parser. In particular, some
changes of this type are inherent in LALR parsers (when compared to
canonical LR parsers), and the resulting parsers will still never
read past the first token at which the error could be detected.
The parser can, however, make extra erroneous reduces before detecting
the error, so this can degrade the parser's ability to do
<a href="#errors">error recovery</a>.
(Refer to reference [2] pp. 244-247 or reference [3] pp. 190-194 for a
complete explanation of this compaction technique.) <br><br>
This option is typically used to work-around the java bytecode
limitations on table initialization code sizes. However, CUP
0.10h introduced a string-encoding for the parser tables which
is not subject to the standard method-size limitations.
Consequently, use of this option should no longer be required
for large grammars.</dd>
<dt><tt>-nowarn</tt>
</dt><dd>This options causes all warning messages (as opposed to error messages)
produced by the system to be suppressed.</dd>
<dt><tt>-nosummary</tt>
</dt><dd>Normally, the system prints a summary listing such things as the
number of terminals, non-terminals, parse states, etc. at the end of
its run. This option suppresses that summary.</dd>
<dt><tt>-progress</tt>
</dt><dd>This option causes the system to print short messages indicating its
progress through various parts of the parser generation process.</dd>
<dt><tt>-dump_grammar</tt>
</dt><dt><tt>-dump_states</tt>
</dt><dt><tt>-dump_tables</tt>
</dt><dt><tt>-dump</tt>
</dt><dd> These options cause the system to produce a human readable dump of
the grammar, the constructed parse states (often needed to resolve
parse conflicts), and the parse tables (rarely needed), respectively.
The <tt>-dump</tt> option can be used to produce all of these dumps.</dd>
<dt><tt>-time</tt>
</dt><dd>This option adds detailed timing statistics to the normal summary of
results. This is normally of great interest only to maintainers of
the system itself.</dd>
<dt><tt>-debug</tt>
</dt><dd>This option produces voluminous internal debugging information about
the system as it runs. This is normally of interest only to maintainers
of the system itself.</dd>
<dt><tt>-newpositions</tt>
</dt><dd>
This option should better be set in the grammar file.<br>
This option keeps CUP from generating old variables for the left
and right hand values of terminals to non-terminals, and then from