Skip to content

Smaller C Core Compiler Wiki

Alexey Frunze edited this page Apr 13, 2015 · 22 revisions

Smaller C Core Compiler

 

Table of contents

About this document

What is Smaller C?

Why yet another compiler?

What features are supported by Smaller C?

What features are NOT supported by Smaller C?

Calling conventions

Limitations and implementation details

How do I compile Smaller C on/for x86?

How do I compile Smaller C for MIPS?

How do I run Smaller C?

How do I compile Smaller C with itself?

Miscellaneous

TODO

About this document

This document reflects the current state of the Smaller C core compiler. It is a live document and it is updated regularly (typically, with every code change).

What is Smaller C?

Smaller C is a simple and small single-pass C compiler, currently supporting most of the C language common between C89/ANSI C and C99 (minus some C89 and plus some C99 features).

Currently it generates 16-bit and 32-bit 80386+ assembly code for NASM that can then be assembled and linked into DOS, Windows and Linux programs.

Code generation for MIPS CPUs is also supported.

The core compiler is capable of compiling its own source code.

Smaller C is not an optimizing compiler, but it does the job better than the typical small C implementation.

Why yet another compiler?

  • I always wanted to write one, to see what it takes and how it is and I have now got enough knowledge and time to finally do it
  • It's a good programming exercise and the project is small enough to be manageable by one person and to be functional in just a few months (in comparison, I haven't yet been able to finish my OS project. I think, it is just a little too big for me)
  • There are a number of old implementations of small C by Ron Cain, James Hendrix and others (see also SubC by Nils Holm) which I could have used as a foundation, but I wanted something less limited, something supporting modern C syntax and both 16-bit and 32-bit x86 targets and still fitting into ~64-128 KB of RAM for code and data and being self-compilable (which is why there are no external lexers and parsers used)
  • It may be useful as a simple and small (cross-) compiler for DOS, RetroBSD or other OSes, including hobby OSes
  • It's fun!

What features are supported by Smaller C?

  • decimal, octal and hexadecimal integer constants, e.g. 15, 017, 0xF, 32768u
  • character constants, e.g. 'a', ''', '\n', '\0', '\x41'
  • string literals, e.g. "string" (with concatenation of adjacent string literals: "str" "ing" is equivalent to "string"; concatenation doesn't fully work when one of the adjacent strings is the result of expansion of a macro or of inclusion of a file)
  • void
  • char (signed by default, can be changed to unsigned via a command-line option)
  • signed char, unsigned char
  • short, unsigned short
  • int, unsigned int
  • long, unsigned long (only in 32-bit and huge mode(l)s)
  • struct, union (can only be passed to and returned from functions "by reference", not by value)
  • #pragma pack() (mostly useless for MIPS because it can't access misaligned memory locations and the MIPS code generator doesn't provide a solution for that)
  • enum
  • functions
  • arrays (including multidimensional) and pointers of/to each other and of/to the above
  • expressions
  • all operators
  • sizeof expression, sizeof( type )
  • type casts, e.g. int x = (int)&x;
  • global and local declarations and function prototypes
  • extern
  • static
  • typedef
  • __func__
  • initialization of variables
  • incomplete types/arrays, e.g. extern int array[][2]; or int main(int argc, char argv[]);*
  • declarations of variables after statements (C99/C++-style), e.g. { int a = 1; printf("a=%d\n", a); int b = 2; printf("b=%d\n", b); }
  • implicit function declarations, e.g. you can call printf() without first declaring it or including stdio.h
  • variadic functions with optional parameters represented as ... similar to printf() (unfortunately, the va_something macros aren't currently supported)
  • the compound statement, {}
  • if, else
  • goto
  • while, do, for (C99/C++-style, supporting variable declaration as in for(int i=0;;)), break, continue
  • switch, case, default
  • return
  • asm("assembly code"); The inline assembly code is output verbatim to the output file or console.
  • #line, e.g. #line 123, #line 234 "file1.c" (or gcc-style: #123, #234 "file1.c")
  • __interrupt functions (experimental; x86/huge only)
  • comments, e.g. / comment / and // comment
  • #define, e.g. #define TEN 10 ( parametrized macros like *#define DOUBLE(x) ((x)2) aren't currently supported)
  • #undef
  • #ifdef, #ifndef, #else, #endif
  • #include, e.g. #include "somefile" or #include <somefile>
  • __FILE__
  • __LINE__

What features are NOT supported by Smaller C?

  • most of the preprocessor (but you can use preprocessors from gcc, pcc and Open Watcom C/C++)
  • wide characters and wide string literals, e.g. L'a', L"string"
  • long long
  • structure/union bit-fields and flexible array members
  • const, volatile (they are treated as white space; everything is de facto volatile-ish)
  • tentative declarations (if you don't know what they are, you shouldn't probably worry)
  • variable-length arrays (AKA VLAs)
  • K&R syntax, e.g. int add(a, b) int a; int b; { return a + b; }
  • Implicit int, e.g. static x = 1; const y = 2; main(void) {} must be static int x = 1; const int y = 2; int main(void) {}.
  • floating point arithmetic, float, double
  • complex-valued arithmetic, _Complex, _Imaginary
  • auto, register, restrict (treated as white space)
  • inline, _Bool
  • probably, whatever I've forgotten to mention

Calling conventions

  • x86: stack-based calling convention: all arguments go onto the stack (order: the first argument is on top of the stack), the caller cleans up the stack after the call, the return value is in (e)ax.
  • MIPS: standard stack-based calling convention: the first 4 arguments go into a0-a3 (stack space is always reserved for the first 4 arguments (even if none passed) as if they were passed on the stack), the rest go onto the stack (order: the first argument, if it was pushed, would appear on top of the stack), the caller cleans up the stack after the call, the return value is in v0.

Limitations and implementation details

  • Error and warning messages emitted by Smaller C indicate approximate error location in the source file. For example, an error at 7:15 would mean that the error is on line 7 in position 15 or before position 15, most probably somewhere in the declaration or expression ending at 7:15
  • All char types (char, signed char, unsigned char) are 8-bit
  • short and unsigned short are 16-bit
  • int and unsigned int are either 16-bit or 32-bit, depending on the mode(l)
  • sizeof(int) = sizeof(void)* = sizeof(void(*)()), IOW, ints and pointers are same size, far pointers are not supported (except in the huge memory model, more on this later)
  • long and unsigned long are 32-bit (if supported by the mode(l))
  • enum types and constants are always of type and size of int
  • Type specifiers cannot occur in an arbitrary order. For example, you can write unsigned char, long int, but you can't write char unsigned, int long. The order is: signed/unsigned then short/long then char/int
  • structures and unions can only be passed to and returned from functions "by reference", not by value
  • Large decimal integer constants without a U suffix (such as 32768 in 16-bit mode(l)s and 2147483648 or 2147483648L in 32-bit mode(l)s) not fitting into type signed int may cause compilation errors. Long story short, you'll probably have to append the U suffix to the constant (e.g. 32768U, 2147483648U, 2147483648UL) recompile the code and forget about it. Now, the long story. If a decimal constant has no suffix, its type must be either int or long or, depending on the version of the C standard that the compiler implements, a) unsigned long (ANSI C/C89) or b) long long (C99). The compiler chooses the first type from this list that can represent the constant. If the decimal constant has an L suffix only, its type must be either long or, depending on the version of the C standard that the compiler implements, a) unsigned long (ANSI C/C89) or b) long long (C99). Again, the compiler chooses the first type that can represent the constant. If the decimal constant has the U suffix only, its type must be either unsigned int or unsigned long or unsigned long long (long long is C99 only). As usual, the compiler chooses the first type that can represent the constant. If the decimal constant has the U suffix and the L suffix (just one L), its type must be either unsigned long or unsigned long long (long long is C99 only). And the compiler still chooses the first type from this list that can represent the constant. As you can see, there's ambiguity as to what type a large decimal constant should be. In ANSI C/C89 it may be of type unsigned long, while in C99 it may be of type long long. In some cases this ambiguity can lead to changing the behavior of the code and thus to its breakage. Smaller C does not support long long as of now and while it might seem reasonable to adopt the ANSI C approach here, it would hide a portability issue that is easily detectable and correctable. I chose to turn this portability issue into a compilation error. Another reason why you can run into this compilation error is that in 16-bit modes Smaller C does not support type long (which, btw, has to have 32 bits or more per the language standard) and what could've become long simply can't because there's no type for it.
  • 16-bit versions of the compiler (except if compiled with the -huge option) will not compile your C code into 32-bit assembly
  • only fully-bracketed (sic) initialization of arrays is supported, hence you can write int a[2][2] = { { 1, 2 }, { 3, 4 } }; but not int a[2][2] = { 1, 2, 3, 4 };
  • there are no redeclaration checks currently, e.g. int i, i; is considered valid, the first i is obscured by the second i. If i is a global variable, you'll get an assembly- or link-time error. Likewise, if you define several local variables (or struct/union members) with the same name (which normally should result in a compile-time error), only the last one will be visible. Similarly, function prototypes should match those of function definitions or bad things may happen at run time.
  • there are no checks for the missing return statement at the end of the function
  • there are minimal type checks in the assignment operator = and in function argument passing. Values of types char, int, void*, struct someTag*, type*, etc can be freely assigned to one another or passed as one another or returned as one another
  • functions returning char, signed char, unsigned char, short and unsigned short have to zero- or sign-extend (depending on signedness of the particular char or short type) the return value to fill the entire ax or eax register ( x86; ax for 16-bit code, eax for 32-bit code) or the v0 register ( MIPS ).
  • Smaller C's functions do not preserve x86 registers (other than (e)sp and (e)bp). Because of that they may not work when used as callbacks from standard library functions such as bsearch(), qsort(), atexit() if those expect (per the calling convention) register preservation in callback functions. This problem exists only with standard libraries borrowed as object/library files from other compilers. The problem does not exist if you're using Smaller C's standard library or if you recompile a borrowed library from its source with Smaller C. It is possible to preserve all or most registers, but there isn't a single widely supported calling convention, so currently this is left unimplemented. You may be able to work-around this with asm("assembly code");, e.g. as the very first thing in a callback function you could do asm("push ebx\npush ecx\npush edx\npush esi\npush edi"); and then just before returning from it do the complementary asm("pop edi\npop esi\npop edx\npop ecx\npop ebx");.
  • Currently, Smaller C uses statically allocated arrays (instead of dynamically allocated memory buffers) to maintain tables of known macros, declarations, enums, identifiers, etc and so large C files may exhaust those arrays and therefore they may need to be split up into several smaller ones. Alternatively, Smaller C can be recompiled with larger arrays to accommodate large C files (there are a few macros in the code that define the sizes of the various arrays).

How do I compile Smaller C on/for x86?

With Turbo C++ 1.01 in DOS/DosBox:

tcc.exe -esmlrc.exe smlrc.c

With DJGPP (gcc 3.3.4 for DOS) in DOS/DosBox:

gcc.exe -Wall -O2 smlrc.c -o smlrc.exe

With Open Watcom C/C++ 1.9 in Windows:

wcl386.exe /q /we /wx /j smlrc.c /fe=smlrc.exe

With 32-bit MinGW (gcc 4.6.2) in Windows:

gcc.exe -Wall -Wextra -O2 smlrc.c -o smlrc.exe

With 32-bit gcc in Linux:

gcc -Wall -Wextra -O2 smlrc.c -o smlrc

How do I compile Smaller C for MIPS?

The compilation steps are pretty much the same as for x86. If you want to compile it with MIPS code generation instead of x86 code generation, you need to define the MIPS macro at compile time. For example, with gcc you'd do it like this:

gcc -Wall -Wextra -DMIPS -O2 smlrc.c -o smlrc

If you want to compile Smaller C for RetroBSD (MIPS), you can do it like so (you'll need the ice2aout tool):

pic32-gcc -nostdinc -nostartfiles -ffreestanding -nostdlib -nodefaultlibs -mno-peripheral-libs -X -T smlrcrb.ld -DNO_ANNOTATIONS -DNO_PREPROCESSOR -DNO_PPACK -D__SMALLER_C__ -D__SMALLER_C_SCHAR__ -D__SMALLER_C_32__ -D_RETROBSD -DMIPS smlrcrb.s lb.c smlrc.c -Os -o smlrcrb.elf

ice2aout smlrcrb.elf smlrc

chmod +x smlrc

How do I run Smaller C?

You invoke it like this (in DOS and Windows):

smlrc.exe [options] somefile.c somefile.asm

or like this (in Linux):

./smlrc [options] somefile.c somefile.asm

If you omit the second file name, the compiler will output the generated assembly code to the standard output (console).

Options:

  • -seg16t (x86 only) This chooses 16-bit output and wraps code and data into separate SEGMENT blocks. The segment declarations are compatible with 16-bit Borland/Turbo C/C++ compilers/linkers. This is the default. Note: does not support long. Predefines __SMALLER_C_16__.
  • -seg16 (x86 only) This chooses 16-bit output and wraps code and data into separate generic SECTION blocks. Note: does not support long. Predefines __SMALLER_C_16__.
  • -seg32 (x86 only) This chooses 32-bit output and wraps code and data into separate generic SECTION blocks. Predefines __SMALLER_C_32__.
  • -flat16 (x86 only) This chooses 16-bit output and does no code from data separation (except global/static variables in function bodies, they are jumped over). This may be useful for producing flat 16-bit binaries (like .COM in DOS) directly with NASM, using the -f bin option or with assemblers not supporting multiple sections/segments like FASM. Note: does not support long. Predefines __SMALLER_C_16__.
  • -flat32 (x86 only) This chooses 32-bit output and does no code from data separation (see -flat16 above)... Predefines __SMALLER_C_32__.
  • -huge (x86 only) This is similar to -seg32 in that ints and pointers are 32-bit. However, the generated code is to be executed in the 16-bit real-address or virtual-8086 mode (i.e. targeting DOS). The pointers are 32-bit physical addresses and they get converted into far pointers (16-bit segment:16-bit offset pairs) completely transparently. This memory model is similar to the huge memory model encountered in 16-bit Borland C/C++ compilers. This model helps porting code to DOS and eases writing code for DOS, liberating the programmer from the need to deal with segments explicitly if more than 64KB of memory needs to be used. With this model you can have more than 64KB of code and more than 64KB of data. Individual data objects can be larger than 64KB as well (e.g. you can have large arrays). You're only limited by the amount of free "conventional" memory in DOS (e.g. up to some 500KB). Currently, in this model the stack is limited to 64KB and the cumulative size of local variables is limited to 32KB (this is a per-function limit). Individual functions are limited to 32KB of code size. These limitations should not pose problems normally. Predefines __SMALLER_C_32__ and __HUGE__.
  • -winstack (x86 only) This option is to be used together with -seg32. It causes proper stack growth on Windows by calling an equivalent of the Visual C++ _chkstk() function to touch/probe stack pages and move the process guard page whenever necessary (when a function allocates 4096 or more bytes for local variables). If the guard page isn't moved, stack accesses beyond it will cause the process to crash.
  • -no-externs suppresses generation of global and extern. This may be useful when compiling one or more C files into one or more assembly files, concatenating all the assembly files into one and assembling it directly into an executable binary without using a linker (e.g. using NASM's -f bin option).
  • -label <number> takes the initial number (non-negative integer; without the angle brackets) for numbered labels in the generated assembly code. Again, this may be useful when compiling one or more C files into assembly code and assembling it (possibly, after concatenation of multiple assembly files into one) as a single assembly file directly into an executable binary without using a linker (e.g. using NASM's -f bin option). The generated assembly has at its end the Next label number: string followed by the label number that can be safely used as an input into -label. This lets you get unique numbered labels in the output of multiple runs of the compiler. Beware, static variables and functions will collide in concatenated assembly files if the same name is declared more than once. This option is to be used together with -no-externs.
  • -use-gp (MIPS only) This generates shorter lb/lbu/lw/sb/sw instructions with gp-register-relative addressing when reading/writing global variables. Note, this option may not always work due to the simplicity of its implementation. Use with caution.
  • -signed-char This makes char signed. This is the default.
  • -unsigned-char This makes char unsigned.
  • -ctor-fxn function_name This inserts "call _function_name" at the beginning of the code generated for main(). Apparently, this is what the modern gcc (32-bit MinGW gcc 4.6.2) does to construct objects (C++?) and initialize some data prior to executing main(). It calls __main(). So you may need to use -ctor-fxn __main or -ctor-fxn _main if you're bootstrapping Smaller C with gcc. Note: this option only works together with -seg32.
  • -leading-underscore (x86 only) This prefixes global C identifiers with an underscore, so you get labels like _main for main() and _printf for printf() in the assembly code. This is the default.
  • -no-leading-underscore (x86 only; default for MIPS ) This results in no underscore prefixing of global C identifiers and you get assembly labels main and printf for main() and printf() respectively. This is useful for compilation using the ELF format in Linux.
  • -I dir This adds a directory to the include file search path. Can be repeated multiple times. Header files in double quotes, e.g. #include "myfile.h", are first looked for in the current directory (note that the current directory isn't necessarily the same directory that contains the file that does #include "myfile.h"), then in the directories specified with the -I option, then in the directories specified with the -SI option.
  • -SI dir This adds a directory to the system include file search path. Can be repeated multiple times. Required for inclusion of system headers using angle brackets, e.g. #include <stdio.h>.
  • -D macro[=expansion text] This defines a macro. When the =expansion text part is omitted, the macro is defined as 1. Can be repeated multiple times.
  • -Wall will cause printing of warnings
  • -verbose will cause printing of the names of the functions being compiled, also includes the -Wall option

How do I compile Smaller C with itself?

First, compile Smaller C using your favorite C compiler into e.g. smlrc.exe.

Next, compile it to assembly code using

either (for DJGPP gcc)

smlrc.exe -seg32 smlrc.c smlrcdj.asm

or (for MinGW gcc)

smlrc.exe -ctor-fxn __main -seg32 smlrc.c smlrcmingw.asm

or (for Linux gcc):

./smlrc -no-leading-underscore -seg32 smlrc.c smlrclinux.asm

or (for Linux gcc, targeting MIPS ):

./smlrc -D MIPS smlrc.c smlrclinuxmips.s

Note: If you have compiled Smaller C with a compiler that implements ints as 16-bit (or smaller than 32-bit), Smaller C won't be able to compile code with the -seg32, -flat32 and -huge options.

Then assemble (you'll need NASM 2.03 or better) the resultant assembly code using

either (for DJGPP gcc)

nasm.exe -f coff smlrcdj.asm -o smlrcdj.o

or (for MinGW and other gcc's)

nasm.exe -f elf smlrcmingw.asm -o smlrcmingw.o

or

nasm -f elf smlrclinux.asm -o smlrclinux.o

or (for Linux gcc, targeting MIPS ):

gcc -c smlrclinuxmips.s -o smlrclinuxmips.o

or alternatively (for Linux GNU as, targeting MIPS ):

as smlrclinuxmips.s -o smlrclinuxmips.o

Finally, link the object file with your "favorite" compiler's standard library using

either (for DJGPP gcc)

gcc.exe smlrcdj.o -o smlrcdj.exe

or (for MinGW and other gcc's)

gcc.exe smlrcmingw.o -o smlrcmingw.exe

or

gcc smlrclinux.o -o smlrclinux

or (for Linux gcc, targeting MIPS ):

gcc smlrclinuxmips.o -o smlrclinuxmips

ALSO, you can now (re)compile Smaller C into a 16-bit DOS .EXE with a precompiled/crosscompiled Smaller C, say smlrc.exe, and NASM, without any linker, like this:

smlrc.exe -seg16 -no-externs lb.c lb.asm

smlrc.exe -seg16 -no-externs -label 1001 -D NO_ANNOTATIONS -D NO_EXTRAS smlrc.c smlrc.asm

nasm -f bin smlrc16.asm -o smlrc16.exe

The above will produce a DOS executable not supporting the -seg32, -flat32 and -huge options, the long integer types, typedef, enum, #pragma pack() and __func__. If you need 32-bit support in Smaller C for DOS or those features, you can compile it like this instead:

smlrc.exe -huge -no-externs lb.c lb.asm

smlrc.exe -huge -no-externs -label 1001 smlrc.c smlrc.asm

nasm -f bin smlrchg.asm -o smlrchg.exe

Similarly, you can now (re)compile Smaller C into a 32-bit Windows .EXE with a precompiled/crosscompiled Smaller C, say smlrc.exe, and NASM, without any linker, like this:

smlrc -seg32 -no-externs -D _WIN32 lb.c lb.asm

smlrc -seg32 -no-externs -label 1001 smlrc.c smlrc.asm

nasm -f bin mzstub.asm -o mzstub.bin

nasm -f bin smlrcw.asm -o smlrcw.exe

Likewise, you can now (re)compile Smaller C into a 32-bit Linux ELF executable with a precompiled/crosscompiled Smaller C, say smlrc, and NASM, without any linker, like this:

smlrc -seg32 -no-externs -D _LINUX lb.c lb.asm

smlrc -seg32 -no-externs -label 1001 smlrc.c smlrc.asm

nasm -f bin smlrcl.asm -o smlrcl

chmod +x smlrcl

Note: If you can't (re)compile Smaller C because it gets too big (e.g. the code doesn't fit into a 64KB segment when compiling using -seg16 or -seg16t), you may want to compile it with the NO_ANNOTATIONS macro defined. To further reduce the code and data sizes, compile with the NO_PREPROCESSOR macro defined (you'll need an external preprocessor if you use this macro). Additionally you can define NO_TYPEDEF_ENUM and/or NO_FUNC_ to exclude support (and code) for typedef/enum and __func__. Defining NO_PPACK will exclude support for #pragma pack(). To reduce the data size further, compile with the SYNTAX_STACK_MAX macro defined with a smaller number than in the source code (this will limit the maximum number of declarations supported in a translation unit).

Miscellaneous

  • If you dislike the annotations that the compiler puts into the assembly code that it generates, you can suppress them. Compile the compiler with the NO_ANNOTATIONS macro defined. This will also reduce the size of the compiler.
  • You can disable the preprocessor by compiling Smaller C with the NO_PREPROCESSOR macro defined. This may be useful if you're using an external preprocessor. This will also reduce the size of the compiler. Additionally you can define NO_TYPEDEF_ENUM and/or NO_FUNC_ to exclude support (and code) for typedef/enum and __func__. Defining NO_PPACK will exclude support for #pragma pack().

TODO

  • Implement or borrow a preprocessor
Clone this wiki locally