Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate "Prelude" of standard C types #293

Open
edsko opened this issue Nov 20, 2024 · 27 comments
Open

Generate "Prelude" of standard C types #293

edsko opened this issue Nov 20, 2024 · 27 comments

Comments

@edsko
Copy link
Collaborator

edsko commented Nov 20, 2024

In the repository there is a file standard_headers.h which imports all of the C standard library. We also have a skeleton implementation of some code (bootstrapPrelude) that trawls through these definitions; currently this is only being used to check if the macro parser is failing to parse some macros that we should be parsing (just to have a source of examples). We should extend this to make a list of all standard types in the standard library, and ensure that for all of these standard C types we have a well-defined translation to standard Haskell types (either from base or from hs-bindgen-patterns).

@TravisCardwell
Copy link
Collaborator

Directory hs-bindgen/musl-include is not searched unless we specify it as an option as follows:

$ cabal run hs-bindgen -- \
    --clang-option "-I$(pwd)/hs-bindgen/musl-include" \
    dev prelude

Do we want to always/only use that directory, or might we want to run the command using different include directories? Would it be worthwhile to either always inject that include directory or expose our own -I option that defaults to that directory accordingly?

@TravisCardwell
Copy link
Collaborator

I get a fatal error that stdatomic.h is not found. It is included by standard_headers.h, in the C11 section. Support for atomic primitives and types is an optional feature, and it sounds like Musl does not support it (reference).

What should we do?

@TravisCardwell
Copy link
Collaborator

I think that we should document the exact VERSION of musl-include that we vendor. We also need to include the COPYRIGHT file.

My suggestion is to vendor the following:

hs-bindgen/
  musl/
    COPYRIGHT
    VERSION
    include/
      ...

Sound good?

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Dec 13, 2024

Assuming that we want to update the vendored Musl headers as new versions are released, I would like to make doing so trivial by using a script. Would it be acceptable to add a Bash script to do this named scripts/musl-update.sh?

I would like to confirm: our vendored Musl header files include the generic architecture headers? Perhaps this is problematic for supporting multiple architectures?

How were the current Musl headers vendored? Were they copied from a release tarball, for example?

@phadej
Copy link
Collaborator

phadej commented Dec 13, 2024

include the generic architecture headers?

No, they include x86_64 build.

I fail to understand how updating musl is related to generating prelude of standard C types.

@phadej
Copy link
Collaborator

phadej commented Dec 13, 2024

I get a fatal error that stdatomic.h is not found. It is included by standard_headers.h, in the C11 section. Support for atomic primitives and types is an optional feature, and it sounds like Musl does not support it (reference).

What should we do?

IMHO, we should support musl. Using Alpine Linux (which uses musl) to produce fully statically linked linux binaries is common practice. @edsko should clarify whether that is important (i.e. or is always supporting stdatomics is more important); if so, then we should or should not test against musl at all.

@TravisCardwell
Copy link
Collaborator

I fail to understand how updating musl is related to generating prelude of standard C types.

I am working on development command hs-bindgen dev prelude with a goal of making it easy to determine which types in the standard headers we do not support yet (in addition to macros). I wanted to confirm which headers we are working with. In the future, we may want to continue to use this command to check for unsupported types in newer releases of Musl. For example, Musl does not yet support the C23 standard headers. If/when support is added, we may want to add support for them as well.

No, they include x86_64 build.

Thank you!

I did a build of Musl 1.2.5 targeting x86_64 and confirmed that I get the exact same headers, so I put the appropriate VERSION and COPYRIGHT files in place.

Verbose notes:

Building Musl 1.2.5:

$ cd /tmp
$ wget https://musl.libc.org/releases/musl-1.2.5.tar.gz
$ tar -zxf musl-1.2.5.tar.gz
$ cd musl-1.2.5
$ ./configure \
    --prefix=/tmp/musl \
    --exec-prefix=/tmp/musl \
    --syslibdir=/tmp/musl/lib \
    --target=x86_64-pc-linux-gnu
$ make
$ make install

Comparing the built include directory with the source:

  • Files from the source include directory are copied without modification. Exception:
    • alltypes.h.in is not copied; file alltypes.h is not created.
  • Files from the source arch/x86_64/bits directory are copied without modification. Exceptions:
    • alltypes.h.in is not copied; file alltypes.h is created.
    • syscall.h.in is not copied; file syscall.h is created.
  • Files from the source arch/generic/bits directory that are not copied from arch/x86_64/bits are copied without modification.
  • Files from the source arch/x86_64 directory are not copied.

To install/upgrade our vendored Musl headers, it looks like we cannot just copy headers from the tarball. We should copy the include directory from a temporary build. The VERSION and COPYRIGHT files should be copied from the tarball.

To always target x86_64, perhaps it is best to keep things simple and just put an assertion in the upgrade script that checks that it is being run on an appropriate platform.

@TravisCardwell
Copy link
Collaborator

Regarding compatibility, I found the compatibility wiki page with links to C99 API coverage and C11 API coverage. In addition to stdatomic.h not being supported, imaginary and _Imaginary_I are missing from complex.h.

@TravisCardwell
Copy link
Collaborator

I would like to confirm how we will implement our "prelude" of C types.

There are a lot of types. In C, many types are available from multiple header files. In Musl, such types are defined in bits/alltypes.h. Standard headers import this header file using preprocessor definitions to determine which types are defined (if they are not already defined). In Haskell, perhaps we will define a single (huge) module defining all of the types? From a user perspective, I would prefer to have types organized into modules that correspond to the C header files. A top-level module could re-export all types, for folks who prefer a single import. In the implementation, modules can import from other types modules when necessary/possible. In cases where doing so would results in circular imports, problematic types could be defined in an Internal module. Example:

HsBindgen.StdLib.Types
...
HsBindgen.StdLib.Types.Internal
...
HsBindgen.StdLib.Types.Primitives
...
HsBindgen.StdLib.Types.StdBool
HsBindgen.StdLib.Types.StdDef
HsBindgen.StdLib.Types.StdInt
...

Perhaps we should use the C prefix for type names, like the types defined in Foreign.C.Types?

Many of the types are platform-specific. Should we generate Haskell definitions that match the current platform? If so, I wonder what is the best way to do this. If we also want to enable/disable features based on availability (when using GNU vs. Musl for example), perhaps we should use Autoconf?

Each type should have a newtype definition. Perhaps we should use deriving newtype to derive instances for all classes that the wrapped type has instances for? Example:

newtype CInt8 = CInt8 CSChar
  deriving newtype (
      Bits
    , Bounded
    , Enum
    , Eq
    , FiniteBits
    , Integral
    , Ix
    , Num
    , Ord
    , Read
    , Real
    , Show
    , Storable
    )

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Dec 16, 2024

Foreign.C.Types defines three types like data CFile = CFile:

  • CFile: C type FILE is a typedef for a struct. Macros are used to optionally make it opaque depending on where it is included from.
  • CFpos: C type fpos_t is a typedef for a union.
  • CJmpBuf: C type jmp_buf is an array typedef for a struct.

These types have no instances. Comments indicate that Eq and Storable instances should be added.

What should we do about such types in our prelude? Perhaps we should define our own versions of these types and not use those in Foreign.C.Types?

For example, we could implement (our own version of) CFile as follows. By only exporting the type constructor, we can keep it opaque.

newtype CFile = CFile CChar
  deriving newtype (Eq, Storable)

Types CFpos and CJmpBuf are more complicated. FWIW, I am not yet sure how we might represent CFpos. The union includes a 16-byte array, yet the Musl implementation consistently casts pointers to const long long *, so perhaps __opaque is never used and we can just wrap CLLong as well...

typedef union _G_fpos64_t {
  char __opaque[16];
  long long __lldata;
  double __align;
} fpos_t;

Perhaps in practice we will only ever deal with pointers to such types anyway. This is probably why they are defined like that in Foreign.C.Types and the Eq and Storable instances were never implemented... If we use such types, however, we will not be able to translate any code that we run across that does not use a pointer.

We will need to test/use the types to gain confidence in the design.

@TravisCardwell
Copy link
Collaborator

We are translating C typedef declarations to newtype wrappers, but perhaps array typedef declarations are an exception since they essentially bake in a pointer. Example:

typedef struct __jmp_buf_tag {...} jmp_buf[1];
typedef jmp_buf sigjmp_buf;

In such a case, perhaps both should be defined using data? Note that instances cannot be defined, as these types are only usable via pointers.

data CJmpBuf    = CJmpBuf
data CSigJmpBuf = CSigJmpBuf

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Dec 17, 2024

Musl supports multiple standards, including POSIX and BSD. Different standards provide different declarations in the same header files, so the declarations exposed by a header file depend on which standards are enabled. For example, standard header time.h exposes POSIX functionality when POSIX is enabled.

Here are the relevant preprocessor definitions:

  • __STRICT_ANSI__: Adds nothing; only suppresses the default features. This macro is defined automatically by GCC and other major compilers in strict standards-conformance modes.
    • Set by the compiler
  • _POSIX_C_SOURCE (or deprecated _POSIX_SOURCE): As specified by POSIX 2008; adds POSIX base. If defined to a value less than 200809L, or if the deprecated version _POSIX_SOURCE is defined at all, interfaces which were removed from the standard but which are still in widespread use are also exposed.
  • _XOPEN_SOURCE: As specified by POSIX 2008; adds all interfaces in POSIX including the XSI option. If defined to a value less than 700, interfaces which were removed from the standard but which are still in widespread use are also exposed.
  • _BSD_SOURCE (or _DEFAULT_SOURCE): Adds everything above, plus a number of traditional and modern interfaces modeled after BSD systems, or supported on current BSD systems based on older standards such as SVID.
  • _GNU_SOURCE (or _ALL_SOURCE): Adds everything above, plus interfaces modeled after GNU libc extensions and interfaces for making use of Linux-specific features.
  • __STDC_VERSION__
    • Long value representing C standard version
    • Used to expose C extensions in standard C headers
    • Set by the compiler

When none are defined by the user, _BSD_SOURCE is set, and _XOPEN_SOURCE is set to 700 to enable POSIX 2017 interfaces. I confirmed that this is what we are currently doing.

What standards to we want to support in our prelude? In the implementation of the prelude command, we should probably initialize the above definitions according to what we want when we parse the Musl headers. If we do not, we may report lots of types for standards that we do not want to include support for.

@TravisCardwell
Copy link
Collaborator

I tried configuring standard_headers.h as follows, but it did not work! I am still seeing _BSD_SOURCE and _XOPEN_SOURCE set in macros-recognized.log. The __STRICT_ANSI__ macro is logged on line 7303 out of 9602 lines, while I expected it to be logged near the top. It looks like the file is being processed out-of-order, but that does not make sense.

#define __STRICT_ANSI__ 1

#if defined(_ALL_SOURCE)
#undef _ALL_SOURCE
#endif

#if defined(_BSD_SOURCE)
#undef _BSD_SOURCE
#endif

#if defined(_DEFAULT_SOURCE)
#undef _DEFAULT_SOURCE
#endif

#if defined(_GNU_SOURCE)
#undef _GNU_SOURCE
#endif

#if defined(_POSIX_C_SOURCE)
#undef _POSIX_C_SOURCE
#endif

#if defined(_POSIX_SOURCE)
#undef _POSIX_SOURCE
#endif

#if defined(_XOPEN_SOURCE)
#undef _XOPEN_SOURCE
#endif

As a test, I added #include <features.h> after that __STRICT_ANSI__ line, but the results did not change. I tried moving the __STRICT_ANSI__ line to the very top of the file, but the results did not change.

Perhaps we can do this configuration via Clan arguments, but I am concerned that this is not working regardless.

@TravisCardwell
Copy link
Collaborator

Clang option -std has many supported modes, and those that start with c automatically define __STRICT_ANSI__. I tried setting c17 using the following command, but the results did not change.

$ cabal run hs-bindgen -- \
    --clang-option "-I$(pwd)/hs-bindgen/musl/include" \
    --clang-option "-std=c17" \
    dev prelude \
  | sed "s|$(pwd)|...|g" \
  > hs-bindgen-dev-prelude.txt

I noticed that the option is stored in clangOtherArgs and is not parsed to clangCStandard, but I do not think that matters.

@TravisCardwell
Copy link
Collaborator

The macros-recognized.log file is written using appendFile! The file is not cleared when the command starts, so each execution adds to the file, explaining why logs appeared to be out of order.

I will definitely change this behavior, but for now I am simply removing the macros-recognized.log file before each execution.

Inserting #define __STRICT_ANSI__ 1 into standard_headers.h before any includes successfully disables BSD and POSIX support. The clearing of the other flags should not be necessary.

Alternatively, passing the -std=c17 Clang argument also successfully disables BSD and POSIX support.

My sanity is restored.

Considering the options, my proposal is to always specify a C standard when running the prelude command. We can make it default to C17 and provide a command-line option that allows selection of a different standard. With only standard modes available, Clang will always set __STRICT_ANSI__, so there is no need to do so in standard_headers.h.

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Dec 18, 2024

What standards to we want to support in our prelude?

Talking with @edsko yesterday, it indeed sounds like we will only support C standards in our standard library (currently "patterns"). Based on a previous discussion, all C code that we generate should conform to C17.

I am now thinking about how to add a C standard option to the command-line executable. Should it be added to the top-level, alongside --clang-option, or should it be added under the dev or prelude commands?

In the context of development commands such as prelude, a motivation for making it an option is to enable us to experiment with adding support for later standards. If we support C17 to start with, we may eventually want to add support for C23.

What should the behavior be for the preprocess command, though? If we develop our prelude with C17 types and users use hs-bindgen with C23 code, perhaps we will use our standard library types when available and generate any C23 types within the user's module(s)? Should we allow users to use other modes such as gnu17? Note that gnu17 is the Clang default; it enables both BSD and POSIX support as well as has some differences described in the C Language Features part of the user manual.

Should the option support older standards such as C11?

EDIT: I went ahead and implemented it at the top level for now, as we already parse ClangArgs there. With this change, the default C standard is now consistently C17 instead of gnu17.

@edsko
Copy link
Collaborator Author

edsko commented Dec 18, 2024

Oops, I apologize for causing harm to your sanity 😬

@TravisCardwell
Copy link
Collaborator

@sheaf mentioned that ptrdiff_t expands to long long on his system. That caught my attention because it is long in the Musl headers for the 64-bit architectures that I have looked at so far. I imagine that it is a case where sizeof(long) == sizeof(long long) and this "spelling" difference does not matter.

This is a good reminder, however, that one may get different results depending on the headers used. I was unable to find an elegant way to query the default Clang header search path, but it can be found in the output of the following command.

$ clang -E - -v </dev/null

The GCC defaults can be found in the output of the following command.

$ $(gcc -print-prog-name=cc1) -v </dev/null 2>&1 | awk -v RS= 'NR==1'

On my system using the default Clang installation, the headers are in /usr/lib/clang/18/include. Type ptrdiff_t is defined as using predefined macro __PTRDIFF_TYPE__ when not overridden by a system macro.

On my system using the default GCC installation, the headers are in /usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/include. Type ptrdiff_t is defined as long int (same as long) when not overridden by a system macro.

When executing hs-bindgen on an experiment without specifying Clang options, I get an error that indicates that the default header search path is not being used. I reproduced this in the main branch, confirming that it is not due to Clang options changes that I have made in my current branch.

hs-bindgen: CErrors ["experiment.h:1:10: fatal error: 'stddef.h' file not found"]

It runs fine when specifying options as follows.

cabal run hs-bindgen -- \
  --clang-option '-nostdinc' \
  --clang-option '-isystem/usr/lib/clang/18/include' \
  preprocess \
    -i experiment.h \
    -o Generated.hs

Inspecting the generated code, it is long with Clang defaults on my system.

newtype Ptrdiff_t = Ptrdiff_t
  { unPtrdiff_t :: FC.CLong
  }

I am unable to investigate using clang-ast-dump because it always uses the default Clang options. CLI options for specifying Clang options need to be added for it to be useful with non-primitive types.

For the preprocess command, I wonder what would be most convenient for users. Some thoughts:

  • Perhaps we should always add -nostdinc so that we can fully control which include paths are searched.
  • We could provide a way to add the default Clang include directory, saving users from having to specify it manually. If an empty search path is the default, then we could use a --clang-headers option. Alternatively, we could do this by default, unless a --nostdinc option is specified.
  • When using pkg-config, directories to include depend on the results of dependency resolution on the system. I wonder if there is a way to incorporate those automatically, perhaps via an environment variable.
  • Include path(s) may depend on the target when cross-compiling.

@TravisCardwell
Copy link
Collaborator

Foreign.C.Types does not define a long double type, perhaps because it is implementation-specific. The standard (since C89) just says that it needs to be at least a large as double.

How should we handle this type?

Note that it is used in the standard library. Example:

typedef struct { long long __ll; long double __ld; } max_align_t;

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Dec 26, 2024

What naming conventions should we use for types defined in the prelude?

The module re-exports types defined in base, which use Haskell-style names with a C prefix (example: CInt). Any _t suffixes are dropped (example: CWchar). We could follow this convention, but there are some standard types that are problematic. For example, float_t cannot be named CFloat because that name is already used for the primitive float type. One way to resolve this is to use a T suffix when necessary (example: CFloatT).

Note that float_t is an architecture-dependent type. For example, it is a long double on i386 and a float on x86_64.

Alternatively, we can follow our own default naming conventions, though this results in non-Haskell-style names that are not consistent with the names of types exported from base (example: Float_t).

@TravisCardwell
Copy link
Collaborator

As decided on in our last call, I am implementing the prelude in module HsBindgen.Patterns.LibC.

Architecture-dependent implementations are re-exported from internal module HsBindgen.Patterns.LibC.Arch. I initially named this module Bits (Musl terminology), but perhaps Arch is easier to understand.

There will be a separate Arch.hs file for each supported architecture. Cabal conditions are used to select the appropriate source as follows:

...
other-modules:
    ...
    HsBindgen.Patterns.LibC.Arch
hs-source-dirs:
    src
if arch(x86_64)
  hs-source-dirs:
    src-x86_64
...

I am currently exposing the HsBindgen.Patterns.LibC module. This is not in accordance with the library design, where everything should be re-exported from HsBindgen.Patterns according to the following comment:

-- | Design patterns for writing high-level FFI bindings
--
-- This is the only exported module in this library. It is intended to be
-- imported unqualified.

Should we continue to do this even as the library grows? By the way, I see that HsBindgen.ConstantArray is implemented outside of the library prefix and is exposed, perhaps indicating that re-exporting everything from a single module is already problematic.

FWIW, as a user, I would much prefer a separate LibC module.

One reason for exposing the LibC module is that it includes module-level documentation. It re-exports many types from base, and module section documentation allows us to document those types. If we were to make that module internal and re-export from HsBindgen.Patterns, we would need to move the documentation to that module. It is more convenient for developers as well as users who look at the source while trying to understand something to keep the documentation next to the relevant code.

@TravisCardwell
Copy link
Collaborator

Here is a GHC issue regarding long double: #3353

The long-double package implements support for x86_64, aarch64, arm, and i386 architectures.

My understanding is that the implementation may differ in different libraries/compilers since the standard does not have exact specifications for the type, however, not just per architecture. Perhaps implementations are consistent in practice...

The x86_64 implementation matches my test (using GCC and GNU libc):

sizeof(float):         4
alignof(float):        4
sizeof(double):        8
alignof(double):       8
sizeof(long double):  16
alignof(long double): 16
Source
#include <stdalign.h>
#include <stdio.h>

int main(void) {
  printf("sizeof(float):        % 2zu\n", sizeof(float));
  printf("alignof(float):       % 2zu\n", alignof(float));
  printf("sizeof(double):       % 2zu\n", sizeof(double));
  printf("alignof(double):      % 2zu\n", alignof(double));
  printf("sizeof(long double):  % 2zu\n", sizeof(long double));
  printf("alignof(long double): % 2zu\n", alignof(long double));
  return 0;
}

As mentioned in the GHC issue, an alternative approach would be to define an opaque type and only support pointers. That would limit what we can support, though, which may be problematic.

@TravisCardwell
Copy link
Collaborator

Initial PR: #347

@TravisCardwell
Copy link
Collaborator

struct lconv, defined in locale.h, seems to differ across C standards.

In C89 (4.4 Localization), the following fields are defined:

char *decimal_point;       /* "." */
char *thousands_sep;       /* "" */
char *grouping;            /* "" */
char *int_curr_symbol;     /* "" */
char *currency_symbol;     /* "" */
char *mon_decimal_point;   /* "" */
char *mon_thousands_sep;   /* "" */
char *mon_grouping;        /* "" */
char *positive_sign;       /* "" */
char *negative_sign;       /* "" */
char int_frac_digits;      /* CHAR_MAX */
char frac_digits;          /* CHAR_MAX */
char p_cs_precedes;        /* CHAR_MAX */
char p_sep_by_space;       /* CHAR_MAX */
char n_cs_precedes;        /* CHAR_MAX */
char n_sep_by_space;       /* CHAR_MAX */
char p_sign_posn;          /* CHAR_MAX */
char n_sign_posn;          /* CHAR_MAX */

In C99 (7.11 Localization), the following fields were added:

char int_p_cs_precedes;    // CHAR_MAX
char int_n_cs_precedes;    // CHAR_MAX
char int_p_sep_by_space;   // CHAR_MAX
char int_n_sep_by_space;   // CHAR_MAX
char int_p_sign_posn;      // CHAR_MAX
char int_n_sign_posn;      // CHAR_MAX

Musl defines the structure with all of these fields, compliant with C99 and later.

What should we do in this case? Perhaps one option is to implement the versions as separate types and make the C standard a factor when resolving types.

@TravisCardwell
Copy link
Collaborator

In general, I wonder how we will resolve LibC type usage during code generation.

Perhaps a first step is mapping C names to LibC types. Perhaps given a C name we can lookup a corresponding LibC type. When one exists, perhaps we can compare the implementations to determine if the LibC type should be used or not?

We cannot simply compare the types. For example, we define newtype CInt64 = CInt64 Int64, not wrapping the same typedefs in the C definitions. Furthermore, the C type spellings may differ in different libraries/platforms, as witnessed in the ptrdiff_t discussion. Perhaps comparison of primitive types can be done based on size and alignment, and comparison of other types can (also) compare the types of each field?

We can generate code for types that differ. For example, a C89 version of struct lconv could be generated if we only provide a C99+ version. As long as we always qualify LibC types, we can avoid conflicts even when the types have the same name.

@edsko
Copy link
Collaborator Author

edsko commented Jan 8, 2025

Brief summary discussion with @TravisCardwell on naming conventions:

  • Where Haskell base types exist, we should re-use them (CInt etc.)
  • For additional types that we need to add, we then have a choice for naming: do we follow the base naming conventions, or the conventions of our default name mangler?
  • It would feel strange to use different conventions within the same module (LibC), so we prefer the former. Additional justification:
    • In an ideal world, these types could eventually be upstreamed to be part of base
    • The C prefix could be interpreted not as "defined in C" but rather "part of the C standard library", justifying why our default name mangler does not insert this prefix.

@TravisCardwell
Copy link
Collaborator

Discussing in chat, there are changes to what we will put in the LibC module.

We decided to not re-export types from base, because the types defined may change over time. I went ahead and removed the re-exports. Note that we still provide documentation for these types.

We decided to not create newtype wrappers for stdint types. For example, we will use Haskell type Int64 directly for the int64_t C type. This is the opposite of our strategy of always creating wrappers for generated code, but doing so should not be needed for these standard types, and the CApiFFI already does this. I went ahead and removed these types. Note that we still provide documentation for them.

This was referenced Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants