Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to parse round-trip string objects from ruamel.yaml #36

Open
Erotemic opened this issue May 27, 2024 · 4 comments
Open

Fails to parse round-trip string objects from ruamel.yaml #36

Erotemic opened this issue May 27, 2024 · 4 comments
Labels
question Further information is requested

Comments

@Erotemic
Copy link
Collaborator

Erotemic commented May 27, 2024

Describe the bug

When parsing text from a YAML file in ruamel.YAML it it returns a SingleQuotedScalarString object, which does inherit from the str type. Sending this string to the pure-python lark parser seems to work fine, but when sending it to the cython variant it throws a TypeError.

To Reproduce

The following is a MWE that reproduces the issue:

"""
Requirements:
    pip install ruamel.yaml lark-cython lark
"""
import io
import ruamel.yaml


NEW_RUAMEL = 1


class _YamlRepresenter:

    @staticmethod
    def str_presenter(dumper, data):
        # https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
        if len(data.splitlines()) > 1 or '\n' in data:
            text_list = [line.rstrip() for line in data.splitlines()]
            fixed_data = '\n'.join(text_list)
            return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style='|')
        return dumper.represent_scalar('tag:yaml.org,2002:str', data)


def _custom_new_ruaml_yaml_obj():
    """
    References:
        https://stackoverflow.com/questions/59635900/ruamel-yaml-custom-commentedmapping-for-custom-tags
        https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another
        https://stackoverflow.com/questions/76870413/using-a-custom-loader-with-ruamel-yaml-0-15-0
    """

    # make a new instance, although you could get the YAML
    # instance from the constructor argument
    class CustomConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        ...

    class CustomRepresenter(ruamel.yaml.representer.RoundTripRepresenter):
        ...

    CustomRepresenter.add_representer(str, _YamlRepresenter.str_presenter)
    yaml_obj = ruamel.yaml.YAML()
    yaml_obj.Constructor = CustomConstructor
    yaml_obj.Representer = CustomRepresenter
    yaml_obj.preserve_quotes = True
    yaml_obj.width = float('inf')
    return yaml_obj


def codeblock(text):
    """
    Create a block of text that preserves all newlines and relative indentation
    """
    import textwrap
    return textwrap.dedent(text).strip('\n')


# For common constructs see:
# https://github.com/lark-parser/lark/blob/master/lark/grammars/common.lark
RESOLUTION_GRAMMAR_PARTS = codeblock(
    '''
    // Resolution parts of the grammar.
    magnitude: NUMBER

    unit: WORD

    numeric_unit: (magnitude WS* unit)
    implicit_unit: unit

    resolved_unit: numeric_unit | implicit_unit

    %import common.NUMBER
    %import common.WS
    %import common.WORD
    ''')

RESOLVED_UNIT_GRAMMAR = codeblock(
    r'''
    // RESOLVED WINDOW GRAMMAR. Eg. 2GSD
    ?start: resolved_unit
    ''') + '\n' + RESOLUTION_GRAMMAR_PARTS


def main():
    yaml_obj = _custom_new_ruaml_yaml_obj()
    file = io.StringIO("{key: '1mGSD'}")
    data = yaml_obj.load(file)
    text = data['key']

    # https://github.com/lark-parser/lark/blob/master/docs/_static/lark_cheatsheet.pdf
    import lark
    try:
        import lark_cython
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr', _plugins=lark_cython.plugins)
    except ImportError:
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr')

    print(f'{type(text)=}')
    print(f'{text.__class__.__mro__=}')

    parser.parse(text)


if __name__ == '__main__':
    """
    CommandLine:
        python ~/code/lark_cython/tests/test_yaml.py
    """
    main()

The type information it prints before it fails is:

type(text)=<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>
text.__class__.__mro__=(<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>, <class 'ruamel.yaml.scalarstring.ScalarString'>, <class 'str'>, <class 'object'>)

I've tested with versions:

3.11.9 (main, May 14 2024, 08:04:54) [GCC 12.2.0]
ruamel.yaml.__version__ = 0.18.6
lark.__version__ = 1.1.9
lark_cython.__version__ = 0.0.15

And

3.11.9 (main, May 13 2024, 14:03:39) [GCC 11.4.0]
ruamel.yaml.__version__ = 0.17.22
lark.__version__ = 1.1.7
lark_cython.__version__ = 0.0.15

My thought is that cython would handle a class that inherits from a str, but perhaps it doesn't I'm not sure if this can be fixed on the lark-cython side, but I figured it was worth reporting.

My current workarond is to do something like this:

        try:
            tree = parser.parse(text)
        except TypeError:
            if isinstance(text, str) and type(text) is not str:
                # We could be in a case where cython is failing to handle
                # overloaded string types. Try casting to a regular str.
                tree = parser.parse(str(text))
            else:
                raise
@erezsh
Copy link
Member

erezsh commented May 27, 2024

Can you include the exception?

Also, why not always use parser.parse(str(text)) ?

@Erotemic
Copy link
Collaborator Author

It's a TypeError from the pyx file:

File ~/.pyenv/versions/3.11.9/envs/pyenv3.11.9/lib/python3.11/site-packages/lark/parser_frontends.py:100, in ParsingFrontend.parse(self, text, start, on_error)
     98 chosen_start = self._verify_start(start)
     99 kw = {} if on_error is None else {'on_error': on_error}
--> 100 stream = self._make_lexer_thread(text)
    101 return self.parser.parse(stream, chosen_start, **kw)

File ~/.pyenv/versions/3.11.9/envs/pyenv3.11.9/lib/python3.11/site-packages/lark/parser_frontends.py:95, in ParsingFrontend._make_lexer_thread(self, text)
     93 def _make_lexer_thread(self, text: str):
     94     cls = (self.options and self.options._plugins.get('LexerThread')) or LexerThread
---> 95     return text if self.skip_lexer else cls.from_text(self.lexer, text)

File lark_cython/lark_cython.pyx:384, in lark_cython.lark_cython.LexerThread.from_text()

TypeError: Argument 'text' has incorrect type (expected str, got SingleQuotedScalarString)

Even though the type I'm passing through is an instance of a class that inherits from str, it is not a base str, and I'm thinking Cython doesn't like that. It could just be that this is a Cython issue, I didn't see anything in the pyx file that looked like it was the implementation checking for an exact str.

The reason not to blindly cast the input to str is because it should raise a TypeError if I give it a bad type. Chances are a bad object would result in a parse error, but there is also a chance it wouldn't, so doing that cast is undesirable.

Even my workaround is undesirable because a class that inherits from str could still define __str__ and mess things up:

class MyStr(str):
    def __str__(self):
        return 'foo'

text = MyStr('hello')
assert text == 'hello'
# Undesirable, but pathological
assert str(text) == 'foo'

Although that second case is fairly pathological.

@erezsh
Copy link
Member

erezsh commented May 27, 2024

Yes, the text is defined as str (for performance reasons), and I suppose Cython doesn't accept objects that inherit from str, since it wants it the actual type to be a simple wchar[] (or whatever).

@Erotemic
Copy link
Collaborator Author

That's my conclusion as well, but I haven't been able to find docs that explicitly state that is the case.

I'm not sure if this needs to be fixed in lark-cython or if it is better left to consumers. The workaround I'm using could be inserted either in lark.parser_frontends or by adding an additional pure-python layer in lark-cython. It's a fairly niche gotcha, and maybe just having this issue exist is good enough so the workaround is googleable.

@erezsh erezsh added the question Further information is requested label Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants