Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Combinator: discard_until / drop_until #1594

Open
Trequetrum opened this issue Dec 22, 2022 · 10 comments · May be fixed by #1790
Open

New Combinator: discard_until / drop_until #1594

Trequetrum opened this issue Dec 22, 2022 · 10 comments · May be fixed by #1790

Comments

@Trequetrum
Copy link

Trequetrum commented Dec 22, 2022

This is a combinator I use all the time, might be useful to see something like it in this crate.

It drops a byte at a time until the given parser matches, then returns the result.

I don't do parsing in any really performance sensative contexts, this can probably be better implemented. This impl demonstrates the idea.

fn drop_until<'a, T>(
    parser: fn(&'a str) -> IResult<&'a str, T>,
) -> impl FnMut(&'a str) -> IResult<&'a str, T> {
    map(many_till(take(1u8), parser), |(_, matched)| matched)
}
@sshine
Copy link

sshine commented Jan 4, 2023

Isn't this equivalent to discarding the output of take_while?

let (s, _) = take_while(p)(s)?;

@Trequetrum
Copy link
Author

Trequetrum commented Jan 4, 2023

Isn't this equivalent to discarding the output of take_while?

I don't fully understand understand how.


Lets say this is our input:

ahdHEahdkjbHELLOlkasjdLLadO

drop_until(tag("HELLO"))(input) 

returns:

OK(("lkasjdLLadO","HELLO"))

I suppose you could use

map(
    pair(
        take_while(not(tag("HELLO")),
        tag("HELLO")
    ),
    |(_, v)| v
)(input)

but is that better? Maybe... though it seems like this matches HELLO twice because of the not(tag("HELLO")) parser.

@sshine
Copy link

sshine commented Jan 5, 2023

I see that drop_until can more easily express some things.

Maybe defining your format in terms of the complement of something is more typical when parsing binary file formats, or when extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

For specifying language grammars, it makes more sense to positively define the thing you're skipping (comments, whitespace, etc.) even if you're just going to discard it. It was with this frame of mind that I assessed the usefulness of drop_until.

@epage
Copy link
Contributor

epage commented Jan 5, 2023

Note that this somewhat parallels the conversation in #1223 / #1566 regarding how much should nom provide common parsers and people drop the output as needed vs provide specialized no-output parsers. One specific case of interest is in the multi module where there are specialized non-allocating variants as the overhead of capturing the output in that case is a lot higher. Note that instead of providing O=() parser variants, they are _count variants, not throwing the data away which is a common pattern in rust APIs.

@Trequetrum
Copy link
Author

Note that this somewhat parallels the conversation in #1223 / #1566

Yeah, I can see that.

I think I'd address this less by the opportunity to tweak performance (as it seems like if your parser isn't allocating, but just returning a slice of the input, there's no performance hit) and more by an appeal to proving a gentle introduction to Nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. Generally, regex was never a serious consideration when parcing language grammars, for example. Conceptually, regex is often used to match to some embedded pattern of tokens in a lager context. A way to pull desired information from an otherwise noisy document.

Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

If that's true, they're likely going to try...

[...] extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

@epage
Copy link
Contributor

epage commented Jan 6, 2023

more by an appeal to proving a gentle introduction to Nom.

With clap, a common problem I find is the larger the API is, the more likely people are to not find functionality they need. I feel like nom is on the cusp of that and would hope that nom limits the convenience variants of parsers to help new users with nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises.
...
Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

For some reason I don't see how this helps with aligning with regex. Maybe enumerating regex features and how you feel they line up with existing or potential parsers would help. That could also be a useful piece of documentation for nom.

clowder added a commit to clowder/pg_logfmt that referenced this issue Apr 21, 2023
Accommodates Rails's default tagged logger.

Nom doesn't support dropping the input[^1], resulting in a useless
vector being allocated for the duration of the parse.

[^1]: rust-bakery/nom#1594
clowder added a commit to clowder/pg_logfmt that referenced this issue Apr 21, 2023
Accommodates Rails's default tagged logger.

Nom doesn't support just dropping the garbage[^1], it's accumulated in a
newly allocated vector.

[^1]: rust-bakery/nom#1594
clowder added a commit to clowder/pg_logfmt that referenced this issue Apr 21, 2023
Accommodates Rails's default tagged logger.

Nom doesn't support just dropping the garbage[^1], it's accumulated in a
newly allocated vector.

[^1]: rust-bakery/nom#1594
@Laura7089
Copy link

I'm interesting in seeing this implemented. I think there's a gap here, as mentioned by @epage - many of the combinators in multi are allocating/_count pairs, but many_till does not have a _count equivalent. It's fairly non-trivial to implement this on one's own because the multiple-application part is terminated by another parser. I think this is worth implementing because of the performance implications of the allocation in many_till.

@Laura7089 Laura7089 linked a pull request Dec 3, 2024 that will close this issue
@adsick
Copy link

adsick commented Dec 10, 2024

When solving this years Advent of Code I decided to use nom for day 3. Essentially I wanted to skip invalid input completely and parse only valid elements. I've not found any easy and straightforward way of doing it in nom.

So I ended up using a terrible solution - iterate over all consecutive substrings one char at a time (parsers were non-overlapping). It worked, but generally it wouldn't work and (obviously) it's not optimal because parsed bytes should be skipped.

@adsick
Copy link

adsick commented Dec 10, 2024

I mean I could and tried to use preceded(not(pattern), pattern) or something similar with many combinator, but that didn't work.

@AlexanderStein
Copy link

AlexanderStein commented Dec 10, 2024

When solving this years Advent of Code I decided to use nom for day 3. Essentially I wanted to skip invalid input completely and parse only valid elements. I've not found any easy and straightforward way of doing it in nom.

So I ended up using a terrible solution - iterate over all consecutive substrings one char at a time (parsers were non-overlapping). It worked, but generally it wouldn't work and (obviously) it's not optimal because parsed bytes should be skipped.

Funny I stumbled accross your comment for the exact same "problem" ;-)
I found a solution only using nom's features. many_till is the important part:

#[derive(Debug, PartialEq)]
struct Operands {
    a: u32,
    b: u32,
}

fn decimal(input: &str) -> IResult<&str, &str> {
    recognize(many1(terminated(one_of("0123456789"), many0(char('_'))))).parse(input)
}

fn parse_operand(input: &str) -> IResult<&str, (&str, &str)> {
        separated_pair(decimal, tag(","), decimal)(input)
}

fn parse_mul(input: &str) -> IResult<&str, Vec<Operands>> {
    map(many1(
        map(many_till(
            take(1_usize),
            delimited(tag("mul("), parse_operand, tag(")"))
        ),|(_, result)| {
            result
        })
    ), |operands| {
        operands.iter().map(|oper| {
            let (left, right) = *oper;
            let a = left.parse::<u32>().unwrap();
            let b = right.parse::<u32>().unwrap();
            Operands{ a, b }
        }).collect()
    })(input)
}

Feel free to use. MIT licensed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants