New Combinator: discard_until / drop_until #1594

Trequetrum · 2022-12-22T23:19:22Z

This is a combinator I use all the time, might be useful to see something like it in this crate.

It drops a byte at a time until the given parser matches, then returns the result.

I don't do parsing in any really performance sensative contexts, this can probably be better implemented. This impl demonstrates the idea.

fn drop_until<'a, T>(
    parser: fn(&'a str) -> IResult<&'a str, T>,
) -> impl FnMut(&'a str) -> IResult<&'a str, T> {
    map(many_till(take(1u8), parser), |(_, matched)| matched)
}

sshine · 2023-01-04T08:31:15Z

Isn't this equivalent to discarding the output of take_while?

let (s, _) = take_while(p)(s)?;

Trequetrum · 2023-01-04T16:09:07Z

Isn't this equivalent to discarding the output of take_while?

I don't fully understand understand how.

Lets say this is our input:

ahdHEahdkjbHELLOlkasjdLLadO

drop_until(tag("HELLO"))(input)

returns:

OK(("lkasjdLLadO","HELLO"))

I suppose you could use

map(
    pair(
        take_while(not(tag("HELLO")),
        tag("HELLO")
    ),
    |(_, v)| v
)(input)

but is that better? Maybe... though it seems like this matches HELLO twice because of the not(tag("HELLO")) parser.

sshine · 2023-01-05T09:38:56Z

I see that drop_until can more easily express some things.

Maybe defining your format in terms of the complement of something is more typical when parsing binary file formats, or when extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

For specifying language grammars, it makes more sense to positively define the thing you're skipping (comments, whitespace, etc.) even if you're just going to discard it. It was with this frame of mind that I assessed the usefulness of drop_until.

epage · 2023-01-05T23:02:46Z

Note that this somewhat parallels the conversation in #1223 / #1566 regarding how much should nom provide common parsers and people drop the output as needed vs provide specialized no-output parsers. One specific case of interest is in the multi module where there are specialized non-allocating variants as the overhead of capturing the output in that case is a lot higher. Note that instead of providing O=() parser variants, they are _count variants, not throwing the data away which is a common pattern in rust APIs.

Trequetrum · 2023-01-06T19:51:52Z

Note that this somewhat parallels the conversation in #1223 / #1566

Yeah, I can see that.

I think I'd address this less by the opportunity to tweak performance (as it seems like if your parser isn't allocating, but just returning a slice of the input, there's no performance hit) and more by an appeal to proving a gentle introduction to Nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. Generally, regex was never a serious consideration when parcing language grammars, for example. Conceptually, regex is often used to match to some embedded pattern of tokens in a lager context. A way to pull desired information from an otherwise noisy document.

Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

If that's true, they're likely going to try...

[...] extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

epage · 2023-01-06T20:05:36Z

more by an appeal to proving a gentle introduction to Nom.

With clap, a common problem I find is the larger the API is, the more likely people are to not find functionality they need. I feel like nom is on the cusp of that and would hope that nom limits the convenience variants of parsers to help new users with nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises.
...
Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

For some reason I don't see how this helps with aligning with regex. Maybe enumerating regex features and how you feel they line up with existing or potential parsers would help. That could also be a useful piece of documentation for nom.

Accommodates Rails's default tagged logger. Nom doesn't support dropping the input[^1], resulting in a useless vector being allocated for the duration of the parse. [^1]: rust-bakery/nom#1594

Accommodates Rails's default tagged logger. Nom doesn't support just dropping the garbage[^1], it's accumulated in a newly allocated vector. [^1]: rust-bakery/nom#1594

Laura7089 · 2024-12-03T14:55:53Z

I'm interesting in seeing this implemented. I think there's a gap here, as mentioned by @epage - many of the combinators in multi are allocating/_count pairs, but many_till does not have a _count equivalent. It's fairly non-trivial to implement this on one's own because the multiple-application part is terminated by another parser. I think this is worth implementing because of the performance implications of the allocation in many_till.

adsick · 2024-12-10T13:35:07Z

When solving this years Advent of Code I decided to use nom for day 3. Essentially I wanted to skip invalid input completely and parse only valid elements. I've not found any easy and straightforward way of doing it in nom.

So I ended up using a terrible solution - iterate over all consecutive substrings one char at a time (parsers were non-overlapping). It worked, but generally it wouldn't work and (obviously) it's not optimal because parsed bytes should be skipped.

adsick · 2024-12-10T13:37:04Z

I mean I could and tried to use preceded(not(pattern), pattern) or something similar with many combinator, but that didn't work.

AlexanderStein · 2024-12-10T16:50:45Z

When solving this years Advent of Code I decided to use nom for day 3. Essentially I wanted to skip invalid input completely and parse only valid elements. I've not found any easy and straightforward way of doing it in nom.

So I ended up using a terrible solution - iterate over all consecutive substrings one char at a time (parsers were non-overlapping). It worked, but generally it wouldn't work and (obviously) it's not optimal because parsed bytes should be skipped.

Funny I stumbled accross your comment for the exact same "problem" ;-)
I found a solution only using nom's features. many_till is the important part:

#[derive(Debug, PartialEq)]
struct Operands {
    a: u32,
    b: u32,
}

fn decimal(input: &str) -> IResult<&str, &str> {
    recognize(many1(terminated(one_of("0123456789"), many0(char('_'))))).parse(input)
}

fn parse_operand(input: &str) -> IResult<&str, (&str, &str)> {
        separated_pair(decimal, tag(","), decimal)(input)
}

fn parse_mul(input: &str) -> IResult<&str, Vec<Operands>> {
    map(many1(
        map(many_till(
            take(1_usize),
            delimited(tag("mul("), parse_operand, tag(")"))
        ),|(_, result)| {
            result
        })
    ), |operands| {
        operands.iter().map(|oper| {
            let (left, right) = *oper;
            let a = left.parse::<u32>().unwrap();
            let b = right.parse::<u32>().unwrap();
            Operands{ a, b }
        }).collect()
    })(input)
}

Feel free to use. MIT licensed

clowder added a commit to clowder/pg_logfmt that referenced this issue Apr 21, 2023

Tolerate garbage preceding logfmt

2a45b7e

Accommodates Rails's default tagged logger. Nom doesn't support just dropping the garbage[^1], it's accumulated in a newly allocated vector. [^1]: rust-bakery/nom#1594

clowder added a commit to clowder/pg_logfmt that referenced this issue Apr 21, 2023

Tolerate garbage preceding logfmt

bb18a60

Accommodates Rails's default tagged logger. Nom doesn't support just dropping the garbage[^1], it's accumulated in a newly allocated vector. [^1]: rust-bakery/nom#1594

Laura7089 linked a pull request Dec 3, 2024 that will close this issue

feat: multi::many_till_count combinator #1790

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Combinator: discard_until / drop_until #1594

New Combinator: discard_until / drop_until #1594

Trequetrum commented Dec 22, 2022 •

edited

Loading

sshine commented Jan 4, 2023

Trequetrum commented Jan 4, 2023 •

edited

Loading

sshine commented Jan 5, 2023

epage commented Jan 5, 2023

Trequetrum commented Jan 6, 2023

epage commented Jan 6, 2023

Laura7089 commented Dec 3, 2024

adsick commented Dec 10, 2024

adsick commented Dec 10, 2024

AlexanderStein commented Dec 10, 2024 •

edited

Loading

New Combinator: discard_until / drop_until #1594

New Combinator: discard_until / drop_until #1594

Comments

Trequetrum commented Dec 22, 2022 • edited Loading

sshine commented Jan 4, 2023

Trequetrum commented Jan 4, 2023 • edited Loading

sshine commented Jan 5, 2023

epage commented Jan 5, 2023

Trequetrum commented Jan 6, 2023

epage commented Jan 6, 2023

Laura7089 commented Dec 3, 2024

adsick commented Dec 10, 2024

adsick commented Dec 10, 2024

AlexanderStein commented Dec 10, 2024 • edited Loading

Trequetrum commented Dec 22, 2022 •

edited

Loading

Trequetrum commented Jan 4, 2023 •

edited

Loading

AlexanderStein commented Dec 10, 2024 •

edited

Loading