Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): Add parse_auditd vrl function #1010

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jorgehermo9
Copy link
Contributor

@jorgehermo9 jorgehermo9 commented Aug 30, 2024

Closes #66

I have a few things left (that I'm aware of):

I added a few TODOs in the code for some things I had doubts. I'll appreciate feedback on those.

Currently, logs are parsed into something similar to

{
    "body": {
        "auid": 1000,
        "format": "enriched",
        "kernel": "6.10.4-arch2-1",
        "op": "start",
        "pid": 1240242,
        "res": "success",
        "ses": 2,
        "uid": 0,
        "ver": "4.0.2"
    },
    "enrichment": {
        "AUID": "vrl",
        "UID": "root"
    },
    "sequence": 6439,
    "timestamp": "2024-08-23T14:27:54.618Z",
    "type": "DAEMON_START"
}

Which is a bit more structured from what has been proposed in #66, but I'm open to any suggestion about this format

@jorgehermo9 jorgehermo9 marked this pull request as ready for review August 30, 2024 18:10
@jszwedko jszwedko marked this pull request as draft September 5, 2024 23:46
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'm happy to see this work started as it is an oft-requested feature. I left a few comments on the TODOs. I also marked this as draft for now until it is ready for final review. It seems like the biggest blocker at the moment is the licensing of the linux-audit-parser-rs crate.

fn parse_auditd(bytes: Value) -> Resolved {
let bytes = bytes.try_bytes()?;
// check if bytes ends with newline, otherwise append it
// TODO: make the parser accept bytes without newline in the linux_audit_parser crate (to-be-contributed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it seems like the parser should work without trailing newlines.

Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +25 to +31
// Should we use UTC as the timezone? The logs are generated with the system's
// timezone... What is the correct behavior? Maybe the system where vector is running
// have a different timezone than the system that generated the logs... so it is
// not correct to assume that the current system's timezone is the correct one
// (with TimeZone::timestamp_millis_opt)
// TODO: we should document that the timestamp is parsed into UTC timezone.
// TODO: Maybe we should accept in a parameter a custom timezone to parse the timestamp?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rely on the timezone that is injected into the VRL runtime. See an example in the parse_syslog function:

pub(crate) fn parse_syslog(value: Value, ctx: &Context) -> Resolved {
let message = value.try_bytes_utf8_lossy()?;
let timezone = match ctx.timezone() {
TimeZone::Local => None,
TimeZone::Named(tz) => Some(*tz),
};
let parsed = syslog_loose::parse_message_with_year_exact_tz(
&message,
resolve_year,
timezone,
Variant::Either,
)?;
Ok(message_to_value(parsed))
}

This timezone is configurable in Vector (e.g. on the remap transform: https://vector.dev/docs/reference/configuration/transforms/remap/#timezone).

let (enrichment, body): (ObjectMap, ObjectMap) = parsed
.body
.into_iter()
// TODO: improve this auditd crate with a IntoIter implementation for Body and not only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍

Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +118 to +119
// TODO: should we store hexadecimals as its integer value or as an hexadecimal string?
// TODO: Uppercase hexa or lowercase hex format?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Do you happen to know how users would typically want to consume these values or what fields typically contain them? I think the answer to this question likely depends on what sorts of operations users need to do on these values (if any).

Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know...In the raw log lines this is shown as a0=55b26d44a6a0 so maybe it makes sense to display it as "a0": "0x55b26d44a6a0" in the output value in my opinion. But I would love to know what requesters think about this...

@jorgehermo9
Copy link
Contributor Author

Hi @jszwedko, I'm planning to address the review's comments now that the licensing issue was discussed.

How can we address the feature-flag thing to comply with the license as you depicted in hillu/linux-audit-parser-rs#1 (comment)? I've added it as an optional dependency in this PR and added to the stdlib feature flag

vrl/Cargo.toml

Line 87 in 95d6951

"dep:linux-audit-parser",
but I don't think it is enough for what you said in that comment. What solution do you propose?

"grantors" => "pam_unix,pam_permit,pam_time",
"acct" => "vrl",
"exe" => "/usr/bin/sudo",
"hostname" => Value::Null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields that are missing in the audit log (for example, hostname="?") are parsed into Null values.

They are shown like this in vector:

{
    "body": {
        "auid": 4294967295,
        "msg": {
            "addr": null,
            "comm": "systemd",
            "exe": "/usr/lib/systemd/systemd",
            "hostname": null,
            "res": "success",
            "terminal": null,
            "unit": "wpa_supplicant"
        },
        "pid": 1,
        "ses": 4294967295,
        "uid": 0
    },
    "enrichment": {
        "AUID": "unset",
        "UID": "root"
    },
    "sequence": 165,
    "timestamp": "2024-08-30T18:14:22.240Z",
    "type": "SERVICE_STOP"
}

Should we filter out that fields, or leave the entry with a null value, so people know that it is present but with no value?

log.insert("body".into(), Value::from(body));

// TODO: should we downcase the keys of the enrichment so that they match its body counterparts?
if !enrichment.is_empty() {
Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In auditd, the association between fields and its enriched value is done with capital letters (see this documentation).

But that limitation is because the log format is plain and it really seems like a workaround.

As we can have a more structured representation and the enrichment fields are separated in another map, maybe we can downcase the keys and have something like this instead:

(deleted some object keys to simplify the example)

{
    "body": {
        "auid": 4294967295,
        "uid": 0
    },
    "enrichment": {
        "auid": "unset",
        "uid": "root"
    }
}

so people can link the body fields and the enrichment fields easily, as this way, they enriched counterparts will exactly match the raw keys, instead of capitalizing the keys before looking them up in the enrichment.

What do you think? I don't really know how people would really consume this so I would appreciate some feedback from people who would actually use this (#66 requesters)

@jorgehermo9
Copy link
Contributor Author

jorgehermo9 commented Oct 21, 2024

hillu/linux-audit-parser-rs@8f39651 Added support for nested parsing the msg filed. Now the events are parsed into something similar to

{
    "body": {
        "auid": 4294967295,
        "msg": {
            "addr": null,
            "comm": "systemd",
            "exe": "/usr/lib/systemd/systemd",
            "hostname": null,
            "res": "success",
            "terminal": null,
            "unit": "systemd-update-utmp"
        },
        "pid": 1,
        "ses": 4294967295,
        "uid": 0
    },
    "enrichment": {
        "AUID": "unset",
        "UID": "root"
    },
    "sequence": 174,
    "timestamp": "2024-08-30T18:14:22.454Z",
    "type": "SERVICE_STOP"
}

note that the msg field is now more similar to what is requested in #66

Tested everything with this config

[sources.file]
type = "file"
include = ["/var/log/audit/audit.log"]
data_dir = "./data"

[transforms.auditd]
type = "remap"
inputs = ["file"]
source = """
. = parse_auditd!(.message)
"""

[sinks.console]
type = "console"
inputs = ["auditd"]
encoding.codec = "json"

And it parsed a bunch of auditd logs with no errors at all! (in order to test it, you would need to enable the auditd systemd service)

})
.collect::<Result<ObjectMap, _>>()?,
),
// There are a few values that `linux-audit-parser` does not return in its parsing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jszwedko
Copy link
Member

Hi @jszwedko, I'm planning to address the review's comments now that the licensing issue was discussed.

How can we address the feature-flag thing to comply with the license as you depicted in hillu/linux-audit-parser-rs#1 (comment)? I've added it as an optional dependency in this PR and added to the stdlib feature flag

vrl/Cargo.toml

Line 87 in 95d6951

"dep:linux-audit-parser",

but I don't think it is enough for what you said in that comment. What solution do you propose?

Hey! Thanks for raising this again. After thinking about it some more, I'd prefer to avoid including LGPL dependencies in Vector, even as optional, since they run the risk of someone accidentally including them without realizing it (or us expanding usage without realizing it). I'd (unfortunately 🙁) prefer to see us find another dependency to parse auditd logs. We could even just leverage the existing support for grok parsing by using specific patterns to parse auditd logs. Example: https://gist.github.com/artbikes/2313040

@jszwedko
Copy link
Member

Apologies, I know this has been a bit of a long road to get to this point 😓 I'm curious what you think.

@jorgehermo9
Copy link
Contributor Author

Hi @jszwedko !

I'd prefer to avoid including LGPL dependencies in Vector, even as optional, since they run the risk of someone accidentally including them without realizing it

I totally agree with that. It also was one of my main concerns about this...

prefer to see us find another dependency to parse auditd logs

I don't think there is something in rust yet. If I have enough time, I may end up writing my own parser with a MPL-compatible license. Maybe I can inspire from https://github.com/elastic/go-libaudit/blob/main/auparse/auparse.go as It is licensed as Apache2.0 and I think rewriting it (or inspiring from it) in rust could be feasible.

However, this may delay things a little bit, but I'll give it a try when I find enough time.

Apologies, I know this has been a bit of a long road to get to this point 😓 I'm curious what you think.
Don't worry! Almost all VRL-related work is already done and it won't change a lot just by changing the underlying parser library. I think the same as you and I totally understand it.

Thanks! I may give a bump about this in a few weeks/month if I manage to implement the parser, but pausing it the moment.

@jszwedko
Copy link
Member

@jorgehermo9 thanks for understanding. I do think we might be able to get by with grok parsing instead of a native Rust parser. It'll be less efficient, but might be easier to get going. Worth considering anyway :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New parser or source for Linux audit logs
2 participants