-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(stdlib): Add parse_auditd
vrl function
#1010
base: main
Are you sure you want to change the base?
feat(stdlib): Add parse_auditd
vrl function
#1010
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I'm happy to see this work started as it is an oft-requested feature. I left a few comments on the TODOs. I also marked this as draft for now until it is ready for final review. It seems like the biggest blocker at the moment is the licensing of the linux-audit-parser-rs
crate.
fn parse_auditd(bytes: Value) -> Resolved { | ||
let bytes = bytes.try_bytes()?; | ||
// check if bytes ends with newline, otherwise append it | ||
// TODO: make the parser accept bytes without newline in the linux_audit_parser crate (to-be-contributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it seems like the parser should work without trailing newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on hillu/linux-audit-parser-rs#3
// Should we use UTC as the timezone? The logs are generated with the system's | ||
// timezone... What is the correct behavior? Maybe the system where vector is running | ||
// have a different timezone than the system that generated the logs... so it is | ||
// not correct to assume that the current system's timezone is the correct one | ||
// (with TimeZone::timestamp_millis_opt) | ||
// TODO: we should document that the timestamp is parsed into UTC timezone. | ||
// TODO: Maybe we should accept in a parameter a custom timezone to parse the timestamp? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rely on the timezone
that is injected into the VRL runtime. See an example in the parse_syslog
function:
vrl/src/stdlib/parse_syslog.rs
Lines 6 to 19 in c6f22a1
pub(crate) fn parse_syslog(value: Value, ctx: &Context) -> Resolved { | |
let message = value.try_bytes_utf8_lossy()?; | |
let timezone = match ctx.timezone() { | |
TimeZone::Local => None, | |
TimeZone::Named(tz) => Some(*tz), | |
}; | |
let parsed = syslog_loose::parse_message_with_year_exact_tz( | |
&message, | |
resolve_year, | |
timezone, | |
Variant::Either, | |
)?; | |
Ok(message_to_value(parsed)) | |
} |
This timezone is configurable in Vector (e.g. on the remap
transform: https://vector.dev/docs/reference/configuration/transforms/remap/#timezone).
let (enrichment, body): (ObjectMap, ObjectMap) = parsed | ||
.body | ||
.into_iter() | ||
// TODO: improve this auditd crate with a IntoIter implementation for Body and not only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on hillu/linux-audit-parser-rs#2
// TODO: should we store hexadecimals as its integer value or as an hexadecimal string? | ||
// TODO: Uppercase hexa or lowercase hex format? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Do you happen to know how users would typically want to consume these values or what fields typically contain them? I think the answer to this question likely depends on what sorts of operations users need to do on these values (if any).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know...In the raw log lines this is shown as a0=55b26d44a6a0
so maybe it makes sense to display it as "a0": "0x55b26d44a6a0"
in the output value in my opinion. But I would love to know what requesters think about this...
Hi @jszwedko, I'm planning to address the review's comments now that the licensing issue was discussed. How can we address the feature-flag thing to comply with the license as you depicted in hillu/linux-audit-parser-rs#1 (comment)? I've added it as an optional dependency in this PR and added to the stdlib feature flag Line 87 in 95d6951
|
"grantors" => "pam_unix,pam_permit,pam_time", | ||
"acct" => "vrl", | ||
"exe" => "/usr/bin/sudo", | ||
"hostname" => Value::Null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields that are missing in the audit log (for example, hostname="?"
) are parsed into Null
values.
They are shown like this in vector:
{
"body": {
"auid": 4294967295,
"msg": {
"addr": null,
"comm": "systemd",
"exe": "/usr/lib/systemd/systemd",
"hostname": null,
"res": "success",
"terminal": null,
"unit": "wpa_supplicant"
},
"pid": 1,
"ses": 4294967295,
"uid": 0
},
"enrichment": {
"AUID": "unset",
"UID": "root"
},
"sequence": 165,
"timestamp": "2024-08-30T18:14:22.240Z",
"type": "SERVICE_STOP"
}
Should we filter out that fields, or leave the entry with a null value, so people know that it is present but with no value?
log.insert("body".into(), Value::from(body)); | ||
|
||
// TODO: should we downcase the keys of the enrichment so that they match its body counterparts? | ||
if !enrichment.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In auditd, the association between fields and its enriched value is done with capital letters (see this documentation).
But that limitation is because the log format is plain and it really seems like a workaround.
As we can have a more structured representation and the enrichment fields are separated in another map, maybe we can downcase the keys and have something like this instead:
(deleted some object keys to simplify the example)
{
"body": {
"auid": 4294967295,
"uid": 0
},
"enrichment": {
"auid": "unset",
"uid": "root"
}
}
so people can link the body fields and the enrichment fields easily, as this way, they enriched counterparts will exactly match the raw keys, instead of capitalizing the keys before looking them up in the enrichment.
What do you think? I don't really know how people would really consume this so I would appreciate some feedback from people who would actually use this (#66 requesters)
hillu/linux-audit-parser-rs@8f39651 Added support for nested parsing the {
"body": {
"auid": 4294967295,
"msg": {
"addr": null,
"comm": "systemd",
"exe": "/usr/lib/systemd/systemd",
"hostname": null,
"res": "success",
"terminal": null,
"unit": "systemd-update-utmp"
},
"pid": 1,
"ses": 4294967295,
"uid": 0
},
"enrichment": {
"AUID": "unset",
"UID": "root"
},
"sequence": 174,
"timestamp": "2024-08-30T18:14:22.454Z",
"type": "SERVICE_STOP"
} note that the msg field is now more similar to what is requested in #66 Tested everything with this config [sources.file]
type = "file"
include = ["/var/log/audit/audit.log"]
data_dir = "./data"
[transforms.auditd]
type = "remap"
inputs = ["file"]
source = """
. = parse_auditd!(.message)
"""
[sinks.console]
type = "console"
inputs = ["auditd"]
encoding.codec = "json" And it parsed a bunch of auditd logs with no errors at all! (in order to test it, you would need to enable the |
}) | ||
.collect::<Result<ObjectMap, _>>()?, | ||
), | ||
// There are a few values that `linux-audit-parser` does not return in its parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on hillu/linux-audit-parser-rs#4
Hey! Thanks for raising this again. After thinking about it some more, I'd prefer to avoid including LGPL dependencies in Vector, even as optional, since they run the risk of someone accidentally including them without realizing it (or us expanding usage without realizing it). I'd (unfortunately 🙁) prefer to see us find another dependency to parse auditd logs. We could even just leverage the existing support for grok parsing by using specific patterns to parse auditd logs. Example: https://gist.github.com/artbikes/2313040 |
Apologies, I know this has been a bit of a long road to get to this point 😓 I'm curious what you think. |
Hi @jszwedko !
I totally agree with that. It also was one of my main concerns about this...
I don't think there is something in rust yet. If I have enough time, I may end up writing my own parser with a MPL-compatible license. Maybe I can inspire from https://github.com/elastic/go-libaudit/blob/main/auparse/auparse.go as It is licensed as Apache2.0 and I think rewriting it (or inspiring from it) in rust could be feasible. However, this may delay things a little bit, but I'll give it a try when I find enough time.
Thanks! I may give a bump about this in a few weeks/month if I manage to implement the parser, but pausing it the moment. |
@jorgehermo9 thanks for understanding. I do think we might be able to get by with grok parsing instead of a native Rust parser. It'll be less efficient, but might be easier to get going. Worth considering anyway :) |
Closes #66
I have a few things left (that I'm aware of):
linux-auditd-parse
crate should have a license compatible with Vector (see Issues with the chosen license (LGPL3+) hillu/linux-audit-parser-rs#1)linux-auditd-parse
for a few things that are left as TODOsI added a few TODOs in the code for some things I had doubts. I'll appreciate feedback on those.
Currently, logs are parsed into something similar to
Which is a bit more structured from what has been proposed in #66, but I'm open to any suggestion about this format