-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic: add "span of calls" scope #2532
base: master
Are you sure you want to change the base?
Conversation
CHANGELOG updated or no update needed, thanks! 😄
we also may want to update the vverbose render to only show each call event once, leaving the match details to a separate section, maybe like:
|
@jorik-utwente FYI |
I realize I dropped this PR without much warning 😇 I went from "I wonder how this would work" to "huh, it seems to work OK" pretty quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, this looks very promising already!!
major things to discuss include the naming and potentially handling of loops
CHANGELOG.md
Outdated
@@ -4,6 +4,8 @@ | |||
|
|||
### New Features | |||
|
|||
- add dynamic sequence scope for matching nearby calls within a thread #2532 @williballenthin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming alternatives to sequence (matching occurs in any order): span, ngram, group/cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"window", "slice", "range"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
math: multiset (or bag, or mset) - https://en.wikipedia.org/wiki/Multiset
- multiple instances of same object
- order doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optionally prefix with "call", e.g., callbag
, callcluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize: I don't think we should use the term "sequence" because it implies that the order of the events matter. capa doesn't match with any care for the order of API calls, so we don't want users to think they can rely on that.
Some reasonable alternatives:
- span
- group
- cluster
- window
- range
Other terms, which work, but are more technical/jargon:
- ngram
- multiset
- bag
As mentioned by @mr-tz, we can (should?) use a prefix, like "call span" or "call range".
I think I most prefer "range" and "span".
The candidates "call range" or "call span" make it seem like the range/span are characteristics of a particular call, rather than a collection of calls. Therefore, maybe we should use "range of calls" or "span of calls" within the rule text and documentation.
So I'd propose: "range of calls"
(in the future, if we supported configurable sequences sizes, we could make the name like: "range of 20 calls" which is fairly nice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to update the PR with the proposed new name here, but I would very much like feedback @mike-hunhoff @mr-tz @fariss @yelhamer and anyone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"range of calls" is a good name for this new scope. It makes the intention clear and, as mentioned, can be easily expanded to in the future, e.g. "range of 20 calls".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i lost this thread (i had a link below that stopped working and thought GH deleted it) and in the interim made a guess at what i had just concluded and renamed things "span of calls". does that work? or do you think its worthwhile to swap over to "range"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I meant "span" but it came out "range" because I had just finished reading your comments above and it was on my mind 😅
The definition of "span" works great for this scope:
the full extent of something from end to end; the amount of space that something covers.
"a warehouse with a clear span of 28 feet"
So no changes needed from my perspective
Good point. I think we'd want to see how this works in practice against a large number of samples and the rules we can translate to use this construct. In particular, loops (like you say) such as you'd see in ransomware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, I'm excited about where this is going for an initial implementation. I echo a few of @mr-tz 's comments/concerns. Additionally, the value 5
comes close to being too small for some of our existing rules, e.g. https://github.com/mandiant/capa-rules/blob/e033410c8910f8b46718a5eefd9f0c7768be1b99/communication/c2/shell/create-reverse-shell.yml#L19-L23 so we'll need to do some additional work to find the sweet spot.
d6106ea
to
6d05d3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a few moments focusing on the core extension here and added some places for additional documentation.
ea9daed
to
b10d591
Compare
computing the features for the sequence, which involves merging features from many calls, seems to take quite a bit of time: i'll have to think on whether there's a creative way to optimize this profile informationbefore: sequence length: 20before: sequence length: 0optimized, sequence length 1 and 20:conclusion:So, there's a bit of overhead to use this new algorithm, but it's independent of SEQUENCE_LENGTH, which is desirable. |
TODO?!
|
4683882
to
69f4728
Compare
6887ba8
to
7d409ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased)
section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
to ensure its not modified by reference after we expect it to be
addresses discussion in mandiant/capa-rules#951 pep8 sequence: add test showing multiple sequences overlapping a single event
also, for repeating behavior, match only the first instance.
sequence: add more tests
contains the call ids for all the calls within the sequence, so we know where to look for related matched. sequence: refactor SequenceMatcher sequence: don't use sequence addresses sequence: remove sequence address
7d409ae
to
6039076
Compare
CHANGELOG updated or no update needed, thanks! 😄
0923bab
to
06472c1
Compare
32bba98
to
139092a
Compare
pep8 fix ref update submodules update testfiles submodule duplicate variable
139092a
to
7b3bf0d
Compare
def render_span_of_calls(layout: rd.DynamicLayout, addrs: list[frz.Address]) -> str: | ||
calls: list[capa.features.address.DynamicCallAddress] = [addr.to_capa() for addr in addrs] # type: ignore | ||
for call in calls: | ||
assert isinstance(call, capa.features.address.DynamicCallAddress) | ||
|
||
pname = _get_process_name(layout, frz.Address.from_capa(calls[0].thread.process)) | ||
call_ids = [str(call.id) for call in calls] | ||
return f"{pname}{{pid:{call.thread.process.pid},tid:{call.thread.tid},calls:{{{','.join(call_ids)}}}}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing incorrect results for the call list, e.g. in the following output there is only one call displayed but four call ids are listed:
$ python -m capa.main tests/data/dynamic/vmray/2f8a79b12a7a989ac7e5f6ec65050036588a92e65aeb6841e08dc228ff0e21b4_min_archive.zip -vv
[...]
capture screenshot
namespace collection/screenshot
author [email protected], @_re_fox, [email protected]
scope span of calls
att&ck Collection::Screen Capture [T1113]
mbc Collection::Screen Capture::WinAPI [E1113.m01]
span of calls @ mulvpilibfy.exe (C:\Users\8qy2SK\Desktop\mulvpilibfy.exe){pid:7104,tid:7108,calls:{36462,36465,37084,37146}}
or:
call:
and:
api: BitBlt @ mulvpilibfy.exe (C:\Users\8qy2SK\Desktop\mulvpilibfy.exe){pid:7104,tid:7108,call:37146}
BitBlt(
hdc: 0x2a010781,
x: 0,
y: 0,
cx: 1440,
cy: 900,
hdcSrc: 0x4d010784,
x1: 0,
y1: 0,
rop: 0xcc0020,
) -> ret_val: 1
[...]
And I've encountered other instances where multiple call matches are displayed but only the call id of the last match displayed is listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should first ensure the node evaluated to true before collecting from the children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice example - false negative for the more specific branch GetDC/BitBlt/CreateCompatibleDC.
maybe we need to:
- add DISPLAY* to CreateDC
- add Gdip routines (GdipCreateBitmapFromScan0, GdipGetImageGraphicsContext, GdipGetDC)
This PR implements the dynamic "span of calls" scope introduced here: mandiant/capa-rules#951
In summary, we want a way to match across calls (in dynamic mode) without resorting to the entire thread (which may be very long, like thousands of events). So, we add a new scope "span of calls" that represents the sliding 20-tuples of calls across each thread. Rules can match against any set of logic within each of these 20-tuples.
For example, consider the initial behavior of thread 3064 in our test CAPE file 0000a657:
This is a long thread with many calls, so yesterday it was tough to write a rule for any behavior that spans multiple calls without introducing false positives. Consider matching on the dynamic resolution and invocation of
AddVectoredExceptionHandler
. Now we can write a rule like:So, within a region of 20 calls, match all this logic.
Here's what the output looks like:
The implementation is pretty easy: maintain a deque of the trailing 20 call events, merging and matching those features.
I picked 20 fairly randomly. I think we can tweak this number as necessary. Smaller and its harder to match logic. Larger and the performance might decrease a bit, and then there's more FP possibility. But I don't think this is too risky.
I think this will affect runtime a bit, since we're matching features twice for each call event (one for the precise call event, one for the sliding window).
There's probably some edge cases to work out around overlapping windows. Consider a rule that matches a single call event within a sequence: that call event is contained by 20 sequences (some covering the events before, some covering the events after). So, we may have to do a little more work (TODO) to not emit those matches twice. I'm not precisely sure of the behavior at this moment. I'll write a test for it.Checklist