-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring '''get_extractor''' in capa/main.py #1842
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code fails the "code_style" test for these reasons:
"capa/main.py:546:16: F401 binaryninja
imported but unused; consider using importlib.util.find_spec
to test for availability
capa/main.py:547:33: F401 binaryninja.BinaryView
imported but unused; consider using importlib.util.find_spec
to test for availability
capa/main.py:561:13: F821 Undefined name BinaryView
capa/main.py:561:26: F821 Undefined name binaryninja
Found 4 errors."
These errors seem to refer to the "import_binja" function, which imports "binaryninja" and "BinaryView"; and the "handle_binja_backend" function, which uses "binaryninja" and "BinaryView."
The test seems to be intended to prevent the use of a function that is not for-sure imported. But the function that uses "binaryninja" and "BinaryView" is only called by the function that imports "binaryninja" and "BinaryView"; and the function that imports "binaryview" and "BinaryView" is always called by the function that uses those modules.
Therefore, even though this code fails the "code_style" test, it doesn't seem like this is necessarily an issue.
Some changes here look good. The issue was initially meant to simplify the invocation and handling of samples around
Do you see potential for those and would you also consider #1821 while taking a look into this? |
capa/main.py
Outdated
|
||
if os_ == OS_AUTO and not is_supported_os(path): | ||
raise UnsupportedOSError() | ||
check_supported_format(path, os_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconsidering, I don't feel like these additional functions necessarily make the code easier to read
especially here where it's not obvious that this may raise an exception...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not something I thought about before, but this is helpful to consider. Thank you!
capa/main.py
Outdated
import capa.features.extractors.dnfile.extractor | ||
|
||
return capa.features.extractors.dnfile.extractor.DnfileFeatureExtractor(path) | ||
return handle_dotnet_format(format_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or here where only a few lines are wrapped
So, I'd vote to leave this function as is.
Hi @mr-tz , Thanks for these comments, and for explaining what issue #1813 was meant to achieve. I think that refactoring some of the '''get_extractor''' format-identifying components could potentially reduce the function's complexity. Looking at the original code that specifically handles the binja and viv extractors, the code looks reasonably simple, although I could be wrong. Are there particular parts of the code for identifying the format and getting the extractor that you think could be simplified? Are you referring to the invocation of '''get_extractor''' in the '''main''' function? I agree with what @williballenthin indicated in his proposals in #1821, which is that '''get_extractor''' and many of the helper functions that it relies on should probably be moved to a different file. Since '''get_extractor''' and the helper functions it relies on have a relatively specific goal (to return FeatureExtractor objects), it may even be worth moving '''get_extractor''' and its helper functions to a feature_extractor.py file. Are there other things about #1821 you think may be worth considering? In addition, it may be a good idea to move the try-except statement in viv-extractor code to a new function. While the try-except clause is not especially complex, I found it a bit visually unwieldy, especially considering that the clause is within both a "with" and "if-else" statement. Also, thank you for pointing out that the names of error handling functions should make clear that they may raise exceptions. This is not something that I had considered before, but I certainly see the value. I have changed the name of the '''is_supported_format''' function to '''check_unsupported_raise_exception''' to reflect this. I have also changed the name of '''import_binja''' to '''attempt_binja_import''', and created '''attempt_save_workspace''', to reflect the error handling abilities of these functions. Please let me know if you have any more questions or comments about these edits, or if you think I should make further enhancements. Best, Aaron |
capa/main.py
Outdated
|
||
if os_ == OS_AUTO and not is_supported_os(path): | ||
raise UnsupportedOSError() | ||
check_unsupported_raise_exception(path, os_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd propose to just leave the code verbatim here instead of in a new function. Or do you see much benefit added by the function?
capa/main.py
Outdated
def handle_binja_backend(path: Path, disable_progress: bool) -> FeatureExtractor: | ||
import capa.features.extractors.binja.extractor | ||
|
||
attempt_binja_import() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can also stay in here and likely resolves the ruff error?
capa/main.py
Outdated
|
||
if should_save_workspace: | ||
logger.debug("saving workspace") | ||
attempt_save_workspace(vw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also simple enough to just leave here
Thanks for the details and code suggestions!!
Yes, that's what I meant originally. Can we wrap the calling and error handling to get an extractor?
I'd propose to start with what's listed in Overall, I still think we don't need to split the function in so many sub parts but would be very happy to see the refactor into a separate module here. |
Hi @mr-tz, Thank you for your comments and thoughts! Do you think that using a decorator would wrap the calling and error handling as you imagine? I have drafted a decorator to perform this error handling and submitted it for review. In addition to cleaning up the error handling for '''get_extractor''', a decorator may help to avoid some of the boilerplate code used for logging errors in capa/helpers.py. However, one benefit of the original code is that it is clear that the function returns if an exception is raised. This can allow someone to easily follow the code flow. With a decorator, the error handling is largely hidden, but this may make it more difficult for someone reading through main.py to get a good sense of what the program is doing - in addition to looking at main.py, they will have to look at where the decorator is defined. As a result, this may be a drawback of using a decorator to perform error handling for '''get_extractor'''. Please let me know if this is more similar to what you hoped to achieve with #1813. If not, I would be very happy to change it and make improvements! Would you also be able to say a little more about how #1821 connects to refactoring the error handling for '''get_extractor'''? In particular, is there a certain way you think #1821 could influence the refactoring of the error handling? Unfortunately, because I am fairly new to contributing to large codebases, I don't have a strong understanding yet of some of these matters. I really appreciate your patience and help thus far! |
Nice idea, I think a decorator could work well here. I also like the potential simplification around the Let's focus this PR on these changes (rollback the first 5 commits - for now) and then work on changes related to #1821 separately as they're not directly related. Sorry for the confusion :) |
capa/helpers.py
Outdated
e_list, return_values = [(UnsupportedFormatError E_INVALID_FILE_TYPE), | ||
(UnsupportedArchError, E_INVALID_FILE_ARCH), | ||
(UnsupportedOSError, E_INVALID_FILE_OS)] | ||
|
||
messsage_list = [ # UnsupportedFormatError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these go into the same list of tuples?
) | ||
logger.error(" If you don't know the input file type, you can try using the `file` utility to guess it.") | ||
logger.error("-" * 80) | ||
def exceptUnsupportedError(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use snake case names and maybe rename this to catch_log_return_errors
or similar?
then let's use via @<decorator_name>
(see, e.g., https://rinaarts.com/declutter-python-code-with-error-handling-decorators/)
Hi @mr-tz thanks for this feedback! I have made the changes you suggested. Please let me know if you think the list unpacking looks good. I am unsure if the list unpacking is very readable with the message tuples included. I have put the decorator above the function definition. I don't know if it's possible to place the decorator above the function invocation and also assign the function output to a variable - it seems like this would be good because the potential error handling would be visible in the '''main''' function. Also, no worries regarding #1821! I will rollback the first 5 commits. Thank you for your feedback so far! Please let me know if you have any more comments or suggestions! |
Hi @mr-tz I reverted the first 5 or so commits - is this how I should roll back those commits? I am not sure if my reverting worked. If not, I was considering deleting this branch and creating a fresh one with just the most recent commits. Do you think that would be a sensible option? |
Yes, I think creating a new branch may be the easier option at this point. |
@mr-tz great will do, thank you. |
Thanks, #1845 supersedes this so feel free to close it here. |
@mr-tz sounds good, will do! Thank you! |
Checklist
These edits address issue #1813.
The edits are meant to ensure that '''get_extractor''' does not perform more than one significant function. In addition, I hope these edits make the logic of the '''get_extractor''' function more clear. With the edits, it is apparent that the function: 1) performs some preliminary error checking; 2) checks whether the '''format_''' or '''backend''' match any existing extractor handling routines; and, 3) if not, raises an error.
Unfortunately, I pushed these changes too early and am now unable to make further edits. In lines 597, "format" should be changed to "format_." In addition, there are three blank lines between some functions - this is inconsistent with the other code in the capa/main.py file, and because of PEP8, these three blank lines should be replaced with two blank lines. I believe these edits are otherwise compliant with the CONTRIBUTING.md guide. Please let me know if I have overlooked something.
I do not believe I have a PR address.
If there are any other questions or concerns about these changes, I would be happy to enhance these edits further.