Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Map Extracted Files to Artifact Definitions in image_export.py #4949

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

sa3eed3ed
Copy link

@sa3eed3ed sa3eed3ed commented Dec 30, 2024

Feature: Map Extracted Files to Artifact Definitions in image_export.py

Description:

This PR adds an optional feature to Plaso's image_export.py tool to generate a JSON file mapping extracted files to the artifact definitions that led to their extraction. This mapping provides valuable context about the extracted files.

Functionality:

The new --enable_artifacts_map flag activates this feature. When enabled, the tool creates an artifacts_map.json file in the output directory. This file contains a dictionary where:

Keys: Artifact definition names (e.g., JupyterConfigFile, SshdConfigFile, WindowsEnvironmentVariableComSpec).
Values: Lists of extracted file paths (relative to the output directory) that matched the corresponding artifact definition.

plaso/scripts/image_export.py --artifact_filters JupyterConfigFile,SshdConfigFile \
--write /home/user/tmp --enable_artifacts_map --logfile /home/user/tmp/log.log \
--volumes all --partitions all /home/user/artifact_disk.dd

This command would produce an artifacts_map.json file similar to:

{
  "SshdConfigFile": ["etc/ssh/sshd_config"],
  "JupyterConfigFile": ["home/dummyuser/.jupyter/jupyter_notebook_config.py"]
}

This output indicates that the files etc/ssh/sshd_config and home/dummyuser/.jupyter/jupyter_notebook_config.py were extracted because they matched the SshdConfigFile and JupyterConfigFile artifact definitions, respectively.

Registry Artifacts:

For artifacts that rely on Windows Registry keys or values (e.g., WindowsEnvironmentVariableComSpec), the tool automatically extracts the relevant registry hive files (e.g., SYSTEM, SOFTWARE, NTUSER.DAT). The artifacts_map.json will map these hive files to both:

The artifact that directly triggered the hive's extraction (e.g., WindowsSystemRegistryFiles).
Any artifacts that rely on data within those hives (e.g., WindowsEnvironmentVariableComSpec).

Example with Registry Artifacts:
If you run image_export.py with --artifact_filters WindowsEnvironmentVariableComSpec, the artifacts_map.json might contain:

{
  "WindowsSystemRegistryFiles": [
    "System Volume Information/Syscache.hve",
    "Windows/System32/config/SAM",
    "Windows/System32/config/SECURITY",
    "Windows/System32/config/SOFTWARE",
    "Windows/System32/config/SYSTEM"
  ],
  "WindowsEnvironmentVariableComSpec": [
    "System Volume Information/Syscache.hve",
    "Users/Warren/AppData/Local/Microsoft/Windows/UsrClass.dat",
    "Users/Warren/NTUSER.DAT",
    "Windows/ServiceProfiles/LocalService/NTUSER.DAT",
    "Windows/ServiceProfiles/NetworkService/NTUSER.DAT",
    "Windows/System32/config/SAM",
    "Windows/System32/config/SECURITY",
    "Windows/System32/config/SOFTWARE",
    "Windows/System32/config/SYSTEM"
  ],
  "WindowsUserRegistryFiles": [
    "Users/Warren/AppData/Local/Microsoft/Windows/UsrClass.dat",
    "Users/Warren/NTUSER.DAT",
    "Windows/ServiceProfiles/LocalService/NTUSER.DAT",
    "Windows/ServiceProfiles/NetworkService/NTUSER.DAT"
  ]
}

This shows that the SYSTEM, SOFTWARE, and other hive files were extracted because of both WindowsSystemRegistryFiles and WindowsEnvironmentVariableComSpec, the mapped paths will be relative to the provided output path under the --write argument.

Technical Details:

The core of this feature is the ArtifactsTrie class, which stores artifact definition paths in a Trie (prefix tree) data structure.

Artifacts Trie Structure
  • Root Node: A special node that doesn't represent a path segment but has children for each unique path separator in the definitions.
  • Path Separator Nodes: Children of the root, representing path separators (e.g., /, ).
  • Other Nodes: Each node represents a path segment from an artifact definition.
  • Glob Handling: Glob patterns (like * and **) are stored as literal node keys.
  • Artifact Names: Nodes corresponding to the end of a valid artifact path store a list of associated artifact names in their artifacts_names attribute.
    Example Trie:
Root
├── / (path separator)
│   ├── Users
│   │   └── **
│   │       └── Downloads
│   │           └── *.pdf (artifacts_names: ["PDFDownloads"])
│   └── Windows
│       └── System32
│           └── config
│               └── SAM (artifacts_names: ["WindowsSAMRegistry"])
└── \ (path separator)
    └── Users
        └── *\
            └── AppData
                └── Local
                    └── test.ini (artifacts_names: ["LocalAppDataFiles"])
Matching Logic

Paths are normalized to use os.sep as the separator.
The GetMatchingArtifacts method traverses the Trie based on input path segments, using fnmatch.fnmatch for glob matching. ** is handled recursively to match zero or more directory levels.

Source Type Handling

When the input to the tool is:

  • Directory: A dfvfs.FileSystem object of type OS is created, with a dfvfs.FileSystemSearcher using the input directory as the mount point. The tool extracts files matching the FindSpec's criteria within this directory.
  • File: ExtractPathSpecs yields the input file path directly without searching, as it's assumed that a user-provided file path should be extracted.

Added safeguard check to exit and print if input is file, this tool can handle images, block devices and hierarchy of directories from the evidence system

@sa3eed3ed sa3eed3ed requested a review from joachimmetz January 20, 2025 10:03
Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 94.52736% with 11 lines in your changes missing coverage. Please review.

Project coverage is 85.11%. Comparing base (9d4e13c) to head (42727b7).

Files with missing lines Patch % Lines
plaso/engine/artifact_filters.py 83.72% 7 Missing ⚠️
plaso/engine/artifacts_trie.py 96.36% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4949      +/-   ##
==========================================
+ Coverage   85.05%   85.11%   +0.06%     
==========================================
  Files         431      432       +1     
  Lines       38648    38822     +174     
==========================================
+ Hits        32873    33045     +172     
- Misses       5775     5777       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sa3eed3ed sa3eed3ed requested a review from joachimmetz January 22, 2025 12:32
"""
artifact_path_segments = self._GetNonEmptyPathSegments(
artifact_path, artifact_path_seperator)
sanitized_path_segments = path_helper.PathHelper.SanitizePathSegments(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sanitize the path here? doesn't that cause matches that incorrect matches given the sanitation is lossy ?

Copy link
Author

@sa3eed3ed sa3eed3ed Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing so because _CreateSanitizedDestination is called inside _ExtractDataStream in image_export.py while building the target_directory, target_filename of the output and _CreateSanitizedDestination calls SanitizePathSegments to sanitize the path under the hood, thus catching cases where sanitized path won't match the path extracted from the trie

Comment on lines +419 to +420
@classmethod
def SanitizePathSegments(cls, path_segments):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend we move this to the cli submodule and only use it for CLI and log output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will wait for you reply to the comment above regarding comparing path of extracted artifact to the sanitized version of the path extracted from the trie as well. If we remove this, I will move this back to the cli sub module. but moving it there now will introduce cyclic import as cli module import artifact_trie which needs to access this method

@sa3eed3ed sa3eed3ed requested a review from joachimmetz January 28, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants