Skip to content

Commit

Permalink
final update, enhancements and mp4 + gif
Browse files Browse the repository at this point in the history
  • Loading branch information
cak committed Jan 5, 2025
1 parent e63d4f4 commit 6492afe
Show file tree
Hide file tree
Showing 5 changed files with 475 additions and 190 deletions.
24 changes: 12 additions & 12 deletions markdown/cve_data_stories/vendor_cve_trends/03_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ jupyter:



## Calculate Cumulative CVE Counts by Vendor (Starting from 1996)
## Calculate Cumulative CVE Counts by Vendor (Starting from 1999)

This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1996, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1999, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.

### Steps in the Script

Expand All @@ -29,8 +29,8 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
- Generates a range of dates from the earliest to the latest `Year` and `Month` in the dataset.
- Ensures no months are missing for any vendor by creating a complete time series for all vendors.

3. **Filter Data to Start at 1996**:
- After generating the complete date range, filters the data to include only years starting from 1996. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
3. **Filter Data to Start at 1999**:
- After generating the complete date range, filters the data to include only years starting from 1999. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.

4. **Build a DataFrame for All Vendors and Dates**:
- Combines the list of unique vendors with the filtered date range using a multi-index.
Expand All @@ -55,7 +55,7 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
### Key Features

- **Filters Sparse Early Data**:
- Focuses on data from 1996 onwards for improved analysis and visualization.
- Focuses on data from 1999 onwards for improved analysis and visualization.

- **Handles Missing Data**:
- Ensures every month is accounted for, even if no CVEs were reported for a vendor in a given month.
Expand All @@ -71,11 +71,11 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
- The final output is a CSV file (`vendor_cumulative_counts.csv`) containing:
| Vendor | Year | Month | Count | Cumulative_Count |
|-----------|------|-------|-------|-------------------|
| freebsd | 1996 | 1 | 5 | 5 |
| freebsd | 1996 | 2 | 0 | 5 |
| freebsd | 1996 | 3 | 8 | 13 |
| redhat | 1996 | 1 | 0 | 0 |
| redhat | 1996 | 2 | 15 | 15 |
| freebsd | 1999 | 1 | 5 | 5 |
| freebsd | 1999 | 2 | 0 | 5 |
| freebsd | 1999 | 3 | 8 | 13 |
| redhat | 1999 | 1 | 0 | 0 |
| redhat | 1999 | 2 | 15 | 15 |


```python
Expand Down Expand Up @@ -108,8 +108,8 @@ df_full = pd.DataFrame(index=full_index).reset_index()
df_full["Year"] = df_full["Date"].dt.year
df_full["Month"] = df_full["Date"].dt.month

# Filter to include only years from 1996 onwards
df_full = df_full[df_full["Year"] >= 1996]
# Filter to include only years from 1999 onwards
df_full = df_full[df_full["Year"] >= 1999]

# Merge with the original data, filling missing counts with 0
df = pd.merge(df_full, df, on=["Vendor", "Year", "Month"], how="left").fillna({"Count": 0})
Expand Down
256 changes: 179 additions & 77 deletions markdown/cve_data_stories/vendor_cve_trends/05_visualizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,84 +15,90 @@ jupyter:
# CVE Data Stories: Vendor CVE Trends - Visualizations


```python
import warnings

## Bar Chart Race: Top 10 CVE Vendors (1996–2024)
import matplotlib.pyplot as plt
import pandas as pd
from bar_chart_race import bar_chart_race
from matplotlib.colors import to_hex
```

This script generates a dynamic bar chart race showcasing the top 10 vendors by cumulative CVE count over time (1996–2024). CVE data offers critical insights into vendor-specific trends in cybersecurity vulnerabilities, highlighting shifts in the security landscape across two decades.

---

### Steps in the Script

1. **Import Necessary Libraries**:
- `pandas`: For efficient data manipulation and preprocessing.
- `bar_chart_race`: To create the bar chart race animation.
- `matplotlib`: For additional visual customizations, including fonts and color palettes.
## Bar Chart Race: Top CVE Vendors (1999–2024)

2. **Load and Preprocess Data**:
- Reads a CSV file (`vendor_top_20.csv`) containing cumulative CVE counts for vendors by year and month.
- Normalizes vendor names for consistency.
- Ensures inclusion of all vendors that appeared in the top 20 during the analyzed period.
This script generates dynamic bar chart race visualizations that showcase the top vendors by cumulative CVE count over time, covering the years 1999–2024. The project provides insights into long-term trends in vendor-specific vulnerabilities, highlighting shifts in the cybersecurity landscape over two decades.

3. **Pivot and Format Data**:
- Prepares the dataset for visualization by transforming it into a pivot table:
- **Rows**: Time (`Year`, `Month`).
- **Columns**: Vendors.
- **Values**: Cumulative CVE counts.
- Combines `Year` and `Month` into a `Date` column (`YYYY-MM`) for a continuous time index.
---

4. **Assign Colors**:
- **Brand Colors**: Maps vendors to their official brand colors for easy recognition.
- **Fallback Colors**: Assigns visually distinct colors to vendors without defined brand colors.
### Purpose

5. **Generate the Bar Chart Race**:
- Animates the top 10 vendors dynamically over time:
- Bars update their positions and lengths based on cumulative CVE counts.
- Parameters enhance readability and visual storytelling.
- Saves the animation as an `.mp4` file for high-quality sharing.
- **Analyze Vulnerability Trends**: Understand which vendors have consistently had the most reported vulnerabilities and how rankings have evolved over time.
- **Engage Through Visualization**: Present data in a visually compelling way that draws attention to key trends in cybersecurity.
- **Inspire Data-Driven Discussions**: Encourage conversations about how this data can inform risk management strategies.

---

### Key Parameters
### Workflow

- **Top Vendors (`n_bars`)**: Displays the top 10 vendors based on cumulative CVE counts.
- **Dynamic Ordering (`fixed_order=False`)**: Updates the bar order dynamically to reflect changes in rankings.
- **Y-Axis Consistency (`fixed_max=True`)**: Maintains a consistent y-axis scale to enable meaningful visual comparisons.
- **Smooth Transitions (`steps_per_period=10`)**: Creates fluid animations between monthly time steps.
- **Frame Duration (`period_length=400`)**: Each time step lasts 400 milliseconds for optimal pacing.
1. **Setup and Data Loading**:
- Imports libraries for data manipulation (`pandas`), visualization (`bar_chart_race`, `matplotlib`), and system utilities (`os`, `warnings`).
- Suppresses irrelevant warnings to streamline outputs.
- Reads a preprocessed CSV file (`vendor_top_20.csv`) containing cumulative CVE counts by vendor, year, and month.

---
2. **Vendor Name Normalization**:
- Ensures vendor names are clean and consistent using a mapping dictionary.
- Handles variations in vendor naming for accurate aggregation.

3. **Data Transformation**:
- Converts the `Year` and `Month` columns into a `datetime` format for proper sorting and animation.
- Pivots the dataset to create a table where:
- **Rows**: Time intervals (monthly or yearly).
- **Columns**: Vendors.
- **Values**: Cumulative CVE counts.
- Prepares both monthly and yearly datasets for separate animations.

### Customization
4. **Color Assignment**:
- Assigns official brand colors to vendors where available for consistent identification.
- Generates fallback colors for vendors without official brand palettes, ensuring a visually distinct output.

- **Visual Enhancements**:
- Clear labels with larger fonts (`bar_label_size=12`) improve readability.
- High resolution (`dpi=300`) ensures professional-quality visuals suitable for presentations and reports.
- **Colors**:
- Brand colors make it easy to identify key vendors.
- Fallback colors ensure distinction for all other vendors.
5. **Bar Chart Race Generation**:
- Creates animations for:
- **Monthly Data**: Top 10 vendors shown dynamically across monthly time steps, saved as an `.mp4` file.
- **Yearly Data**: Top 5 vendors aggregated by year, optimized as a `.gif` file for LinkedIn sharing.
- Configures parameters for animation smoothness, readability, and file size optimization.

---

### Output
### Parameters for Customization

- **Video File**:
- The animation is saved as `top_10_vendors_cve_trends_2002_2024.mp4`, ready for sharing and embedding.
- **Top Vendors (`n_bars`)**:
- Displays the top 10 vendors for monthly visualizations and top 5 for yearly GIFs.
- **Dynamic Rankings (`fixed_order=False`)**:
- Bar positions adjust dynamically based on rankings in each time interval.
- **Y-Axis Consistency (`fixed_max=True`)**:
- Maintains a fixed scale across time intervals for meaningful comparisons.
- **Transition Smoothness (`steps_per_period`)**:
- Controls animation fluidity, with fewer steps used for smaller file sizes.
- **Animation Speed (`period_length`)**:
- Adjusted for LinkedIn-friendly GIFs with faster transitions.

- **Insights**:
- Tracks the dynamic evolution of CVE counts by vendor.
- Highlights key shifts and emerging trends in vulnerability disclosures across two decades, providing actionable insights into the cybersecurity landscape.
---

### Outputs

```python jupyter={"is_executing": true}
import os
import warnings
1. **Monthly Animation (`.mp4`)**:
- High-quality video highlighting the top 10 vendors month by month.
- Saved as `top_10_vendors_cve_trends_1999_2024.mp4`.

import matplotlib.pyplot as plt
import pandas as pd
from bar_chart_race import bar_chart_race
from matplotlib.colors import to_hex
2. **Yearly Animation (`.gif`)**:
- Lightweight GIF optimized for LinkedIn, showing top 5 vendors per year.
- Saved as `top_5_vendors_cve_trends_1999_2024.gif`.


```python
# Suppress font warnings
warnings.filterwarnings("ignore", category=UserWarning)

Expand Down Expand Up @@ -239,35 +245,131 @@ colors = [
brand_colors.get(vendor, fallback_colors[i % len(fallback_colors)])
for i, vendor in enumerate(df_pivot.columns)
]
```

### Generate Monthly MP4 Bar Chart Race
In this step, we generate a bar chart race video in MP4 format that visualizes cumulative CVE counts by vendor over time, aggregated monthly.

- The output video will display the **top 10 vendors** ranked by their cumulative CVE counts for each month from 1999 to 2024.
- The `period_length` and `steps_per_period` control the animation speed and smoothness.
- The resolution (`dpi=300`) ensures high-quality output.

The resulting MP4 file will be saved to the specified path.


```python jupyter={"is_executing": true}
# Output file path
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_10_vendors_cve_trends_1999_2024.mp4"

# Generate bar chart race
bar_chart_race(
df=df_pivot, # Pivoted DataFrame with cumulative CVE counts by vendor over time
filename=output_file, # Path to save the output video (e.g., .mp4). Set to None to display inline
orientation="h", # Display bars horizontally to show vendor trends over time
sort="desc", # Sort vendors by descending CVE count for each time period
n_bars=10, # Display the top 10 vendors at any given time
fixed_order=False, # Allow dynamic changes in the order of vendors as CVE counts update
fixed_max=True, # Keep the maximum y-axis value consistent across all time periods
steps_per_period=10, # Number of animation frames to transition between each month
period_length=400, # Duration (in milliseconds) of each month in the animation
interpolate_period=True, # Smoothly interpolate CVE counts between months for fluid animation
label_bars=True, # Display the CVE count as a label on each bar
bar_size=0.85, # Thickness of each bar as a fraction of the available space
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize date label size and position for each month
period_fmt="%Y-%m", # Format of the date label displayed for each time period (e.g., "2023-01")
title="Top Vendors by CVE", # Title of the bar chart animation
title_size=20, # Font size for the chart title
bar_label_size=12, # Font size for the CVE count labels displayed on each bar
tick_label_size=10, # Font size for axis tick labels (representing CVE counts)
cmap=colors, # Colors for each vendor's bar, using brand or fallback colors
dpi=300, # Resolution of the output video (higher DPI produces better quality)
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value)
)

print(f"Bar chart race mp4 saved to {output_file}.")
```

### Prepare Data for Yearly GIF
To simplify the visualization for LinkedIn, the CVE data is aggregated by year instead of monthly intervals. This reduces the size and complexity of the bar chart race while maintaining key trends.

#### Steps:
1. **Convert Index to Datetime**:
- The date index is converted to a datetime format for proper resampling.

2. **Resample by Year-End**:
- Using the `resample('YE').last()` method, we extract the **last value of each year**. This ensures that the cumulative data accurately reflects the total CVE count for each vendor by the end of the year.

3. **Format the Index**:
- The index is updated to show only the year as a string for clarity in the visualization.

4. **Handle Missing Data**:
- Any missing values (`NaN`) are filled with `0` to prevent gaps in the animation.

5. **Avoid Rendering Issues**:
- A small value (`1e-5`) is added to the data to avoid potential rendering artifacts during animation.

6. **Ensure Complete Year Range**:
- The data is reindexed to include all years in the range, filling any missing years with `0`.

```python
# Convert index to datetime and resample
df_pivot.index = pd.to_datetime(df_pivot.index)
df_yearly = df_pivot.resample('YE').last() # Use last value of each year for cumulative data

# Update index to show only the year
df_yearly.index = df_yearly.index.year.astype(str) # Convert years to strings for proper formatting

# Fill NaN values
df_yearly = df_yearly.fillna(0)

# Add a small value to avoid rendering issues
df_yearly += 1e-5

# Ensure all years are present
all_years = [str(year) for year in range(int(df_yearly.index[0]), int(df_yearly.index[-1]) + 1)]
df_yearly = df_yearly.reindex(all_years, fill_value=0)
```

### Generate Yearly GIF Bar Chart Race
Using the aggregated yearly data, we create a **GIF optimized for LinkedIn**.

- The GIF shows the **top 5 vendors** ranked by cumulative CVE counts for each year from 1999 to 2024.
- To ensure the file size is within LinkedIn's 8MB limit:
- Resolution is reduced (`dpi=150`).
- Animation transitions are faster (`period_length=200` milliseconds).
- Fewer steps per period (`steps_per_period=5`) reduce frame count.

The resulting GIF will be saved to the specified path.


```python
# Output file path
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_10_vendors_cve_trends_1996_2024.mp4"
os.makedirs(os.path.dirname(output_file), exist_ok=True)
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_5_vendors_cve_trends_1999_2024.gif"

# Generate bar chart race
bar_chart_race(
df=df_pivot, # The pivoted DataFrame containing cumulative CVE counts by vendor over time.
filename=output_file, # Path to save the output video (e.g., .mp4). Set to None to display inline in a notebook.
orientation="h", # Display bars horizontally to show vendor trends over time.
sort="desc", # Sort vendors by descending CVE count for each time period.
n_bars=10, # Number of top CVE vendors to display at any given time.
fixed_order=False, # Allow the order of vendors to change dynamically as CVE counts update over time.
fixed_max=True, # Keep the maximum CVE count consistent across all time periods for better comparison.
steps_per_period=10, # Number of animation frames to transition between each month.
period_length=400, # Duration (in milliseconds) for each month in the animation.
interpolate_period=True, # Smoothly interpolate CVE counts between months for fluid animation.
label_bars=True, # Display the CVE count as a label on each bar.
bar_size=0.85, # Thickness of each bar as a fraction of the available space for the month.
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize the date label for each month (size and position).
period_fmt="%Y-%m", # Format of the date label displayed for each time period (e.g., "2023-01").
title="Top Vendors by CVE", # Title of the bar chart animation.
title_size=20, # Font size for the chart title.
bar_label_size=12, # Font size for the CVE count labels displayed on each bar.
tick_label_size=10, # Font size for axis tick labels (representing CVE counts).
cmap=colors, # Colors for each vendor's bar, using brand colors or fallback colors if unspecified.
dpi=300, # Resolution of the output video (higher DPI produces better quality but larger files).
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value).
df=df_yearly, # Aggregated DataFrame with yearly cumulative CVE counts by vendor
filename=output_file, # Path to save the output GIF (optimized for LinkedIn)
orientation="h", # Display bars horizontally to show vendor trends over time
sort="desc", # Sort vendors by descending CVE count for each year
n_bars=5, # Display the top 5 vendors at any given time
fixed_order=False, # Allow dynamic changes in the order of vendors as CVE counts update
fixed_max=True, # Keep the maximum y-axis value consistent across all time periods
steps_per_period=5, # Number of animation frames to transition between each year
period_length=200, # Duration (in milliseconds) of each year in the animation
interpolate_period=False, # Disable interpolation to avoid rendering artifacts
label_bars=True, # Display the CVE count as a label on each bar
bar_size=0.85, # Thickness of each bar as a fraction of the available space
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize date label size and position for each year
period_fmt="{x}", # Display the year as it appears in the DataFrame index
title="Top Vendors by CVE (Yearly)", # Title of the bar chart animation
title_size=18, # Font size for the chart title
bar_label_size=10, # Font size for the CVE count labels displayed on each bar
tick_label_size=8, # Font size for axis tick labels (representing CVE counts)
cmap=colors, # Colors for each vendor's bar, using brand or fallback colors
dpi=150, # Resolution of the output GIF (optimized for smaller file size)
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value)
)

print(f"Bar chart race saved to {output_file}.")
print(f"Bar chart race gif saved to {output_file}.")
```
Loading

0 comments on commit 6492afe

Please sign in to comment.