Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3540] Check columns for One-Hot-Encoding before compression #2054

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions scripts/perftest/ohe_checks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% end comment %}
-->

# Checking One Hot Encodedness before Compression tests

To run all tests for One Hot Encoding Checks:
* install systemds,
* make sure that the paths for SYSTEMDS_ROOT, JAVA_HOME, HADOOP_HOME are correctly set
* set the path for LOG4JPROP to `$SYSTEMDS_ROOT/scripts/perftest/ohe_checks/log4j-compression.properties`
* run experiments.sh

Alternatively, to run the experiment.dml script directly with OHE checks enabled, use this command:

`$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml --config ohe.xml `

Note: You can use -nvargs to set the variables rows, cols, dummy, distinct, repeats (how many times you want to generate a random matrix, transform-encode it and compress it)

(Dummy is the array of column indexes that you would like to One Hot Encode, example: dummy="[1]" will One Hot Encode the first column)

To collect the metrics from the logs for easier comparison, you can run `parse_logs.py` and an excel file called `combined_metrics.xlsx` will be created in this directory.
---
# Documentation of Changes to codebase for Implementing OHE Checks

## Flag to enable/disable OHE checks (Disabled by default)
- Added ``COMPRESSED_ONEHOTDETECT = "sysds.compressed.onehotdetect"`` to ``DMLConfig`` and adjusted the relevant methods
- Added attribute to ``CompressionSettings`` ``public final boolean oneHotDetect`` and adjusted the methods
- Adjusted ``CompressionSettingsBuilder`` to check if ``COMPRESSED_ONEHOTDETECT`` has been set to true to enable the checks

## Changes in `CoCoderFactory`

### 1. Introduction of OHE Detection

**Condition Addition:**
- Added a condition to check for `cs.oneHotDetect` along with the existing condition `!containsEmptyConstOrIncompressable` in the `findCoCodesByPartitioning` method. This ensures that the process considers OHE detection only if it is enabled in the compression settings.
- Original code only checked for `containsEmptyConstOrIncompressable` and proceeded to cocode all columns if false. The updated code includes an additional check for `cs.oneHotDetect`.

### 2. New Data Structures for OHE Handling

**New Lists:** Introduced two new lists to manage the OHE detection process:
- `currentCandidates`: To store the current candidate columns that might form an OHE group.
- `oheGroups`: To store lists of columns that have been validated as OHE groups.

### 3. Filtering Logic Enhancements

**Column Filtering:** Enhanced the loop that iterates over columns to identify OHE candidates:
- Columns that are empty, constant, or incompressible are filtered into respective lists.
- For other columns, they are added to `currentCandidates` if they are deemed candidates (via `isCandidate` function).

### 4. Addition of `isHotEncoded` Function

**Function Creation:** Created a new `isHotEncoded` function to evaluate if the accumulated columns form a valid OHE group.
- **Parameters:** Takes a list of column groups (`colGroups`), a boolean flag (`isSample`), an array of non-zero counts (`nnzCols`), and the number of rows (`numRows`).
- **Return Type:** Returns a `String` indicating the status of the current candidates:
- `"POTENTIAL_OHE"`: When the current candidates could still form an OHE group.
- `"NOT_OHE"`: When the current candidates cannot form an OHE group.
- `"VALID_OHE"`: When the current candidates form a valid OHE group.
- **Logic:** The function calculates the total number of distinct values and offsets, and checks if they meet the criteria for forming an OHE group.

### 5. Enhanced Group Handling

**Candidate Processing:** Within the loop, after adding a column to `currentCandidates`:
- Calls `isHotEncoded` to check the status of the candidates.
- If `isHotEncoded` returns `"NOT_OHE"`, moves the candidates to regular groups and clears the candidates list.
- If `isHotEncoded` returns `"VALID_OHE"`, moves the candidates to `oheGroups` and clears the candidates list.
- If `isHotEncoded` returns `"POTENTIAL_OHE"`, continues accumulating candidates.

### 6. Final Candidate Check

**Post-loop Check:** After the loop, checks any remaining `currentCandidates`:
- If they form a valid OHE group, adds them to `oheGroups`.
- Otherwise, adds them to regular groups.

### 7. Overwrite and CoCode Groups

**Overwrite Groups:** Updates `colInfos.compressionInfo` with the processed `groups`.
**OHE Group Integration:** Combines indexes for validated OHE groups and adds them to the final `groups`.

## One Hot Encoded Columns Compression in `ColGroupFactory`

### Description

The `compressOHE` function is designed to compress columns that are one-hot encoded (OHE). It validates and processes the input data to ensure it meets the criteria for one-hot encoding, and if so, it compresses the data accordingly. If the data does not meet the OHE criteria, it falls back to a direct compression method (`directCompressDDC`).

### Implementation Details

1. **Validation of `numVals`**:
- Ensures the number of distinct values (`numVals`) in the column group is greater than 0.
- Throws a `DMLCompressionException` if `numVals` is less than or equal to 0.

2. **Handling Transposed Matrix**:
- If the matrix is transposed (`cs.transposed` is `true`):
- Creates a `MapToFactory` data structure with an additional unique value.
- Iterates through the sparse block of the matrix, checking for non-one values or multiple ones in the same row.
- If a column index in the sparse block is empty, or if non-one values or multiple ones are found, it falls back to `directCompressDDC`.

3. **Handling Non-Transposed Matrix**:
- If the matrix is not transposed (`cs.transposed` is `false`):
- Creates a `MapToFactory` data structure.
- Iterates through each row of the matrix:
- Checks for the presence of exactly one '1' in the columns specified by `colIndexes`.
- If multiple ones are found in the same row, or if no '1' is found in a sample row, it falls back to `directCompressDDC`.

4. **Return Value**:
- If the data meets the OHE criteria, returns a `ColGroupDDC` created with the column indexes, an `IdentityDictionary`, and the data.
- If the data does not meet the OHE criteria, returns the result of `directCompressDDC`.
47 changes: 47 additions & 0 deletions scripts/perftest/ohe_checks/experiment.dml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

## This script generates a random matrix, transforms some columns to be One-Hot-Encoded, and then compresses

# Set default values
default_rows = 1000
default_cols = 10
default_dummy = "[1]"
default_repeats = 1
default_num_distinct = 10

#nvargs
rows = ifdef($rows, default_rows)
cols = ifdef($cols, default_cols)
dummy = ifdef($dummy, default_dummy)
repeats = ifdef($repeats, default_repeats)
num_distinct = ifdef($distinct, default_num_distinct)

# Generate random matrix and apply transformations
x = rand(rows=rows, cols=cols, min=0, max=num_distinct)
x = floor(x)
Fall = as.frame(x)
jspec = "{ids: true, dummycode: " + dummy + "}";
for(i in 1:repeats){
[T,M] = transformencode(target=Fall, spec=jspec)
xc = compress(T)
}
print(toString(xc))
61 changes: 61 additions & 0 deletions scripts/perftest/ohe_checks/experiments.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env bash
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

mkdir BaselineLogs
mkdir OHELogs
run_base() {
$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml \
--seed 42 --debug -nvargs rows=$1 cols=$2 dummy="$3" distinct=$4 > BaselineLogs/${5}_${1}_rows_${2}_cols_${3}_encoded_base.txt 2>&1
}

run_ohe() {
$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml \
--seed 42 --debug --config ohe.xml -nvargs rows=$1 cols=$2 dummy="$3" distinct=$4> OHELogs/${5}_${1}_rows_${2}_cols_${3}_encoded_ohe.txt 2>&1
}

# Run same experiments but checking One-Hot-Encoded columns first
run_ohe 1000 1 "[1]" 10 1
run_ohe 1000 5 "[2]" 10 2
run_ohe 1000 5 "[1,2]" 10 3
run_ohe 1000 5 "[1,2,3]" 10 4
run_ohe 1000 5 "[1,2,3,4,5]" 10 5
run_ohe 1000 10 "[1,3,5]" 10 6
run_ohe 1000 10 "[1,2,5,6]" 10 7
run_ohe 100000 1 "[1]" 100 8
run_ohe 100000 5 "[1,2]" 100 9
run_ohe 100000 5 "[1,2,3]" 100 10
run_ohe 100000 100 "[1,3,50,60,70,80]" 100 11
run_ohe 100000 100 "[1,2,24,25,50,51]" 100 12

# Run baseline experiments
run_base 1000 1 "[1]" 10 1
run_base 1000 5 "[2]" 10 2
run_base 1000 5 "[1,2]" 10 3
run_base 1000 5 "[1,2,3]" 10 4
run_base 1000 5 "[1,2,3,4,5]" 10 5
run_base 1000 10 "[1,3,5]" 10 6
run_base 1000 10 "[1,2,5,6]" 10 7
run_base 100000 1 "[1]" 100 8
run_base 100000 5 "[1,2]" 100 9
run_base 100000 5 "[1,2,3]" 100 10
run_base 100000 100 "[1,3,50,60,70,80]" 100 11
run_base 100000 100 "[1,2,24,25,50,51]" 100 12
33 changes: 33 additions & 0 deletions scripts/perftest/ohe_checks/log4j-compression.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

log4j.rootLogger=ERROR, console

log4j.logger.org.apache.sysds=INFO
log4j.logger.org.apache.sysds.runtime.compress=DEBUG
log4j.logger.org.apache.spark=ERROR
log4j.logger.org.apache.spark.SparkContext=OFF
log4j.logger.org.apache.hadoop=ERROR

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
22 changes: 22 additions & 0 deletions scripts/perftest/ohe_checks/ohe.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<!--
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
-->

<root>
<sysds.compressed.onehotdetect>true</sysds.compressed.onehotdetect>
</root>
90 changes: 90 additions & 0 deletions scripts/perftest/ohe_checks/parse_logs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------


import os
import re
import pandas as pd

# Patterns to search for
patterns = {
"num_col_groups": re.compile(r"--num col groups:\s+(\d+)"),
"compressed_size": re.compile(r"--compressed size:\s+(\d+)"),
"compression_ratio": re.compile(r"--compression ratio:\s+([\d.]+)"),
"execution_time": re.compile(r"Total execution time:\s+([\d.]+) sec.")
}

# Function to extract metrics from the text files
def extract_metrics(file_path):
with open(file_path, 'r') as file:
content = file.read()
metrics = {}
for key, pattern in patterns.items():
match = pattern.search(content)
if match:
metrics[key] = match.group(1)
else:
metrics[key] = None
return metrics

# Directories for baseline and OHE
baseline_dir = "BaselineLogs"
ohe_dir = "OHELogs"

# Data storage
data_combined = []

# Process baseline and corresponding OHE files
for file_name in os.listdir(baseline_dir):
if file_name.endswith("_encoded_base.txt"):
experiment_name = file_name[:-4] # Remove the .txt extension
file_path_baseline = os.path.join(baseline_dir, file_name)
metrics_baseline = extract_metrics(file_path_baseline)

file_name_ohe = file_name.replace('_base.txt', '_ohe.txt').replace('Baseline', 'OHE')
file_path_ohe = os.path.join(ohe_dir, file_name_ohe)
if os.path.exists(file_path_ohe):
metrics_ohe = extract_metrics(file_path_ohe)
else:
metrics_ohe = {key: None for key in patterns.keys()}

combined_metrics = {
'experiment': experiment_name,
'baseline_num_col_groups': metrics_baseline.get('num_col_groups'),
'baseline_compressed_size': metrics_baseline.get('compressed_size'),
'baseline_compression_ratio': metrics_baseline.get('compression_ratio'),
'baseline_execution_time': metrics_baseline.get('execution_time'),
'ohe_num_col_groups': metrics_ohe.get('num_col_groups'),
'ohe_compressed_size': metrics_ohe.get('compressed_size'),
'ohe_compression_ratio': metrics_ohe.get('compression_ratio'),
'ohe_execution_time': metrics_ohe.get('execution_time')
}

data_combined.append(combined_metrics)

# Create DataFrame
df_combined = pd.DataFrame(data_combined)

# Write to Excel
output_file = 'combined_metrics.xlsx'
df_combined.to_excel(output_file, index=False)

print(f'Excel file "{output_file}" has been created successfully.')
Loading
Loading