Filter out dominion partial ballots #910

artoonie · 2024-12-31T00:08:59Z

Closes #909

Confirmed that I get the same results as expected (+2 to +9 relative to the certified results, as explained elsewhere)

Adds tests that confirm this behavior, as no previous test did. Used a random selection of CVR files from the Alaska election which contain partial ballots.

src/main/java/network/brightspots/rcv/DominionCvrReader.java

nurse-the-code · 2025-01-13T21:21:08Z

I have verified this PR using the following processes:

Reviewing all of the code changes
Tabulating the 2024 Alaska US House race both with the changes from this PR and without the changes from this PR.
- This race was chosen for testing the PR, because it is a statewide race with a large data set of CVRs.
- 2024_alaska_us_house_race_with_pr_910_detailed_report.json
- 2024_alaska_us_house_race_without_pr_detailed_report.json
- Comparing the detailed_report.json files from both tabulations, I noticed the following differences:
- A quick sanity check reveals that in each round the remaining candidate counts and invalid vote counts are lower with this PR than without it. That makes sense and is expected, because the goal of this PR is to exclude ballots marked as invalid contest for the race being tabulated.
- We still want to make sure that these differences are entirely the result of excluding the invalid contest.
Creating two databases from the RCTab output files: one using the PR to tabulate; one not using this PR to tabulate.
- The rctab_cvr.csv file for each tabulation was used to create a table with RCTab CVR data.
- The audit_i.log file(s) was/were for each tabulation was/were used to create a table with round-by-round parsed audit log info.
- This was facilitated by:
  - creating a round-by-round debugging tool that parses the RCTab audit_i.log file(s) and generating a .csv file with round-by-round information on how a CVR was interpreted by RCTab during tabulation. (Expect a 404 error if you don't have access to the round-by-round debugging tool.)
  - updating RCTab to print a shared primary id to the both the rctab_cvr.csv and the audit_i.log files.
  - updating the round-by-round debugging tool to generate a SQLite database when supplied an rctab_cvr.csv files and a list of the audit_i.log files. The rctab_cvr.csv data is in one table and the audit_i.log data is in another table, with both tables joined on a primary id of "RCTab CVR Id". (Expect a 404 error if you don't have access to the round-by-round debugging tool.)

Running an SQL query that:

loads both databases,
in each database joins the RCTab CVR and round-by-round audit logging data,
and then compares the joined data in both database to count the number of records that match on critical details:
1. "RCTab CVR Id"`,
2. candidate rankings,
3. and round-by-round interpretation (either candidate name or a reason the CVR has been invalidated).

SQL Query:

-- Attach both databases
ATTACH DATABASE './us-house-outstack-included.sqlite3' AS outstack_included_db;
ATTACH DATABASE './us-house-outstack-excluded.sqlite3' AS outstack_excluded_db;

-- Count matching records between both databases
SELECT COUNT(*)
FROM (
    -- Query from first database
    SELECT r1."RCTab CVR Id",
           r1."Rank 1", r1."Rank 2", r1."Rank 3", r1."Rank 4", r1."Rank 5",
           rbr1."Round 1", rbr1."Round 2", rbr1."Round 3"
    FROM outstack_included_db.rctab_cvr r1
    JOIN outstack_included_db.round_by_round rbr1 
    ON r1."RCTab CVR Id" = rbr1."RCTab CVR Id"
) AS db1
INNER JOIN (
    -- Query from second database
    SELECT r2."RCTab CVR Id",
           r2."Rank 1", r2."Rank 2", r2."Rank 3", r2."Rank 4", r2."Rank 5",
           rbr2."Round 1", rbr2."Round 2", rbr2."Round 3"
    FROM outstack_excluded_db.rctab_cvr r2
    JOIN outstack_excluded_db.round_by_round rbr2 
    ON r2."RCTab CVR Id" = rbr2."RCTab CVR Id"
) AS db2
ON db1."RCTab CVR Id" = db2."RCTab CVR Id"
   AND db1."Rank 1" = db2."Rank 1"
   AND db1."Rank 2" = db2."Rank 2"
   AND db1."Rank 3" = db2."Rank 3"
   AND db1."Rank 4" = db2."Rank 4"
   AND db1."Rank 5" = db2."Rank 5"
   AND db1."Round 1" = db2."Round 1"
   AND db1."Round 2" = db2."Round 2"
   AND db1."Round 3" = db2."Round 3";

-- Detach databases when done
DETACH DATABASE outstack_included_db;
DETACH DATABASE outstack_excluded_db;

Result: 335853

This is also the "Total Number of Ballots" that can be found in the output reports from the invalid contest excluded tabulation. This means that every single record from the invalid contest excluded (i.e. tabulation with this PR) was an exact match for a record in the invalid contest included (i.e. tabulation without this PR). This is the correct and expected behavior, and should be enough to confirm the PR.

To add an extra layer of comfort and security, I wrote a SQL query to count the unique round-by-round interpretations. The SQL query does this by:

Joining the RCTab CVR data and the round-by-round parsed audit log data on an "RCTab CVR Id" in each database (similar to what I did in the first query).
Finding the records in outstack included database (created from tabulating without this PR) where "RCTab CVR Id" does not match any "RCTab CVR Id" in the outstack excluded database (created from tabulating with this PR).
In those found records, counting the number of values in each round that matches the following comprehensive list of candidates and invalidation conditions found in the 2024 Alaska US House race:
1. "Begich"
2. "Peltola"
3. "Howe"
4. "Hafner"
5. "Ballots by Overvotes"
6. "Ballots by Skipped Rankings"
7. "Ballots by Exhausted Choices"
8. "Did Not Rank"
Printing those records counts in a table.

SQL query:

.mode box
.headers on

-- Attach both databases
ATTACH DATABASE './us-house-outstack-included.sqlite3' AS outstack_included_db;
ATTACH DATABASE './us-house-outstack-excluded.sqlite3' AS outstack_excluded_db;

-- Create a temporary view for records only in outstack_included_db
WITH unique_included_records AS (
    SELECT r1."RCTab CVR Id", rbr1."Round 1", rbr1."Round 2", rbr1."Round 3"
    FROM outstack_included_db.rctab_cvr r1
    JOIN outstack_included_db.round_by_round rbr1 
    ON r1."RCTab CVR Id" = rbr1."RCTab CVR Id"
    WHERE NOT EXISTS (
        SELECT 1
        FROM outstack_excluded_db.rctab_cvr r2
        WHERE r1."RCTab CVR Id" = r2."RCTab CVR Id"
    )
)

-- Count occurrences for each candidate/category across all rounds
SELECT 
    'Begich' as Category,
    COUNT(CASE WHEN "Round 1" LIKE '%Begich%' THEN 1 END) as "Round 1 Count",
    COUNT(CASE WHEN "Round 2" LIKE '%Begich%' THEN 1 END) as "Round 2 Count",
    COUNT(CASE WHEN "Round 3" LIKE '%Begich%' THEN 1 END) as "Round 3 Count"
FROM unique_included_records

UNION ALL
SELECT 
    'Peltola',
    COUNT(CASE WHEN "Round 1" LIKE '%Peltola%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Peltola%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Peltola%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Howe',
    COUNT(CASE WHEN "Round 1" LIKE '%Howe%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Howe%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Howe%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Hafner',
    COUNT(CASE WHEN "Round 1" LIKE '%Hafner%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Hafner%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Hafner%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Overvotes',
    COUNT(CASE WHEN "Round 1" LIKE '%Overvotes%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Overvotes%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Overvotes%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Skipped Rankings',
    COUNT(CASE WHEN "Round 1" LIKE '%Skipped Rankings%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Skipped Rankings%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Skipped Rankings%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Exhausted Choices',
    COUNT(CASE WHEN "Round 1" LIKE '%Exhausted Choices%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Exhausted Choices%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Exhausted Choices%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Did Not Rank',
    COUNT(CASE WHEN "Round 1" LIKE '%Did Not Rank%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Did Not Rank%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Did Not Rank%' THEN 1 END)
FROM unique_included_records;

-- Detach databases when done
DETACH DATABASE outstack_included_db;
DETACH DATABASE outstack_excluded_db;

Results:

I then did a little sanity check on that by using the Python interpreter to:
1. make sure that the counts for each column in the output table match each other
2. make sure that the sum of one of those matching tables when added to the number of outstack excluded ballots equals the number of outstack included ballots.

yezr

LGTM!

Filter out dominion partial ballots

12e20f1

artoonie added the WIP label Dec 31, 2024

Add tests. Clean up.

6fce534

artoonie removed the WIP label Jan 1, 2025

artoonie changed the title ~~WIP Filter out dominion partial ballots~~ Filter out dominion partial ballots Jan 1, 2025

artoonie requested a review from yezr January 1, 2025 21:51

nurse-the-code reviewed Jan 3, 2025

View reviewed changes

src/main/java/network/brightspots/rcv/DominionCvrReader.java Show resolved Hide resolved

yezr approved these changes Jan 16, 2025

View reviewed changes

yezr merged commit eaaa4b6 into develop Jan 16, 2025
1 check passed

yezr deleted the feature/issue-909_filter-dominion branch January 16, 2025 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out dominion partial ballots #910

Filter out dominion partial ballots #910

artoonie commented Dec 31, 2024 •

edited

Loading

nurse-the-code commented Jan 13, 2025

yezr left a comment

Filter out dominion partial ballots #910

Filter out dominion partial ballots #910

Conversation

artoonie commented Dec 31, 2024 • edited Loading

nurse-the-code commented Jan 13, 2025

yezr left a comment

Choose a reason for hiding this comment

artoonie commented Dec 31, 2024 •

edited

Loading