Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out dominion partial ballots #910

Merged
merged 2 commits into from
Jan 16, 2025
Merged

Conversation

artoonie
Copy link
Collaborator

@artoonie artoonie commented Dec 31, 2024

Closes #909

Confirmed that I get the same results as expected (+2 to +9 relative to the certified results, as explained elsewhere)

Adds tests that confirm this behavior, as no previous test did. Used a random selection of CVR files from the Alaska election which contain partial ballots.

@artoonie artoonie added the WIP label Dec 31, 2024
@artoonie artoonie removed the WIP label Jan 1, 2025
@artoonie artoonie changed the title WIP Filter out dominion partial ballots Filter out dominion partial ballots Jan 1, 2025
@artoonie artoonie requested a review from yezr January 1, 2025 21:51
@nurse-the-code
Copy link
Collaborator

I have verified this PR using the following processes:

  1. Reviewing all of the code changes

  2. Tabulating the 2024 Alaska US House race both with the changes from this PR and without the changes from this PR.

    • This race was chosen for testing the PR, because it is a statewide race with a large data set of CVRs.
    • 2024_alaska_us_house_race_with_pr_910_detailed_report.json
    • 2024_alaska_us_house_race_without_pr_detailed_report.json
    • Comparing the detailed_report.json files from both tabulations, I noticed the following differences:
      Screenshot 2025-01-13 at 12 33 45
    • A quick sanity check reveals that in each round the remaining candidate counts and invalid vote counts are lower with this PR than without it. That makes sense and is expected, because the goal of this PR is to exclude ballots marked as invalid contest for the race being tabulated.
    • We still want to make sure that these differences are entirely the result of excluding the invalid contest.
  3. Creating two databases from the RCTab output files: one using the PR to tabulate; one not using this PR to tabulate.

    • The rctab_cvr.csv file for each tabulation was used to create a table with RCTab CVR data.
    • The audit_i.log file(s) was/were for each tabulation was/were used to create a table with round-by-round parsed audit log info.
    • This was facilitated by:
      • creating a round-by-round debugging tool that parses the RCTab audit_i.log file(s) and generating a .csv file with round-by-round information on how a CVR was interpreted by RCTab during tabulation. (Expect a 404 error if you don't have access to the round-by-round debugging tool.)
      • updating RCTab to print a shared primary id to the both the rctab_cvr.csv and the audit_i.log files.
      • updating the round-by-round debugging tool to generate a SQLite database when supplied an rctab_cvr.csv files and a list of the audit_i.log files. The rctab_cvr.csv data is in one table and the audit_i.log data is in another table, with both tables joined on a primary id of "RCTab CVR Id". (Expect a 404 error if you don't have access to the round-by-round debugging tool.)
  4. Running an SQL query that:

    1. loads both databases,
    2. in each database joins the RCTab CVR and round-by-round audit logging data,
    3. and then compares the joined data in both database to count the number of records that match on critical details:
      1. "RCTab CVR Id"`,
      2. candidate rankings,
      3. and round-by-round interpretation (either candidate name or a reason the CVR has been invalidated).

    SQL Query:

    -- Attach both databases
    ATTACH DATABASE './us-house-outstack-included.sqlite3' AS outstack_included_db;
    ATTACH DATABASE './us-house-outstack-excluded.sqlite3' AS outstack_excluded_db;
    
    -- Count matching records between both databases
    SELECT COUNT(*)
    FROM (
        -- Query from first database
        SELECT r1."RCTab CVR Id",
               r1."Rank 1", r1."Rank 2", r1."Rank 3", r1."Rank 4", r1."Rank 5",
               rbr1."Round 1", rbr1."Round 2", rbr1."Round 3"
        FROM outstack_included_db.rctab_cvr r1
        JOIN outstack_included_db.round_by_round rbr1 
        ON r1."RCTab CVR Id" = rbr1."RCTab CVR Id"
    ) AS db1
    INNER JOIN (
        -- Query from second database
        SELECT r2."RCTab CVR Id",
               r2."Rank 1", r2."Rank 2", r2."Rank 3", r2."Rank 4", r2."Rank 5",
               rbr2."Round 1", rbr2."Round 2", rbr2."Round 3"
        FROM outstack_excluded_db.rctab_cvr r2
        JOIN outstack_excluded_db.round_by_round rbr2 
        ON r2."RCTab CVR Id" = rbr2."RCTab CVR Id"
    ) AS db2
    ON db1."RCTab CVR Id" = db2."RCTab CVR Id"
       AND db1."Rank 1" = db2."Rank 1"
       AND db1."Rank 2" = db2."Rank 2"
       AND db1."Rank 3" = db2."Rank 3"
       AND db1."Rank 4" = db2."Rank 4"
       AND db1."Rank 5" = db2."Rank 5"
       AND db1."Round 1" = db2."Round 1"
       AND db1."Round 2" = db2."Round 2"
       AND db1."Round 3" = db2."Round 3";
    
    -- Detach databases when done
    DETACH DATABASE outstack_included_db;
    DETACH DATABASE outstack_excluded_db;

    Result: 335853

    This is also the "Total Number of Ballots" that can be found in the output reports from the invalid contest excluded tabulation. This means that every single record from the invalid contest excluded (i.e. tabulation with this PR) was an exact match for a record in the invalid contest included (i.e. tabulation without this PR). This is the correct and expected behavior, and should be enough to confirm the PR.


To add an extra layer of comfort and security, I wrote a SQL query to count the unique round-by-round interpretations. The SQL query does this by:

  1. Joining the RCTab CVR data and the round-by-round parsed audit log data on an "RCTab CVR Id" in each database (similar to what I did in the first query).

  2. Finding the records in outstack included database (created from tabulating without this PR) where "RCTab CVR Id" does not match any "RCTab CVR Id" in the outstack excluded database (created from tabulating with this PR).

  3. In those found records, counting the number of values in each round that matches the following comprehensive list of candidates and invalidation conditions found in the 2024 Alaska US House race:

    1. "Begich"
    2. "Peltola"
    3. "Howe"
    4. "Hafner"
    5. "Ballots by Overvotes"
    6. "Ballots by Skipped Rankings"
    7. "Ballots by Exhausted Choices"
    8. "Did Not Rank"
  4. Printing those records counts in a table.

SQL query:

.mode box
.headers on

-- Attach both databases
ATTACH DATABASE './us-house-outstack-included.sqlite3' AS outstack_included_db;
ATTACH DATABASE './us-house-outstack-excluded.sqlite3' AS outstack_excluded_db;

-- Create a temporary view for records only in outstack_included_db
WITH unique_included_records AS (
    SELECT r1."RCTab CVR Id", rbr1."Round 1", rbr1."Round 2", rbr1."Round 3"
    FROM outstack_included_db.rctab_cvr r1
    JOIN outstack_included_db.round_by_round rbr1 
    ON r1."RCTab CVR Id" = rbr1."RCTab CVR Id"
    WHERE NOT EXISTS (
        SELECT 1
        FROM outstack_excluded_db.rctab_cvr r2
        WHERE r1."RCTab CVR Id" = r2."RCTab CVR Id"
    )
)

-- Count occurrences for each candidate/category across all rounds
SELECT 
    'Begich' as Category,
    COUNT(CASE WHEN "Round 1" LIKE '%Begich%' THEN 1 END) as "Round 1 Count",
    COUNT(CASE WHEN "Round 2" LIKE '%Begich%' THEN 1 END) as "Round 2 Count",
    COUNT(CASE WHEN "Round 3" LIKE '%Begich%' THEN 1 END) as "Round 3 Count"
FROM unique_included_records

UNION ALL
SELECT 
    'Peltola',
    COUNT(CASE WHEN "Round 1" LIKE '%Peltola%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Peltola%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Peltola%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Howe',
    COUNT(CASE WHEN "Round 1" LIKE '%Howe%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Howe%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Howe%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Hafner',
    COUNT(CASE WHEN "Round 1" LIKE '%Hafner%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Hafner%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Hafner%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Overvotes',
    COUNT(CASE WHEN "Round 1" LIKE '%Overvotes%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Overvotes%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Overvotes%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Skipped Rankings',
    COUNT(CASE WHEN "Round 1" LIKE '%Skipped Rankings%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Skipped Rankings%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Skipped Rankings%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Exhausted Choices',
    COUNT(CASE WHEN "Round 1" LIKE '%Exhausted Choices%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Exhausted Choices%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Exhausted Choices%' THEN 1 END)
FROM unique_included_records

UNION ALL
SELECT 
    'Did Not Rank',
    COUNT(CASE WHEN "Round 1" LIKE '%Did Not Rank%' THEN 1 END),
    COUNT(CASE WHEN "Round 2" LIKE '%Did Not Rank%' THEN 1 END),
    COUNT(CASE WHEN "Round 3" LIKE '%Did Not Rank%' THEN 1 END)
FROM unique_included_records;

-- Detach databases when done
DETACH DATABASE outstack_included_db;
DETACH DATABASE outstack_excluded_db;

Results:
Screenshot of SQL query results

I then did a little sanity check on that by using the Python interpreter to:
1. make sure that the counts for each column in the output table match each other
2. make sure that the sum of one of those matching tables when added to the number of outstack excluded ballots equals the number of outstack included ballots.

Screenshot 2025-01-13 at 11 21 38

Copy link
Collaborator

@yezr yezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yezr yezr merged commit eaaa4b6 into develop Jan 16, 2025
1 check passed
@yezr yezr deleted the feature/issue-909_filter-dominion branch January 16, 2025 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dominion: Properly filter outstack = 7 "Invalid Contest" CVRs
3 participants