Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for repeat detIDs when converting clusters from SoA #47076

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mroguljic
Copy link
Contributor

@mroguljic mroguljic commented Jan 9, 2025

PR description:

Fix for HLT crashses reported here: #46783

  • No changes are expected in output. It fixes rare events where out-of-order pixel hits are recorded, crashing GPU clusterizing. "Normal" events should be unaffected

PR validation:

  • With this change, the problematic event from the reproducer in the linked issue is processed without crashing
  • runTheMatrix.py -l limited -i all --ibeos:
    • 39 38 36 31 20 1 1 1 1 1 1 tests passed, 7 0 0 0 0 0 0 0 0 0 0 failed.
    • 7 fail because of DAS, "Step0-DAS_ERROR", I assume this is not related to the PR

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14_1_X

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 9, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 9, 2025

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47076/43248

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @mroguljic for master.

It involves the following packages:

  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dkotlins, @felicepantaleo, @ferencek, @gpetruc, @missirol, @mmusich, @mroguljic, @mtosi, @rovere, @threus, @tsusa, @tvami this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Jan 10, 2025

enable gpu

@mmusich
Copy link
Contributor

mmusich commented Jan 10, 2025

@mroguljic can you please squash the commits ?

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #47076 was updated. @cmsbuild, @jfernan2, @mandrenguyen can you please check and sign again.

@mmusich
Copy link
Contributor

mmusich commented Jan 10, 2025

@cmsbuild, please test

@mmusich
Copy link
Contributor

mmusich commented Jan 10, 2025

@mroguljic

I have been testing this with the reproducer at #46783 (comment) (modified to run over all the error stream files that we have for the two runs in question):

  • lxplus-gpu
  • CMSSW_15_0_X_2025-01-09-2300 + f62a42f
#!/bin/bash -ex

# List of run numbers
runs=(
    388769
    388770
)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="141X_dataRun3_HLT_v2"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration run:${run} \
    --globaltag ${global_tag} \
    --data \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --path HLT_HIUPC_DoubleEG5_BptxAND_SinglePixelTrack_MaxPixelTrack_v* \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

  # Run the HLT configuration with cmsRun and redirect output to log file
  cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

the processing fails at event 781 with the following assertion:

cmsRun: src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromSoAAlpaka.cc:138: void SiPixelRecHitFromSoAAlpaka<TrackerTraits>::produce(edm::StreamID, edm::Event&, const edm::EventSetup&) const [\
with TrackerTraits = pixelTopology::HIonPhase1]: Assertion `nhits <= dsv.size()' failed.

The same thing happens when I try the reproducer from the related issue: #47021 (comment).

This fix thus looks incomplete. Please take a look.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6a745/43713/summary.html
COMMIT: f62a42f
CMSSW: CMSSW_15_0_X_2025-01-09-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47076/43713/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 866
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52205
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @mandrenguyen, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@mmusich
Copy link
Contributor

mmusich commented Jan 10, 2025

hold

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @mmusich
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants