Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KalmanFitterError:1 results in segmentation violation v37.4.0 and above #4028

Open
plariono opened this issue Jan 13, 2025 · 13 comments
Open
Assignees

Comments

@plariono
Copy link
Contributor

This is to report that at least in v37.4.0 and above the KalmanFitterError:1 error crashes the reconstruction.

This happens for the following systems:

  • Local, Mac OS 15.2, Command Line Tools 16.2, python 3.12.8
  • lxplus, configured with source CI/setup_cvmfs_lcg.sh: LCG_105/x86_64-el9-gcc13-opt

Crash log.

15:19:01    TrackFinding   WARNING   Track finding failed for seed 8348 with errorMagneticFieldError:1
15:19:01    TrackFinding   ERROR     Update step failed: KalmanFitterError:1
15:19:01    TrackFinding   ERROR     Processing of selected track states failed: KalmanFitterError:1
15:19:01    TrackFinding   ERROR     Error in filter: KalmanFitterError:1

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007fd642ed8afa in wait4 () from /lib64/libc.so.6
#1  0x00007fd642e4b283 in do_system () from /lib64/libc.so.6
#2  0x00007fd63fb12519 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc13-opt/lib/libCore.so
#3  0x00007fd63fb11ed4 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc13-opt/lib/libCore.so

This was not happening in v36 from 18th of September '24:

15:46:10    TrackFinding   ERROR     Extrapolation for seed 25829 and track 0 failed with error SurfaceError:1
15:46:11    TrackFinding   ERROR     Update step failed: KalmanFitterError:1
15:46:11    TrackFinding   ERROR     Processing of selected track states failed: KalmanFitterError:1
15:46:11    TrackFinding   ERROR     Error in filter: KalmanFitterError:1
15:46:11    TrackFinding   ERROR     CombinatorialKalmanFilter failed: KalmanFitterError:1 Kalman update failed with the initial parameters:  13.6416  82.8744  2.75151 0.330688  2.31717        0
15:46:11    TrackFinding   WARNING   Track finding failed for seed 27669 with errorKalmanFitterError:1
15:46:12    TrackFinding   ERROR     Step failed with MagneticFieldError:1: Interpolation out of bounds was requested
15:46:12    TrackFinding   ERROR     Propagation failed: MagneticFieldError:1 Interpolation out of bounds was requested with the initial parameters: 
 13.8485
-24.2444
 2.74833
0.425185
-3.31868
       0
15:46:12    TrackFinding   WARNING   Track finding failed for seed 27826 with errorMagneticFieldError:1
15:46:12    TrackFinding   ERROR     Step failed with MagneticFieldError:1: Interpolation out of bounds was requested
15:46:12    TrackFinding   ERROR     Propagation failed: MagneticFieldError:1 Interpolation out of bounds was requested with the initial parameters: 
 14.6321
-167.565
 2.89044
0.358598
 3.52027
       0
15:46:12    TrackFinding   WARNING   Track finding failed for seed 28189 with errorMagneticFieldError:1
15:46:24    Sequencer      INFO      finished event 0
@andiwand
Copy link
Contributor

andiwand commented Jan 14, 2025

Which workflow did you run here @plariono

MagneticFieldError:1 usually means that the propagation left the tracking volume which could be due to a navigation failure or an invalid geometry

But the segfault should definitely not happen. Looks like we do an out of bounds array access

@plariono
Copy link
Contributor Author

Hi @andiwand, I'm running the full chain with high multiplicity events, Pythia + Fatras + CKF + Greedy Solver for the ALICE 3 geometry as usual.
Here the magnetic field map is used which covers the detector with some small margin. I could also test it with an ideal field.

@paulgessinger
Copy link
Member

The SEGFAULT indeed seem to come from inside the Track EDM from an out-of-bounds vector access. I'm wondering if we produce incomplete track candidates or track states that are not handled correctly later on.

@paulgessinger
Copy link
Member

paulgessinger commented Jan 15, 2025

@andiwand It seems like in the CKF we access the tip index from the branch state:

auto currentBranch = result.activeBranches.back();
TrackIndexType prevTip = currentBranch.tipIndex();

The .tipIndex() call is the one that ends up causing the SEGFAULT, because that index is already out of range for the track container.

I'm not sure if the SEGFAULT occurs in the same call to the track finding that exhibits the magnetic field error, or if the track container gets corrupted somehow and then fails on a subsequent run.

@andiwand
Copy link
Contributor

But that means the precondition for the filter call is not met right @paulgessinger ?

@paulgessinger
Copy link
Member

I guess so. I'm wondering: could it be that the first CKF invokation in the track finding algo fails, and then the second invokation produces the SEGFAULT?

@paulgessinger
Copy link
Member

paulgessinger commented Jan 15, 2025

Do we carry on with error'ed branches in the CKF?
I was under the impression that the whole seed / CKF invokation gets aborted, not just the failed branch.

EDIT: Seems like the CKF actor consumes the errors and only sets a "last error" variable but, if I read it right, does not terminate the propagation. Even given that, it's unclear to me how this corrupts the track container, in the sense that the back of activeBranches points to an invalid track index.

@andiwand
Copy link
Contributor

If the we encounter an error in the propagation or in the CKF the propagation should be stopped. In case of CKF errors here

return !result.lastError.ok() || result.finished;

If the first pass of the CKF does error we should not continue in the track finding because of

I suspect that this is something that only happens with branching because we have not encountered a segfault in Athena or Acts standalone where we do not branch. I had problems in the past with the branching logic because it is quite involved.

@paulgessinger
Copy link
Member

@plariono Could you run the same crashing job with the track finding verbosity set to VERBOSE and attach the full log here?

@andiwand
Copy link
Contributor

I would recommend only running the offending seed. otherwise the logfile will be quite overwhelming

@plariono
Copy link
Contributor Author

@paulgessinger @andiwand thanks for the suggestions. How can I run only the problematic seed?

@plariono
Copy link
Contributor Author

Also, I did run the setup without the field map and the crash did happen.

19:47:52    TrackFinding   WARNING   Second track finding failed for seed 3601 with errorPropagatorError:3
19:47:52    TrackFinding   ERROR     failed to extrapolate track: SurfaceError:1
19:47:52    TrackFinding   ERROR     Extrapolation for seed 3601 and track 0 failed with error SurfaceError:1
19:47:56    TrackFinding   ERROR     Propagation reached the step count limit of 100000 (did 100000 steps)
19:47:56    TrackFinding   ERROR     Propagation failed: PropagatorError:3 Propagation reached the configured maximum number of steps with the initial parameters: 
-15.3692
-129.544
0.765524
 2.57382
 1110.91
  102694
19:47:56    TrackFinding   WARNING   Second track finding failed for seed 1699 with errorPropagatorError:3
19:47:56    TrackFinding   ERROR     Update step failed: KalmanFitterError:1
19:47:56    TrackFinding   ERROR     Processing of selected track states failed: KalmanFitterError:1
19:47:56    TrackFinding   ERROR     Error in filter: KalmanFitterError:1

 *** Break *** segmentation violation
19:49:10    TrackFinding   ERROR     Propagation reached the step count limit of 100000 (did 100000 steps)
19:49:10    TrackFinding   ERROR     Propagation failed: PropagatorError:3 Propagation reached the configured maximum number of steps with the initial parameters: 
-36.9747
-132.231
-3.05455
0.418232
-3.11863
       0
19:49:10    TrackFinding   WARNING   Track finding failed for seed 1110 with errorPropagatorError:3



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================

Thread 16 (Thread 0x7f984effb640 (LWP 2321027) "python3"):

@paulgessinger
Copy link
Member

@plariono I think the fact that it also crashes without magnetic field is just a consequence of the fact that the underlying issue is a navigation failure, which ultimately leads to the SEGFAULT.

I think the exact seed is a bit tricky, but you should probably at least be able to restrict it to the failing event in question by using the skip parameter of the sequencer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants