Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rare nullptr in get_single_signal #320

Open
rheiland opened this issue Nov 10, 2024 · 7 comments
Open

rare nullptr in get_single_signal #320

rheiland opened this issue Nov 10, 2024 · 7 comments

Comments

@rheiland
Copy link
Collaborator

In a particular model which @drbergman and @heberlr have been running replicates on, there seem to be occasional runtime errors. In the current development branch, at this line:
https://github.com/MathCancer/PhysiCell/blob/development/core/PhysiCell_signal_behavior.cpp#L1049
if we insert this code (after dead_cells++;), it seems to sometimes detect a nullptr and at least avoid a fatal error:

                Phase* myPhase = &(pC->phenotype.cycle.current_phase());
                if ( myPhase == nullptr)
                {
                      std::cout << __FUNCTION__ << "  -------- myPhase = nullptr\n";
		      return -9e9; 
                }

Still unsure of the underlying cause, but maybe others have a clue. In my initial tests, it also seems to be more common to happen after the final call to update_all_cells in main.cpp, before leaving the while loop and saving the final.* simulation outputs.

@rheiland
Copy link
Collaborator Author

This error seems to be so rare and weird and not reproducible that I'm closing this issue for now. But here's the latest error that I saw while running lldb with a clang-compiled executable.

total agents: 1388
interval wall time: 0 days, 0 hours, 0 minutes, and 14.5646 seconds 
total wall time: 0 days, 0 hours, 5 minutes, and 28.0556 seconds 

error: project_clang_debug debug map object file "/Users/heiland/dev/PhysiCell-development/PhysiCell_signal_behavior.o" changed (actual: 0x6733df39, debug map: 0x672f7fda) since this executable was linked, debug info will not be loaded
error: project_clang_debug debug map object file "/Users/heiland/dev/PhysiCell-development/PhysiCell_rules.o" changed (actual: 0x6733df38, debug map: 0x672f7fdb) since this executable was linked, debug info will not be loaded
Process 56308 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x4)
    frame #0: 0x00000001000c86c4 project_clang_debug`PhysiCell::get_single_signal(PhysiCell::Cell*, int) + 2396
project_clang_debug`PhysiCell::get_single_signal:
->  0x1000c86c4 <+2396>: ldr    w8, [x8, #0x4]
    0x1000c86c8 <+2400>: adrp   x9, 110
    0x1000c86cc <+2404>: add    x9, x9, #0x668            ; PhysiCell::PhysiCell_constants::apoptotic
    0x1000c86d0 <+2408>: ldr    w9, [x9]
Target 0: (project_clang_debug) stopped.
error: project_clang_debug debug map object file "/Users/heiland/dev/PhysiCell-development/PhysiCell_constants.o" changed (actual: 0x6733df35, debug map: 0x672f7fd9) since this executable was linked, debug info will not be loaded
(lldb) bt
error: project_clang_debug debug map object file "/Users/heiland/dev/PhysiCell-development/PhysiCell_cell.o" changed (actual: 0x6733df35, debug map: 0x672f7fd8) since this executable was linked, debug info will not be loaded
error: project_clang_debug debug map object file "/Users/heiland/dev/PhysiCell-development/PhysiCell_cell_container.o" changed (actual: 0x6733df31, debug map: 0x672f7fd7) since this executable was linked, debug info will not be loaded
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x4)
  * frame #0: 0x00000001000c86c4 project_clang_debug`PhysiCell::get_single_signal(PhysiCell::Cell*, int) + 2396
    frame #1: 0x00000001000c9ce4 project_clang_debug`PhysiCell::get_single_signal(PhysiCell::Cell*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 68
    frame #2: 0x00000001000dee8c project_clang_debug`PhysiCell::Hypothesis_Rule::evaluate(PhysiCell::Cell*) + 160
    frame #3: 0x00000001000df07c project_clang_debug`PhysiCell::Hypothesis_Rule::apply(PhysiCell::Cell*) + 32
    frame #4: 0x00000001000e48c8 project_clang_debug`PhysiCell::Hypothesis_Ruleset::apply(PhysiCell::Cell*) + 100
    frame #5: 0x00000001000eb100 project_clang_debug`PhysiCell::apply_ruleset(PhysiCell::Cell*) + 88
    frame #6: 0x00000001000a1cf8 project_clang_debug`PhysiCell::Cell::advance_bundled_phenotype_functions(double) + 76
    frame #7: 0x000000010008ceec project_clang_debug`.omp_outlined._debug__.8 + 348
    frame #8: 0x000000010008cf5c project_clang_debug`.omp_outlined..9 + 36
    frame #9: 0x00000001008f9dbc libomp.dylib`__kmp_invoke_microtask + 156
    frame #10: 0x00000001008a7c08 libomp.dylib`__kmp_serial_fork_call(ident*, int, fork_context_e, int, void (*)(int*, int*, ...), int (*)(int), kmp_info*, kmp_team*, ompt_data_t*, void**, ompt_data_t**, char*) + 828
    frame #11: 0x00000001008a6f90 libomp.dylib`__kmp_fork_call + 3696
    frame #12: 0x000000010089bc4c libomp.dylib`__kmpc_fork_call + 172
    frame #13: 0x000000010008c360 project_clang_debug`PhysiCell::Cell_Container::update_all_cells(double, double, double, double) + 484
    frame #14: 0x000000010008c170 project_clang_debug`PhysiCell::Cell_Container::update_all_cells(double) + 64
    frame #15: 0x00000001001237d0 project_clang_debug`main + 1940
    frame #16: 0x0000000194e660e0 dyld`start + 2360
(lldb)

@drbergman
Copy link
Collaborator

The issue has been tracked to standard_cell_cell_interactions causing some cells to be flagged for removal (due to either phagocytosis or fusion). However, the neighbors lists are not updated after this. Thus, when a rule uses contact as a signal, the neighbors list could have pointers to cells that have been removed. An upcoming PR will address this.

@rheiland
Copy link
Collaborator Author

Is there a model (with rules) that faithfully reproduces the bug? If so, using which version of the code?

@rheiland
Copy link
Collaborator Author

rheiland commented Dec 1, 2024

I've convinced myself that the fundamental bug is the following (as I've been discussing with @drbergman): die() currently just deletes the pointer to that dead cell ("D"). However, it is possible that one or more of D's state.neighbors (cells) contain D as one of its state.neighbors. And in PhysiCell_signal_behavior.cpp get_selected_signals, in the "contact with" block, we can encounter a neighbor cell which is one that has been deleted. Therefore we get a bogus pointer (memory address) which leads to erratic (not reproducible) segfaults.

One fix would be the following:

void Cell::die()
{
    for( int i=0; i < this->state.neighbors.size(); i++ )
    {
        Cell* pC = this->state.neighbors[i];
        for( int jdx=0; jdx < pC->state.neighbors.size(); jdx++ )
        {
            Cell* pC2 = pC->state.neighbors[jdx];
            if (pC2 == this)
            {
                pC->state.neighbors.erase(find(pC->state.neighbors.begin(), pC->state.neighbors.end(), this));
            }
        }
    }
    delete_cell(this);
    return; 
} 

Rf. https://github.com/rheiland/PhysiCell/blob/fix-die-check-nbrs/core/PhysiCell_cell.cpp#L837
I ran the manual test_* scripts (build/run/diffs in /beta dir) and the diffs showed there were "only" two sample projects (not intracellular) that showed diffs (un-reproducible): interaction and rules (and the diffs appeared in snapshot00000001.svg )

This repo also contains run1rule_v3.xml, cell_rule1.csv (which needs to be copied into /config before running the model), and run_multi.sh which simply runs the model N times (to check for segfaults).

@drbergman
Copy link
Collaborator

I like this solution a lot. It helps ensure that any time a cell is removed that the neighbors list remains accurate. I get why it makes sense to put it within this die() method, but it would also feel at home within the void delete_cell( int index ) function as that does similar things with springs.

The main reason I did not go for this solution was its reliance on the assumption that neighbors are always symmetrical. The add_potentials function is currently symmetric. I don't know if it will always be or if others will implement update_velocity functions that produce asymmetric neighbor relationships. I think if we go with this, we may want to also consider refactoring standard_update_cell_velocity so that the neighbors list is updated outside of that as a standalone function/method that enforces this desired symmetry.

The second reason was to avoid extra compute time, opting to rewrite the neighbors list each mechanics step (as we currently do) to maintain this list.

@rheiland
Copy link
Collaborator Author

rheiland commented Dec 2, 2024

Yeah, I'm not really suggesting this die() "fix" to be a serious fix to the overall problem (attempting to use a cell pointer that's been deleted). I did this primarily to convince myself that it would very directly fix the erratic segfault bug that I'd tracked down to https://github.com/rheiland/PhysiCell/blob/fix-die-check-nbrs/core/PhysiCell_signal_behavior.cpp#L783 .

Also, since this "fix" didn't involve rearranging code in Cell_Container::update_all_cells, I could run the manual reproducibility tests to see how the sample projects' results compared to 1.14.0. Recall our discussion of epsilon diffs in the template project with #338.

Excellent point on calling out concerns in delete_cell. And yes, I agree on the "extra compute time". I don't really get the "reliance on ...symmetrical" comment, as I don't see it as an assumption, but honestly, it's irrelevant at this point :-)

So long as we all agree that #338 will fix the issue of having invalid cell pointers in state.neighbors and we accept that this will redefine new baseline results for future reproducibility tests, I'm on board. This has consumed a lot of our time, but we've learned a lot.

@drbergman
Copy link
Collaborator

Gotcha. I'm all in on resolving this issue sooner than later. So, I'm happy to agree to #338 and move towards merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants