Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setenv() is not thread safe #46002

Open
makortel opened this issue Sep 13, 2024 · 35 comments
Open

setenv() is not thread safe #46002

makortel opened this issue Sep 13, 2024 · 35 comments

Comments

@makortel
Copy link
Contributor

makortel commented Sep 13, 2024

This issue spins off from #44659 after the discovery that our code is calling setenv() in the parallel part of cmsRun, and concurrently to std::getenv() calls (#44659 (comment)).

We need to first analyze the existing cases to call setenv(), and remove those for which there is a better way.

If there are any legitimate use cases left, we need to figure out a mechanism for setting environment variables during the serial part of cmsRun (Service constructors would satisfy the requirement, but we might want something else than moving all the setenv() code to new Services).

Then we need to migrate the existing code calling setenv().

@makortel
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 13, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

Here are some examples of the present setenv() calls

void setLogging(const std::string& level) {
/*
* 0 = all messages are logged (default behavior)
* 1 = INFO messages are not printed
* 2 = INFO and WARNING messages are not printed
* 3 = INFO, WARNING, and ERROR messages are not printed
*/
setenv("TF_CPP_MIN_LOG_LEVEL", level.c_str(), 0);
}

that is called from e.g.


tensorflow::setLogging("3");


setenv("RIVET_REF_PATH", rivetrefCstr, 1);


setenv("OMPI_MCA_pmix_server_uri", uri.c_str(), false);

@makortel
Copy link
Contributor Author

I tested that modifications to the os.environ in the cfg.py file are visible to the C++ side of cmsRun.

One option would be to implement the environment variable setting purely in the python side. I think the solution would have to

  • be something that survives the pickling+unpickling, and gets executed in the worker node
  • probably be good to be associated to the module configuration somehow, and applied only if the associated module gets used in the job
  • probably be good to have access to the final module parameters
    • if implemented in python, the parameters injected by the configuration validation (in C++) would not be available

@makortel
Copy link
Contributor Author

Actually the next step should be a more detailed analysis of the existing use cases of setting the environment variables in CMSSW code.

We should also find a way to get an understanding if any 3rd party code calls setenv() (in the parallel section of cmsRun). Possible ways could be e.g. a gdb script or LD_PRELOADable library to override setenv().

@makortel
Copy link
Contributor Author

It seems to me the most likely setenv() call to cause problems is the one in tensorflow::setLogging()

void setLogging(const std::string& level) {
/*
* 0 = all messages are logged (default behavior)
* 1 = INFO messages are not printed
* 2 = INFO and WARNING messages are not printed
* 3 = INFO, WARNING, and ERROR messages are not printed
*/
setenv("TF_CPP_MIN_LOG_LEVEL", level.c_str(), 0);
}

Unsafe calls to tensorflow::setLogging():

Possibly unsafe call:

  • testGraphLoadingCUDA::test()
    • The test program sets the Tensorflow to use a TBB thread pool, but calls the setLogging() before any inference calls.

Acceptable calls (in single-threaded test code in PhysicsTools/TensorFlow)

Problems with this pattern

  • tensorflow::setLogging() makes it very easy to add innocent-looking but unsafe code
  • The setting is global rather than per module
  • Different modules set different value (module construction order is unspecified, and thus the final outcome of these calls could differ from expectation)

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

@makortel
Copy link
Contributor Author

makortel commented Sep 19, 2024

RivetAnalyzer::beginJob() sets RIVET_REF_PATH and RIVET_INFO_PATH if they are not already set.

void RivetAnalyzer::beginJob() {
//set the environment, very ugly but rivet is monolithic when it comes to paths
char* cmsswbase = std::getenv("CMSSW_BASE");
char* cmsswrelease = std::getenv("CMSSW_RELEASE_BASE");
if (!std::getenv("RIVET_REF_PATH")) {
const std::string rivetref = string(cmsswbase) +
"/src/GeneratorInterface/RivetInterface/data:" + string(cmsswrelease) +
"/src/GeneratorInterface/RivetInterface/data:.";
char* rivetrefCstr = strdup(rivetref.c_str());
setenv("RIVET_REF_PATH", rivetrefCstr, 1);
free(rivetrefCstr);
}
if (!std::getenv("RIVET_INFO_PATH")) {
const std::string rivetinfo = string(cmsswbase) +
"/src/GeneratorInterface/RivetInterface/data:" + string(cmsswrelease) +
"/src/GeneratorInterface/RivetInterface/data:.";
char* rivetinfoCstr = strdup(rivetinfo.c_str());
setenv("RIVET_INFO_PATH", rivetinfoCstr, 1);
free(rivetinfoCstr);
}
}

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

@makortel
Copy link
Contributor Author

spu::Unzip resets the TMPDIR if it is longer than 50 characters

int Unzip(std::string infile, std::string outfile) {
/////////////////////////////////////////////
/////////////// BUG FIX FOR MPI /////////////
/////////////////////////////////////////////
const char *tmpdir = std::getenv("TMPDIR");
if (tmpdir && (strlen(tmpdir) > 50)) {
setenv("TMPDIR", "/tmp", true);
}

Seems that this code was added in #21682 to work around a "feature" of OpenMPI 2.0 and 2.1 (more information in #21419, https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3807/1/1/1.html, https://www.open-mpi.org/faq/?category=osx#startup-errors-with-open-mpi-2.0.x).

Our OpenMPI version is now 4.1.6, maybe this workaround is no longer needed? @cms-sw/generators-l2

For future reference, the spu::Unzip is called via

@makortel
Copy link
Contributor Author

MPIService constructor sets the OMPI_MCA_pmix_server_uri from a configuration parameter, if such parameter exists.

// set the pmix_server_uri MCA parameter if specified in the configuration and not already set in the environment
if (config.existsAs<std::string>("pmix_server_uri", false)) {
std::string uri = config.getUntrackedParameter<std::string>("pmix_server_uri");
// do not overwrite the environment variable if it is already set
setenv("OMPI_MCA_pmix_server_uri", uri.c_str(), false);
}

As Services are constructed in the serial part of cmsRun, this setenv() call is ok-ish.

@makortel
Copy link
Contributor Author

SiStripConfigDb::usingDatabase() resets TNS_ADMIN

// Check TNS_ADMIN environmental variable
std::string pattern = "TNS_ADMIN";
std::string tns_admin = "/afs/cern.ch/project/oracle/admin";
if (std::getenv(pattern.c_str()) != nullptr) {
tns_admin = std::getenv(pattern.c_str());
edm::LogVerbatim(mlConfigDb_) << "[SiStripConfigDb::" << __func__ << "]"
<< " TNS_ADMIN is set to: \"" << tns_admin << "\"";
} else {
edm::LogWarning(mlConfigDb_) << "[SiStripConfigDb::" << __func__ << "]"
<< " TNS_ADMIN is not set!"
<< " Trying to use /afs and setting to: \"" << tns_admin << "\"";
}
// Retrieve TNS_ADMIN from .cfg file and override
if (!dbParams_.tnsAdmin().empty()) {
std::stringstream ss;
ss << "[SiStripConfigDb::" << __func__ << "]"
<< " Overriding TNS_ADMIN value using cfg file!" << std::endl
<< " Original value : \"" << tns_admin << "\"!" << std::endl
<< " New value : \"" << dbParams_.tnsAdmin() << "\"!";
tns_admin = dbParams_.tnsAdmin();
edm::LogVerbatim(mlConfigDb_) << ss.str();
}
// Remove trailing slash and set TNS_ADMIN
if (tns_admin.empty()) {
tns_admin = ".";
}
std::string slash = tns_admin.substr(tns_admin.size() - 1, 1);
if (slash == sistrip::dir_) {
tns_admin = tns_admin.substr(0, tns_admin.size() - 1);
}
setenv(pattern.c_str(), tns_admin.c_str(), 1);

potentially modifying an existing value or a value from configuration.

The usingDatabase() is called via

SiStripConfigDb is defined as a Service, and since Services are constructed in the serial part of cmsRun, this setenv() call is ok-ish.

@makortel
Copy link
Contributor Author

main() in OnlineDB/EcalCondDB/test/LaserSeqToDB.cpp sets MESTORE

setenv("MESTORE", mestore.Data(), 1);

This program looks serial up to that point (unless ROOT would spawn threads), so is safe.

@makortel
Copy link
Contributor Author

CastorShowerLibraryMaker::update() has a commented out call

/*
for (int i=emSLHolder.SLEnergyBins.size()-1;i>0;i--) {
if (emSLHolder.nEvtInBinE.at(i)==(int)emSLHolder.nEvtPerBinE) {
std::ostringstream out;
out << emSLHolder.SLEnergyBins.at(i);
edm::LogVerbatim("HcalSim") << "Bin Limit: " << out.str();
setenv("CASTOR_SL_PG_MAXE",out.str().c_str(),1);
}
break;
}
*/

to set the CASTOR_SL_PG_MAXE environment variable. As being commented out, it is "safe", but maybe this commented out code could be removed? @cms-sw/simulation-l2

@makortel
Copy link
Contributor Author

assign ml, generators, simulation

@cmsbuild
Copy link
Contributor

New categories assigned: ml,generators,simulation

@bbilin,@civanch,@mdhildreth,@mkirsano,@menglu21,@lviliani,@kpedro88,@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

I made a PR removing tensorflow::setLogging() #46065

@makortel
Copy link
Contributor Author

@smuzaffar Building on top of the cmsTraceExceptions, I crafted the following script to trace the calls to setenv() in the frameworks' parallel section (approximately)

#!/bin/bash

cat << EOF | gdb --args $@
set pagination off
set breakpoint pending on
# breakpoint between Service construction and first parallel section
break ScheduleItems::initMisc
run
# we don't need that breakpoint anymore
clear ScheduleItems::initMisc
# add additional breakpoint
break setenv
command
where
continue
end
continue
quit
EOF

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

@smuzaffar
Copy link
Contributor

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?

@smuzaffar
Copy link
Contributor

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel

right, scram can set there variables properly and we should do it. I will open cmsdist PR

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 19, 2024

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

Sure we can do that. So how will it working ? Should we just use --command ' --prefix catch-setenv.sh' option for runTheMatrix ?

@makortel
Copy link
Contributor Author

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

Sure we can do that. So how will it working ? Should we just use --command ' --prefix catch-setenv.sh' option for runTheMatrix ?

Maybe, I didn't really look into details. Where should we place the script? cmsTraceExceptions is outside of CMSSW, on the other hand this script would have mild dependence on the framework details to figure out an approximate point to start the tracing.

(in principle the script could also be extended to catch other unwanted functions, but maybe it's better to not overgeneralize)

@civanch
Copy link
Contributor

civanch commented Sep 20, 2024

@makortel , in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv. So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.

@smuzaffar
Copy link
Contributor

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

@makortel
Copy link
Contributor Author

makortel commented Sep 20, 2024

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

I'm fine with either way.

@makortel
Copy link
Contributor Author

@civanch

in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv.

Just to clear up confusion, do you mean getenv() or setenv()? I'm asking because there is no std::setenv() (it is not part of C++ or C standards, but POSIX).

So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.

Thanks!

@makortel
Copy link
Contributor Author

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?

Assuming from #46065 (comment) that @cms-sw/ml-l2 is ok with the change, I opened a PR to cmsdist to set the environment variable cms-sw/cmsdist#9418

@mseidel
Copy link

mseidel commented Sep 20, 2024

please stop tagging me, I'm the wrong guy

@makortel
Copy link
Contributor Author

Apologies!

@makortel
Copy link
Contributor Author

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

Tagging the right person this time @mseidel42 ...

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 20, 2024

I have added cms-sw/cms-common@0471f57 which should do what

void RivetAnalyzer::beginJob() {
//set the environment, very ugly but rivet is monolithic when it comes to paths
char* cmsswbase = std::getenv("CMSSW_BASE");
char* cmsswrelease = std::getenv("CMSSW_RELEASE_BASE");
if (!std::getenv("RIVET_REF_PATH")) {
const std::string rivetref = string(cmsswbase) +
"/src/GeneratorInterface/RivetInterface/data:" + string(cmsswrelease) +
"/src/GeneratorInterface/RivetInterface/data:.";
char* rivetrefCstr = strdup(rivetref.c_str());
setenv("RIVET_REF_PATH", rivetrefCstr, 1);
free(rivetrefCstr);
}
if (!std::getenv("RIVET_INFO_PATH")) {
const std::string rivetinfo = string(cmsswbase) +
"/src/GeneratorInterface/RivetInterface/data:" + string(cmsswrelease) +
"/src/GeneratorInterface/RivetInterface/data:.";
char* rivetinfoCstr = strdup(rivetinfo.c_str());
setenv("RIVET_INFO_PATH", rivetinfoCstr, 1);
free(rivetinfoCstr);
}
}
is doing i.e. if RIVET_REF_PATH or RIVET_INFO_PATH are not set then set these pointing to src/GeneratorInterface/RivetInterface/data under LOCALTOP and RELEASETOP.

When deployed then this script will run for all cmssw releases and set RIVET_REF_PATH or RIVET_INFO_PATH when one will run cmsenv

@civanch
Copy link
Contributor

civanch commented Sep 21, 2024

@makortel , sorry, of course, getenv.

@civanch
Copy link
Contributor

civanch commented Sep 24, 2024

+simulation

@smuzaffar
Copy link
Contributor

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2025

We also have putenv in few places https://github.com/search?q=repo%3Acms-sw%2Fcmssw+putenv+language%3AC%2B%2B&type=code

I went through them, and all the ones needing attention seemed to be related to CondDB access. I opened a separate issue for them #47038.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants