From 1d628b9e4b97d50b6a76fdd201121df7a8a63999 Mon Sep 17 00:00:00 2001 From: Thomas Vuillaume Date: Fri, 26 Jan 2024 19:22:33 +0100 Subject: [PATCH] exhaustive user documentation (#441) * fix headers * add cli to doc * modify scripts to build cli doc * use book theme * order most recent prods first * fix issue doc requirements * add autoprogam to requirements * add mystparder * restructure user doc * reorder doc menus --- README.rst | 182 +------------ docs/conf.py | 1 + docs/index.rst | 5 +- docs/lstmcpipe_user_doc.md | 531 +++++++++++++++++++++++++++++++++++++ docs/requirements.txt | 1 + 5 files changed, 541 insertions(+), 179 deletions(-) create mode 100644 docs/lstmcpipe_user_doc.md diff --git a/README.rst b/README.rst index 72516e12..1a3e273d 100644 --- a/README.rst +++ b/README.rst @@ -1,7 +1,7 @@ lstMCpipe ========= -|code| |documentation| |slack| |CI| |coverage| |conda| |pypi| |zenodo| |fair| +|code| |documentation| |slack| |CI| |coverage| |conda| |pypi| |zenodo| |fair| .. |code| image:: https://img.shields.io/badge/lstmcpipe-code-green :target: https://github.com/cta-observatory/lstmcpipe/ @@ -23,8 +23,8 @@ lstMCpipe :alt: Static Badge - -Scripts to ease the reduction of MC data on the LST cluster at La Palma. + +Scripts to ease the reduction of MC data on the LST cluster at La Palma. With this package, the analysis/creation of R1/DL0/DL1/DL2/IRFs can be orchestrated. Contact: @@ -41,7 +41,7 @@ If lstMCpipe was used for your analysis, please cite: .. code-block:: @misc{garcia2022lstmcpipe, - title={The lstMCpipe library}, + title={The lstMCpipe library}, author={Enrique Garcia and Thomas Vuillaume and Lukas Nickel}, year={2022}, eprint={2212.00120}, @@ -51,7 +51,7 @@ If lstMCpipe was used for your analysis, please cite: in addition to the exact lstMCpipe version used from https://doi.org/10.5281/zenodo.6460727 - + You may also want to include the config file with your published code for reproducibility. @@ -82,8 +82,6 @@ This will setup a new enviroment with lstchain and other needed tools available If you already have your lstchain conda environment, you may simply activate it and install lstmcpipe there using `pip install lstmcpipe`. -HIPERTA (referred to as rta in the following) support is builtin, but no installation instructions can be provided as of now. - Alternatively, you can install `lstmcpipe` in your own enviroment to use different versions of the analysis pipelines. WARNING: Due to changing APIs and data models, we cannot support other versions than the ones specified in @@ -114,7 +112,7 @@ As a LST member, you may require a MC analysis with a specific configuration, fo To do so, please: -#. Make sure to be part of the `github cta-observatory/lst-dev team `__. If not, ask one of the admins. +#. Make sure to be part of the `github cta-observatory/lst-dev team `__. If not, ask one of the admins. > note that you can also fork the repository and open the pull request from your fork, but the tests will fail because they need the private LST test data #. Clone the repository in the cluster at La Palma. #. Create a new branch named with you ``prodID`` @@ -163,171 +161,3 @@ but **please note** that it still requires a lot of resources to process a full production. Think about other LP-IT cluster users. -Stages ⚙️ --------- -After launching of the pipeline all selected tasks will be performed in order. -These are referred to as *stages* and are collected in ``lstmcpipe/stages``. -Following is a short overview over each stage, that can be specified in the configuration. - -**r0_to_dl1** - -In this stage simtel-files are processed up to datalevel 1 and separated into files for training -and for testing. -For efficiency reasons files are processed in batches: N files (depending on paricle type -as that influences the averages duration of the processing) are submitted as one job in a jobarray. -To group the files together, the paths are saved in files that are passed to -python scripts in ``lstmcpipe/scripts`` which then call the selected pipelines -processing tool. These are: - -- lstchain: lstchain_mc_r0_to_dl1 -- ctapipe: ctapipe-stage1 -- rta: lstmcpipe_hiperta_r0_to_dl1lstchain (``lstmcpipe/hiperta/hiperta_r0_to_dl1lstchain.py``) - - -**dl1ab** - -As an alternative to the processing of simtel r0 files, existing dl1 files can be reprocessed. -This can be useful to apply different cleanings or alter the images by adding noise etc. -For this to work the old files have to contain images, i.e. they need to have been processed -using the ``no_image: False`` flag in the config. -The config key ``dl1_reference_id`` is used to determine the input files. -Its value needs to be the full prod_id including software versions (i.e. the name of the -directories directly above the dl1 files). -For lstchain the dl1ab script is used, ctapipe can use the same script as for simtel -processing. There is no support for hiperta! - - -**merge_dl1** - -In this stage the previously created dl1 files are merged so that you end up with -train and test datesets for the next stages. - - -**train_test_split** - -Split the dataset into training and testing datasets, performing a random selection of files with the specified ratio -(default=0.5). - -**train_pipe** - -IMPORTANT: From here on out only ``lstchain`` tools are available. More about that at the end. - -In this stage the models to reconstruct the primary particles properties are trained -on the gamma-diffuse and proton train data. -At present this means that random forests are created using lstchains -``lstchain_mc_trainpipe`` -Models will be stored in the ``models`` directory. - - -**dl1_to_dl2** - -The previously trained models are evaluated on the merged dl1 files using ``lstchain_dl1_to_dl2`` from -the lstchain package. -DL2 data can be found in ``DL2`` directory. - -**dl2_to_irfs** - -Point-like IRFs are produced for each set of offset gammas. -The processing is performed by calling ``lstchain_create_irf_files``. - - -**dl2_to_sensitivity** -A sensitivity curve is estimated using a script based on pyirf which performs a cut optimisation -similar to EventDisplay. -The script can be found in ``lstmcpipe/scripts/script_dl2_to_sensitivity.py``. -This does not use the IRFs and cuts computed in dl2_to_irfs, so this can not be compared to observed data. -It is a mere benchmark for the pipeline. - - -Logs and data output 📈 ------------------------ -**NOTE**: ``lstmcpipe`` expects the data to be located in a specific structure on the cluster. -Output will be written in a stanardized way next to the input data to make sure everyone can access it. -Analysing a custom dataset requires replicating parts of the directory structure and is not the -intended use case for this package. - -All the ```r0_to_dl1`` stage job logs are stored ``/fefs/aswg/data/mc/running_analysis/.../job_logs`` and later -moved to ``/fefs/aswg/data/mc/analysis_logs/.../``. - -Every time a full MC production is launched, two files with logging information are created: - -- ``log_reduced_Prod{3,5}_{PROD_ID}.yml`` -- ``log_onsite_mc_r0_to_dl3_Prod{3,5}_{PROD_ID}.yml`` - -The first one contains a reduced summary of all the scheduled `job ids` (to which particle the job corresponds to), -while the second one contains the same plus all the commands passed to slurm. - -Steps explanation 🔍 --------------------- - -The directory structure and the stages to run are determined by the config stages. -After that, the job dependency between stages is done automatically. - - If the full workflow is launched, directories will not be verified as containing data. Overwriting will only happen when a MC prods sharing the same ``prod_id`` and analysed the same day is run - - If each step is launched independently (advanced users), no overwriting directory will take place prior confirmation from the user - -Example of default directory structure for a prod5 MC prod: - -.. code-block:: - - - /fefs/aswg/data/ - ├── mc/ - | ├── DL0/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ - | | └── simtel files - | | - | ├── running_analysis/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ - | | └── YYYYMMDD_v{lstchain}_{prod_id}/ - | | └── temporary dir for r0_to_dl1 + merging stages - | | - | ├── analysis_logs/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ - | | └── YYYYMMDD_v{lstchain}_{prod_id}/ - | | ├── file_lists_training/ - | | ├── file_lists_testing/ - | | └── job_logs/ - | | - | ├── DL1/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ - | | └── YYYYMMDD_v{lstchain}_{prod_id}/ - | | ├── dl1 files - | | ├── training/ - | | └── testing/ - | | - | ├── DL2/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ - | | └── YYYYMMDD_v{lstchain}_{prod_id}/ - | | └── dl2 files - | | - | └── IRF/20200629_prod5_trans_80/zenith_20deg/south_pointing/ - | └── YYYYMMDD_v{lstchain}_{prod_id}/ - | ├── off0.0deg/ - | ├── off0.4deg/ - | └── diffuse/ - | - └── models/ - └── 20200629_prod5_trans_80/zenith_20deg/south_pointing/ - └── YYYYMMDD_v{lstchain}_{prod_id}/ - ├── reg_energy.sav - ├── reg_disp_vector.sav - └── cls_gh.sav - - - -Real Data analysis 💀 ---------------------- - -Real data analysis is not supposed to be supported by these scripts. Use at your own risk. - - -Pipeline Support 🛠️ -------------------- - -So far the reference pipeline is ``lstchain`` and only with it a full analysis is possible. -There is however support for ``ctapipe`` and ``hiperta`` as well. -The processing up to dl1 is relatively agnostic of the pipeline; working implementations exist for all of them. - -In the case of ``hiperta`` a custom script converts the dl1 output to ``lstchain`` compatible files and the later stages -run using ``lstchain`` scripts. - -In the case of ``ctapipe`` dl1 files can be produced using ``ctapipe-stage1``. Once the dependency issues are solved and -ctapipe 0.12 is released, this will most likely switch to using ``ctapipe-process``. We do not have plans to keep supporting older -versions longer than necessary currently. -Because the files are not compatible to ``lstchain`` and there is no support for higher datalevels in ``ctapipe`` yet, it is not possible -to use any of the following stages. This might change in the future. diff --git a/docs/conf.py b/docs/conf.py index 28e9cc97..3fb1ced8 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -48,6 +48,7 @@ "sphinxcontrib.mermaid", "nbsphinx", "sphinxcontrib.autoprogram", + "myst_parser", ] autosummary_generate = True diff --git a/docs/index.rst b/docs/index.rst index 75da2021..f9806517 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,7 +4,6 @@ contain the root `toctree` directive. - .. include:: ../README.rst lstmcpipe API documentation @@ -15,10 +14,10 @@ lstmcpipe API documentation productions pipeline + lstmcpipe_user_doc.md examples/configs_pointings - api/lstmcpipe cli - + api/lstmcpipe Indices and tables diff --git a/docs/lstmcpipe_user_doc.md b/docs/lstmcpipe_user_doc.md new file mode 100644 index 00000000..7fa5249d --- /dev/null +++ b/docs/lstmcpipe_user_doc.md @@ -0,0 +1,531 @@ +# lstmcpipe user documentation + + + +## Context: LST data analysis + +An example of workflow: + +[![](https://mermaid.ink/img/pako:eNptkk2LwjAQhv9KyGkXLOh662EvusKCXuqxKTLbjlrIR0lSliL-9500DXTVHNJ8PO87M-nceG0a5Dk_S_NbX8F6ti-EFprRKJYnVb8d-5-Lhe76zrJMOq_qru0wyz6ZIqV0ZbGbVtVM9chu9x-nw6akDztsqhTATdZRUxbLeMmmcQGloBxnIXRnjTeaFiix9tboiUTdJL8YJcamYlodQn8XO5eAmOgDEFQNeBizswiShV31rHgsZ_ZOQREh44CI7X6VPFfJLuW4mtFPScxLeYmtJyxAoTQCxhfqhnAPGuTgWpdM1snkJcMXXKFV0DbUAbegEdxfUaHgOS0bPEMvveBC3wmF3pvjoGuee9vjgvcdeeO2BfqH6v_hV9N6Y3l-BunoEMftIXba2HD3P0O7zCs?type=png)](https://cta-observatory.github.io/cta-lstchain/lst_analysis_workflow.html) + +Each step of the analysis is implemented in lstchain as a script. +For example, if you want to train a model, you may run `lstchain_mc_train_pipe`. +See https://cta-observatory.github.io/cta-lstchain/introduction.html#analysis-steps + + + +### Problems + +- lstchain provides individual steps for each stage of the analysis but no complete pipeline +- theses steps need to be orchestrated (dependencies, inputs/outputs) +- data specific analysis + - MC are tuned from DL1 level to match real data to be analyzed + - therefore, the analysis of MC data models productions of models/IRFs are in the hands of analyzers +- job handling can be harder than it seems (logic between them, directory organization, jobs requirements for different configs...) +- everybody makes mistakes, even Abelardo (yes I know, hard to believe) + + + +### How do we allow custom MC productions for specific analysis in an easy-to-use and centralized manner? + + +- a common implementation of the pipeline +- a centralized production library +- a way to request for specific productions + +**=> lstmcpipe** + + + + +## General Idea + +- common implementation of a pipeline such as + +![an example of pipeline](https://mermaid.ink/img/pako:eNqdVV1rwjAU_SshD0NBh_row57cYOD2oI9tKZmNGkiTkqQbYv3vy4dN01YtW0C4vfec9Nx7LvYMdzzDcAmBPnvKf3ZHJBRYb2IGrkeWXweBiiPYzJqkOQeU5yjazFyQtIuF4IozU3VRp4wp3inhAHUcQDDLYtY82heA6fQFVGKWKp5mdF65rAmj1XreE-Fe2yO5tGf1xNViesS64KmB7BvD0oh2xxkRGk_0zeFwfXfm2lsT7OdDIb2J3VYyVVgqK9oEyZ-FpZ5_3-o27qHpbeiw_e1WBCLM9WKi5K5kjxvS7IH9ZegO1V9v18MybSOpLChRo1E3Mx53F_IuuQF2KxbuWx_AtT0DT113Grq_MZBTkALXXZg41N9kLSHX_xtUNmX3DJ51SWBEpxlSKBxdaL69oLcNdyxf1Nu7-N_2LlLPf7gJAW5oewPo8PZeB-NMdM10TDBGtcshYNEAKvtqnhKxr8D75i3SP5kMgCVmkijyTdSpAtvXz220bTIB2ZQssaBcRQU72EAmcAJzLHJEMv2ZOBt8DNUR5ziGSx1meI9KqmIYs4uGloU2Hr9mRHEBl3tEJZ5AVCq-PbEdXCpR4hq0Ikh7nNdJbDkf7nNkv0qXX-IaAyQ?type=png) + +- customizable through config file +- allowing complete reproducibility of the steps + + + + + +### Advantages for analyzers + +- easy to use (no or minimal code to write) +- centralized production library +- reproducibility +- easy to share with other analyzers and LST Collaboration +- less time, errors and frustration + + + +### Advantages for the LST Collaboration + +- centralized production library +- common implementation +- increased trust +- results are easier to compare / reproduce +- less computing resources + + + +## Implementation + + + +### lstmcpipe config file example + + +```yaml [3|5-7|12-15|17-37|38-42] +workflow_kind: lstchain # should not be modified for LST analyses + +prod_id: 20230517_v0.9.13_large_offset # you define this ! + +source_environment: + source_file: /fefs/aswg/software/conda/etc/profile.d/conda.sh # do not modify unless you have your own conda installation (and have a good reason for it) + conda_env: lstchain-v0.9.13 # name of the conda environment to use for the analysis (must exist on the cluster) + +slurm_config: + user_account: dpps # slurm user account to use. Keep dpps if the production is run through the PR scheme + +stages_to_run: # list of stages to run +- r0_to_dl1 +- merge_dl1 +- train_pipe + +# all stages have a list of input/output (mandatory), lstchain options and slurm options +stages: + r0_to_dl1: + - input: /fefs/aswg/data/mc/DL0/LSTProd2/TestDataset/Crab_large_offset/sim_telarray/node_corsika_theta_52.374_az_240.004_/output_v1.4 + output: /fefs/aswg/data/mc/DL1/AllSky/20230517_v0.9.13_large_offset/TestDataset/Crab_large_offset/node_corsika_theta_52.374_az_240.004_ + merge_dl1: + input: /fefs/aswg/data/mc/DL1/AllSky/20230517_v0.9.13_large_offset/TestDataset/Crab_large_offset/node_corsika_theta_52.374_az_240.004_ + output: /fefs/aswg/data/mc/DL1/AllSky/20230517_v0.9.13_large_offset/TestDataset/Crab_large_offset/dl1_20230517_v0.9.13_large_offset_node_theta_52.374_az_240.004_merged.h5 + options: --no-image + train_pipe: + - input: + gamma: + /fefs/aswg/data/mc/DL1/AllSky/20231024_v0.10.4_base_dec_min_2924_min_1802_6166/TrainingDataset/dec_min_2924/GammaDiffuse/dl1_20231024_v0.10.4_base_dec_min_2924_min_1802_6166_dec_min_2924_GammaDiffuse_merged.h5 + proton: + /fefs/aswg/data/mc/DL1/AllSky/20231024_v0.10.4_base_dec_min_2924_min_1802_6166/TrainingDataset/dec_min_2924/Protons/dl1_20231024_v0.10.4_base_dec_min_2924_min_1802_6166_dec_min_2924_Protons_merged.h5 + output: + /fefs/aswg/data/models/AllSky/20231024_v0.10.4_base_dec_min_2924_min_1802_6166/dec_min_2924 + extra_slurm_options: + partition: xxl + mem: 100G + cpus-per-task: 16 +# the following stages, even if defined, will not be run as they are not in stages_to_run + dl1_to_dl2: + ... + dl2_to_irfs: + ... +``` + + + + +### How it works + +- Each stage is a step of the pipeline and runs a lstchain script + - e.g. `r0_to_dl1` runs `lstchain_mc_r0_to_dl1` +- Each stage takes a list of input/output (mandatory) +- Other options can be passed to the lstchain script through the config file +- Slurm job options can be passed to each stage using `extra_slurm_options` +- lstmcpipe implements the logic between the stages and the corresponding slurm rules + - e.g. waiting for all jobs from `r0_to_dl1` to be over before running stage `merge_dl1` + + + + +### Generate a config file + +The config file can be **created or modified manually**. +So you can define your own pipeline quite easily, use your own conda environment and target your own directories... + +But can be more convienently **generated** from the command line `lstmcpipe-generate-config` that must be run **on the cluster** + + + + +### Pipelines and paths handling + + + +- When generating a config, it builds the directory structure for you. + - For example, it knows that `R0` data for the allsky prod are stored in `/fefs/aswg/data/mc/DL0/LSTProd2/TrainingDataset/Protons/dec_*` and will create the subsequent directory structure, producing DL1 data in `/fefs/aswg/data/mc/DL1/AllSky/$PROD_ID/TrainingDataset/dec_*` + +- That knowledge is implemented in the `lstmcpipe.config.paths_config.PathConfig` child classes (one child class per pipeline). + - The main pipelines are implemented. + - You can implement your own if you have specific use cases. + +- The class name is passed to the `lstmcpipe-generate-config` command line tool, along with options specific to that class. + - e.g. ``lstmcpipe_generate_config PathConfigAllSkyFull --prod_id whatagreatprod --dec_list dec_2276``` + - you may find the supported pipelines and their generation command-line in the [lstmcpipe documentation](https://cta-observatory.github.io/lstmcpipe/pipeline) + + + + + +### lstchain config + +The command line tool `lstmcpipe-generate-config` also generates a lstchain config file for you. + +It actually uses `lstchain_dump_config --mc` to dump the lstchain config from the version installed in your current environment. The config file is: +- complete +- tailored to MC analysis + +You should modify it to your needs, e.g. adding the parameters provided by `lstchain_tune_nsb`. + +Even though lstchain does strictly require an exhaustive config, please provide one. It will help others and provide a more explicit provenance information. + + + +## 📊 Requesting a MC analysis + +As a LST member, you may require a MC analysis with a specific configuration, for example to later analyse a specific source with tuned MC parameters. + +### Production list + +You may find the list of existing productions in [the lstmcpipe documentation](https://cta-observatory.github.io/lstmcpipe/productions.html). +Please check in this list that a request similar to the one you are about to make does not exist already! + + + +### Determine your needs + +Depending on the real data you want to analyse. + +=> determine the corresponding MC data (e.g. which declination line). + +=> determine if you need a tuned MC production. + + + +### Generate your config + +For most analyzers, the easiest way is to use a conda environment on the cluster and run the `lstmcpipe-generate-config` command line tool. + +For the allsky prod, you may use the following command line: + +```bash +lstmcpipe_generate_config PathConfigAllSkyFull --prod_id whatagreatprod --dec_list dec_2276 +``` + +- your prod_id should be unique and explicit. It will be used to name the directories and files of your production. Add the date and the lstchain version to it, e.g. `20240101_v0.10.4_dec_123_crab_tuned` +- then edit the lstmcpipe config file, especially for the conda environment that you want to use for the analysis. +- check that the rest of the config is ok for you (stages, directories...) +- edit the lstchain config file, especially to add any NSB tuning parameters. Please provide an exhaustive config that will help others and provide a more explicit provenance information. + - see `lstchain_tune_nsb` for more information + + + +### Prepare your pull-request + + + +1. If you are not familiar with git, I recommend you follow the git course: https://escape2020.github.io/school2021/posts/clase04/ +2. make sure you are in the [GitHub cta-observatory organization](https://github.com/orgs/cta-observatory/people) - if not ask Karl Kosack or Max Linhoff to add you. +3. make sure you are in the [GitHub lst-dev group](https://github.com/orgs/cta-observatory/teams/lst-dev) - if not ask Thomas Vuillaume, Ruben Lopez-Coto, Abelardo Moralejo or Max Linhoff to add you. +4. clone the lstmcpipe repository +``` +git clone git@github.com:cta-observatory/lstmcpipe.git +``` + - (`git clone https://github.com/cta-observatory/lstmcpipe.git` +if you did not setup ssh key in your github account) + +5. create a directory named after your `prod_id` in `lstmcpipe/production_configs/` +6. add your lstmcpipe config, lstchain config and a readme.md file in this directory +7. commit and push your changes in a new branch +``` +git switch -c my_new_branch +git add production_configs/my_prod_id +git commit -m "my new production" +git push origin my_new_branch +``` + +8. go to the [lstmcpipe repository](https://github.com/cta-observatory/lstmcpipe/) and create a pull-request from your branch +9. wait for the CI to run and check that everything is ok +10. your pull-request will be reviewed and merged if everything is ok +11. you will get notified in the pull-request when the production is ready + + + + + + +### And then? + +We will run the production on the cluster using + +``` +lstmcpipe -c lstmcpipe_config.yml -conf_lst lstchain_config.json +``` + +And will notify you in the github pull-request when it is done. + + + + +## 🚀 TL;DR - Summary for analyzers in a hurry + + + +1. Search the production library for an existing one that suits your needs +2. If you find one, you can use it directly (models and DL2 paths are in the config file) +3. If not, you can generate a config file for your own analysis. See [lstmcpipe documentation](https://cta-observatory.github.io/lstmcpipe/pipeline) for the list of supported pipelines. +Example: +``` +ssh cp02 +``` +``` +source /fefs/aswg/software/conda/etc/profile.d/conda.sh +conda activate lstchain-v0.10.5 +cd lstmcpipe/production_configs +mkdir 20240101_my_prod_id; cd 20240101_my_prod_id; +lstmcpipe_generate_config PathConfigAllSkyFull --prod_id 20240101_my_prod_id --dec_list dec_2276 +``` +4. Check the generated lstmcpipe config. In particular for the conda environment that you want to use for the analysis. +2. Edit the generated lstchain config. In particular to add any NSB tuning parameters. Please provide an exhaustive config that will help others and provide a more explicit provenance information. +3. Create and edit a `readme.md` file in the same directory to describe your production. Don't add sensitive or private information in this file, refer to the LST wiki if needed. +4. Submit your configs+readme through a pull-request (see dedicated section) in the lstmcpipe repository. +5. Enjoy! 🎉 😎 + + + + + + +## Need help? + +Find this documentation again at https://cta-observatory.github.io/lstmcpipe/ + + +Join the CTA North slack and ask for help in the lstmcpipe_prods channel + +[![Slack](https://img.shields.io/badge/CTA_North_slack-lstmcpipe_prods_channel-darkgreen?logo=slack)](https://cta-north.slack.com/archives/C035H3C2HAS) + + + +## Cite lstmcpipe + +If you publish results / analysis, please consider citing lstmcpipe: + +https://cta-observatory.github.io/lstmcpipe/index.html#cite-us + +```bibtex +@misc{garcia2022lstmcpipe, + title={The lstMCpipe library}, + author={Enrique Garcia and Thomas Vuillaume and Lukas Nickel}, + year={2022}, + eprint={2212.00120}, + archivePrefix={arXiv}, + primaryClass={astro-ph.IM} +} +``` + +in addition to the exact lstmcpipe version used from + +[![Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.6460727.svg)](https://doi.org/10.5281/zenodo.6460727) + +You may also want to include the config file with your published code for reproducibility 🔄 + + + +## DEMO + +(and questions ?) + + + + +## Appendix + + + +### Steps explanation 🔍 + +The directory structure and the stages to run are determined by the config stages. +After that, the job dependency between stages is done automatically. + - If the full workflow is launched, directories will not be verified as containing data. Overwriting will only happen when a MC prods sharing the same `prod_id` and analysed the same day is run + - If each step is launched independently (advanced users), no overwriting directory will take place prior confirmation from the user + + + +Example of default directory structure for a prod5 MC prod: + +``` + /fefs/aswg/data/ + ├── mc/ + | ├── DL0/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ + | | └── simtel files + | | + | ├── running_analysis/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ + | | └── YYYYMMDD_v{lstchain}_{prod_id}/ + | | └── temporary dir for r0_to_dl1 + merging stages + | | + | ├── analysis_logs/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ + | | └── YYYYMMDD_v{lstchain}_{prod_id}/ + | | ├── file_lists_training/ + | | ├── file_lists_testing/ + | | └── job_logs/ + | | + | ├── DL1/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ + | | └── YYYYMMDD_v{lstchain}_{prod_id}/ + | | ├── dl1 files + | | ├── training/ + | | └── testing/ + | | + | ├── DL2/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ + | | └── YYYYMMDD_v{lstchain}_{prod_id}/ + | | └── dl2 files + | | + | └── IRF/20200629_prod5_trans_80/zenith_20deg/south_pointing/ + | └── YYYYMMDD_v{lstchain}_{prod_id}/ + | ├── off0.0deg/ + | ├── off0.4deg/ + | └── diffuse/ + | + └── models/ + └── 20200629_prod5_trans_80/zenith_20deg/south_pointing/ + └── YYYYMMDD_v{lstchain}_{prod_id}/ + ├── reg_energy.sav + ├── reg_disp_vector.sav + └── cls_gh.sav +``` + + + +### Real Data analysis 💀 + +Real data analysis is not supposed to be supported by these scripts. Use at your own risk. + + + +### Pipeline Support 🛠️ + + + +So far the reference pipeline is `lstchain` and only with it a full analysis is possible. +There is however support for `ctapipe` and `hiperta` as well. +The processing up to dl1 is relatively agnostic of the pipeline; working implementations exist for all of them. + +In the case of `hiperta` a custom script converts the dl1 output to `lstchain` compatible files and the later stages +run using `lstchain` scripts. + +In the case of `ctapipe` dl1 files can be produced using `ctapipe-stage1`. Once the dependency issues are solved and +ctapipe 0.12 is released, this will most likely switch to using `ctapipe-process`. We do not have plans to keep supporting older +versions longer than necessary currently. +Because the files are not compatible to `lstchain` and there is no support for higher datalevels in `ctapipe` yet, it is not possible +to use any of the following stages. This might change in the future. + + + + + +### Stages ⚙️ + +After launching of the pipeline all selected tasks will be performed in order. +These are referred to as *stages* and are collected in `lstmcpipe/stages`. +Following is a short overview over each stage, that can be specified in the configuration. + + + +**r0_to_dl1** + +In this stage simtel-files are processed up to datalevel 1 and separated into files for training +and for testing. +For efficiency reasons files are processed in batches: N files (depending on paricle type +as that influences the averages duration of the processing) are submitted as one job in a jobarray. +To group the files together, the paths are saved in files that are passed to +python scripts in `lstmcpipe/scripts` which then call the selected pipelines +processing tool. These are: + +- lstchain: lstchain_mc_r0_to_dl1 +- ctapipe: ctapipe-stage1 +- rta: lstmcpipe_hiperta_r0_to_dl1lstchain (`lstmcpipe/hiperta/hiperta_r0_to_dl1lstchain.py`) + + + +**dl1ab** + +As an alternative to the processing of simtel r0 files, existing dl1 files can be reprocessed. +This can be useful to apply different cleanings or alter the images by adding noise etc. +For this to work the old files have to contain images, i.e. they need to have been processed +using the `no_image: False` flag in the config. +The config key `dl1_reference_id` is used to determine the input files. +Its value needs to be the full prod_id including software versions (i.e. the name of the +directories directly above the dl1 files). +For lstchain the dl1ab script is used, ctapipe can use the same script as for simtel +processing. There is no support for hiperta! + + + +**merge_dl1** + +In this stage the previously created dl1 files are merged so that you end up with +train and test datesets for the next stages. + + + +**train_test_split** + +Split the dataset into training and testing datasets, performing a random selection of files with the specified ratio +(default=0.5). + + + +**train_pipe** + +IMPORTANT: From here on out only `lstchain` tools are available. More about that at the end. + +In this stage the models to reconstruct the primary particles properties are trained +on the gamma-diffuse and proton train data. +At present this means that random forests are created using lstchains +`lstchain_mc_trainpipe` +Models will be stored in the `models` directory. + + + +**dl1_to_dl2** + +The previously trained models are evaluated on the merged dl1 files using `lstchain_dl1_to_dl2` from +the lstchain package. +DL2 data can be found in `DL2` directory. + + + +**dl2_to_irfs** + +Point-like IRFs are produced for each set of offset gammas. +The processing is performed by calling `lstchain_create_irf_files`. + + + +**dl2_to_sensitivity** +A sensitivity curve is estimated using a script based on pyirf which performs a cut optimisation +similar to EventDisplay. +The script can be found in `lstmcpipe/scripts/script_dl2_to_sensitivity.py`. +This does not use the IRFs and cuts computed in dl2_to_irfs, so this can not be compared to observed data. +It is a mere benchmark for the pipeline. + + + +### 📈 Logs and data output + + + +Output will be written in a stanardized way next to the input data to make sure everyone can access it. + +By default, job logs are stored in `/fefs/aswg/data/mc/running_analysis/.../job_logs` and later moved to `/fefs/aswg/data/mc/analysis_logs/.../`. + +Every time a full MC production is launched, two files with logging information are created: + +- `log_reduced_Prod{3,5}_{PROD_ID}.yml` +- `log_onsite_mc_r0_to_dl3_Prod{3,5}_{PROD_ID}.yml` + +The first one contains a reduced summary of all the scheduled `job ids` (to which particle the job corresponds to), +while the second one contains the same plus all the commands passed to slurm. + + \ No newline at end of file diff --git a/docs/requirements.txt b/docs/requirements.txt index b2c49c9c..12e18d94 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,4 +1,5 @@ GitPython +myst_parser nbsphinx sphinx sphinx-argparse