-
Notifications
You must be signed in to change notification settings - Fork 0
Faroe Islands case (09 March 2021)
- Period (09-10 March 2021):
- Status 27 Jan 2023: Runs submitted
- Status 31 Jan 2023: Stalled in climate generation (Prepare_pgd)
- Status 03 Feb 2023: Climate generation succeeds for FAROE500 now, but I had to do some changes to get it to run (see fix below)
- Status 07 Feb 2023: Crash in Canari, which was related to the number of pools used (see fix below)
- Status 09 Feb 2023: I had only been running with LL_LIST="24,6," so far and got a crash first in the Forecast in 2021031000. Looked at running longer forecasts and then also got a crash for longer FC lengths in other cycles (at 42 for 2021030900 and 18 at 2021031000). See problem below.
- Output data available at: ec:/nhac/harmonie/cy46_faroe500/YYYY/MM/DD/HH/
- Local Exp setup: /home/nhac/hm_home/cy46_faroe500/
- Source code: /perm/nhac/git/github/DEODE-NWP/Harmonie/
- Period (09-10 March 2021): 8 day spin-up
- Status 27 Jan 2023: Runs submitted with CY46-DEODE
- Status 31 Jan 2023: Stalled in climate generation (Prepare_pgd) - memory issues
- Status 03 Feb 2023: I am currently requesting the maximum memory for one node - 128 GB but I am still having memory issues:
Operating system error: Cannot allocate memory Allocation would exceed memory limit
- Status 09 Feb 2023: No change in status, still having memory issues. Will try and run Bolli's cut-out climate files when they are ready. Another option would be to test parallelization, which supposedly makes use of the SFX_MPI macro. Are now also trying out the UWC-W version 1.2 of CY43 instead, as I haven't gotten anywhere with the Pepare_pgd memory issue. With this version I can use pre-created climate files from UWC-W and therefore won't have the same memory issue. Run has been submitted and is currently at MARS_stage_bd.
- Status 27 Feb 2023: Run resubmitted as it crashed while I was out of office. Currently running without problems, although it is a bit slow. No output generated yet as it is still in the warm-up period.
- Status 07 Mar 2023: The UWCW cy43 run has finalized
- Status 08 Mar 2023: Will try to run with the new version of the DEODE branch of cy46 which includes updates to Prepare_pgd to make it parallelizable, which hopefully will allow climate generation without encountering memory issues.
- Output data available at: ec:/nhac/harmonie/cy43_uwcw_dini20a/YYYY/MM/DD/HH/
- Local Exp setup: /home/nhac/hm_home/cy46_dini20a/ /home/nhac/hm_home/cy43_uwcw_dini20a/
- Source code: /perm/nhac/git/github/DEODE-NWP/Harmonie/ /ec/res4/hpcperm/dui/harmonie_git/ieew/tags/uwcw_v1.2/
The run crashed in Prepare_pgd with an error related to the LAI, where I got a crash related to the following line in interpol_field2d.F90 for the LAI:
CALL ABOR1_SFX('Some points lack data and are too far away from other points. & & Please define a higher halo value in NAM_IO_OFFLINE.')
I was in contact with both Patrick and Bolli about this. The crash was related to a lack of LAI and/or albedo data within the domain. This is apparently a known problem in SURFEX related to TEB where e.g. water-dominated domains crashes in PGD due to lack of urban vegetation information. The cure here is to update the XDATA_VEGTYPE for urban vegetation. Also, the use of the new ALB_SAT and LAI_SAT parameter data files was necessary as well.
The fix includes the following:
-
Update XDATA_VEGTYPE in src/surfex/SURFEX/init_types_param.F90 to the following:
XDATA_VEGTYPE(NUT_*,NVT_GRAS) = 0.3 XDATA_VEGTYPE(NUT_*,NVT_NO) = 0.7 XDATA_VEGTYPE(NUT_*,NVT_TEBD) = 0.0
-
Use the new LAI_SAT and ALB_SAT parameter files. These data files are available at Atos here:
- /ec/res4/hpcperm/hlam/data/climate/ECOCLIMAP-SG/ALB_SAT
- /ec/res4/hpcperm/hlam/data/climate/ECOCLIMAP-SG/LAI_SAT
Also, an update to the script scr/Prepare_pgd is needed to make use of them. See this commit:
Or latest version of branch dev-CY46h1.
The run crashed in Canari with the following error message reported:
Signal#11 was caused by address not mapped to object [memaddr=0x6000000030], nsigs = 1
This is apparently related to MPI and the parallelization and it seems to be the same problem as Erik and others had. For me, I had to lower the number of pools used for Canari to the following:
$nnode_canari=1; $tpn_canari=1; $npools_canari = $nnode_canari * $tpn_canari ; ... $job_list{'Canari'}{'ZNPROC'} = 'export NPROC='.$npools_canari; $job_list{'Canari'}{'ZNPROCX'} = 'export NPROCX=1'; $job_list{'Canari'}{'ZNPROCY'} = 'export NPROCY='.$npools_canari ; $job_list{'Canari'}{'ZNPOOLS'} = 'export NPOOLS='.$npools_canari ; $job_list{'Canari'}{'NTASKS'} = $submit_type.' --nodes='.$nnode_canari; $job_list{'Canari'}{'TASKS_PER_NODE'} = $submit_type.' --ntasks-per-node='.$tpn_canari;
Crash message:
Signal#8 was caused by floating-point divide by zero [memaddr=0x3769e4b], nsigs = 1
with the following backtrace:
==== backtrace (tid:3999048) ==== 0 0x0000000000b49970 signal_drhook() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/ifsaux/support/drhook.c:1809 1 0x0000000000012ce0 __funlockfile() :0 2 0x0000000003769e4b calc_hgpl() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_locpat.F90:335 3 0x0000000003769e4b rttov_locpat_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_locpat.F90:171 4 0x00000000036679ab rttov_setgeometry_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_setgeometry.F90:135 5 0x0000000003631de0 rttov_direct_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_direct.F90:451 6 0x0000000003684fa9 rttov_ec_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/ifs/rttov_ec.F90:629 7 0x0000000001e2fab7 mts_phys_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/phys_dmn/mts_phys.F90:586 8 0x0000000001e0968e phymfpos_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/phymfpos.F90:653 9 0x0000000001be71f2 vpos_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/vpos.F90:306 10 0x0000000001be27c1 scan2m_vpos_._omp_fn.0() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/scan2m_vpos.F90:132 11 0x0000000000011706 GOMP_parallel() ???:0 12 0x0000000001be2e43 scan2m_vpos_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/scan2m_vpos.F90:132 13 0x00000000017c705a stepo_fpos_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/stepo_fpos.F90:212 14 0x000000000127959f allfpos_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/allfpos.F90:701 15 0x000000000116442e fullpos_drv_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/fullpos_drv.F90:211 16 0x0000000001299449 scan2m_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/scan2m.F90:576 17 0x0000000000c231db stepo_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/stepo.F90:389 18 0x0000000000b96d93 cnt4_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt4.F90:932 19 0x0000000000b91d33 cnt3_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt3.F90:200 20 0x0000000000b91a17 cnt2_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt2.F90:138 21 0x0000000000b90962 cnt1_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt1.F90:161 22 0x0000000000b8869f cnt0_() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt0.F90:217 23 0x000000000063d266 master() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/master.F90:189 24 0x000000000063d266 main() /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/master.F90:7 25 0x000000000003acf3 __libc_start_main() ???:0 26 0x000000000063d40e _start() ???:0 =================================
At the moment, no clue why this happens. Have tried lowering the time step from 20 to 15. Run was just submitted (2023-02-09).
Lowering the time step to 15 did not work and neither did lowering it even further. Instead, I adjusted some settings in config_exp.h:
- Increased MAKEGRIB_LISTENERS from 2 to 3
and in Env_submit:
- Reduced ntasks_mars to 3
- Set $nprocx=34 and $nprocy=41 for Forecast and Listen2file and otherwise to $nprocx=16 and $nprocy=10
Furthermore, I was trying to run with 90 levels before but I changed it so that I'm running with 65 levels instead.
Not really sure if this was all I did or which change made it work but suddenly it works like a charm running for FC lengths of 48 h.