Skip to content

Faroe Islands case (09 March 2021)

Emy Alerskans edited this page Mar 8, 2023 · 7 revisions

cy46-deode branch (Emy Alerskans/Xioahua Yang): DINI 2 km and FAROE500 500 m

FAROE 550 m

  • Period (09-10 March 2021):
    • Status 27 Jan 2023: Runs submitted
    • Status 31 Jan 2023: Stalled in climate generation (Prepare_pgd)
    • Status 03 Feb 2023: Climate generation succeeds for FAROE500 now, but I had to do some changes to get it to run (see fix below)
    • Status 07 Feb 2023: Crash in Canari, which was related to the number of pools used (see fix below)
    • Status 09 Feb 2023: I had only been running with LL_LIST="24,6," so far and got a crash first in the Forecast in 2021031000. Looked at running longer forecasts and then also got a crash for longer FC lengths in other cycles (at 42 for 2021030900 and 18 at 2021031000). See problem below.
  • Output data available at: ec:/nhac/harmonie/cy46_faroe500/YYYY/MM/DD/HH/
  • Local Exp setup: /home/nhac/hm_home/cy46_faroe500/
  • Source code: /perm/nhac/git/github/DEODE-NWP/Harmonie/

DINI 2 km

  • Period (09-10 March 2021): 8 day spin-up
    • Status 27 Jan 2023: Runs submitted with CY46-DEODE
    • Status 31 Jan 2023: Stalled in climate generation (Prepare_pgd) - memory issues
    • Status 03 Feb 2023: I am currently requesting the maximum memory for one node - 128 GB but I am still having memory issues:
      Operating system error: Cannot allocate memory
      Allocation would exceed memory limit
    • Status 09 Feb 2023: No change in status, still having memory issues. Will try and run Bolli's cut-out climate files when they are ready. Another option would be to test parallelization, which supposedly makes use of the SFX_MPI macro. Are now also trying out the UWC-W version 1.2 of CY43 instead, as I haven't gotten anywhere with the Pepare_pgd memory issue. With this version I can use pre-created climate files from UWC-W and therefore won't have the same memory issue. Run has been submitted and is currently at MARS_stage_bd.
    • Status 27 Feb 2023: Run resubmitted as it crashed while I was out of office. Currently running without problems, although it is a bit slow. No output generated yet as it is still in the warm-up period.
    • Status 07 Mar 2023: The UWCW cy43 run has finalized
    • Status 08 Mar 2023: Will try to run with the new version of the DEODE branch of cy46 which includes updates to Prepare_pgd to make it parallelizable, which hopefully will allow climate generation without encountering memory issues.
  • Output data available at: ec:/nhac/harmonie/cy43_uwcw_dini20a/YYYY/MM/DD/HH/
  • Local Exp setup: /home/nhac/hm_home/cy46_dini20a/ /home/nhac/hm_home/cy43_uwcw_dini20a/
  • Source code: /perm/nhac/git/github/DEODE-NWP/Harmonie/ /ec/res4/hpcperm/dui/harmonie_git/ieew/tags/uwcw_v1.2/

FAROE500 - Climate generation

The run crashed in Prepare_pgd with an error related to the LAI, where I got a crash related to the following line in interpol_field2d.F90 for the LAI:

CALL ABOR1_SFX('Some points lack data and are too far away from other points. &
& Please define a higher halo value in NAM_IO_OFFLINE.')

I was in contact with both Patrick and Bolli about this. The crash was related to a lack of LAI and/or albedo data within the domain. This is apparently a known problem in SURFEX related to TEB where e.g. water-dominated domains crashes in PGD due to lack of urban vegetation information. The cure here is to update the XDATA_VEGTYPE for urban vegetation. Also, the use of the new ALB_SAT and LAI_SAT parameter data files was necessary as well.

The fix includes the following:

  • Update XDATA_VEGTYPE in src/surfex/SURFEX/init_types_param.F90 to the following:

  • Use the new LAI_SAT and ALB_SAT parameter files. These data files are available at Atos here:

    • /ec/res4/hpcperm/hlam/data/climate/ECOCLIMAP-SG/ALB_SAT
    • /ec/res4/hpcperm/hlam/data/climate/ECOCLIMAP-SG/LAI_SAT

    Also, an update to the script scr/Prepare_pgd is needed to make use of them. See this commit:

    Or latest version of branch dev-CY46h1.

FAROE500 - Crash in Canari

The run crashed in Canari with the following error message reported:

Signal#11 was caused by address not mapped to object [memaddr=0x6000000030], nsigs = 1

This is apparently related to MPI and the parallelization and it seems to be the same problem as Erik and others had. For me, I had to lower the number of pools used for Canari to the following:

$npools_canari = $nnode_canari * $tpn_canari ;
$job_list{'Canari'}{'ZNPROC'}          = 'export NPROC='.$npools_canari;
$job_list{'Canari'}{'ZNPROCX'}         = 'export NPROCX=1';
$job_list{'Canari'}{'ZNPROCY'}         = 'export NPROCY='.$npools_canari ;
$job_list{'Canari'}{'ZNPOOLS'}         = 'export NPOOLS='.$npools_canari ;
$job_list{'Canari'}{'NTASKS'}          = $submit_type.' --nodes='.$nnode_canari;
$job_list{'Canari'}{'TASKS_PER_NODE'}  = $submit_type.' --ntasks-per-node='.$tpn_canari;

FAROE500 - Crash in Forecast for longer FC

Crash message:

Signal#8 was caused by floating-point divide by zero [memaddr=0x3769e4b], nsigs = 1

with the following backtrace:

==== backtrace (tid:3999048) ====
 0 0x0000000000b49970 signal_drhook()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/ifsaux/support/drhook.c:1809
 1 0x0000000000012ce0 __funlockfile()  :0
 2 0x0000000003769e4b calc_hgpl()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_locpat.F90:335
 3 0x0000000003769e4b rttov_locpat_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_locpat.F90:171
 4 0x00000000036679ab rttov_setgeometry_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_setgeometry.F90:135
 5 0x0000000003631de0 rttov_direct_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/main/rttov_direct.F90:451
 6 0x0000000003684fa9 rttov_ec_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/satrad/rttov/ifs/rttov_ec.F90:629
 7 0x0000000001e2fab7 mts_phys_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/phys_dmn/mts_phys.F90:586
 8 0x0000000001e0968e phymfpos_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/phymfpos.F90:653
 9 0x0000000001be71f2 vpos_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/vpos.F90:306
10 0x0000000001be27c1 scan2m_vpos_._omp_fn.0()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/scan2m_vpos.F90:132
11 0x0000000000011706 GOMP_parallel()  ???:0
12 0x0000000001be2e43 scan2m_vpos_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/scan2m_vpos.F90:132
13 0x00000000017c705a stepo_fpos_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/stepo_fpos.F90:212
14 0x000000000127959f allfpos_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/allfpos.F90:701
15 0x000000000116442e fullpos_drv_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/fullpos/fullpos_drv.F90:211
16 0x0000000001299449 scan2m_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/scan2m.F90:576
17 0x0000000000c231db stepo_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/stepo.F90:389
18 0x0000000000b96d93 cnt4_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt4.F90:932
19 0x0000000000b91d33 cnt3_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt3.F90:200
20 0x0000000000b91a17 cnt2_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt2.F90:138
21 0x0000000000b90962 cnt1_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt1.F90:161
22 0x0000000000b8869f cnt0_()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/arpifs/control/cnt0.F90:217
23 0x000000000063d266 master()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/master.F90:189
24 0x000000000063d266 main()  /lus/h2resw01/scratch/nhac/hm_home/cy46_faroe500/lib/R32/src/master.F90:7
25 0x000000000003acf3 __libc_start_main()  ???:0
26 0x000000000063d40e _start()  ???:0

At the moment, no clue why this happens. Have tried lowering the time step from 20 to 15. Run was just submitted (2023-02-09).

Lowering the time step to 15 did not work and neither did lowering it even further. Instead, I adjusted some settings in config_exp.h:

  • Increased MAKEGRIB_LISTENERS from 2 to 3

and in Env_submit:

  • Reduced ntasks_mars to 3
  • Set $nprocx=34 and $nprocy=41 for Forecast and Listen2file and otherwise to $nprocx=16 and $nprocy=10

Furthermore, I was trying to run with 90 levels before but I changed it so that I'm running with 65 levels instead.

Not really sure if this was all I did or which change made it work but suddenly it works like a charm running for FC lengths of 48 h.

Clone this wiki locally