Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No overwriting of nc4 files! #2638

Closed
rtodling opened this issue Mar 7, 2024 · 15 comments
Closed

No overwriting of nc4 files! #2638

rtodling opened this issue Mar 7, 2024 · 15 comments
Assignees
Labels
❓ Question Further information is requested ❄️ Stale This issue has been marked stale

Comments

@rtodling
Copy link
Contributor

rtodling commented Mar 7, 2024

At some point in the past a change was made so that MAPL (or CFIO) would crash if trying the model tried to overwrite an output.

I remember that when I first stumbled on this, I wasn't so convinced w/ the need for this. In response to that, a knob what put in (or the default was changed) to allow for overwrite.

I believe the latest version of MAPL now has it such that overwrite causes the model to crash. Can we revisit this again please? This is a very inconvenient future especially for debugging purposes.

Perhaps there is a flag I can add to AGCM.rc or HISTORY to tell MAPL/CFIO not to bother, is there?

@tclune
Copy link
Collaborator

tclune commented Mar 8, 2024

Hi Ricardo. Our GCHM (GEOS-CHEM) colleagues have asked for similar. I thought we had enabled this, but have forgotten the details. I know at the low levels there is a switch, but do not know how/if it propagates from History.rc.

In the worst case a kludge to the interface should not be too hard.

I'm hoping that by "crash" you mean that the model trapped the exception and gave an informative message that it would not overwrite the file? For me that is level-0 and very high priority.

@mathomp4 mathomp4 added the ❓ Question Further information is requested label Mar 8, 2024
@tclune
Copy link
Collaborator

tclune commented Mar 8, 2024

@rtodling See #2391. We have a global setting in History that allows noclobber. It is an open issue to do it per-collection.

Currently, by setting Allow_Overwrite: .true. you can allow every collection to clobber.

Please let us know if you need this per-collection, in which case we can raise the priority of the other ticket. Closing this one.

@tclune tclune closed this as completed Mar 8, 2024
@mathomp4
Copy link
Member

mathomp4 commented Mar 8, 2024

Indeed, as @tclune says, you'd add to the top of history where the other global variables are:

VERSION: 1
EXPID:  f5295_fp
EXPDSC: f5295_fp__GEOSadas-5_29_5__agrid_C720__ogrid_C
EXPSRC: GEOSadas-5_29_5
Allow_Overwrite: .true.

that should allow overwriting of history. But, note it is global so every collection will be allowed to overwrite.

@rtodling
Copy link
Contributor Author

rtodling commented Mar 8, 2024

Hi Guys, thanks for the reply on this. I will add the opt to the history. Many thanks.

Ricardo

@mathomp4
Copy link
Member

@rtodling Warning. This might not be working. I'm reopening this issue.

@mathomp4 mathomp4 reopened this Mar 15, 2024
@mathomp4
Copy link
Member

@rtodling Can you tell us what version of MAPL you are using? We might need to go back in time and patch this once we know the fix

@tclune
Copy link
Collaborator

tclune commented Mar 15, 2024

Looks like the global option is currently broken. NOAA noticed a problem ...

@tclune
Copy link
Collaborator

tclune commented Mar 15, 2024

@lizziel We found that this capability is broken. Those with better memory than me assert that it did work at one time. Raising the priority - might be a 1st where we have 3 different "customers" complaining about the same thing.

@junwang-noaa
Copy link

@tclune, @weiyuan-jiang , here are some details. The error message we got from UFS weather model:

  0: pe=00000 FAIL at line=00187    NetCDF4_FileFormatter.F90                <status=13>
  0: pe=00000 FAIL at line=00062    HistoryCollection.F90                    <status=13>
  0: pe=00000 FAIL at line=00811    ServerThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00138    BaseServer.F90                           <status=13>
  0: pe=00000 FAIL at line=01002    ServerThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00097    MessageVisitor.F90                       <status=13>
  0: pe=00000 FAIL at line=00115    AbstractMessage.F90                      <status=13>
  0: pe=00000 FAIL at line=00107    SimpleSocket.F90                         <status=13>
  0: pe=00000 FAIL at line=00449    ClientThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00399    ClientManager.F90                        <status=13>
  0: pe=00000 FAIL at line=03560    MAPL_HistoryGridComp.F90                 <status=13>
  0: pe=00000 FAIL at line=01901    MAPL_Generic.F90                         <status=13>
  0: pe=00000 FAIL at line=01291    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=01220    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=01166    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=00834    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=00974    MAPL_CapGridComp.F90                     <status=13>

Please let me know if you want to reproduce the case. So far, the atmosphere can write out files with symbolic link.
Before the run:

[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l sfcf000.nc
lrwxrwxrwx 1 Jun.Wang stmp 17 Mar 15 16:45 sfcf000.nc -> output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/sfcf000.nc
ls: cannot access output/sfcf000.nc: No such file or directory
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l gocart.inst_aod.20210322_1200z.nc4
lrwxrwxrwx 1 Jun.Wang stmp 41 Mar 15 16:44 gocart.inst_aod.20210322_1200z.nc4 -> output/gocart.inst_aod.20210322_1200z.nc4
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/gocart.inst_aod.20210322_1200z.nc4
ls: cannot access output/gocart.inst_aod.20210322_1200z.nc4: No such file or directory

Then I saw the error message when running the test, and the following in the run directory:

[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l sfcf000.nc
lrwxrwxrwx 1 Jun.Wang stmp 17 Mar 15 16:45 sfcf000.nc -> output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/sfcf000.nc
-rw-r--r-- 1 Jun.Wang stmp 85452865 Mar 15 17:49 output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l gocart.inst_aod.20210322_1200z.nc4
lrwxrwxrwx 1 Jun.Wang stmp 41 Mar 15 16:44 gocart.inst_aod.20210322_1200z.nc4 -> output/gocart.inst_aod.20210322_1200z.nc4
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/gocart.inst_aod.20210322_1200z.nc4
ls: cannot access output/gocart.inst_aod.20210322_1200z.nc4: No such file or directory

@mathomp4
Copy link
Member

Actually, now that I think about it, I think @rtodling is safe. The oddity is occurring because of the broken-symlink style. We are pondering this...

@mathomp4
Copy link
Member

Related issue: #1620

@tclune
Copy link
Collaborator

tclune commented Mar 15, 2024

@junwang-noaa We can only replicate that particular error when the symlink itself is broken. But ... there is a different problem that you will hit once you fix that one.

The history option Allow_Overwrite does not currently propagate to the server side and fixing that is more subtle than you might have thought. We have scenarios in which a previous segment of a simulation has already written a time slice to a history output file and then the file needs to be appended-to rather overwritten.

This late on Friday this is making my head hurt. On Monday I will work with @bena-nasa to diagram the various cases, what should happen and how to even detect when it should clobber vs append. Sigh.

@bena-nasa
Copy link
Collaborator

All, I made a new issue with summarizes what is going on in much more detail.
#2653

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 60 days. If there are no updates within 7 days, it will be closed. You can add the ":hourglass: Long Term" label to prevent the stale action from closing this issue.

@github-actions github-actions bot added the ❄️ Stale This issue has been marked stale label Sep 17, 2024
@mathomp4
Copy link
Member

Closing in favor of #2653 (which might be fixed? → @bena-nasa )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❓ Question Further information is requested ❄️ Stale This issue has been marked stale
Projects
None yet
Development

No branches or pull requests

6 participants