Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U/jrbogart/provenance v2 #111

Merged
merged 8 commits into from
Jul 11, 2024
Merged

U/jrbogart/provenance v2 #111

merged 8 commits into from
Jul 11, 2024

Conversation

JoanneBogart
Copy link
Collaborator

@JoanneBogart JoanneBogart commented Jun 21, 2024

Add provenance information to any created parquet files as metadata.
Include run options as part of provenance (both in parquet files and in output text config).
Also includes

  • bug fix so that, when SSO is not available, code responds gracefully
  • a change to handling of cosmology parameters to stay compatible with different versions of GCRCatalogs

Partially addresses issue #64

@JoanneBogart JoanneBogart requested a review from jchiang87 July 3, 2024 18:50
Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions and change requests.

@@ -220,13 +220,20 @@ def __init__(self, sed_path, fmt='db'):
assumptions: that the table name is "SED" and that the columns are
"wavelength" (units angstroms) and "flux" (units flambda)
'''
if not sed_path:
# print a warning message?
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is returning None really the right thing here? From the line above in skyCatalogs.py, if this object is returned as None, the sso object type simply isn't registered, but it seems like it should raise an exception if the needed SED info isn't available instead of silently omitting SSOs.

Copy link
Collaborator Author

@JoanneBogart JoanneBogart Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the typical config file includes all object types supported by the code, whether or not they're actually available at the installation. This will change with the stuff I'm currently working on (which, among other things, should make it easier to include only those object types which are available), but for now the fact that the SSO SED isn't there is often the first sign the code sees that there are no SSOs. I could add a warning message but would rather not raise an exception. That would force people to meddle with the configs by hand (to remove the sso section) more than I'd like.

object_class=SsoObject,
collection_class=SsoCollection)
if self._sso_sed_factory:
self.cat_cxt.register_source_type('sso',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below in sed_tools.py. Silently omitting SSOs here seems wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll come up with a suitable place to log a warning message.

inputs=inputs,
run_options=self._run_options)

arrow_schema = make_star_schema(metadata_input=file_metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment below,

    #  Need a way to indicate which object types to include; deal with that
    #  later.  For now, default is stars + sn

still relevant? I would think not, since sncosmo support has been removed. As it stands, it's hard to see how it relates to the code around it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, no longer relevant. Will delete.

@@ -1165,5 +1193,6 @@ def write_provenance_file(self, datafile_path):
'''
outpath = datafile_path.rsplit('.', 1)[0] + '_provenance.yaml'

prov = assemble_provenance(self._pkg_root, inputs=None)
prov = assemble_provenance(self._pkg_root, inputs=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems inconsistent that we have inputs=None for this provenance and an inputs value that's not None above. Shouldn't the same provenance info be used in both places?

Copy link
Collaborator Author

@JoanneBogart JoanneBogart Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. In the first case, a global config file is being written. The value for inputs is all the possible inputs catalog_creator knows about. In the second, it's only associated with a particular binary file. Probably there should have been an inputs argument to write_provenance_file so that inputs appropriate to the object type of that file could be included. (That's what I've done in the code I'm working on where there is no global provenance, but only provenance associated with an object type) But, now that provenance is being included in the binary file (which includes only relevant inputs), I don't think we need this routine altogether. My inclination is to get rid of it and the option in the creation script which causes it to be called.

Comment on lines 50 to 61
def callinfo_to_dict(args):
"""
Make a dict out of program arguments. Each option value is
either a simple atomic type or a list
"""
a_dict = dict()
for e in dir(args):
if not e.startswith('_'):
v = eval('args.' + e)
a_dict[e] = v
return a_dict

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that

dict(args._get_kwargs())

does the same thing as this function.

Copy link
Collaborator Author

@JoanneBogart JoanneBogart Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it does. I could also use _get_kwargs to simplify log_callinfo but the results aren't alphabetized, which is useful in that case. Another similar routine,print_callinfo, isn't used so I will remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To alphabetize by keyword,

dict(sorted(args._get_kwargs()))

would work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll make that change.

@@ -237,7 +238,8 @@ def __init__(self, parts, area_partition=None, skycatalog_root=None,
galaxy_stride=1000000, provenance=None,
dc2=False, sn_object_type='sncosmo', galaxy_type='cosmodc2',
include_roman_flux=False, star_input_fmt='sqlite',
sso_truth=None, sso_sed=None, sso_partition='healpixel'):
sso_truth=None, sso_sed=None, sso_partition='healpixel',
run_options=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring entry for run_options please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, certainly. Just an oversight.

self._run_options = run_options

# Do we need this?
self._sed_bins = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this attribute were named something like self._cosmodc2_sed_bins or self._tophat_sed_bins, it would clearer in which contexts it's used. Based on the code below, setting to None here isn't strictly needed, but setting a private attribute in the __init__ seems like a good practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll change the name.

@JoanneBogart JoanneBogart requested a review from jchiang87 July 10, 2024 21:20
@JoanneBogart
Copy link
Collaborator Author

JoanneBogart commented Jul 10, 2024

CI check is failing because sdf is down, but nothing I did in this last commit (all in "create" part of skycatalogs) would affect it. This PR is ready for re-review.

Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but I would recommend making the warning more explicit about the outcome of the missing SSO files.

@@ -334,6 +334,9 @@ def __init__(self, config, mp=False, skycatalog_root=None, verbose=False,
'sso_sed.db')

self._sso_sed_factory = SsoSedFactory(self._sso_sed_path)
if not self._sso_sed_factory:
self._logger.warning('SSO appear in the list of available object types but supporting files do not exist')
self._logger.warning('SSOs will not be present in object lists')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a warning that explicitly says that "SSOs will not be simulated" would be more helpful to users.

@JoanneBogart JoanneBogart merged commit da8e100 into main Jul 11, 2024
1 check failed
@JoanneBogart JoanneBogart deleted the u/jrbogart/provenance_v2 branch July 11, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants