Add support for the Croissant metadata specification #328

Reikyo · 2024-12-17T18:12:49Z

Added namespace, schema and profile for the Croissant metadata specification (https://docs.mlcommons.org/croissant/docs/croissant-spec.html)

…ication (https://docs.mlcommons.org/croissant/docs/croissant-spec.html)

Reikyo · 2024-12-17T18:30:06Z

Hi @amercader - I'm working with MLCommons on a project being run by @benjelloun at Google, aiming to integrate capability for the Croissant metadata specification as widely as possible.

Over the last few weeks we've been exploring CKAN and its extension ecosystem, and found that the existing ckanext-dcat extension already provides the needed base functionality. I've created a new branch with a single commit showing all required adjustments. To existing files there are only changes to the list of namespaces, and to the list and import of profiles. The new schema and profile are in new separate files isolated from the existing logic, and have been made according to the documentation (schemas, profiles) and using the existing schemaorg.py profile as a basis.

You can find some further conversation on this topic here. I'd be grateful if you can review the PR and consider it for integration into master. Please let me and @benjelloun know if you have any questions.

Many thanks for your help.

amercader · 2024-12-17T21:07:15Z

@Reikyo this is great and extremely timely as I had started working myself on a croissant profile these last few days. I think this is an extremely valuable feature for sites that we can get ready relatively soon. I'll review asap and get back to you.

Regarding where this functionality should live I think it wouldn't be a big task to separate the schema.org/croissant profiles to their own extension but let's focus on getting this ready for now.

Thanks, this is really exciting

Reikyo · 2024-12-18T10:36:33Z

@amercader Thanks for the quick reply, much appreciated.

You can find some further info about the work done on this here. In particular, see the spreadsheet for indication of the properties and values being considered, and how they are present in the Croissant RDF graph when no scheming schema is applied to modify the UI (column O) and when the new Croissant scheming schema is applied (column S). In the latter case there are, of course, many more output properties, which go up to (but not including) the RecordSet section of the Croissant specification. You'll find a number of notes throughout the new schema (schemas/croissant.yaml) and profile (profiles/croissant.py) to clarify the details and reasoning. Also note that for the schema, I allowed users to define their own "@id"s where applicable, and you can see these used in the profile where "id_given" is present.

In the above folder I've also included images and files that show what I'm thinking of as a three step process. There are files showing all three steps when no scheming schema is applied, and when the Croissant scheming schema is applied.

Step-1: Data is input via edit forms (dataset and resource)
Step-2: Data is represented in a Python dictionary (dataset_dict)
Step-3: Data is output in RDF representation, which can be serialised in various ways (no/default profile, schemaorg profile, Croissant profile)

Hopefully this makes it clear exactly what the changes do.

amercader · 2024-12-19T10:35:39Z

@Reikyo I had a first pass it looks like all this is going in the right direction.

Some points in no particular order:

The new profile will translate the internal CKAN dict representation to a Croissant one but we still need a mechanism to expose that to users. From a user point of view, they should just enable a croissant plugin in their ckan.plugins config option and forget about it. The plugin will take care of embedding the JSON-LD code in the dataset page. This is how the schema.org markup for Google Dataset Search works now (with the structured_data plugin). So we need a plugin similar to this one (registered in pyproject.toml) that adds a new template helper that embeds the code in the package/read_base.html template.
(Aside: the current code could be improved because structured_data relies on the dcat plugin being loaded, but I'll take care of that)
Is the croissant JSON-LD representation meant to only be embedded in the HTML source of the dataset page? or is there value in having it available at a dedicated endpoint (e.g. /dataset/<id>/croissant.jsonld). I see Hugging Face does both so I imagine there is value. In any case it should be easy to expose an endpoint, using IBlueprint
It's great that you have mapped the properties between the standard CKAN model properties (i.e. no custom schema, column O) and added a new schema that adds the ones not present. In that case I think it's important that the properties follow the existing names in the DCAT profiles, so sites can get metadata exposed in both formats using a single schema. Going through the list I believe this is already the case but flagging just in case.
At first I was wondering if id_given might be comparable to an URI and so if we could keep uri as the field name for consistency with the DCAT profiles but looking at the spec and your comments in the schema it seems like there are clear conventions and recommendations for defining ids in the context of Croissant so it make sense to keep these separate.
I'll take care of putting these properly in the docs, but here's a summary of what we would need a new profile to have to include it in this extension (basically to ensure that is properly tested and documented). We have the profile class and the schema, so what's missing is:
- Examples
- Tests, in this case we only need serialization tests (e.g. given a certain CKAN dataset in the examples folder we get the expected croissant properties, see the schema.org ones) and we could also validate the output with the python croissant package to make sure there are no warnings.
- Docs: brief intro of what the Croissant spec is, what the plugin does and how to enable it.

Let me know if all this makes sense. I can help with any of the points above, in fact I just pushed a new croissant branch on this repo. If you change the target of this PR to that branch I can push stuff there as well to help move forward quicker if it helps.

I'll be mostly off the next couple of weeks though, but can pick it up again in January.

Reikyo · 2025-01-06T13:47:33Z

@amercader Firstly happy new year, I hope you had a good break. Similarly I've been off over the last couple of weeks, but looking forward to progressing this issue now.

Many thanks for the quick evaluation of the proposed changes, I appreciate the detailed comments. Addressing the comments in turn:

Yes, maintainers will have to make some further adjustment to ensure that the Croissant profile output is embedded in their dataset page. As you know, this is accessible via an RDF endpoint by https://{ckan-instance-host}/dataset/{dataset-id}.{format}?profiles=croissant, but that's not useful for e.g. Google dataset search. As per the information here, we see that a maintainer will have to adjust their read_base.html from {{ h.structured_data([pkg.id](http://pkg.id/))|safe }} to {{ h.structured_data([pkg.id](http://pkg.id/), ['croissant'])|safe }}. I didn't hard-code this in the proposed changes in order not to mess with the default behaviour of the structured_data plugin, which, as you said, currently embeds the schemaorg profile output in the dataset page. Looking into the code that you linked to here, and the associated code here, I see that the structured_data function accepts an argument profiles, which could then be specified as ['croissant'] to override the default ['schemaorg'] behaviour. Could this be done via an environment variable, rather than requiring a whole new plugin?
There has at least been value in having the Croissant profile output available at its own endpoint for testing purposes, as if changes are made to the code then it's easy to refresh the browser page and check the output without having to first scroll down to it, as needed when viewing it embedded in the dataset page. I can't anticipate all use cases, but this may also be useful for other users as well. I thought this was default behaviour of the ckanext-dcat extension anyway, for any new profile in the profiles/ folder providing that it's registered in profiles/__init__.py and pyproject.toml, as in the proposed changes?
Yes, I followed existing nomenclature where present, and only introduced Croissant specific nomenclature where I couldn't see an existing field in e.g. schemas/dcat_ap_full.yaml. For example, I used the default issued for the publication date field name where I otherwise would have used date_published in order to match the front-end label, which in turn is chosen to best reflect the Croissant specification. As another example, there was no default field for providing synonymous links, so I created same_as just for this purpose without having anything else to fall back on. One question I do have is whether my field cite_as would better be alternate_identifier, following schemas/dcat_ap_full.yaml, what do you think?
I think some input from @benjelloun would be useful regarding the IDs of the various graph nodes. By default, without anything specified by the user in the id_given fields, these are auto-generated by CKAN like https://{ckan-instance-host}/dataset/254a04dd-1c91-4d92-bfb1-487e00f99f2a for the main dataset node, and like _:Neb80621429684583b9a86cd54595016b for other internal nodes. With the Croissant schema, the user can change an ID to anything they want, including removal of the link aspect where present, so MY-DATASET-ID would be an acceptable and complete ID if specified by the user.
Thanks for the link here to other info required for these changes to be fully accepted, such as examples and tests. I'll have a look at what's needed as soon as I can.

Finally, I've changed the target branch to ckanext-dcat:croissant. Many thanks for providing an official home for this work.

amercader · 2025-01-08T15:02:28Z

@Reikyo Happy new year!

I messed up and pushed some commits to make things easier for you to the croissant branch but that meant that I closed this PR. My apologies, if you pull the latest croissant branch from this repo and create a new PR I'll be more careful in the future 🙏

See comments below:

Plugins

This is a bit of confusing CKAN terminology so bear with me :) Extensions (e.g. ckanext-dcat) are Python packages that contain one or more plugins (e.g. dcat, structured_data and croissant). The plugins are the ones that add functionality to CKAN and in most cases, maintainers should just concern with loading plugins (i.e. adding them to ckan.plugins and maybe configuring them). Profiles are just an internal thing used by the ckanext-dcat extension that we need to map properties to CKAN fields. We need the croissant one that you created but most maintainers won't be interacting with them.

In 4be90d3 I added a new croissant plugin that wraps all the functionality required. So most maintainers should just enable the croissant plugin and the plugin will take care of adding the relevant template snippets, helpers etc. The structured_data plugin used to rely on the dcat one but I changed that in #329 and now users can enable any of the dcat, structured_data and croissant plugins as they prefer.

You will need to re-run pip install -e . when pulling the latest changes to register the new plugin (and add it to ckan.plugins of course)

Endpoint

Commit 104b1ae adds a new endpoint at /dataset/<id>/croissant.jsonld that returns the Croissant JSON-LD serialization when the croissant plugin is loaded. It was possible to get this representation using ?profiles= params but this gives it a dedicated endpoint, that could even be advertised with a <link rel="alternate" > tag in the source code.

Property names

Whenever possible I'd try to follow existing names for internal field names in the DCAT profiles, e.g. issued rather than date_published. That doesn't mean that you can't change the UI labels in the schema file (croissant.yml) to better align with what users would expect.

I'm not an expert but citeAs seems semantically different to DCAT's alternate identifiers (e.g. a bibtex citation vs an id for the same dataset in another catalog) so I'm +1 on keeping them separate

Docs, etc

I created a stub for where the docs should live here: 675c484, I can do the same for tests, etc

Let me know if all this makes sense and sorry again for merging the PR

Reikyo · 2025-01-13T19:34:42Z

Moving the conversation from here to #330.

Added namespace, schema and profile for the Croissant metadata specif…

955abf8

…ication (https://docs.mlcommons.org/croissant/docs/croissant-spec.html)

Reikyo changed the base branch from master to croissant January 6, 2025 14:00

amercader merged commit c69c639 into ckan:croissant Jan 8, 2025
4 checks passed

Reikyo mentioned this pull request Jan 13, 2025

Updated docs with Croissant information. #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for the Croissant metadata specification #328

Add support for the Croissant metadata specification #328

Reikyo commented Dec 17, 2024

Reikyo commented Dec 17, 2024 •

edited

Loading

amercader commented Dec 17, 2024

Reikyo commented Dec 18, 2024

amercader commented Dec 19, 2024

Reikyo commented Jan 6, 2025 •

edited

Loading

amercader commented Jan 8, 2025 •

edited

Loading

Reikyo commented Jan 13, 2025

Add support for the Croissant metadata specification #328

Add support for the Croissant metadata specification #328

Conversation

Reikyo commented Dec 17, 2024

Reikyo commented Dec 17, 2024 • edited Loading

amercader commented Dec 17, 2024

Reikyo commented Dec 18, 2024

amercader commented Dec 19, 2024

Reikyo commented Jan 6, 2025 • edited Loading

amercader commented Jan 8, 2025 • edited Loading

Reikyo commented Jan 13, 2025

Reikyo commented Dec 17, 2024 •

edited

Loading

Reikyo commented Jan 6, 2025 •

edited

Loading

amercader commented Jan 8, 2025 •

edited

Loading