Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change regular expression to allow both upper and lower case letters in language code #780

Conversation

sfisher
Copy link
Contributor

@sfisher sfisher commented Nov 1, 2024

After the dash like en-US it used to only allow upper case letters.

This could be either.

I looked at the DataCite docs and it mentions IETF BCP 47, ISO 639-1 language codes. I looked over these codes and it appears that this regex will cover it (I don't see any non-alphabetic characters in either part).

It looks like the error message is ok: ERR_LANGUAGE = _("Must be a valid language code (IETF BCP 47 or ISO 639-1)")

I didn't see it come up in the documentation elsewhere when I searched.

@sfisher sfisher requested a review from jsjiang November 1, 2024 00:30
@sfisher sfisher changed the base branch from main to develop November 1, 2024 00:31
Copy link
Contributor

@jsjiang jsjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me. We can deploy this to Dev or Stg to test.

Jing

@jsjiang
Copy link
Contributor

jsjiang commented Nov 7, 2024

@sfisher Hi Scott, there are two language fields in the Advanced DOI registration form:

  1. Title Language
  2. Language
    I think the fix applied to the Title Language field but not the Language field.

@adambuttrick Hi Adam, do we need to apply this change to both fields or only one (which one)? Also, should we make the check case insensitive or only allows some combinations such as:

  • lower-lower: en-us
  • lower-Upper: en-US

Jing

@adambuttrick
Copy link

adambuttrick commented Nov 7, 2024

DataCite staff confirmed that use of mixed case across language tagging in fields should not result in validation errors. The change should be applied to all fields that make use of language tagging, as either is valid.

@sfisher
Copy link
Contributor Author

sfisher commented Nov 9, 2024

I believe the language is using the correct constraint everywhere we allow it.

There is one constant for this and it is re-used in the places that there is an XML:lang attribute.

When you look at DataCite 4.5, section 9 where it talks about the other Language that you mention it is a recommended value and we are currently not enforcing any value at all that I can see when I look at the code (whereas the other places, the enforcement is clear).

As I understand it, DataCite doesn't enforce that it has to meet this controlled format and neither do we in our code, so there is not need to change this unless we begin enforcing it when DataCite doesn't.

From the the spec for DataCite 4.5, these are the places that the xml:lang attribute may be used:

XML provides an xml:lang attribute2 that can be used on the following properties and sub- properties: 
• 3. Title
• 4. Publisher
• 6. Subject
• 16. Rights
• 17. Description
• 20.3 title
• 2.1 creatorName when 2.1.a nameType is “Organizational”
• 7.1 contributorName when 7.1.a nameType is “Organizational” 

About 9. Language:

9. Language 
Obligation: Optional 
Occurrences: 0-1 
Definition: The primary language of the resource. 
Allowed values, examples, other constraints: 
Recommended values are taken from IETF BCP 47, ISO 639-1 language codes. Examples: en, de, fr 
Example XML 
<language>en</language> 

I also wasn't sure if somehow this was required, even though it is marked "recommended." ChatGPT also seemed to agree with me that it's not mandatory to fit in one of these code schemes (though it may be wrong).

As I understand it, if we want to enforce this language fits the schemes then we would be stricter than DataCite is. If that's a feature we want, then we can change it but it would be a new feature change and not a correction to the problem that this ticket is about.

@adambuttrick
Copy link

@sfisher Thanks for this! The scope of the ticket is simply aligning with DataCite, where language is recommended, but not required. There is a typology for the levels of obligation with specific meanings:

  • Mandatory (M) properties must be provided;
  • Recommended (R) properties are optional, but strongly recommended for interoperability; and
  • Optional (O) (but not specifically recommended) properties provide richer description.

More details here:

https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/overview/

@sfisher sfisher merged commit f9c1ecc into develop Nov 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MAINTENANCE] Update language code validation in the EZID UI for title, subject, and description fields
3 participants