Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL download for multi-taxonomy - should we add syntactic sugar ? #365

Open
djtfmartin opened this issue Dec 3, 2024 · 0 comments
Open
Assignees

Comments

@djtfmartin
Copy link
Contributor

djtfmartin commented Dec 3, 2024

SQL downloads for multi-taxonomy are working in the dev2 environment, but the required SQL is not very user friendly.
Given the importance of querying taxonomically for GBIF, it may serve well to add some syntax support which is easier to understand. This isn't something done in SQL downloads code currently, so would serve as a special case.

Example SQL

SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude
FROM occurrence 
WHERE 
array_contains(classifications["2d59e5db-57ad-41ff-97d6-11f5fb264527"], "urn:lsid:marinespecies.org:taxname:158970")

Example with curl

curl -X POST https://api.gbif-dev2.org/v1/occurrence/download/request \
     -u usr:pwd \
     -H "Content-Type: application/json" \
     -d '{
           "sendNotification": true,
           "notificationAddresses": ["[email protected]"],
           "format": "SQL_TSV_ZIP",
           "sql": "SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude FROM occurrence WHERE array_contains(classifications['\''2d59e5db-57ad-41ff-97d6-11f5fb264527'\''], '\''urn:lsid:marinespecies.org:taxname:158970'\'')"
         }'

Some proposals:

  1. Add support for simple names for classifications
SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude
FROM occurrence 
WHERE 
array_contains(classifications["worms"], "urn:lsid:marinespecies.org:taxname:158970")

Simple to implement, but with a cost of maintaining a mapping.

  1. Allow checklistKey to be specified separately, and then provide backend logic to construct the array_contains statement.
SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude
FROM occurrence 
WHERE 
checklistKey = "2d59e5db-57ad-41ff-97d6-11f5fb264527" 
AND
taxonKey = "urn:lsid:marinespecies.org:taxname:158970"

Similar to the current behaviour with HTTP GET services for the Occurrence API. See #342
The addition of the checklistKey assumes that any taxon key queries will use that checklist for matching the taxon key.
The absence of a checklistKey would result in using the default taxonomy, which would still entail a array_contains statement being constructed in the backend. This would require some manipulation of operands.

  1. Custom SQL function

We could support a custom SQL function:

SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude
FROM occurrence 
WHERE taxonLookup('2d59e5db-57ad-41ff-97d6-11f5fb264527', 'urn:lsid:marinespecies.org:taxname:158970')

or with default taxonomy (xcol in this example)

SELECT datasetKey, scientificName, decimalLatitude, decimalLongitude
FROM occurrence 
WHERE taxonLookup('58VJH')
  1. Support with JOINS

We could populate a taxonomy table in Hive and provide support for taxonomy via joins like so:

SELECT o.datasetKey, o.scientificName, o.decimalLatitude, o.decimalLongitude...
FROM occurrence o
  JOIN occurrence_taxonomy ot ON ot.occurrenceKey = o.gbifKey 
  JOIN taxonomy t ON ot.taxonKey =  t.key
WHERE t.datasetKey = '2d59e5db-57ad-41ff-97d6-11f5fb264527' AND t.scientificName LIKE 'Gadus%'

Similarly

WITH taxa_keys AS (
  SELECT key 
  FROM taxonomy 
  WHERE datasetKey = '2d59e5db-57ad-41ff-97d6-11f5fb264527' AND scientificName LIKE ´Gadus%´   
)
SELECT * FROM occurrence WHERE taxa CONTAINS(taxa_keys.key);
@djtfmartin djtfmartin self-assigned this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant