Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add facets to collection search #166

Open
MortenHofft opened this issue Aug 5, 2024 · 3 comments
Open

add facets to collection search #166

MortenHofft opened this issue Aug 5, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@MortenHofft
Copy link
Member

We could add facets to collection search

2 types of metrics would be possible

  • collection facets - counting collections with a given filter - this is the standard behaviour for facets
  • specimen facets - counting specimens with a given filter.

specimen facets is the only thing that makes sense within a collection.
both make sense for institutions and grscicoll generally, but currently there isn't any data for it.

examples of collection facet questions:

how many collections have data in spain
how many collections have data about taxon x
how many collections have type specimens of taxon x
which is the most prevalent preservation types for this collection
breakdowns across collections: how many collections per: kingdom, preservation type, country, type specimens, types/country types/kingdom

examples of specimen facet questions:

Which orders does this collection mainly deal with
Breakdown of phyla per country for a collection/institution/total
breakdowns for all: specimens per: kingdom, preservation type, country, type specimens, types/country types/kingdom

We could start with collection facets?

e.g. ?country=ES&country=FR&facet=kingdomKey same behaviour as normally

These collection facets is what I'm guessing would be useful: descriptorCountry, country, kingdomKey, phylumKey, ...other taxonGroupKeys..., typeStatus, preservationType, contentType, personalCollection, instititutionKey, active

Ideally we added something new to the API. Namely cardinality of those facets. So an option to, not only get top 10 orders, but also get the number of unique orders. These makes it easier to do UI.
Examples where cardinality is used:
https://grscicoll.hp.gbif-staging.org/specimen/search?layout=W1t7ImlkIjoiYm1tNW8iLCJwIjp7fSwidHJhbnNsYXRpb24iOiJkYXNoYm9hcmQuc3RhdGlzdGljcyIsInQiOiJvY2N1cnJlbmNlU3VtbWFyeSJ9XSxbeyJpZCI6IjE4NGhxIiwicCI6eyJ2aWV3IjoiVEFCTEUifSwidHJhbnNsYXRpb24iOiJmaWx0ZXJzLmNvbGxlY3Rpb25LZXkubmFtZSIsInQiOiJjb2xsZWN0aW9uS2V5In1dXQ%3D%3D&view=DASHBOARD
distinct species, distinct taxa in statistics chart + number of results in collection chart

@MortenHofft MortenHofft added the enhancement New feature or request label Aug 5, 2024
@MortenHofft
Copy link
Member Author

If going for cardinality, we might want to discuss with the rest of the team what the api should look like.

Ideas:
just include it in the normal facet response
?facet=type&facetLimit=2

"facets": [
  {
  "field": "TYPE",
  "cardinality": 4, <==== new field that list the number of facets, not just in the response but in total
  "counts": [
      {
        "name": "CHECKLIST",
        "count": 53833
      },
      {
        "name": "OCCURRENCE",
        "count": 49485
      }
      ]
  }
]

other approach use ?facet=something&cardinality=publisherKey&limit=0&offset=0
and then a distinct response for that

{
  "count": 1000,
  "limit": 0,
  "offset": 0,
  "results": [],
  "facets": ...,
  "cardinality": {
    "PUBLISHER_KEY": 1234 <==== distinct publisherKeys within the given search filter
  }
}

@MortenHofft
Copy link
Member Author

specimenFacet seem more difficult

E.g. count number of specimens per kingdom
Quick thoughts on the subject. It could probably be nice within collections if we started to have some collection being richly described. But it seems more difficult - both for the API but also to present it in a meaningful and fair way.

facets: [
  {kingdomKey: 1, count: 123456} // (from 2 csv rows. one with 123000 individuals and another with 456 individuals)
]

individualCount sum across all those descriptors that have that kingdomKey=1
so you would have to get distinct kingdoms within the filter.
And for each sum the individual count of all the matching descriptors.

Presentation wise the UI would probably have to show caveats like this for e.g. a kingdom breakdown:

  • Only 10% of collections have CDs
  • Only 50% of those with matching CDs have marked them as "no double counting" [?]
  • Only 20% of collections that provide CDs have marked them as exhaustive (describes their entire collection)
  • Only 40% of the provided CDs have a scientific name
  • Only 30% of the matched CDs have a specimen count
  • That means that this chart is based on 130 descriptors from 4 collections.

@ManonGros
Copy link
Collaborator

Thanks Morten! The collection facets and proposed implementation make a lot of sense.

The specimen facets are much more complicated and yes we would have to display a lot of caveats. We would also have to add some other fields to know if the people uploading records have double counted, are exhaustive, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants