You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MetaStore: storage for dataset metadata, not storing the data itself or the raw data blobs (files). Think CKAN Classic's DB tables, or the datapackage.json format.
The service should provide the following main capabilities:
Provide API to store, access and manage versioned dataset and resource metadata
Deployed as a stand-alone micro-service
Integrate with external authentication / authorization service (JWT tokens)
Can be integrated with CKAN Classic via a wrapper client extension (and ckanext-authz-service)
Will typically use Git as a backend, but this is not a hard requirement as theoretically other storage backends that support versioned key-value storage could be used
Initially, we will provide a GitHub API based storage backend, but this should be pluggable
Questions:
JS vs Python (vs Go) as server language code => use python
JS: if we wanted this as a client lib (would we ever) then useful
Go: the best but no support at ...
Tag vs version => choose tag
dataset identifiers: TODO (org, name), uuid, ...
resource identifiers
GraphQL vs REST => REST (REST is a good fit here)
Acceptance
A library for git(hub) based metastore (no auth required):
APIs for:
Dataset - read, create, update, (patch), delete, (purge), list (?)
An HTTP micro-service exposing Web API (most likely RESTful)
JWT based authorization (see implementation in Giftless)
Tasks
8d + 2d for mocks
Write a simple README with example client flow in curl / JS or python (0.5d)
OPTIONAL: Could extend current SDK or README like ... (?)
Just for read, create, update ...
Write the github wrapper that implements that (7d)
Mock backend for git (mock out api calls) (2d)
dataset CRUD (1.5d)
Create: create .lfsinfo, create datapackage.json and associated lfs files (one for each resource)
Read: read the datapackage.json
Update: check which resources if any have changed (pull old datapackage.json and compare against those resources) and write those those [OR: cheaper - just write all of them but that may be painful for datasets with lots of resources] THEN update datapackage.json (order matters only if using git cli tool where you'll want datapackage.json update to come last)
Delete: delete the datapackage.json and all associated resource files
tag CRUDL (1d)
Create: create a tag with name, description, author (date?)
Update: update a tag with new name, description ...
# TODO: what would be difference between this and the dataset ...defproject_read(project_id: str):
return {
'owner_org_or_user':
'dataset': {
datapackageobject ...
},,
'issues': ... ,,,
'flows': # future ...
}
defdataset_read(dataset_id: str, revision: Optional[str] =None) ->Dataset:
"""Get dataset metadata given a dataset ID and optional revision reference; Would be nice if ``revision_ref`` can be a tag name, branch name, commit sha etc. like with Git. The return value is essentially the datapackage.json file from the right revision; It includes metadata for all resources. dataset_id: tuple (xxx, yyy) or unique identifier """return {
# datapackage.json ...
}
defdataset_create(dataset):
""" dataset: is a valid data package object. { resources: [ { 'name': ..., 'path': 'mydata.csv', # we assume this is in git lfs ... 'sha256': '...', # need ... 'bytes': '...' } ] } """# Code here will extract ckanext-gitdatahub codedefdataset_update(dataset_id, dataset):
""" dataset: a full data package object """# Code here will extract ckanext-gitdatahub codedefdataset_delete():
""" TODO: semantics - at least for github. I think rather than archive we simply mark this in datapackage.json or do nothing at all - state is something managed at HubStore level (?) """# Code here will extract ckanext-gitdatahub codedefdataset_move():
"""Move a dataset between organizations (do we need this?) """defdataset_purge(dataset_id: str):
"""Purge a deleted dataset This should delete the git repo """defrevision_list(dataset_id: str): ->List[Revision]
"""Get list of revisions for a dataset TODO: is all changes to the repo - or only to datapackage.json ... ANS: for now all the commits in the repo b/c e.g. a file might change but not datapackage.json """return [
{
"id": ...
"timestamp": ..
}
]
deftag_list(dataset_id: str): ->List[Tag]
"""Get list of tags for a dataset """deftag_create(dataset_id: str, tag_name: str, **kwargs) ->Tag:
"""Create a tag (named revision, or "version" in the old ckanext-versions terminology) """deftag_update(dataset_id: str, tag: str, **kwargs) ->Tag:
""" Allows actions like change the name, the description, etc. (tag metadata) """deftag_read(dataset_id: str, tag: str) ->Tag:
"""Get tag metadata """deftag_delete(datasett_id: str, tag: str) ->None:
"""Delete a tag """
Porcelain API:
defdataset_revert(dataset_id, to_revision_ref: str) ->Dataset:
"""Revert a dataset to an older revision / tag Under the hood this is a `git revert` like operation, and is somewhat equivalent to ckanext-versions' `dataset_version_promote` action. """defrevision_diff(dataset_id, revision_ref_a: str, revision_ref_b: str) ->DatasetDiff:
"""Compare two revisions of a dataset and return a 'diff' object. Maybe this is best handled as a client-side operation and doesn't need an API """
For gates is a requirement:
dataset_revert
revision_diff
Stuff we only need if we're doing CKAN actions (vs. an independent microservice):
defget_resource(dataset_id: str, resource_id: str, revision_ref: Optional[str] =None) ->Resource:
"""Get resource metadata in revision, similar to ``get_dataset`` """returnfilter(..., get_dataset(dataset_id, revision_ref))
Epic: A stand-alone MetaStore microservice
STATUS: IMPLEMENTED 👍
MetaStore: storage for dataset metadata, not storing the data itself or the raw data blobs (files). Think CKAN Classic's DB tables, or the
datapackage.json
format.The service should provide the following main capabilities:
ckanext-authz-service
)Questions:
Acceptance
A library for git(hub) based metastore (no auth required):
list (?)make html-docs
, need to upload these to readthedocs or similar and provide a link in READMEmetastore-lib
and belongs somewhere else (search backend)datapackage.json
so same APIs as dataset.Wrap library into a microserviceAn HTTP micro-service exposing Web API (most likely RESTful)JWT based authorization (see implementation in Giftless)Tasks
8d + 2d for mocks
Write the API service around this ... (2d)Add authentication (1.5d)Analysis
Beginning of README-driven development
Design Public API
The following actions are exposed via Web API:
Porcelain API:
For gates is a requirement:
dataset_revert
revision_diff
Stuff we only need if we're doing CKAN actions (vs. an independent microservice):
Internal API
API for extensions to hook into
Github
Repo
https://developer.github.com/v3/repos/#get-a-repository
DELETE /repos/:owner/:repo
https://developer.github.com/v3/repos/#delete-a-repository
Contents
Parameters:
https://developer.github.com/v3/repos/contents/
Tags
https://developer.github.com/v3/git/tags/
Commits
https://developer.github.com/v3/repos/commits/
Gitlab
:::info
Actually think gitlab may have the cleaner API. E.g. having projects as first class and repos as distinct.
:::
https://docs.gitlab.com/ee/api/README.html
https://docs.gitlab.com/ee/api/projects.html
The text was updated successfully, but these errors were encountered: