Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] Git-based MetaStore Service #13

Closed
25 of 34 tasks
rufuspollock opened this issue Jun 14, 2020 · 1 comment
Closed
25 of 34 tasks

[epic] Git-based MetaStore Service #13

rufuspollock opened this issue Jun 14, 2020 · 1 comment
Assignees

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Jun 14, 2020

Epic: A stand-alone MetaStore microservice

STATUS: IMPLEMENTED 👍

MetaStore: storage for dataset metadata, not storing the data itself or the raw data blobs (files). Think CKAN Classic's DB tables, or the datapackage.json format.

The service should provide the following main capabilities:

  • Provide API to store, access and manage versioned dataset and resource metadata
  • Deployed as a stand-alone micro-service
  • Integrate with external authentication / authorization service (JWT tokens)
  • Can be integrated with CKAN Classic via a wrapper client extension (and ckanext-authz-service)
  • Will typically use Git as a backend, but this is not a hard requirement as theoretically other storage backends that support versioned key-value storage could be used
    • Initially, we will provide a GitHub API based storage backend, but this should be pluggable

Questions:

  • JS vs Python (vs Go) as server language code => use python
    • JS: if we wanted this as a client lib (would we ever) then useful
    • Go: the best but no support at ...
  • Tag vs version => choose tag
  • dataset identifiers: TODO (org, name), uuid, ...
  • resource identifiers
  • GraphQL vs REST => REST (REST is a good fit here)

Acceptance

A library for git(hub) based metastore (no auth required):

Wrap library into a microservice

  • An HTTP micro-service exposing Web API (most likely RESTful)
  • JWT based authorization (see implementation in Giftless)

Tasks

8d + 2d for mocks

  • Write a simple README with example client flow in curl / JS or python (0.5d)
    • OPTIONAL: Could extend current SDK or README like ... (?)
    • Just for read, create, update ...
  • Write the github wrapper that implements that (7d)
    • Mock backend for git (mock out api calls) (2d)
    • dataset CRUD (1.5d)
      • Create: create .lfsinfo, create datapackage.json and associated lfs files (one for each resource)
      • Read: read the datapackage.json
      • Update: check which resources if any have changed (pull old datapackage.json and compare against those resources) and write those those [OR: cheaper - just write all of them but that may be painful for datasets with lots of resources] THEN update datapackage.json (order matters only if using git cli tool where you'll want datapackage.json update to come last)
      • Delete: delete the datapackage.json and all associated resource files
    • tag CRUDL (1d)
      • Create: create a tag with name, description, author (date?)
      • Update: update a tag with new name, description ...
      • Delete: delete a tag
      • List: list a tag
    • revisions RL (0.5d) - crude is just get all commits (prob easiest). Alternative would be commits to datapackage.json because we can assume all file changes will also touch datapackage.json https://pygithub.readthedocs.io/en/latest/github_objects/Repository.html#github.Repository.Repository.get_commits
    • (?) resource R (0.5d - just layer on top of dataset atm)
    • dataset @ revision (0.5d) - get the datapackage.json file at revision X
    • resource @ revision (0.25d) - get datapackage.json at revision X and pull our resource Y
  • (Optional - Probably won't do atm) Write the API service around this ... (2d)
    • Stub flask
    • Wire up endpoints
  • Add authentication (1.5d)

Analysis

Beginning of README-driven development

# creates project implicit
$ curl -X POST https://metastore/dataset/create { owner: xxx, name: yyy}
201 CREATED - { id: ... }

# check stuff
curl https://metastore/dataset/:id
curl https://metastore/project/:id

Design Public API

The following actions are exposed via Web API:

# TODO: what would be difference between this and the dataset ...
def project_read(project_id: str):
    return {
      'owner_org_or_user':
      'dataset': {
        data package object ...
      },,
      'issues': ... ,,,
      'flows': # future ...
    }

def dataset_read(dataset_id: str, revision: Optional[str] = None) -> Dataset:
    """Get dataset metadata given a dataset ID and optional revision
    reference; Would be nice if ``revision_ref`` can be a tag name,
    branch name, commit sha etc. like with Git.
    
    The return value is essentially the datapackage.json file from the
    right revision; It includes metadata for all resources. 
    
    dataset_id: tuple (xxx, yyy) or unique identifier
    """
    return {
      # datapackage.json ...
    }

def dataset_create(dataset):
    """
    dataset: is a valid data package object.
    
    {
      resources: [
        {
          'name': ...,
          'path': 'mydata.csv', # we assume this is in git lfs ...
          'sha256': '...',  # need ...
          'bytes': '...'
        }
      ]
    }
    """
    # Code here will extract ckanext-gitdatahub code
    
def dataset_update(dataset_id, dataset):
    """
    dataset: a full data package object
    """
    # Code here will extract ckanext-gitdatahub code

    
def dataset_delete():
    """
    TODO: semantics - at least for github. I think rather than archive we simply mark this in datapackage.json or do nothing at all - state is something managed at HubStore level (?)
    """
    # Code here will extract ckanext-gitdatahub code

def dataset_move():
    """Move a dataset between organizations (do we need this?)
    """

def dataset_purge(dataset_id: str):
    """Purge a deleted dataset
    
    This should delete the git repo
    """

def revision_list(dataset_id: str): -> List[Revision]
    """Get list of revisions for a dataset
    
    TODO: is all changes to the repo - or only to datapackage.json ... ANS: for now all the commits in the repo b/c e.g. a file might change but not datapackage.json
    """
    return [
      {
        "id": ...
        "timestamp": ..
      }
    ]
    
def tag_list(dataset_id: str): -> List[Tag]
    """Get list of tags for a dataset
    """

def tag_create(dataset_id: str, tag_name: str, **kwargs) -> Tag:
    """Create a tag (named revision, or "version" in the old 
    ckanext-versions terminology)
    """

def tag_update(dataset_id: str, tag: str, **kwargs) -> Tag:
    """ Allows actions like change the name, the description, etc. (tag    
    metadata)
    """
  
def tag_read(dataset_id: str, tag: str) -> Tag:
    """Get tag metadata
    """
    
def tag_delete(datasett_id: str, tag: str) -> None:
    """Delete a tag
    """

Porcelain API:

def dataset_revert(dataset_id, to_revision_ref: str) -> Dataset:
    """Revert a dataset to an older revision / tag
    
    Under the hood this is a `git revert` like operation, 
    and is somewhat equivalent to ckanext-versions' 
    `dataset_version_promote` action.
    """
    
def revision_diff(dataset_id, revision_ref_a: str, revision_ref_b: str) -> DatasetDiff:
    """Compare two revisions of a dataset and return a 'diff' object.
    
    Maybe this is best handled as a client-side operation and doesn't
    need an API
    """

For gates is a requirement:

  • dataset_revert
  • revision_diff

Stuff we only need if we're doing CKAN actions (vs. an independent microservice):

def get_resource(dataset_id: str, resource_id: str, revision_ref: Optional[str] = None) -> Resource:
    """Get resource metadata in revision, similar to ``get_dataset``
    """
    return filter(..., get_dataset(dataset_id, revision_ref))

Internal API

API for extensions to hook into

Github

Repo

https://developer.github.com/v3/repos/#get-a-repository

GET /repos/:owner/:repo

DELETE /repos/:owner/:repo

https://developer.github.com/v3/repos/#delete-a-repository

Contents

Parameters:

  • ref -> string: The name of the commit/branch/tag. Default: the repository’s default branch (usually master)

https://developer.github.com/v3/repos/contents/

GET /repos/:owner/:repo/readme
GET /repos/:owner/:repo/contents/:path

Tags

https://developer.github.com/v3/git/tags/

GET /repos/:owner/:repo/git/tags/:tag_sha

Commits

https://developer.github.com/v3/repos/commits/

GET /repos/:owner/:repo/commits

Gitlab

:::info
Actually think gitlab may have the cleaner API. E.g. having projects as first class and repos as distinct.
:::

https://docs.gitlab.com/ee/api/README.html

https://docs.gitlab.com/ee/api/projects.html

@shevron
Copy link
Contributor

shevron commented Jun 14, 2020

This is all done, excluding wrapping with a service which will be done separately as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants