Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Productizing bot deployment #1541

Open
8 of 11 tasks
dpaiton opened this issue Jun 18, 2024 · 3 comments
Open
8 of 11 tasks

Productizing bot deployment #1541

dpaiton opened this issue Jun 18, 2024 · 3 comments

Comments

@dpaiton
Copy link
Contributor

dpaiton commented Jun 18, 2024

Tasks

  • write a readme for deploying bots (@slundqui)
    • include a section on using AWS to deploy bots
  • "one-click" solution to deploying checkpoint & invariant bots (tbd after readme)
  • setup rollbar monitoring & notifications
    • flag or label or something to make sure these errors can be filtered (@slundqui )
    • discord channel that ingests notifs (@wakamex)
  • write up a playbook for when things fail (@jrhea)
  • setup credential storage (@mcclurejt)
  • setup & distribute lastpass credentials for pauser (@jalextowle )
  • make sure everyone has access (@wakamex )
  • review invariant checks (@jalextowle @slundqui )

responsibility

All people listed should

  1. know how to (& have credentials to) restart and/or deploy bots
  2. monitor bot-related rollbar notifications; check that any critical bugs are being addressed
  3. understand error prioritization and know the failure playbook

importance (priority)

  1. invariant fails (page @jalextowle @jrhea @mcclurejt )
  2. checkpoint bot tries to checkpoint & fails
  3. checkpoint bot goes down
  4. invariant goes down

top priorities for mainnet

  • checkpoint bot & invariance check bot
    • runs
    • reporting system for when it goes down
    • secure credential management
    • documentation on how to (re)deploy bots

bots to consider

  • checkpoint
  • invariance check
  • lpandarb
    • this should be added after the other two are working well

documentation

  • README.md in infra

uptime monitoring

  • easily-accessible location for cloud machine address & status
  • easily-accessible portal to view all deployed bot wallets

error reporting & notifications

  • notifications to critical team when bots go down (rollbar?)
  • system in place to assign responsibility for who should handle errors

easy start & restart

  • minimal steps to deploy new bots on a pool
  • ideally would be able to run out in a mainnet fork on aws instance

containerized deployment

  • setup flag for "service bots"

invariant checks

  • rollbar filters for each check type

credentials storage

  • privileged access to private keys for bots
  • whoever sets this up is fine with making calls -- lets prioritize "easy" and "safe"
    • ideally use a free service, but if not then fine
  • easiest to use env vars
  • lastpass credentials for pauser

continuous deployment

  • nice to have
  • when infra pushes a release we deploy bots on a mainnet fork in AWS?
  • almost-continuous deployment -- make it easy for a dev to manually test deployment

current status -- checkpoint bot:

  • running in docker container
    • docker can restart automatically on failure (easily set up)
  • passes credentials via env variables set in infra repo
    • registry address, rpc uri (points to anvil node), private key, rollbar api key
@slundqui
Copy link

Readme on deploying bots within delvtech/hyperdrive-infra#119

@slundqui
Copy link

slundqui commented Jun 18, 2024

Something to note is that rollbar doesn't have a great way to log "this process is dead". May need a separate "monitoring" container that logs errors if the service bots containers are stopped, or we allow docker to always restart. Even then, if the aws machine goes down, there's no way of logging an "this is down" error

@wakamex
Copy link
Contributor

wakamex commented Jun 20, 2024

I changed the second-last bullet from document machine details (ip, port) and make sure everyone has ssh access to make sure everyone has access. Originally we envisioned using AWS, but @mcclurejt convinced me fly.io is way easier. We won't need individual ssh keys. But I'll still go through and make sure everyone has access, so I tagged myself to the bullet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants