This repository has been archived by the owner on Jul 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 51
Weekly Meeting 2016 10 06
Kristen Carlson Accardi edited this page Oct 13, 2016
·
3 revisions
- roll call (1 minute)
- opens (5 minutes)
- bugs (10 minutes): pre-seed a list of new or priority up/down candidates into the agenda for meeting focus
- triage
- scrub
- prior meeting actions (5 minutes): check prior meetings' minutes ACTION items from minutes for progress and resolution
- Tim had AR from prior meeting to discuss Identity abstractions in this meeting
Meeting started by kristenc at 16:03:19 UTC. The full logs are available at ciao-project/2016/ciao-project.2016-10-06-16.03.log.html .
-
role call (kristenc, 16:03:30)
-
opens (kristenc, 16:05:37)
-
bug triage (kristenc, 16:06:45)
- LINK: https://github.com/01org/ciao/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aopen%20created%3A%3E%3D2016-09-29 (kristenc, 16:06:51)
- LINK: https://github.com/01org/ciao/issues/620 (kristenc, 16:21:27)
- LINK: https://github.com/01org/ciao/issues/636 might also be a dup (of a different issue)...would need to search (tcpepper, 16:51:24)
Meeting ended at 17:00:17 UTC.
-
UNASSIGNED
- (none)
- kristenc (119)
- tcpepper (94)
- markusry (32)
- mcastelino (9)
- rbradford (3)
- ciaomtgbot (3)
- david-lyle (1)
- albertom (1)
- leoswaldo (1)
Generated by MeetBot
_ 0.1.4
.. _MeetBot
: http://wiki.debian.org/MeetBot
###Full IRC Log
16:03:19 <kristenc> #startmeeting weekly_meeting
16:03:19 <ciaomtgbot> Meeting started Thu Oct 6 16:03:19 2016 UTC. The chair is kristenc. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:03:19 <ciaomtgbot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:03:19 <ciaomtgbot> The meeting name has been set to 'weekly_meeting'
16:03:30 <kristenc> #topic role call
16:03:39 <kristenc> o/
16:03:55 <rbradford> o/
16:03:59 <david-lyle> o/
16:04:21 <tcpepper> o/
16:04:30 <markusry> o/
16:04:36 <leoswaldo> o/
16:04:49 <albertom> o/
16:05:37 <kristenc> #topic opens
16:05:42 <kristenc> does anyone have any opens?
16:06:41 <kristenc> ok.
16:06:45 <kristenc> #topic bug triage
16:06:51 <kristenc> #link https://github.com/01org/ciao/issues?utf8=0.000000E+002 16:07:05 <kristenc> these are all the issues created in the last week.
16:07:30 <kristenc> 15 new ones.
16:07:42 <kristenc> shall we go oldest to newest and prioritize
16:07:58 <tcpepper> works for me
16:08:16 <markusry> I'll do 611 tomorrow
16:08:36 <kristenc> 611 and 613 seem to already have priorities. so unless someone disagrees, we can skip those.
16:08:45 <markusry> Sounds good.
16:08:54 <markusry> rbradford has a patch for 612
16:08:56 <markusry> 613
16:08:59 <tcpepper> 613: I think we need to have an architectural discussion at some point
16:09:19 <tcpepper> my fear is we automate letting a cluster fall into a degenerate state
16:09:24 <tcpepper> w/o a means to notice and correct that
16:09:53 <markusry> rbradford's patch won't do this by default
16:10:00 <tcpepper> I think a sufficient solution is proposed in the issue
16:10:05 <tcpepper> but would be nice to hear more folks thoughts
16:10:34 <markusry> You need to explicitly force launcher on the command line to autodetect kvm
16:10:51 <tcpepper> as long as we never document this option I'm happy :)
16:11:02 <markusry> It's not even compiled in by default
16:11:04 * tcpepper fears the user who actually reads documentation
16:11:09 <markusry> It's like --with-ui
16:11:14 <tcpepper> perfect
16:12:00 <kristenc> ok, 614?
16:12:11 <kristenc> seems to me like markusry is already working on this.
16:12:21 <kristenc> I think that makes it a p1 :).
16:13:02 <tcpepper> no argument from me
16:13:34 <markusry> Actually, I haven't started on 614
16:13:43 <markusry> I decided to fix the identity service instead
16:14:00 <markusry> I thought it would be the quickest route for testing this week
16:14:01 <kristenc> markusry, so using fake identity is working for enabling ceph in single vm?
16:14:13 <markusry> Yes, it works fine.
16:14:21 <kristenc> ok - then it's definitely not a p1.
16:14:31 <markusry> once 637 and 640 are merged
16:14:51 <markusry> agreed.
16:15:01 <kristenc> so p2 or p3? my feeling is p3.
16:15:13 <kristenc> since we don't explicitly need it for anything.
16:15:22 <kristenc> it seems more like a "good idea, nice to have" type thing.
16:15:33 <markusry> The other thing we thought it might be useful for was Single VM in travis
16:15:46 <markusry> as there's a race condition in the fake identity service
16:15:52 <markusry> But I can fix this easily another way
16:16:13 <markusry> So I agree p3 is fine for now
16:16:21 <tcpepper> the closer we make the singlevm to a real production preferred architecture...the better
16:16:26 <tcpepper> but there's a lot to do there
16:16:33 <tcpepper> p3's ok by me
16:16:40 <markusry> I'd need to learn something about keystone for a start :-)
16:17:11 <kristenc> heh.
16:17:18 <kristenc> ok - 615.
16:17:43 <kristenc> there's a gap in our swagger stuff. it doesn't cover any of the new apis we've created for storage.
16:17:52 <kristenc> and doesn't handle the compute refactor.
16:18:03 <kristenc> it's documentation only.
16:18:06 <kristenc> so - priority?
16:18:13 <tcpepper> maybe a dumb question, but...: who do we imagine consumes this docu?
16:18:26 <kristenc> UI developers.
16:18:49 <tcpepper> ah ok...sorry jvillalo ;0
16:19:17 <tcpepper> if we're focused on ciao-cli first (which I think is the case), I'd call this a P4
16:19:34 <kristenc> yes - and there's a work around for the lack of documentation.
16:21:02 <kristenc> I updated 616 to the same priority.
16:21:27 <kristenc> https://github.com/01org/ciao/issues/620
16:21:47 <kristenc> I'd like to be able to rebuild the volume database just be querying the ceph cluster.
16:21:53 <tcpepper> this one begs for a bit of arch discussion too imho:
16:22:04 <tcpepper> I reaaaaallly like the idea of storing data in the component it relates to
16:22:16 <kristenc> the issue right now is just one of ignorance on my part.
16:22:16 <tcpepper> but also in the meantime we need a key:value store for various things
16:22:39 <kristenc> In july I tried to get this working - but failed due to not know how to do it right.
16:22:42 <tcpepper> we pollute the controller over time dumping stuff in its datastore. and complicate the security threat model too.
16:22:44 <kristenc> I think it's supported.
16:22:55 <kristenc> It'd be nice to not use a datastore at all for volumes.
16:23:02 <kristenc> and just keep all metadata in ceph.
16:23:10 <kristenc> they support adding key/value pairs to image.
16:23:23 <kristenc> I just was not smart enough to make it work.
16:23:31 <tcpepper> and similarly for things like keys (https://github.com/01org/ciao/issues/631) or cloud-init's (https://github.com/01org/ciao/issues/631)
16:24:03 <tcpepper> user-specific and project-specific key:value pairs are going to come up again and again
16:24:17 <tcpepper> and we saw the Nova answer...meh just keep a copy of it in our database
16:24:26 <tcpepper> and we all know how we reacted to hearing that
16:24:30 <kristenc> we won't be able to use a one size fits all solution.
16:24:42 <kristenc> but for storage, ceph seems to already have a way to store key value.
16:24:46 <kristenc> so I'd like to use it.
16:24:50 <tcpepper> yep
16:24:53 <kristenc> and eliminate the sql database.
16:25:03 <markusry> Ooh, sounds good.
16:25:03 <kristenc> I think this is a p2 or p3 though.
16:25:21 <kristenc> mainly because we do have another way to do it right now.
16:25:29 <tcpepper> maybe even a janitorial?
16:26:52 <kristenc> ok with me. it'd be a nice project for someone who wanted to learn.
16:26:59 <kristenc> and contribute something useful.
16:27:01 <tcpepper> exactly
16:27:33 <kristenc> 622 is just a question.
16:28:18 * tcpepper looks for the issue on scheduler queueing jobs
16:28:26 <kristenc> this was a problem I ran into while testing on a cluster behind a bastion with no proxy.
16:28:27 <tcpepper> I think in the short term reject the job
16:28:51 <tcpepper> but revisit when scheduler can queue and I can implement a "waiting for cnci" flag and flush those jobs when their cnci arrives
16:29:24 <kristenc> i think i worded the problem poorly.
16:29:33 <kristenc> the problem is there are no network nodes ready to launch cncis.
16:30:24 <kristenc> and so controller will happily try to launch a cnci.
16:30:29 <tcpepper> that's what I understood you to be saying
16:30:30 <tcpepper> aaah
16:30:35 <kristenc> but get no "no network nodes" message.
16:30:35 <tcpepper> ok so there's too layers
16:30:44 <tcpepper> a workload before a cnci and a cnci before a nn
16:30:58 <tcpepper> either way scheduler has the knowledge to get the right thing done
16:31:03 <mcastelino> o/
16:31:04 <tcpepper> (once I implement a queue)
16:31:20 <kristenc> also wondered if the os-prepare code could run asynchronously.
16:31:35 <kristenc> because in this particular case, its actually doing something useless.
16:31:44 <kristenc> (searching for storage bundles on a network node)
16:32:00 <kristenc> and I'd like to be able to just have it start.
16:32:05 <tcpepper> that sounds like a separate bug
16:32:12 <tcpepper> but architecturally we have an open on that
16:32:21 <tcpepper> do we run a cluster in a degraded state?
16:32:22 <kristenc> but still - even if you take os-prepare out of the picture, the question remains - what to do when no network nodes.
16:32:30 <kristenc> yes.
16:32:52 <tcpepper> we don't even have a definition yet for preferred cluster implementations vs degraded
16:32:59 <tcpepper> so I say for now reject it at controller
16:33:06 <tcpepper> and we can do better over time at the scheduler
16:33:13 <kristenc> I'd have to add a lot of code to monitor the cluster.
16:33:19 <kristenc> and make sure it was always "healthy"
16:33:27 <kristenc> and then the question is - how often do you check it.
16:33:38 <tcpepper> hmm. maybe that's a enhancement ticket of its own
16:33:48 <tcpepper> in a way this is a "cloud full"
16:33:58 <tcpepper> in as much as the required resources are not available in the cloud
16:34:06 <tcpepper> so scheduler rejects it then
16:34:14 <tcpepper> w/o controller having to monitor now
16:34:40 <kristenc> scheduler should currently reject a cnci start request if there are no network nodes, right?
16:34:47 <tcpepper> yep
16:34:53 <kristenc> so I wonder why it wasn't.
16:35:08 <kristenc> do you think it's because of where the os-prepare delay was?
16:35:15 <kristenc> it checked into the scheduler
16:35:18 <kristenc> perhaps
16:35:24 <kristenc> but wasn't ready to do anything?
16:35:52 <kristenc> this might be a bug.
16:36:00 <tcpepper> could well be
16:36:22 <tcpepper> a race between checked in and actually ready to receive
16:37:23 <kristenc> lets skip it - I think it needs more investigation.
16:37:42 <tcpepper> btw I like the question label
16:37:48 <kristenc> I'm going to mark it as a p3 bug though.
16:37:49 <tcpepper> really useful to label things we need to talk about
16:38:41 <kristenc> 623 was a result of me having 622 issues?
16:38:43 <kristenc> :)
16:38:52 <kristenc> it's just a nice to have.
16:38:53 <kristenc> p3?
16:39:06 <tcpepper> yeah
16:39:10 <tcpepper> super easy to implement
16:39:12 <kristenc> ciao-cli will give you status of the cluster.
16:39:12 <tcpepper> janitorial
16:39:23 <kristenc> but doesn't tell you about network vs. compute nodes.
16:39:41 <tcpepper> I almost recall markusry having mentioned this gap once
16:39:59 <tcpepper> or we noticed it trying to debug one of our clusters early on when something wasn't checking in
16:40:12 <tcpepper> maybe we previously even listed the nn as a node in the ciao-cli node list, but not denoted as an nn
16:40:29 <kristenc> 626 is a p1 bug as it's blocking completion of boot from volume. plus I'm already working on it.
16:40:31 * tcpepper has sad memories of lists of 101 uuid's and trying to reason about what was what
16:41:47 <kristenc> 627 is a request to implement a new compute api endpoint. I would vote p2 - we do implement a different api for attaching volumes.
16:42:14 <kristenc> it is however, not the same api as what the openstack uis use. and jorge implemented the ciao ui to be compatible with the openstack uis.
16:42:34 <kristenc> so he'd like us to support the nova api for attaching volumes rather than the cinder one.
16:42:35 <tcpepper> I can see it as a P2, and also tag OpenStack Compat
16:43:22 <kristenc> ciao-cli uses the cinder api.
16:44:18 <kristenc> 631 probably needs an in person discussion.
16:44:42 <kristenc> it might be work that would not be needed once the workloads feature is implemented.
16:45:22 <kristenc> for now I am going to suggest a p3.
16:45:25 <tcpepper> yep
16:45:36 <tcpepper> it's interesting to me b/c a user asked for it based on comps to existing clouds
16:45:46 <tcpepper> and we saw Nova's pain with this feature
16:45:51 <tcpepper> would be nice to implement and correctly
16:46:29 <kristenc> 633
16:46:57 <kristenc> sounds like a nice to have feature - p3?
16:47:02 <kristenc> mcastelino, ??
16:47:02 <tcpepper> yeah
16:47:05 <markusry> It's needed for getting SingleVmworking in travis
16:47:08 <tcpepper> this was to enable a test gate
16:47:20 <tcpepper> if Clearlinux builds a daily base image for the cnci, we need a QA gate
16:47:20 <markusry> So we can run our bat tests in travis
16:47:22 <mcastelino> I am working on it for two reasons
16:47:41 <mcastelino> 1. Single VM on Travis. 2. As a validate new images coming out of dev ops
16:47:53 <tcpepper> then https://download.clearlinux.org/demos/ciao/ can move to https://download.clearlinux.org/images and always be up to date in its OS internals
16:47:57 <kristenc> ah - well, that is different (Single VM).
16:48:18 <kristenc> I guess if you *need* it to get single VM working, then it should be a higher priority.
16:48:22 <tcpepper> I'd go P2 on it
16:48:25 <mcastelino> yes
16:48:30 <markusry> Yep. I think so
16:48:37 <markusry> We're not that far away
16:48:57 <mcastelino> The plan is to support clear a based CNCIs and CNCI based on another systemd based distro... debian or centos
16:49:20 <kristenc> rbradford, I assume 635 and 636 are intermittant test failures on travis?
16:49:27 <rbradford> kristenc, yes
16:49:28 <markusry> mcastelino: I'm sure I've asked you this before but why can't we run the CNCI inside a container?
16:49:45 <tcpepper> tenant separation
16:49:53 <tcpepper> fear of tenant escaping the container
16:50:06 <markusry> Okay, but for singleVM it might be okay?
16:50:09 <mcastelino> It has not been designed to be run as a container.. It does DHCP. Also it uses macvtap.. And it needs a supervisor inside the container to restart the CNCI agent
16:50:20 <kristenc> ok - I wish we had a master bug for all the travis races so that we can easily close them all / retest everything once the root cause is fixed.
16:50:20 <mcastelino> if we add all of that to a container.. it looks like a VM..
16:50:30 <kristenc> or a label for them.
16:50:33 <markusry> mcastelino: Got it. Thanks for the answer
16:50:53 <markusry> kristenc: I think there are different causes
16:50:55 <tcpepper> I think https://github.com/01org/ciao/issues/635 is a dup. there's a prior bug tracking the instance id race
16:51:24 <tcpepper> https://github.com/01org/ciao/issues/636 might also be a dup (of a different issue)...would need to search
16:51:34 <tcpepper> I think for that panic nil map one we have a better trace in another bug
16:51:45 <markusry> Yes, I've seen that one before
16:51:58 <markusry> But maybe in a different test failure
16:52:00 <kristenc> tcpepper, they are all symptoms of the same issue likely.
16:52:27 <kristenc> I suppose you don't know till you fix the issue that you think is causing it.
16:52:32 <markusry> Right
16:52:39 <kristenc> which I do already have another issue filed for.
16:52:45 <tcpepper> my impression has been they're different
16:52:50 <tcpepper> but..yeah hard to say
16:52:53 <kristenc> but that's why I wish we could tie them all together.
16:53:01 <kristenc> to make it easier on ourselves.
16:53:07 <kristenc> should we declare a new label?
16:53:10 <kristenc> travis-failures
16:53:12 <kristenc> ?
16:53:25 <kristenc> or intermittent-travis-fail?
16:53:39 <rbradford> travis-flakes ?
16:53:45 <kristenc> fine.
16:54:03 <kristenc> let's start tagging these suckers so that when we think we've solved them we can keep an eye on them.
16:55:07 <kristenc> tcpepper, I think 638 might be a duplicate of another work item in here. let me see if I've got that captured somewhere else.
16:55:51 <kristenc> although - I've not been great about putting planned features for controller into github.
16:56:05 <tcpepper> I did some searching and didn't see cloud-init anywhere
16:56:27 <tcpepper> it's plausible we never did it b/c this is such an obvious need to have done early that surely we'd just do it
16:56:36 <kristenc> tcpepper, it would have been called "workloads"
16:56:41 <kristenc> but I don't see it either.
16:56:55 <kristenc> i don't think i've put any of the planned work around workloads into github yet.
16:57:09 <tcpepper> there's https://github.com/01org/ciao/issues/510
16:57:13 <tcpepper> which is slightly different imho
16:57:24 <tcpepper> we could enable ciao-cli -> ciao-controller passing of a yaml
16:57:24 <kristenc> no - that's different.
16:57:31 <tcpepper> w/o addressing the security/isolation aspect
16:57:38 <kristenc> although - part of the work that needs to be done in order to do "workloads" right.
16:58:23 <tcpepper> I feel like something can be done quickly/simply to allow the cli to optionally pass the yaml and override controller reading test.yaml
16:58:29 <tcpepper> I can take it on
16:58:31 <mcastelino> tcpepper, tcpepper I think we should only allow the user to specify the cloud-init part of the workload.. not the whole yaml
16:58:38 <tcpepper> yes
16:58:43 <kristenc> tcpepper, I think we want to resist the urge to do it quick and dirty.
16:58:55 <tcpepper> I didn't say dirty did I?
16:58:56 <kristenc> there's a lot of work to do workloads correctly.
16:59:06 <kristenc> lets talk offline.
16:59:09 <tcpepper> ok
16:59:22 <tcpepper> so P4 it then?
16:59:34 <mcastelino> also the way mark launches containers... we fake cloud-init for containers.. so that is nice
16:59:38 <tcpepper> implicitly it is a p4
16:59:42 <kristenc> I would call it a p2.
16:59:44 <tcpepper> since we haven't worked on it in a year
16:59:58 <kristenc> because to me the workloads feature is something that should be done soon.
17:00:06 <kristenc> but - you are right, we haven't done it yet.
17:00:10 <kristenc> ok.
17:00:12 <tcpepper> oh well
17:00:13 <tcpepper> it's 10am
17:00:14 <kristenc> and this concludes our meeting.
17:00:16 <tcpepper> and we're pumpkins
Development
Architecture