Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Use the best available model #6

Closed
3 of 4 tasks
nelsonic opened this issue Nov 11, 2023 · 6 comments · Fixed by #15
Closed
3 of 4 tasks

Feat: Use the best available model #6

nelsonic opened this issue Nov 11, 2023 · 6 comments · Fixed by #15
Assignees
Labels
discuss Share your constructive thoughts on how to make progress with this issue documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality feedback Feedback from people using the App or any other repo priority-1 Highest priority issue. This is costing us money every minute that passes. question A question needs to be answered before progress can be made on this issue T1h Time Estimate 1 Hour technical A technical issue that requires understanding of the code, infrastructure or dependencies

Comments

@nelsonic
Copy link
Member

nelsonic commented Nov 11, 2023

As noted in #5 (comment) the mini model currently deployed returns basic classification. 💭
We want to use the best available pre-trained model to maximise the effectiveness of classification. ✨

The Fly.io instance is currently: https://fly.io/dashboard/dwyl-img-class shared-cpu-1x@256MB
imgai-machine-256

When the VM is not in-use it gets paused: https://fly.io/blog/fly-now-with-power-pause/

imgai-paused

So we can easily bump this to 2 GB or 4 GB RAM without fear of it costing us a fortune.
https://fly.io/docs/about/pricing/#compute

fly-perf-core

Please:

  • Update the code to use the best (biggest?) available model
  • Increase the Fly.io instance RAM size and CPU to comfortably match the requirement of the best model.

    If the model is 3GB use a 4GB instance. If the model is 15GB use 16GB instance.

  • Document which model you're using and why and the steps (fly CLI commands) you used to scale up.

Thanks.

Note: Boss Level: if there's a significant advantage to using an A100 GPU please play with it:
https://fly.io/docs/about/pricing/#gpus-and-fly-machines

image

Just don't leave it running over a weekend ... 💰 🔥 😉
and document everything you learn along the way. 👌
Knowing your way around A100 will be a good talking point when you interview at Anthropic. 😜

Chore

  • Fix auto-deployment ...
image
@nelsonic nelsonic added documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality T1h Time Estimate 1 Hour priority-1 Highest priority issue. This is costing us money every minute that passes. technical A technical issue that requires understanding of the code, infrastructure or dependencies labels Nov 11, 2023
@nelsonic nelsonic moved this to 🔖 Ready for Development in dwyl app kanban Nov 11, 2023
@nelsonic nelsonic added feedback Feedback from people using the App or any other repo discuss Share your constructive thoughts on how to make progress with this issue labels Nov 12, 2023
@LuchoTurtle
Copy link
Member

I was using a lightweight model because I initially didn't want this machine to "waste money".
The reason why it also takes a bit of time to start up is that the app is suspended every time there's no activity in one hour. So it has to cold start every time someone loads in the app.

I've gotten this e-mail from fly.io.

image

So perhaps we'll bump up the specs a bit.

@nelsonic
Copy link
Member Author

Right ... so we/you have to scale up the builder ... which currently has 4GB RAM:
https://fly.io/dashboard/dwyl-img-class
image

I don't have any issues with using a performance CPU and 4 GB of RAM as noted above in my screenshot. 👆
$41/month (ceiling) won't break the bank.

The question is: will using a bigger pre-trained model significantly increase the cold-boot time for the app? 💭 ⏳

@nelsonic nelsonic added the question A question needs to be answered before progress can be made on this issue label Nov 13, 2023
@LuchoTurtle
Copy link
Member

I think so.
Even though I tried to cache the model locally while deploying (it's written on the README file), it seems that the model is fetched every time it's booted up again (looking at the docs at https://fly.io/apps/imgai/monitoring).

So if I just increase the model, it will probably take much, much longer. I'll try to see if caching is effectively working or not and if I can get this to work.

@nelsonic
Copy link
Member Author

Ok. Sounds like you might want to cache the model and any other useful stuff on an attached volume: https://fly.io/docs/apps/volume-storage/ this will be stored along with the machine config during pause.
The App will need to know how to check the volume at boot time ... 💭
Feel free to open a separate issue (sub-task) for this. 👌

Oh, and while you're looking at the repo, remember to ⭐ it yourself .... 😉

@nelsonic
Copy link
Member Author

Definitely open a forum topic https://community.fly.io to discuss how best to run this on Fly.io
Explain the context and link back to this repo so people know.
You will get expert help very fast given the content of the app.

image

@nelsonic
Copy link
Member Author

Just captured the logs on imgai https://fly.io/apps/imgai/monitoring for reference.

fly logs -a imgai

Waiting for logs...

2023-11-13T06:50:07.922 app[683d527a1d7d08] mad [info] INFO Starting clean up.

2023-11-13T06:50:07.923 app[683d527a1d7d08] mad [info] WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)

2023-11-13T06:50:07.931 app[683d527a1d7d08] mad [info] 2023/11/13 06:50:07 listening on [fdaa:3:7f9d:a7b:1be:16eb:4bb0:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T06:50:08.924 app[683d527a1d7d08] mad [info] [ 344.646832] reboot: Restarting system

2023-11-13T06:55:24.827 proxy[e82d629c315628] mad [info] Downscaling app imgai from 1 machines to 0 machines, stopping machine e82d629c315628 (region=mad, process group=app)

2023-11-13T06:55:24.835 app[e82d629c315628] mad [info] INFO Sending signal SIGTERM to main child process w/ PID 305

2023-11-13T06:55:24.838 app[e82d629c315628] mad [info] 06:55:24.836 [notice] SIGTERM received - shutting down

2023-11-13T06:55:26.428 app[e82d629c315628] mad [info] INFO Main child exited normally with code: 0

2023-11-13T06:55:26.429 app[e82d629c315628] mad [info] INFO Starting clean up.

2023-11-13T06:55:26.431 app[e82d629c315628] mad [info] WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)

2023-11-13T06:55:26.440 app[e82d629c315628] mad [info] 2023/11/13 06:55:26 listening on [fdaa:3:7f9d:a7b:1bf:355c:df71:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T06:55:27.431 app[e82d629c315628] mad [info] [ 661.012916] reboot: Restarting system

2023-11-13T06:57:33.582 proxy[683d527a1d7d08] mad [info] Starting machine

2023-11-13T06:57:33.754 app[683d527a1d7d08] mad [info] [ 0.041896] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!

2023-11-13T06:57:33.787 app[683d527a1d7d08] mad [info] [ 0.043270] PCI: Fatal: No config space access function found

2023-11-13T06:57:33.984 app[683d527a1d7d08] mad [info] INFO Starting init (commit: 15238e9)...

2023-11-13T06:57:34.002 app[683d527a1d7d08] mad [info] INFO Preparing to run: `/app/bin/server` as nobody

2023-11-13T06:57:34.006 app[683d527a1d7d08] mad [info] INFO [fly api proxy] listening at /.fly/api

2023-11-13T06:57:34.014 app[683d527a1d7d08] mad [info] 2023/11/13 06:57:34 listening on [fdaa:3:7f9d:a7b:1be:16eb:4bb0:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T06:57:34.046 proxy[683d527a1d7d08] mad [info] machine started in 464.263689ms

2023-11-13T06:57:37.011 app[683d527a1d7d08] mad [info] WARN Reaped child process with pid: 362 and signal: SIGUSR1, core dumped? false

2023-11-13T06:57:37.196 app[683d527a1d7d08] mad [info] |===================== | 37% (25.88/69.55 KB) |================================== | 61% (42.27/69.55 KB) |=============================================== | 84% (58.65/69.55 KB) |==============================================================| 100% (69.55 KB)

2023-11-13T06:57:39.150 proxy[683d527a1d7d08] mad [info] waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.103998901s so far)

2023-11-13T06:57:40.394 app[683d527a1d7d08] mad [info] 06:57:40.387 [info] TfrtCpuClient created.

2023-11-13T06:57:42.156 proxy[683d527a1d7d08] mad [error] failed to connect to machine: gave up after 15 attempts (in 8.110079546s)

2023-11-13T06:57:42.211 proxy[e82d629c315628] mad [info] Starting machine

2023-11-13T06:57:42.382 app[e82d629c315628] mad [info] [ 0.041090] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!

2023-11-13T06:57:42.415 app[e82d629c315628] mad [info] [ 0.042371] PCI: Fatal: No config space access function found

2023-11-13T06:57:42.602 app[e82d629c315628] mad [info] INFO Starting init (commit: 15238e9)...

2023-11-13T06:57:42.619 app[e82d629c315628] mad [info] INFO Preparing to run: `/app/bin/server` as nobody

2023-11-13T06:57:42.625 app[e82d629c315628] mad [info] INFO [fly api proxy] listening at /.fly/api

2023-11-13T06:57:42.634 app[e82d629c315628] mad [info] 2023/11/13 06:57:42 listening on [fdaa:3:7f9d:a7b:1bf:355c:df71:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T06:57:42.664 proxy[e82d629c315628] mad [info] machine started in 453.004364ms

2023-11-13T06:57:45.611 app[e82d629c315628] mad [info] |==================== | 35% (24.37/69.55 KB) |================================= | 59% (40.75/69.55 KB) |========================================= | 73% (50.66/69.55 KB) |====================================================== | 96% (67.05/69.55 KB) |==============================================================| 100% (69.55 KB)

2023-11-13T06:57:46.910 app[683d527a1d7d08] mad [info] 06:57:46.903 [info] Running AppWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2023-11-13T06:57:46.918 app[683d527a1d7d08] mad [info] 06:57:46.914 [info] Access AppWeb.Endpoint at https://imgai.fly.dev

2023-11-13T06:57:47.932 proxy[e82d629c315628] mad [info] waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.267697053s so far)

2023-11-13T06:57:48.753 app[e82d629c315628] mad [info] 06:57:48.744 [info] TfrtCpuClient created.

2023-11-13T06:57:50.898 proxy[e82d629c315628] mad [error] failed to connect to machine: gave up after 15 attempts (in 8.233731986s)

2023-11-13T06:57:51.096 app[683d527a1d7d08] mad [info] 06:57:51.081 request_id=F5ccdIIcjcdeMV8AAAER [info] HEAD /

2023-11-13T06:57:51.097 app[683d527a1d7d08] mad [info] 06:57:51.095 request_id=F5ccdIIcjcdeMV8AAAER [info] Sent 200 in 14ms

2023-11-13T06:57:55.292 app[e82d629c315628] mad [info] 06:57:55.286 [info] Running AppWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2023-11-13T06:57:55.299 app[e82d629c315628] mad [info] 06:57:55.296 [info] Access AppWeb.Endpoint at https://imgai.fly.dev

2023-11-13T07:05:07.194 proxy[683d527a1d7d08] mad [info] Downscaling app imgai from 2 machines to 1 machines, stopping machine 683d527a1d7d08 (region=mad, process group=app)

2023-11-13T07:05:07.203 app[683d527a1d7d08] mad [info] INFO Sending signal SIGTERM to main child process w/ PID 305

2023-11-13T07:05:07.207 app[683d527a1d7d08] mad [info] 07:05:07.204 [notice] SIGTERM received - shutting down

2023-11-13T07:05:08.539 app[683d527a1d7d08] mad [info] INFO Main child exited normally with code: 0

2023-11-13T07:05:08.540 app[683d527a1d7d08] mad [info] INFO Starting clean up.

2023-11-13T07:05:08.542 app[683d527a1d7d08] mad [info] WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)

2023-11-13T07:05:08.549 app[683d527a1d7d08] mad [info] 2023/11/13 07:05:08 listening on [fdaa:3:7f9d:a7b:1be:16eb:4bb0:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T07:05:09.542 app[683d527a1d7d08] mad [info] [ 455.793730] reboot: Restarting system

2023-11-13T07:07:25.358 proxy[e82d629c315628] mad [info] Downscaling app imgai from 1 machines to 0 machines, stopping machine e82d629c315628 (region=mad, process group=app)

2023-11-13T07:07:25.362 app[e82d629c315628] mad [info] INFO Sending signal SIGTERM to main child process w/ PID 305

2023-11-13T07:07:25.365 app[e82d629c315628] mad [info] 07:07:25.363 [notice] SIGTERM received - shutting down

2023-11-13T07:07:27.298 app[e82d629c315628] mad [info] INFO Main child exited normally with code: 0

2023-11-13T07:07:27.299 app[e82d629c315628] mad [info] INFO Starting clean up.

2023-11-13T07:07:27.301 app[e82d629c315628] mad [info] WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)

2023-11-13T07:07:27.310 app[e82d629c315628] mad [info] 2023/11/13 07:07:27 listening on [fdaa:3:7f9d:a7b:1bf:355c:df71:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T07:07:28.302 app[e82d629c315628] mad [info] [ 585.924182] reboot: Restarting system

2023-11-13T11:07:49.847 proxy[683d527a1d7d08] mad [info] Starting machine

2023-11-13T11:07:50.029 app[683d527a1d7d08] mad [info] [ 0.050000] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!

2023-11-13T11:07:50.062 app[683d527a1d7d08] mad [info] [ 0.051452] PCI: Fatal: No config space access function found

2023-11-13T11:07:50.253 app[683d527a1d7d08] mad [info] INFO Starting init (commit: 15238e9)...

2023-11-13T11:07:50.272 app[683d527a1d7d08] mad [info] INFO Preparing to run: `/app/bin/server` as nobody

2023-11-13T11:07:50.277 app[683d527a1d7d08] mad [info] INFO [fly api proxy] listening at /.fly/api

2023-11-13T11:07:50.286 app[683d527a1d7d08] mad [info] 2023/11/13 11:07:50 listening on [fdaa:3:7f9d:a7b:1be:16eb:4bb0:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T11:07:50.312 proxy[683d527a1d7d08] mad [info] machine started in 465.151197ms

2023-11-13T11:07:53.432 app[683d527a1d7d08] mad [info] |==================== | 36% (24.70/69.55 KB) |========================== | 47% (32.76/69.55 KB) |======================================== | 71% (49.15/69.55 KB) |===================================================== | 94% (65.53/69.55 KB) |==============================================================| 100% (69.55 KB)

2023-11-13T11:07:55.621 proxy[683d527a1d7d08] mad [info] waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.308249053s so far)

2023-11-13T11:07:56.938 app[683d527a1d7d08] mad [info] 11:07:56.933 [info] TfrtCpuClient created.

2023-11-13T11:07:58.611 proxy[683d527a1d7d08] mad [error] failed to connect to machine: gave up after 15 attempts (in 8.299130722s)

2023-11-13T11:07:58.684 proxy[e82d629c315628] mad [info] Starting machine

2023-11-13T11:07:58.866 app[e82d629c315628] mad [info] [ 0.041512] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!

2023-11-13T11:07:58.900 app[e82d629c315628] mad [info] [ 0.042921] PCI: Fatal: No config space access function found

2023-11-13T11:07:59.087 app[e82d629c315628] mad [info] INFO Starting init (commit: 15238e9)...

2023-11-13T11:07:59.103 app[e82d629c315628] mad [info] INFO Preparing to run: `/app/bin/server` as nobody

2023-11-13T11:07:59.108 app[e82d629c315628] mad [info] INFO [fly api proxy] listening at /.fly/api

2023-11-13T11:07:59.116 app[e82d629c315628] mad [info] 2023/11/13 11:07:59 listening on [fdaa:3:7f9d:a7b:1bf:355c:df71:2]:22 (DNS: [fdaa::3]:53)

2023-11-13T11:07:59.152 proxy[e82d629c315628] mad [info] machine started in 467.879812ms

2023-11-13T11:08:02.112 app[e82d629c315628] mad [info] WARN Reaped child process with pid: 362 and signal: SIGUSR1, core dumped? false

2023-11-13T11:08:02.170 app[e82d629c315628] mad [info] |========================== | 46% (31.78/69.55 KB) |========================== | 47% (32.76/69.55 KB) |======================================== | 71% (49.15/69.55 KB) |===================================================== | 94% (65.53/69.55 KB) |==============================================================| 100% (69.55 KB)

2023-11-13T11:08:03.373 app[683d527a1d7d08] mad [info] 11:08:03.368 [info] Running AppWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2023-11-13T11:08:03.381 app[683d527a1d7d08] mad [info] 11:08:03.376 [info] Access AppWeb.Endpoint at https://imgai.fly.dev

2023-11-13T11:08:04.737 proxy[e82d629c315628] mad [info] waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.584947666s so far)

2023-11-13T11:08:05.474 app[e82d629c315628] mad [info] 11:08:05.468 [info] TfrtCpuClient created.

2023-11-13T11:08:07.716 proxy[e82d629c315628] mad [error] failed to connect to machine: gave up after 15 attempts (in 8.563973471s)

2023-11-13T11:08:07.917 app[683d527a1d7d08] mad [info] 11:08:07.899 request_id=F5cqHOI8CQpeMV8AAAER [info] GET /

2023-11-13T11:08:07.918 app[683d527a1d7d08] mad [info] 11:08:07.917 request_id=F5cqHOI8CQpeMV8AAAER [info] Sent 200 in 17ms

2023-11-13T11:08:08.156 app[683d527a1d7d08] mad [info] 11:08:08.153 [info] CONNECTED TO Phoenix.LiveView.Socket in 548µs

2023-11-13T11:08:08.156 app[683d527a1d7d08] mad [info] Transport: :websocket

2023-11-13T11:08:08.156 app[683d527a1d7d08] mad [info] Serializer: Phoenix.Socket.V2.JSONSerializer

2023-11-13T11:08:08.156 app[683d527a1d7d08] mad [info] Parameters: %{"_csrf_token" => "EHUiW3kcMwlTIQMJU3cvCHI_PlsBRCAcV3vi7CIP8fmHc5Vp0uR2rwOJ", "_live_referer" => "undefined", "_mounts" => "0", "_track_static" => %{"0" => "https://imgai.fly.dev/assets/app-25b5c2cfcdb3041cab0cdbc67bcf91d9.css?vsn=d", "1" => "https://imgai.fly.dev/assets/app-956037311f0a3945ed53c93e204e141c.js?vsn=d"}, "vsn" => "2.0.0"}

2023-11-13T11:08:12.106 app[e82d629c315628] mad [info] 11:08:12.103 [info] Running AppWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2023-11-13T11:08:12.112 app[e82d629c315628] mad [info] 11:08:12.110 [info] Access AppWeb.Endpoint at https://imgai.fly.dev

2023-11-13T11:08:12.498 app[683d527a1d7d08] mad [info] 11:08:12.497 request_id=F5cqHfRUC17kCj8AAAFR [info] GET /

2023-11-13T11:08:12.498 app[683d527a1d7d08] mad [info] 11:08:12.498 request_id=F5cqHfRUC17kCj8AAAFR [info] Sent 200 in 842µs

2023-11-13T11:08:12.721 app[683d527a1d7d08] mad [info] 11:08:12.720 [info] CONNECTED TO Phoenix.LiveView.Socket in 32µs

2023-11-13T11:08:12.721 app[683d527a1d7d08] mad [info] Transport: :websocket

2023-11-13T11:08:12.721 app[683d527a1d7d08] mad [info] Serializer: Phoenix.Socket.V2.JSONSerializer

2023-11-13T11:08:12.721 app[683d527a1d7d08] mad [info] Parameters: %{"_csrf_token" => "Kx8yW3klDi8jEggjWSQWKxp8FB8VYDY1mYfi7ztvHUfbifoSX6xvfSYc", "_live_referer" => "undefined", "_mounts" => "0", "_track_static" => %{"0" => "https://imgai.fly.dev/assets/app-25b5c2cfcdb3041cab0cdbc67bcf91d9.css?vsn=d", "1" => "https://imgai.fly.dev/assets/app-956037311f0a3945ed53c93e204e141c.js?vsn=d"}, "vsn" => "2.0.0"}

We need to address

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Share your constructive thoughts on how to make progress with this issue documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality feedback Feedback from people using the App or any other repo priority-1 Highest priority issue. This is costing us money every minute that passes. question A question needs to be answered before progress can be made on this issue T1h Time Estimate 1 Hour technical A technical issue that requires understanding of the code, infrastructure or dependencies
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants