-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Establish what metrics are needed for production monitoring #18
Comments
A CDN will have its own metrics, which are obviously independent of tilekiln. We also can't really control what metrics those are. Behind the server, we need metrics on serving, DB health, tile metrics, and replication. Because in production tilekiln will be behind nginix or something else that handles SSL termination, we can rely on that for serving metrics like number of requests, avg request size, tile serving time, etc. This avoids duplicating existing metrics, and these metrics would be needed for any server, not just tilekiln. There are existing PostgreSQL metrics for table size, index size, bloat, etc. Doing these per-table will get us the per-zoom sizes, because the storage tables are partitioned. Replication metrics are covered by existing osm2pgsql metrics, and tilekiln is not specific to osm2pgsql. Where we need metrics is for tiles. This would cover tile size, number of tiles, and new tiles being generated. The dimensions would be host, db, schema/tileset, and zoom. These can be calculated from the DB with something like
Notes:
For tiles being generated we need a counter. Obvious choice is a table with number of tiles generated per zoom, and then set it to +1 its current value after generating a tile. |
Error metrics will also be needed |
I was thinking based on https://prometheus.github.io/client_python/getting-started/three-step-demo/ I should use the prometheus client to track tile generation information and upload at the end of generation, but this would work poorly in case of error. Instead I should track it manually, or perhaps use it for timing each layer? Thinking through this has revealed the need to store configuration information in-DB, both so I don't have to specify the config files, but also so I can support many config files for different schemas. |
Tileset storage metrics have been implemented. Additionally, there are metrics for monitoring the time to calculate the storage metrics. Generation metrics need implementing still. |
Generation metrics deferred to after parallelism is implemented. This is because reworking the generation code to be parallel might change things around. |
What do we need for production monitoring, and what do we already have?
The text was updated successfully, but these errors were encountered: