Establish what metrics are needed for production monitoring #18

pnorman · 2024-01-03T01:11:37Z

What do we need for production monitoring, and what do we already have?

pnorman · 2024-01-03T01:45:30Z

A CDN will have its own metrics, which are obviously independent of tilekiln. We also can't really control what metrics those are.

Behind the server, we need metrics on serving, DB health, tile metrics, and replication.

Because in production tilekiln will be behind nginix or something else that handles SSL termination, we can rely on that for serving metrics like number of requests, avg request size, tile serving time, etc. This avoids duplicating existing metrics, and these metrics would be needed for any server, not just tilekiln.

There are existing PostgreSQL metrics for table size, index size, bloat, etc. Doing these per-table will get us the per-zoom sizes, because the storage tables are partitioned.

Replication metrics are covered by existing osm2pgsql metrics, and tilekiln is not specific to osm2pgsql.

Where we need metrics is for tiles.

This would cover tile size, number of tiles, and new tiles being generated. The dimensions would be host, db, schema/tileset, and zoom. These can be calculated from the DB with something like

SUM(length(geom::bytea)),
COUNT(*),
percentile_disc(ARRAY[0.00,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]) WITHIN GROUP (ORDER BY length(tile))
FROM tiles_z14;

Notes:

this might be faster as a query on the entire tiles table with parallelism, but needs consideration.
length(tile) is different than pg_column_size
When we get compressed tiles, this will report compressed size. The only way to avoid this would be to decompress each tile in-DB and get its size, or add another column with tile size, adding 32 bits per row for a total of 1.4GB additional storage. This isn't unreasonable, but can be deferred for now.
Have to check how to do histograms with prometheus
This is faster with JIT on
It can't be calculated every 15s, so maybe cache with a mat view

For tiles being generated we need a counter. Obvious choice is a table with number of tiles generated per zoom, and then set it to +1 its current value after generating a tile.
This would add time to tile generation, so maybe do it in bunches?

pnorman · 2024-01-03T05:29:34Z

Error metrics will also be needed

pnorman · 2024-01-16T21:32:26Z

I was thinking based on https://prometheus.github.io/client_python/getting-started/three-step-demo/ I should use the prometheus client to track tile generation information and upload at the end of generation, but this would work poorly in case of error. Instead I should track it manually, or perhaps use it for timing each layer?

Thinking through this has revealed the need to store configuration information in-DB, both so I don't have to specify the config files, but also so I can support many config files for different schemas.

pnorman · 2024-01-30T04:20:32Z

Tileset storage metrics have been implemented. Additionally, there are metrics for monitoring the time to calculate the storage metrics.

Generation metrics need implementing still.

pnorman · 2024-02-07T04:25:52Z

Generation metrics deferred to after parallelism is implemented. This is because reworking the generation code to be parallel might change things around.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish what metrics are needed for production monitoring #18

Establish what metrics are needed for production monitoring #18

pnorman commented Jan 3, 2024

pnorman commented Jan 3, 2024

pnorman commented Jan 3, 2024

pnorman commented Jan 16, 2024

pnorman commented Jan 30, 2024

pnorman commented Feb 7, 2024

Establish what metrics are needed for production monitoring #18

Establish what metrics are needed for production monitoring #18

Comments

pnorman commented Jan 3, 2024

pnorman commented Jan 3, 2024

pnorman commented Jan 3, 2024

pnorman commented Jan 16, 2024

pnorman commented Jan 30, 2024

pnorman commented Feb 7, 2024