feat: implement boundary usage tracker and telemetry collection#21716
feat: implement boundary usage tracker and telemetry collection#21716
Conversation
f149d4f to
dace6b5
Compare
Documentation CheckNo Changes NeededThis PR implements internal telemetry collection for boundary usage statistics. The changes are entirely internal:
The telemetry data structure is properly defined in Automated review via Coder Tasks |
There was a problem hiding this comment.
I have a more comprehensive test I will add after this PR merges that will stitch in BoundaryLogsAPI to ensure Track() is called properly when the logs are received from the workspace agent.
07eb313 to
778ad4f
Compare
Implements telemetry for boundary usage tracking across all Coder replicas and reports them via telemetry. Changes: - Implement Tracker with Track() and FlushToDB() methods - Add telemetry integration via collectBoundaryUsageSummary() - Use telemetry lock to ensure only one replica collects per period The tracker accumulates unique workspaces, unique users, and request counts (allowed/denied) in memory, then flushes to the database periodically. During telemetry collection, stats are aggregated across all replicas and reset for the next period.
778ad4f to
0cf7cf9
Compare
f0ssel
left a comment
There was a problem hiding this comment.
If possible would like another engineer's eyes here but everything here looks clean to me
| return nil | ||
| } | ||
|
|
||
| //nolint:gocritic // This is the actual package doing boundary usage tracking. |
There was a problem hiding this comment.
What is the lint complaining here? Is there a more "proper" way for dbauthz?
There was a problem hiding this comment.
A lint rule is flagging this as dangerous because there are some dbauthz.As<...> functions that are very powerful.
Lines 23 to 30 in 3eeeabf
This particular usage only gives access to boundary usage resources, and it's only being used for boundary usage tracking. I think the lint rule is pretty aggressive given some of the deprecated powerful functions, but there's not much risk here.
| r.options.Logger.Debug(ctx, "boundary usage telemetry lock already claimed by another replica, skipping", slog.F("period_ending_at", periodEndingAt)) | ||
| return nil, nil //nolint:nilnil // This is simple to handle when dealing with telemetry. | ||
| } | ||
| if err != nil { |
There was a problem hiding this comment.
I'm a bit concerned that if any error, other than unique_violation will be returned or unique_violation will be incorrectly wrapped - collectBoundaryUsageSummary will return error, which can break all telemetry process?
but I see that aibridge uses same approach, so probably okay.
There was a problem hiding this comment.
Yeah, everything in the snapshot seems to be all-or-nothing. I debated making the snapshot proceed if the boundary usage fails for whatever reason, but I couldn't really come up with a good reason to deviate from the prior art because the boundary telemetry should always work assuming there's no unique violation.
|
For future readers: I tested this by launching the develop.sh script and pointed Coder at a local telemetry server with |
Implements telemetry for boundary usage tracking across all Coder replicas and reports them via telemetry.
Changes:
The tracker accumulates unique workspaces, unique users, and request counts (allowed/denied) in memory, then flushes to the database periodically. During telemetry collection, stats are aggregated across all replicas and reset for the next period.
Relates to coder/boundary#138