Skip to content

feat: implement boundary usage tracker and telemetry collection#21716

Merged
zedkipp merged 4 commits intomainfrom
zedkipp/boundary-usage-telemetry-snapshot
Jan 28, 2026
Merged

feat: implement boundary usage tracker and telemetry collection#21716
zedkipp merged 4 commits intomainfrom
zedkipp/boundary-usage-telemetry-snapshot

Conversation

@zedkipp
Copy link
Contributor

@zedkipp zedkipp commented Jan 27, 2026

Implements telemetry for boundary usage tracking across all Coder replicas and reports them via telemetry.

Changes:

  • Implement Tracker with Track() and FlushToDB() methods
  • Add telemetry integration via collectBoundaryUsageSummary()
  • Use telemetry lock to ensure only one replica collects per period

The tracker accumulates unique workspaces, unique users, and request counts (allowed/denied) in memory, then flushes to the database periodically. During telemetry collection, stats are aggregated across all replicas and reset for the next period.

Relates to coder/boundary#138

@zedkipp zedkipp force-pushed the zedkipp/boundary-usage-telemetry-snapshot branch 4 times, most recently from f149d4f to dace6b5 Compare January 27, 2026 22:21
@zedkipp zedkipp marked this pull request as ready for review January 27, 2026 22:23
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Jan 27, 2026

Documentation Check

No Changes Needed

This PR implements internal telemetry collection for boundary usage statistics. The changes are entirely internal:

  1. Internal tracking: The boundaryusage.Tracker accumulates statistics (unique workspaces, users, allowed/denied requests) in memory and flushes them periodically to the database.

  2. Telemetry integration: The BoundaryUsageSummary is collected during telemetry snapshots and aggregates data across all replicas. This follows the same pattern as other internal telemetry metrics already documented in the code.

  3. No user-facing changes:

    • No new CLI flags, API endpoints, or configuration options
    • No changes to existing Agent Boundary behavior or configuration
    • The telemetry collection is automatic and uses existing CODER_TELEMETRY_ENABLE controls
  4. Existing documentation covers the feature:

    • docs/admin/setup/telemetry.md explains telemetry collection and references the source code for details
    • docs/ai-coder/boundary/agent-boundary.md documents Agent Boundaries and audit logs (which are distinct from this telemetry)

The telemetry data structure is properly defined in coderd/telemetry/telemetry.go (the BoundaryUsageSummary struct), which is the source of truth referenced by the telemetry documentation.


Automated review via Coder Tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a more comprehensive test I will add after this PR merges that will stitch in BoundaryLogsAPI to ensure Track() is called properly when the logs are received from the workspace agent.

@zedkipp zedkipp force-pushed the zedkipp/boundary-usage-telemetry-snapshot branch 2 times, most recently from 07eb313 to 778ad4f Compare January 27, 2026 22:44
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.

Changes:
- Implement Tracker with Track() and FlushToDB() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period

The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
@zedkipp zedkipp force-pushed the zedkipp/boundary-usage-telemetry-snapshot branch from 778ad4f to 0cf7cf9 Compare January 27, 2026 22:45
Copy link
Contributor

@f0ssel f0ssel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible would like another engineer's eyes here but everything here looks clean to me

return nil
}

//nolint:gocritic // This is the actual package doing boundary usage tracking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the lint complaining here? Is there a more "proper" way for dbauthz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lint rule is flagging this as dangerous because there are some dbauthz.As<...> functions that are very powerful.

coder/scripts/rules.go

Lines 23 to 30 in 3eeeabf

// dbauthzAuthorizationContext is a lint rule that protects the usage of
// system contexts. This is a dangerous pattern that can lead to
// leaking database information as a system context can be essentially
// "sudo".
//
// Anytime a function like "AsSystem" is used, it should be accompanied by a comment
// explaining why it's ok and a nolint.
func dbauthzAuthorizationContext(m dsl.Matcher) {

This particular usage only gives access to boundary usage resources, and it's only being used for boundary usage tracking. I think the lint rule is pretty aggressive given some of the deprecated powerful functions, but there's not much risk here.

r.options.Logger.Debug(ctx, "boundary usage telemetry lock already claimed by another replica, skipping", slog.F("period_ending_at", periodEndingAt))
return nil, nil //nolint:nilnil // This is simple to handle when dealing with telemetry.
}
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that if any error, other than unique_violation will be returned or unique_violation will be incorrectly wrapped - collectBoundaryUsageSummary will return error, which can break all telemetry process?

but I see that aibridge uses same approach, so probably okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, everything in the snapshot seems to be all-or-nothing. I debated making the snapshot proceed if the boundary usage fails for whatever reason, but I couldn't really come up with a good reason to deviate from the prior art because the boundary telemetry should always work assuming there's no unique violation.

@zedkipp
Copy link
Contributor Author

zedkipp commented Jan 28, 2026

For future readers: I tested this by launching the develop.sh script and pointed Coder at a local telemetry server with CODER_TELEMETRY=true CODER_TELEMETRY_URL=http://localhost:3001 ./scripts/develop.sh. Here’s an example of the telemetry snapshot the local telemetry server received:

{
  "aibridge_interceptions_summaries": null,
  <snip>
  "boundary_usage_summary": {
    "allowed_requests": 7,
    "denied_requests": 90,
    "unique_users": 1,
    "unique_workspaces": 1
  },
  "cli_invocations": null,
  "deployment_id": "bc7037e3-51f4-4ef8-9270-85535b93c3f8",
  "external_provisioners": null,
  <snip>
}

@zedkipp zedkipp merged commit 2204731 into main Jan 28, 2026
57 checks passed
@zedkipp zedkipp deleted the zedkipp/boundary-usage-telemetry-snapshot branch January 28, 2026 02:11
@github-actions github-actions bot locked and limited conversation to collaborators Jan 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants