Skip to content

fix(coderd): ensure agent WebSocket conn is cleaned up#19711

Merged
DanielleMaywood merged 4 commits intomainfrom
danielle/devcontainers/handle-websocket-conn-cleanup
Sep 5, 2025
Merged

fix(coderd): ensure agent WebSocket conn is cleaned up#19711
DanielleMaywood merged 4 commits intomainfrom
danielle/devcontainers/handle-websocket-conn-cleanup

Conversation

@DanielleMaywood
Copy link
Contributor

@DanielleMaywood DanielleMaywood commented Sep 5, 2025

Relates to #19449

This PR fixes an unfortunate bug in the watch workspace agent containers endpoint.

The /containers/watch endpoint works a little like this:

sequenceDiagram
  Client->>coderd: `GET /containers/watch`
  coderd->>Agent: `GET /containers/watch`
  Agent->>coderd: `ACCEPT`
  Agent<<->>coderd: WebSocket Connection
  coderd->>Client: `ACCEPT`
  coderd<<->>Client: WebSocket Connection
Loading

Unfortunately, when the connection between the Client and coderd was closed, the connection between coderd and the Agent was kept alive. Leaving this WebSocket connection alive meant that every 15 seconds a Heartbeat ping was sent between coderd and the agent. As this heartbeat traffic happened over the tailnet, the agent recorded this network traffic as workspace activity, which resulted in coderd bumping the lifetime of the workspace every minute.

The changes introduced in this PR ensure that when the heartbeat between the Client and coderd stop, it cancels a context which causes a cleanup for the WebSocket between coderd and the Agent.

@DanielleMaywood DanielleMaywood force-pushed the danielle/devcontainers/handle-websocket-conn-cleanup branch from 647ffa1 to 894e6ca Compare September 5, 2025 10:01
@DanielleMaywood DanielleMaywood marked this pull request as ready for review September 5, 2025 10:17
@DanielleMaywood DanielleMaywood marked this pull request as draft September 5, 2025 10:17
@DanielleMaywood DanielleMaywood force-pushed the danielle/devcontainers/handle-websocket-conn-cleanup branch from 894e6ca to d3c1dd5 Compare September 5, 2025 11:17
@DanielleMaywood DanielleMaywood force-pushed the danielle/devcontainers/handle-websocket-conn-cleanup branch from d3c1dd5 to 35a7440 Compare September 5, 2025 11:18
// response to this issue: https://github.com/coder/coder/issues/19449

var (
ctx = testutil.Context(t, testutil.WaitLong)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this test uses testutil.WaitLong because the WebSocket heartbeat is every 15 seconds. I'm not sure if we have any way to decrease this interval for the test, so we need to have a long enough timeout that the test can run for 15 seconds to trigger this heartbeat (which then closes the channel).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate 😔. Perhaps we should have a WebsocketHeartbeatInterval on coderd options (requires updating the functions too of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree although I think that is out-of-scope for this PR. Should we create an issue to track that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do something like

type HeartbeatOption func(...)
func WithHeartbeatInterval(d time.Duration) HeartbeatOption
func Heartbeat(ctx context.Context, conn *websocket.Conn, opts ...HeartbeatOption) {...}

It might be worthwhile also unexporting httpapi.HeartbeatInterval if we do this.

@DanielleMaywood DanielleMaywood marked this pull request as ready for review September 5, 2025 12:18
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this change help the underlying issue?

To me this issue seems potentially originating from wsjson where it reads indefinitely:

			// we don't use d.ctx here because it only gets canceled after closing the connection
			// and a "connection closed" type error is more clear than context canceled.
			typ, b, err := d.conn.Read(context.Background())

Edit: To clarify, I don't see how this change (addition of context) changes the behavior. Previously we already did WatchContainer and defer closer.Close(). This calls Close on the underlying websocket.Conn which should disconnect and stop the reads.

The new context does not change how e.g. wsjson channel behaves since it will be cancelled after returning from the handler. But Close should already give the same behavior.

@DanielleMaywood
Copy link
Contributor Author

How does this change help the underlying issue?

I want to preface this with the understanding that my networking fundamentals are a bit poor but:

go httpapi.Heartbeat spins up a heartbeat goroutine, but when that heartbeat fails due to a closed connection, it just returns. We have no knowledge that the connection has closed. So, for as long as we don't attempt to write anything down the pipe, we're completely unaware of the broken connection. By switching to go httpapi.HeartbeatClose, we can explicitly cancel our context and force a cleanup when the heartbeat fails.

To me this issue seems potentially originating from wsjson where it reads indefinitely:

			// we don't use d.ctx here because it only gets canceled after closing the connection
			// and a "connection closed" type error is more clear than context canceled.
			typ, b, err := d.conn.Read(context.Background())

Forgive me but I don't think we're using wsjson anywhere here? We are also never reading from the client connection either.

@mafredri
Copy link
Member

mafredri commented Sep 5, 2025

Forgive me but I don't think we're using wsjson anywhere here? We are also never reading from the client connection either.

We are in the WatchContainers routine. But I understand now that it's not related to the issue (based on your other comment).

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially the mechanism by which we fix the issue was a bit lost on me, but I get it now. Small suggestion but otherwise LGTM.

// response to this issue: https://github.com/coder/coder/issues/19449

var (
ctx = testutil.Context(t, testutil.WaitLong)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate 😔. Perhaps we should have a WebsocketHeartbeatInterval on coderd options (requires updating the functions too of course).

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how many other usages of httpapi.Heartbeat need to be replaced by httpapi.HeartbeatClose? It seems like any API-initiated endpoint may be a candidate?

@mafredri
Copy link
Member

mafredri commented Sep 5, 2025

I wonder how many other usages of httpapi.Heartbeat need to be replaced by httpapi.HeartbeatClose? It seems like any API-initiated endpoint may be a candidate?

@johnstcn Potentially any endpoint that is only rarely writing or reading (or not at all) from the websocket established between browser and coderd. Not sure how common those are.

@DanielleMaywood
Copy link
Contributor Author

I wonder how many other usages of httpapi.Heartbeat need to be replaced by httpapi.HeartbeatClose? It seems like any API-initiated endpoint may be a candidate?

Even if all of them are fine as they are, it is definitely quite easy for this to happen again. I think we should create a follow-up ticket to track this

@DanielleMaywood DanielleMaywood merged commit e12b621 into main Sep 5, 2025
26 checks passed
@DanielleMaywood DanielleMaywood deleted the danielle/devcontainers/handle-websocket-conn-cleanup branch September 5, 2025 13:26
@github-actions github-actions bot locked and limited conversation to collaborators Sep 5, 2025
@DanielleMaywood DanielleMaywood added cherry-pick/v2.25 Needs to be cherry-picked to the 2.25 release branch cherry-pick/v2.26 Needs to be cherry-picked to the 2.26 release branch labels Sep 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cherry-pick/v2.25 Needs to be cherry-picked to the 2.25 release branch cherry-pick/v2.26 Needs to be cherry-picked to the 2.26 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants