fix(coderd): ensure agent WebSocket conn is cleaned up#19711
fix(coderd): ensure agent WebSocket conn is cleaned up#19711DanielleMaywood merged 4 commits intomainfrom
Conversation
647ffa1 to
894e6ca
Compare
894e6ca to
d3c1dd5
Compare
d3c1dd5 to
35a7440
Compare
| // response to this issue: https://github.com/coder/coder/issues/19449 | ||
|
|
||
| var ( | ||
| ctx = testutil.Context(t, testutil.WaitLong) |
There was a problem hiding this comment.
Unfortunately this test uses testutil.WaitLong because the WebSocket heartbeat is every 15 seconds. I'm not sure if we have any way to decrease this interval for the test, so we need to have a long enough timeout that the test can run for 15 seconds to trigger this heartbeat (which then closes the channel).
There was a problem hiding this comment.
That's unfortunate 😔. Perhaps we should have a WebsocketHeartbeatInterval on coderd options (requires updating the functions too of course).
There was a problem hiding this comment.
I definitely agree although I think that is out-of-scope for this PR. Should we create an issue to track that?
There was a problem hiding this comment.
We could do something like
type HeartbeatOption func(...)
func WithHeartbeatInterval(d time.Duration) HeartbeatOption
func Heartbeat(ctx context.Context, conn *websocket.Conn, opts ...HeartbeatOption) {...}It might be worthwhile also unexporting httpapi.HeartbeatInterval if we do this.
There was a problem hiding this comment.
How does this change help the underlying issue?
To me this issue seems potentially originating from wsjson where it reads indefinitely:
// we don't use d.ctx here because it only gets canceled after closing the connection
// and a "connection closed" type error is more clear than context canceled.
typ, b, err := d.conn.Read(context.Background())
Edit: To clarify, I don't see how this change (addition of context) changes the behavior. Previously we already did WatchContainer and defer closer.Close(). This calls Close on the underlying websocket.Conn which should disconnect and stop the reads.
The new context does not change how e.g. wsjson channel behaves since it will be cancelled after returning from the handler. But Close should already give the same behavior.
I want to preface this with the understanding that my networking fundamentals are a bit poor but:
Forgive me but I don't think we're using |
We are in the |
mafredri
left a comment
There was a problem hiding this comment.
Initially the mechanism by which we fix the issue was a bit lost on me, but I get it now. Small suggestion but otherwise LGTM.
| // response to this issue: https://github.com/coder/coder/issues/19449 | ||
|
|
||
| var ( | ||
| ctx = testutil.Context(t, testutil.WaitLong) |
There was a problem hiding this comment.
That's unfortunate 😔. Perhaps we should have a WebsocketHeartbeatInterval on coderd options (requires updating the functions too of course).
johnstcn
left a comment
There was a problem hiding this comment.
I wonder how many other usages of httpapi.Heartbeat need to be replaced by httpapi.HeartbeatClose? It seems like any API-initiated endpoint may be a candidate?
@johnstcn Potentially any endpoint that is only rarely writing or reading (or not at all) from the websocket established between browser and coderd. Not sure how common those are. |
Even if all of them are fine as they are, it is definitely quite easy for this to happen again. I think we should create a follow-up ticket to track this |
Relates to #19449
This PR fixes an unfortunate bug in the watch workspace agent containers endpoint.
The
/containers/watchendpoint works a little like this:Unfortunately, when the connection between the Client and coderd was closed, the connection between coderd and the Agent was kept alive. Leaving this WebSocket connection alive meant that every 15 seconds a Heartbeat ping was sent between coderd and the agent. As this heartbeat traffic happened over the tailnet, the agent recorded this network traffic as workspace activity, which resulted in coderd bumping the lifetime of the workspace every minute.
The changes introduced in this PR ensure that when the heartbeat between the Client and coderd stop, it cancels a context which causes a cleanup for the WebSocket between coderd and the Agent.