[client] Close giveUpChan in connectWithRetryRuns defer

The trailing close(giveUpChan) at the bottom of the function only ran on
the backoff.Retry path. The DisableAutoConnect path returned early via
the if-block, skipping the close entirely. That branch is hit whenever
the active profile has auto-connect disabled — so every Down for those
profiles waited the full 5s timeout in the Down RPC select (and twice
when two Downs queued up, since both snapped the same never-closing
chan).

Move close(giveUpChan) into the existing defer so it fires on every
exit path: DisableAutoConnect return, backoff.Retry return, or panic.
The close happens after clientRunning=false is committed under the
mutex, so a Down/Up that wakes on the chan-close doesn't observe a
half-state where the chan is closed but clientRunning is still true.

Updates the Down RPC comment to point at the deferred close as the
signal source, and reframes the 5s timeout warning as "the goroutine
is wedged in a slow teardown step" rather than the expected case.
This commit is contained in:
Zoltan Papp
2026-05-14 15:44:15 +02:00
parent e4eedbe18f
commit c0e7c61c4b

View File

@@ -229,10 +229,20 @@ func (s *Server) Start() error {
// mechanism to keep the client connected even when the connection is lost.
// we cancel retry if the client receive a stop or down command, or if disable auto connect is configured.
func (s *Server) connectWithRetryRuns(ctx context.Context, profileConfig *profilemanager.Config, statusRecorder *peer.Status, runningChan chan struct{}, giveUpChan chan struct{}) {
// close(giveUpChan) MUST run on every exit path (DisableAutoConnect
// return, backoff.Retry return, panic) — Down() blocks for up to 5s
// waiting on this signal before flipping the state to Idle, and a
// missed close leaves Down() always hitting the timeout. The signal
// fires AFTER clientRunning=false is committed under the mutex so a
// Down/Up racing with the goroutine exit never observes a half-state
// (chan closed but clientRunning still true).
defer func() {
s.mutex.Lock()
s.clientRunning = false
s.mutex.Unlock()
if giveUpChan != nil {
close(giveUpChan)
}
}()
if s.config.DisableAutoConnect {
@@ -287,10 +297,6 @@ func (s *Server) connectWithRetryRuns(ctx context.Context, profileConfig *profil
if err := backoff.Retry(runOperation, backOff); err != nil {
log.Errorf("operation failed: %v", err)
}
if giveUpChan != nil {
close(giveUpChan)
}
}
// loginAttempt attempts to login using the provided information. it returns a status in case something fails
@@ -903,11 +909,13 @@ func (s *Server) Down(ctx context.Context, _ *proto.DownRequest) (*proto.DownRes
// Wait for the connectWithRetryRuns goroutine to finish with a short timeout.
// This prevents the goroutine from setting ErrResetConnection after Down() returns.
// The giveUpChan is closed at the end of connectWithRetryRuns.
// The giveUpChan is closed by the goroutine's deferred cleanup (see
// connectWithRetryRuns) on every exit path. A timeout here typically
// means the goroutine is still wedged inside a slow teardown step.
if giveUpChan != nil {
select {
case <-giveUpChan:
log.Debugf("client goroutine finished successfully")
log.Debugf("client goroutine finished, giveUpChan closed")
case <-time.After(5 * time.Second):
log.Warnf("timeout waiting for client goroutine to finish, proceeding anyway")
}