Compare commits

...

21 Commits

Author SHA1 Message Date
Ashley Mensah
222d498bb6 fix(ci): distinguish workarounds from actual fixes in system prompt 2026-04-28 17:48:43 +02:00
Ashley Mensah
52cd104f1e chore(ci): switch back to gpt-4o-mini for higher quota 2026-04-28 17:44:05 +02:00
Ashley Mensah
92f666f652 fix(ci): cap retry-after and handle quota exhaustion gracefully 2026-04-28 17:42:43 +02:00
Ashley Mensah
4fc0cb7ec4 fix(ci): pace API requests to avoid rate limit thrashing 2026-04-28 17:30:18 +02:00
Ashley Mensah
695614834e fix(ci): fix policy logic and add message truncation
- enforcePolicy: respect KEEP_OPEN when model is confident and
  pre_score is low. Only promote to MANUAL_REVIEW when model suggests
  resolution or pre_score has hard signals
- Truncate user messages to 24k chars (issue body capped at 4k) to
  stay within GitHub Models 8000 token input limit
2026-04-28 16:38:36 +02:00
Ashley Mensah
d75fa6ad45 feat(ci): switch to gpt-4o and add rate limit retry
- Upgrade model from gpt-4o-mini to gpt-4o for better classification
- Add retry loop with backoff on 429 responses (up to 5 retries)
- Respect Retry-After header from GitHub Models API
2026-04-28 16:23:31 +02:00
Ashley Mensah
7ce7f322eb increase max number of dry run issues to 100 2026-04-28 16:13:38 +02:00
Ashley Mensah
22bcf70b6e fix(ci): add additionalProperties to nested schema objects 2026-04-28 16:11:57 +02:00
Ashley Mensah
fe8aa21245 feat(ci): wire up gpt-4o-mini for issue classification
- Replace stub callGitHubModel() with real GitHub Models API call
  using gpt-4o-mini with structured JSON output
- Build detailed user messages from issue body, comments, and timeline
- Add per-issue decision logging to classify step
- Upload candidates.json and decisions.json as workflow artifacts
2026-04-28 16:11:01 +02:00
Ashley Mensah
29f211e51c fix(ci): guard all decisions behind dry-run check
Move the dry-run check to the top of the loop so it applies to all
decision types, not just AUTO_CLOSE. In dry-run mode the workflow now
only logs what it would do without touching any issues.
2026-04-28 16:08:16 +02:00
Ashley Mensah
6df3580bd3 fix(ci): handle project API permission errors gracefully
GITHUB_TOKEN cannot access org-level Projects V2. Make addToProject
return null on failure instead of crashing, and skip setTextField
calls when project access is unavailable. A PAT with project scope
is needed for full project board integration.
2026-04-28 15:55:12 +02:00
Ashley Mensah
9ff735dd52 fix(ci): fix workflow by adding working-directory, fetch script, and removing npm ci
- Set defaults.run.working-directory to .github/issue-resolution so
  scripts resolve from the correct path
- Remove npm ci step (no npm dependencies needed)
- Add fetch-candidates.mjs to gather open issues with comments and
  timeline events via GitHub REST API
- Add minimal package.json with type: module
2026-04-28 15:53:17 +02:00
Ashley Mensah
7c43973bc9 fix typo in workflow 2026-04-28 15:48:35 +02:00
Ashley Mensah
01e53d07b9 fix typo in workflow 2026-04-28 15:46:57 +02:00
Ashley Mensah
797dce1631 dummy commit to test workflow 2026-04-28 15:43:37 +02:00
Ashley Mensah
2877fcbbf6 add push trigger to workflow 2026-04-28 15:42:50 +02:00
Ashley Mensah
5d8201fcd0 Merge branch 'main' into github-issue-resolver 2026-04-28 15:36:54 +02:00
Ashley Mensah
09595bd0c2 enable dry run, add project field id values 2026-04-28 15:36:21 +02:00
Zoltan Papp
8fc4265995 [relay] evict foreign client cache on disconnect (#6015)
* [relay] evict foreign client cache on disconnect

When a foreign relay's TCP connection drops, the manager's
onServerDisconnected handler only triggered reconnect logic for the
home server; the disconnected foreign entry stayed in the relayClients
cache. Subsequent OpenConn calls reused the closed client until the
60-second cleanup tick evicted it, breaking peer connectivity through
that relay for up to a minute.

Evict the foreign entry from the cache on disconnect so the next
OpenConn dials a fresh client.

Also:
- Make the reconnect backoff cap configurable via WithMaxBackoffInterval
  ManagerOption; the previous hard-coded 60s constant forced
  TestAutoReconnect to sleep ~61s. Test now polls Ready() and finishes
  in ~2s.
- Add NB_HOME_RELAY_SERVERS env var that overrides the relay URL list
  received from management, so a peer can be pinned to a specific home
  relay (used by the netbird-conn-lab Edge 4 reproducer).

* [client] treat empty NB_HOME_RELAY_SERVERS as unset

Returning (urls=[], ok=true) when the env var contained only separators or
whitespace caused callers to wipe the mgmt-provided relay list, leaving the
peer with no relays. Treat a parsed-empty result the same as an unset env.
2026-04-28 15:04:48 +02:00
Zoltan Papp
9c50819f20 Don't mark management disconnected on transient job stream errors (#6005)
The JOB stream is a separate channel from the SYNC stream. Server-side
EOF or other transient errors on the JOB stream do not indicate that
the management connection is unhealthy — the SYNC stream remains the
authoritative state signal.

Previously, a JOB stream EOF would call notifyDisconnected and the
client would emit OnConnecting to the UI. The backoff retry would
reconnect the JOB stream, but handleJobStream never calls notifyConnected
on success, so the UI was stuck on "Connecting" until the next SYNC
event or health check.

Keep notifyDisconnected for codes.PermissionDenied since IsLoginRequired
relies on managementError to detect expired auth.
2026-04-28 15:04:41 +02:00
Ashley Mensah
6b8e40f78d initial commit - workflow yaml, prompts and schemas 2026-04-23 18:48:38 +02:00
14 changed files with 783 additions and 26 deletions

5
.github/issue-resolution/package.json vendored Normal file
View File

@@ -0,0 +1,5 @@
{
"name": "issue-resolution",
"private": true,
"type": "module"
}

View File

@@ -0,0 +1,32 @@
You are a GitHub issue resolution classifier.
Your job is to decide whether an open GitHub issue is:
- AUTO_CLOSE
- MANUAL_REVIEW
- KEEP_OPEN
Rules:
1. AUTO_CLOSE is only allowed if there is objective, hard evidence:
- a merged linked PR that clearly resolves the issue, or
- an explicit maintainer/member/owner/collaborator comment saying the issue is fixed, resolved, duplicate, or superseded
2. If there is any contradictory later evidence, do NOT AUTO_CLOSE.
3. If evidence is promising but not airtight, choose MANUAL_REVIEW.
4. If the issue still appears active or unresolved, choose KEEP_OPEN.
5. Do not invent evidence.
6. Output valid JSON only.
Maintainer-authoritative roles:
- MEMBER
- OWNER
- COLLABORATOR
Workarounds vs. actual fixes:
- A WORKAROUND is when a user changes their own setup to avoid the problem (editing configs, using a different setting, manual SQL fixes, switching tools). Workarounds do NOT count as resolution — the underlying issue is still present in the product.
- An ACTUAL FIX is when a user reports the problem went away after upgrading to a specific version (e.g., "fixed after updating to v0.65.1") or after a specific PR was merged. This suggests the fix was shipped in the product itself.
- If only workarounds exist and no maintainer has confirmed a fix, classify as KEEP_OPEN.
- If a user reports an actual fix via a version upgrade but no maintainer confirmed it, classify as MANUAL_REVIEW (not AUTO_CLOSE).
Important:
- Later comments outweigh earlier ones.
- A non-maintainer saying "fixed for me" is not enough for AUTO_CLOSE.
- If uncertain, prefer MANUAL_REVIEW or KEEP_OPEN.

View File

@@ -0,0 +1,80 @@
{
"type": "object",
"additionalProperties": false,
"required": [
"decision",
"reason_code",
"confidence",
"hard_signals",
"contradictions",
"summary",
"close_comment",
"manual_review_note"
],
"properties": {
"decision": {
"type": "string",
"enum": ["AUTO_CLOSE", "MANUAL_REVIEW", "KEEP_OPEN"]
},
"reason_code": {
"type": "string",
"enum": [
"resolved_by_merged_pr",
"maintainer_confirmed_resolved",
"duplicate_confirmed",
"superseded_confirmed",
"likely_fixed_but_unconfirmed",
"still_open",
"unclear"
]
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"hard_signals": {
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"required": ["type", "url"],
"properties": {
"type": {
"type": "string",
"enum": [
"merged_pr",
"maintainer_comment",
"duplicate_reference",
"superseded_reference"
]
},
"url": { "type": "string" }
}
}
},
"contradictions": {
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"required": ["type", "url"],
"properties": {
"type": {
"type": "string",
"enum": [
"reporter_still_broken",
"later_unresolved_comment",
"ambiguous_pr_link",
"other"
]
},
"url": { "type": "string" }
}
}
},
"summary": { "type": "string" },
"close_comment": { "type": "string" },
"manual_review_note": { "type": "string" }
}
}

View File

@@ -0,0 +1,155 @@
import fs from "node:fs/promises";
const decisions = JSON.parse(await fs.readFile("decisions.json", "utf8"));
const dryRun = String(process.env.DRY_RUN).toLowerCase() === "true";
const headers = {
Authorization: `Bearer ${process.env.GH_TOKEN}`,
Accept: "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
};
async function rest(url, method = "GET", body) {
const res = await fetch(url, {
method,
headers,
body: body ? JSON.stringify(body) : undefined
});
if (!res.ok) throw new Error(`${res.status} ${url}: ${await res.text()}`);
return res.status === 204 ? null : res.json();
}
async function graphql(query, variables) {
const res = await fetch("https://api.github.com/graphql", {
method: "POST",
headers,
body: JSON.stringify({ query, variables })
});
if (!res.ok) throw new Error(`${res.status}: ${await res.text()}`);
const json = await res.json();
if (json.errors) throw new Error(JSON.stringify(json.errors));
return json.data;
}
async function addLabel(owner, repo, issueNumber, labels) {
return rest(
`https://api.github.com/repos/${owner}/${repo}/issues/${issueNumber}/labels`,
"POST",
{ labels }
);
}
async function addComment(owner, repo, issueNumber, body) {
return rest(
`https://api.github.com/repos/${owner}/${repo}/issues/${issueNumber}/comments`,
"POST",
{ body }
);
}
async function closeIssue(owner, repo, issueNumber) {
return rest(
`https://api.github.com/repos/${owner}/${repo}/issues/${issueNumber}`,
"PATCH",
{ state: "closed", state_reason: "completed" }
);
}
async function getIssueNodeId(owner, repo, issueNumber) {
const issue = await rest(`https://api.github.com/repos/${owner}/${repo}/issues/${issueNumber}`);
return issue.node_id;
}
async function addToProject(issueNodeId) {
const mutation = `
mutation($projectId: ID!, $contentId: ID!) {
addProjectV2ItemById(input: {projectId: $projectId, contentId: $contentId}) {
item { id }
}
}
`;
try {
const data = await graphql(mutation, {
projectId: process.env.PROJECT_ID,
contentId: issueNodeId
});
return data.addProjectV2ItemById.item.id;
} catch (err) {
console.warn(`[WARN] Could not add to project (needs PAT with project scope): ${err.message}`);
return null;
}
}
async function setTextField(itemId, fieldId, value) {
const mutation = `
mutation($projectId: ID!, $itemId: ID!, $fieldId: ID!, $value: String!) {
updateProjectV2ItemFieldValue(input: {
projectId: $projectId,
itemId: $itemId,
fieldId: $fieldId,
value: { text: $value }
}) {
projectV2Item { id }
}
}
`;
return graphql(mutation, {
projectId: process.env.PROJECT_ID,
itemId,
fieldId,
value
});
}
for (const d of decisions) {
const [owner, repo] = d.repository.split("/");
if (dryRun) {
console.log(`[DRY RUN] #${d.issue_number}${d.final_decision} (confidence: ${d.model.confidence}, reason: ${d.model.reason_code})`);
continue;
}
if (d.final_decision === "AUTO_CLOSE") {
await addLabel(owner, repo, d.issue_number, ["auto-closed-resolved"]);
await addComment(owner, repo, d.issue_number, d.model.close_comment);
await closeIssue(owner, repo, d.issue_number);
}
if (d.final_decision === "MANUAL_REVIEW") {
await addLabel(owner, repo, d.issue_number, ["resolution-candidate"]);
const issueNodeId = await getIssueNodeId(owner, repo, d.issue_number);
const itemId = await addToProject(issueNodeId);
if (itemId) {
if (process.env.PROJECT_CONFIDENCE_FIELD_ID) {
await setTextField(itemId, process.env.PROJECT_CONFIDENCE_FIELD_ID, String(d.model.confidence));
}
if (process.env.PROJECT_REASON_FIELD_ID) {
await setTextField(itemId, process.env.PROJECT_REASON_FIELD_ID, d.model.reason_code);
}
if (process.env.PROJECT_EVIDENCE_FIELD_ID) {
await setTextField(itemId, process.env.PROJECT_EVIDENCE_FIELD_ID, d.issue_url);
}
if (process.env.PROJECT_LINKED_PR_FIELD_ID) {
const linked = (d.model.hard_signals || []).map(x => x.url).join(", ");
if (linked) {
await setTextField(itemId, process.env.PROJECT_LINKED_PR_FIELD_ID, linked);
}
}
if (process.env.PROJECT_REPO_FIELD_ID) {
await setTextField(itemId, process.env.PROJECT_REPO_FIELD_ID, d.repository);
}
}
await addComment(
owner,
repo,
d.issue_number,
d.model.manual_review_note ||
"This issue looks like a possible resolution candidate, but not with enough certainty for automatic closure. Added to the review queue."
);
}
}

View File

@@ -0,0 +1,244 @@
import fs from "node:fs/promises";
const candidates = JSON.parse(await fs.readFile("candidates.json", "utf8"));
const systemPrompt = await fs.readFile("prompts/issue-resolution-system.txt", "utf8");
const outputSchema = JSON.parse(await fs.readFile("schemas/issue-resolution-output.json", "utf8"));
function isMaintainerRole(role) {
return ["MEMBER", "OWNER", "COLLABORATOR"].includes(role || "");
}
function preScore(candidate) {
let score = 0;
const hardSignals = [];
const contradictions = [];
for (const t of candidate.timeline) {
const sourceIssue = t.source?.issue;
if (t.event === "cross-referenced" && sourceIssue?.pull_request?.html_url) {
hardSignals.push({
type: "merged_pr",
url: sourceIssue.html_url
});
score += 40; // provisional until PR merged state is verified
}
if (["referenced", "connected"].includes(t.event)) {
score += 10;
}
}
for (const c of candidate.comments) {
const body = c.body.toLowerCase();
if (
isMaintainerRole(c.author_association) &&
/\b(fixed|resolved|duplicate|superseded|closing)\b/.test(body)
) {
score += 25;
hardSignals.push({
type: "maintainer_comment",
url: c.html_url
});
}
if (/\b(still broken|still happening|not fixed|reproducible)\b/.test(body)) {
score -= 50;
contradictions.push({
type: "later_unresolved_comment",
url: c.html_url
});
}
}
return { score, hardSignals, contradictions };
}
// GitHub Models gpt-4o has an 8000 token input limit.
// Reserve ~2000 tokens for system prompt + response overhead.
// 1 token ~= 4 chars, so cap user message at ~24000 chars.
const MAX_USER_MESSAGE_CHARS = 24000;
function truncate(text, maxChars) {
if (text.length <= maxChars) return text;
return text.slice(0, maxChars) + "\n\n[... truncated due to length]";
}
function buildUserMessage(candidate) {
const { issue, comments, timeline } = candidate;
const commentBlock = comments
.map((c) => `[${c.author_association}] ${c.user} (${c.created_at}):\n${c.body}`)
.join("\n---\n");
const timelineBlock = timeline
.filter((t) => ["cross-referenced", "referenced", "connected", "closed", "reopened"].includes(t.event))
.map((t) => {
let line = `${t.event} (${t.created_at})`;
if (t.source?.issue?.html_url) line += `${t.source.issue.html_url}`;
if (t.source?.issue?.pull_request?.html_url) line += ` (PR: ${t.source.issue.pull_request.html_url})`;
return line;
})
.join("\n");
const msg = [
`## Issue #${issue.number}: ${issue.title}`,
`URL: ${issue.html_url}`,
`Created: ${issue.created_at} | Updated: ${issue.updated_at}`,
`Labels: ${issue.labels.join(", ") || "none"}`,
"",
"### Body",
truncate(issue.body || "(empty)", 4000),
"",
"### Comments",
commentBlock || "(none)",
"",
"### Timeline events",
timelineBlock || "(none)",
].join("\n");
return truncate(msg, MAX_USER_MESSAGE_CHARS);
}
const MODEL = "gpt-4o-mini";
const MAX_RETRIES = 5;
function sleep(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function callGitHubModel(candidate) {
const body = JSON.stringify({
model: MODEL,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: buildUserMessage(candidate) },
],
response_format: {
type: "json_schema",
json_schema: {
name: "issue_resolution",
strict: true,
schema: outputSchema,
},
},
temperature: 0.1,
});
for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
const res = await fetch("https://models.inference.ai.azure.com/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.GH_TOKEN}`,
"Content-Type": "application/json",
},
body,
});
if (res.status === 429) {
const retryAfter = Number(res.headers.get("retry-after")) || 30;
if (retryAfter > 120) {
console.warn(` [QUOTA EXHAUSTED] API wants ${retryAfter}s wait — skipping remaining issues.`);
return null;
}
console.warn(` [RATE LIMITED] Waiting ${retryAfter}s (attempt ${attempt + 1}/${MAX_RETRIES})...`);
await sleep(retryAfter * 1000);
continue;
}
if (!res.ok) {
const text = await res.text();
throw new Error(`GitHub Models ${res.status}: ${text}`);
}
const data = await res.json();
return JSON.parse(data.choices[0].message.content);
}
throw new Error(`GitHub Models: exceeded ${MAX_RETRIES} retries due to rate limiting`);
}
function enforcePolicy(modelOut, pre) {
const approvedReasons = new Set([
"resolved_by_merged_pr",
"maintainer_confirmed_resolved",
"duplicate_confirmed",
"superseded_confirmed"
]);
const hasHardSignal =
(modelOut.hard_signals || []).some(s =>
["merged_pr", "maintainer_comment", "duplicate_reference", "superseded_reference"].includes(s.type)
) || pre.hardSignals.length > 0;
const hasContradiction =
(modelOut.contradictions || []).length > 0 || pre.contradictions.length > 0;
if (
modelOut.decision === "AUTO_CLOSE" &&
modelOut.confidence >= 0.97 &&
approvedReasons.has(modelOut.reason_code) &&
hasHardSignal &&
!hasContradiction
) {
return "AUTO_CLOSE";
}
if (modelOut.decision === "KEEP_OPEN" && pre.score < 25) {
return "KEEP_OPEN";
}
if (
modelOut.decision === "MANUAL_REVIEW" ||
modelOut.decision === "AUTO_CLOSE" ||
pre.score >= 25
) {
return "MANUAL_REVIEW";
}
return "KEEP_OPEN";
}
console.log(`Classifying ${candidates.length} candidates with ${MODEL}...\n`);
// 15 req/min limit → 1 request every 4s. Use 4.5s for safety margin.
const PACE_MS = 4500;
let lastRequestTime = 0;
async function paced(fn) {
const elapsed = Date.now() - lastRequestTime;
if (elapsed < PACE_MS) await sleep(PACE_MS - elapsed);
lastRequestTime = Date.now();
return fn();
}
const decisions = [];
for (const candidate of candidates) {
const pre = preScore(candidate);
const modelOut = await paced(() => callGitHubModel(candidate));
if (modelOut === null) {
console.warn(`\nQuota exhausted after ${decisions.length} issues. Writing partial results.`);
break;
}
const finalDecision = enforcePolicy(modelOut, pre);
decisions.push({
repository: candidate.repository,
issue_number: candidate.issue.number,
issue_url: candidate.issue.html_url,
title: candidate.issue.title,
pre_score: pre.score,
final_decision: finalDecision,
model: modelOut
});
console.log(
`#${candidate.issue.number} | pre_score: ${pre.score} | model: ${modelOut.decision} @ ${modelOut.confidence} | final: ${finalDecision} | ${modelOut.reason_code}`
);
}
await fs.writeFile("decisions.json", JSON.stringify(decisions, null, 2));
console.log(`\nWrote ${decisions.length} decisions to decisions.json`);

View File

@@ -0,0 +1,89 @@
import fs from "node:fs/promises";
const token = process.env.GH_TOKEN;
const repo = process.env.REPO; // "owner/repo"
const maxIssues = Number(process.env.MAX_ISSUES) || 100;
const headers = {
Authorization: `Bearer ${token}`,
Accept: "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
};
async function rest(url) {
const res = await fetch(url, { headers });
if (!res.ok) throw new Error(`${res.status} ${url}: ${await res.text()}`);
return res.json();
}
async function paginate(url, max) {
const items = [];
let page = 1;
while (items.length < max) {
const perPage = Math.min(100, max - items.length);
const sep = url.includes("?") ? "&" : "?";
const batch = await rest(`${url}${sep}per_page=${perPage}&page=${page}`);
if (!batch.length) break;
items.push(...batch);
page++;
}
return items.slice(0, max);
}
console.log(`Fetching up to ${maxIssues} open issues from ${repo}...`);
const issues = await paginate(
`https://api.github.com/repos/${repo}/issues?state=open&sort=updated&direction=desc`,
maxIssues
);
// Filter out pull requests (GitHub API returns PRs as issues too)
const realIssues = issues.filter((i) => !i.pull_request);
console.log(`Found ${realIssues.length} open issues (excluded PRs).`);
const candidates = [];
for (const issue of realIssues) {
const [comments, timeline] = await Promise.all([
rest(`https://api.github.com/repos/${repo}/issues/${issue.number}/comments?per_page=100`),
rest(`https://api.github.com/repos/${repo}/issues/${issue.number}/timeline?per_page=100`),
]);
candidates.push({
repository: repo,
issue: {
number: issue.number,
html_url: issue.html_url,
title: issue.title,
body: issue.body,
created_at: issue.created_at,
updated_at: issue.updated_at,
labels: issue.labels.map((l) => l.name),
},
comments: comments.map((c) => ({
body: c.body,
author_association: c.author_association,
html_url: c.html_url,
created_at: c.created_at,
user: c.user?.login,
})),
timeline: timeline.map((t) => ({
event: t.event,
created_at: t.created_at,
source: t.source
? {
issue: {
html_url: t.source.issue?.html_url,
pull_request: t.source.issue?.pull_request
? { html_url: t.source.issue.pull_request.html_url }
: undefined,
},
}
: undefined,
})),
});
console.log(` #${issue.number}${comments.length} comments, ${timeline.length} timeline events`);
}
await fs.writeFile("candidates.json", JSON.stringify(candidates, null, 2));
console.log(`Wrote ${candidates.length} candidates to candidates.json`);

View File

@@ -0,0 +1,64 @@
name: issue-resolution-triage
on:
push:
branches: [github-issue-resolver]
workflow_dispatch:
inputs:
dry_run:
description: "If true, do not close issues"
required: false
default: "true"
max_issues:
description: "How many issues to process"
required: false
default: "100"
schedule:
- cron: "17 2 * * *"
permissions:
contents: read
issues: write
pull-requests: read
models: read
# todo: remove hardcoded values
jobs:
triage:
runs-on: ubuntu-latest
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DRY_RUN: "true"
MAX_ISSUES: "100"
REPO: ${{ github.repository }}
PROJECT_ID: "PVT_kwDOBfz4Jc4BVeWR"
PROJECT_STATUS_FIELD_ID: "PVTSSF_lADOBfz4Jc4BVeWRzhQ56sU"
PROJECT_CONFIDENCE_FIELD_ID: "PVTF_lADOBfz4Jc4BVeWRzhQ57x4"
PROJECT_REASON_FIELD_ID: "PVTF_lADOBfz4Jc4BVeWRzhQ5-Lg"
PROJECT_EVIDENCE_FIELD_ID: "PVTF_lADOBfz4Jc4BVeWRzhQ5-Pw"
PROJECT_LINKED_PR_FIELD_ID: "PVTF_lADOBfz4Jc4BVeWRzhQ56sc"
PROJECT_REPO_FIELD_ID: "PVTF_lADOBfz4Jc4BVeWRzhQ56sk"
PROJECT_STATUS_OPTION_NEEDS_REVIEW_ID: "a55a2be9"
defaults:
run:
working-directory: .github/issue-resolution
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: node scripts/fetch-candidates.mjs
- run: node scripts/classify-candidates.mjs
- run: node scripts/apply-decisions.mjs
- uses: actions/upload-artifact@v4
if: always()
with:
name: triage-results
path: |
.github/issue-resolution/candidates.json
.github/issue-resolution/decisions.json

View File

@@ -333,6 +333,10 @@ func (c *ConnectClient) run(mobileDependency MobileDependency, runningChan chan
c.statusRecorder.MarkSignalConnected() c.statusRecorder.MarkSignalConnected()
relayURLs, token := parseRelayInfo(loginResp) relayURLs, token := parseRelayInfo(loginResp)
if override, ok := peer.OverrideRelayURLs(); ok {
log.Infof("overriding relay URLs from %s: %v", peer.EnvKeyNBHomeRelayServers, override)
relayURLs = override
}
peerConfig := loginResp.GetPeerConfig() peerConfig := loginResp.GetPeerConfig()
engineConfig, err := createEngineConfig(myPrivateKey, c.config, peerConfig, logPath) engineConfig, err := createEngineConfig(myPrivateKey, c.config, peerConfig, logPath)

View File

@@ -944,7 +944,12 @@ func (e *Engine) handleRelayUpdate(update *mgmProto.RelayConfig) error {
return fmt.Errorf("update relay token: %w", err) return fmt.Errorf("update relay token: %w", err)
} }
e.relayManager.UpdateServerURLs(update.Urls) urls := update.Urls
if override, ok := peer.OverrideRelayURLs(); ok {
log.Infof("overriding relay URLs from %s: %v", peer.EnvKeyNBHomeRelayServers, override)
urls = override
}
e.relayManager.UpdateServerURLs(urls)
// Just in case the agent started with an MGM server where the relay was disabled but was later enabled. // Just in case the agent started with an MGM server where the relay was disabled but was later enabled.
// We can ignore all errors because the guard will manage the reconnection retries. // We can ignore all errors because the guard will manage the reconnection retries.

View File

@@ -8,6 +8,7 @@ import (
const ( const (
EnvKeyNBForceRelay = "NB_FORCE_RELAY" EnvKeyNBForceRelay = "NB_FORCE_RELAY"
EnvKeyNBHomeRelayServers = "NB_HOME_RELAY_SERVERS"
) )
func IsForceRelayed() bool { func IsForceRelayed() bool {
@@ -16,3 +17,28 @@ func IsForceRelayed() bool {
} }
return strings.EqualFold(os.Getenv(EnvKeyNBForceRelay), "true") return strings.EqualFold(os.Getenv(EnvKeyNBForceRelay), "true")
} }
// OverrideRelayURLs returns the relay server URL list set in
// NB_HOME_RELAY_SERVERS (comma-separated) and a boolean indicating whether
// the override is active. When the env var is unset, the boolean is false
// and the caller should keep the list received from the management server.
// Intended for lab/debug scenarios where a peer must pin to a specific home
// relay regardless of what management offers.
func OverrideRelayURLs() ([]string, bool) {
raw := os.Getenv(EnvKeyNBHomeRelayServers)
if raw == "" {
return nil, false
}
parts := strings.Split(raw, ",")
urls := make([]string, 0, len(parts))
for _, p := range parts {
p = strings.TrimSpace(p)
if p != "" {
urls = append(urls, p)
}
}
if len(urls) == 0 {
return nil, false
}
return urls, true
}

View File

@@ -252,21 +252,19 @@ func (c *GrpcClient) handleJobStream(
c.notifyDisconnected(err) c.notifyDisconnected(err)
return backoff.Permanent(err) // unrecoverable error, propagate to the upper layer return backoff.Permanent(err) // unrecoverable error, propagate to the upper layer
case codes.Canceled: case codes.Canceled:
log.Debugf("management connection context has been canceled, this usually indicates shutdown") log.Debugf("job stream context has been canceled, this usually indicates shutdown")
return err return err
case codes.Unimplemented: case codes.Unimplemented:
log.Warn("Job feature is not supported by the current management server version. " + log.Warn("Job feature is not supported by the current management server version. " +
"Please update the management service to use this feature.") "Please update the management service to use this feature.")
return nil return nil
default: default:
c.notifyDisconnected(err) log.Warnf("job stream disconnected, will retry silently. Reason: %v", err)
log.Warnf("disconnected from the Management service but will retry silently. Reason: %v", err)
return err return err
} }
} else { } else {
// non-gRPC error // non-gRPC error
c.notifyDisconnected(err) log.Warnf("job stream disconnected, will retry silently. Reason: %v", err)
log.Warnf("disconnected from the Management service but will retry silently. Reason: %v", err)
return err return err
} }
} }

View File

@@ -8,10 +8,7 @@ import (
log "github.com/sirupsen/logrus" log "github.com/sirupsen/logrus"
) )
const ( const defaultMaxBackoffInterval = 60 * time.Second
// TODO: make it configurable, the manager should validate all configurable parameters
reconnectingTimeout = 60 * time.Second
)
// Guard manage the reconnection tries to the Relay server in case of disconnection event. // Guard manage the reconnection tries to the Relay server in case of disconnection event.
type Guard struct { type Guard struct {
@@ -19,14 +16,23 @@ type Guard struct {
OnNewRelayClient chan *Client OnNewRelayClient chan *Client
OnReconnected chan struct{} OnReconnected chan struct{}
serverPicker *ServerPicker serverPicker *ServerPicker
// maxBackoffInterval caps the exponential backoff between reconnect
// attempts.
maxBackoffInterval time.Duration
} }
// NewGuard creates a new guard for the relay client. // NewGuard creates a new guard for the relay client. A non-positive
func NewGuard(sp *ServerPicker) *Guard { // maxBackoffInterval falls back to defaultMaxBackoffInterval.
func NewGuard(sp *ServerPicker, maxBackoffInterval time.Duration) *Guard {
if maxBackoffInterval <= 0 {
maxBackoffInterval = defaultMaxBackoffInterval
}
g := &Guard{ g := &Guard{
OnNewRelayClient: make(chan *Client, 1), OnNewRelayClient: make(chan *Client, 1),
OnReconnected: make(chan struct{}, 1), OnReconnected: make(chan struct{}, 1),
serverPicker: sp, serverPicker: sp,
maxBackoffInterval: maxBackoffInterval,
} }
return g return g
} }
@@ -49,7 +55,7 @@ func (g *Guard) StartReconnectTrys(ctx context.Context, relayClient *Client) {
} }
// start a ticker to pick a new server // start a ticker to pick a new server
ticker := exponentTicker(ctx) ticker := g.exponentTicker(ctx)
defer ticker.Stop() defer ticker.Stop()
for { for {
@@ -125,11 +131,11 @@ func (g *Guard) notifyReconnected() {
} }
} }
func exponentTicker(ctx context.Context) *backoff.Ticker { func (g *Guard) exponentTicker(ctx context.Context) *backoff.Ticker {
bo := backoff.WithContext(&backoff.ExponentialBackOff{ bo := backoff.WithContext(&backoff.ExponentialBackOff{
InitialInterval: 2 * time.Second, InitialInterval: 2 * time.Second,
Multiplier: 2, Multiplier: 2,
MaxInterval: reconnectingTimeout, MaxInterval: g.maxBackoffInterval,
Clock: backoff.SystemClock, Clock: backoff.SystemClock,
}, ctx) }, ctx)

View File

@@ -39,6 +39,15 @@ func NewRelayTrack() *RelayTrack {
type OnServerCloseListener func() type OnServerCloseListener func()
// ManagerOption configures a Manager at construction time.
type ManagerOption func(*Manager)
// WithMaxBackoffInterval caps the exponential backoff between reconnect
// attempts to the home relay. A non-positive value keeps the default.
func WithMaxBackoffInterval(d time.Duration) ManagerOption {
return func(m *Manager) { m.maxBackoffInterval = d }
}
// Manager is a manager for the relay client instances. It establishes one persistent connection to the given relay URL // Manager is a manager for the relay client instances. It establishes one persistent connection to the given relay URL
// and automatically reconnect to them in case disconnection. // and automatically reconnect to them in case disconnection.
// The manager also manage temporary relay connection. If a client wants to communicate with a client on a // The manager also manage temporary relay connection. If a client wants to communicate with a client on a
@@ -65,11 +74,12 @@ type Manager struct {
listenerLock sync.Mutex listenerLock sync.Mutex
mtu uint16 mtu uint16
maxBackoffInterval time.Duration
} }
// NewManager creates a new manager instance. // NewManager creates a new manager instance.
// The serverURL address can be empty. In this case, the manager will not serve. // The serverURL address can be empty. In this case, the manager will not serve.
func NewManager(ctx context.Context, serverURLs []string, peerID string, mtu uint16) *Manager { func NewManager(ctx context.Context, serverURLs []string, peerID string, mtu uint16, opts ...ManagerOption) *Manager {
tokenStore := &relayAuth.TokenStore{} tokenStore := &relayAuth.TokenStore{}
m := &Manager{ m := &Manager{
@@ -86,8 +96,11 @@ func NewManager(ctx context.Context, serverURLs []string, peerID string, mtu uin
relayClients: make(map[string]*RelayTrack), relayClients: make(map[string]*RelayTrack),
onDisconnectedListeners: make(map[string]*list.List), onDisconnectedListeners: make(map[string]*list.List),
} }
for _, opt := range opts {
opt(m)
}
m.serverPicker.ServerURLs.Store(serverURLs) m.serverPicker.ServerURLs.Store(serverURLs)
m.reconnectGuard = NewGuard(m.serverPicker) m.reconnectGuard = NewGuard(m.serverPicker, m.maxBackoffInterval)
return m return m
} }
@@ -290,19 +303,36 @@ func (m *Manager) onServerConnected() {
go m.onReconnectedListenerFn() go m.onReconnectedListenerFn()
} }
// onServerDisconnected start to reconnection for home server only // onServerDisconnected handles relay disconnect events. For the home server it
// starts the reconnect guard. For foreign servers it evicts the now-dead client
// from the cache so the next OpenConn builds a fresh one instead of reusing a
// closed client.
func (m *Manager) onServerDisconnected(serverAddress string) { func (m *Manager) onServerDisconnected(serverAddress string) {
m.relayClientMu.Lock() m.relayClientMu.Lock()
if serverAddress == m.relayClient.connectionURL { isHome := m.relayClient != nil && serverAddress == m.relayClient.connectionURL
if isHome {
go func(client *Client) { go func(client *Client) {
m.reconnectGuard.StartReconnectTrys(m.ctx, client) m.reconnectGuard.StartReconnectTrys(m.ctx, client)
}(m.relayClient) }(m.relayClient)
} }
m.relayClientMu.Unlock() m.relayClientMu.Unlock()
if !isHome {
m.evictForeignRelay(serverAddress)
}
m.notifyOnDisconnectListeners(serverAddress) m.notifyOnDisconnectListeners(serverAddress)
} }
func (m *Manager) evictForeignRelay(serverAddress string) {
m.relayClientsMutex.Lock()
defer m.relayClientsMutex.Unlock()
if _, ok := m.relayClients[serverAddress]; ok {
delete(m.relayClients, serverAddress)
log.Debugf("evicted disconnected foreign relay client: %s", serverAddress)
}
}
func (m *Manager) listenGuardEvent(ctx context.Context) { func (m *Manager) listenGuardEvent(ctx context.Context) {
for { for {
select { select {

View File

@@ -2,6 +2,7 @@ package client
import ( import (
"context" "context"
"fmt"
"testing" "testing"
"time" "time"
@@ -360,7 +361,8 @@ func TestAutoReconnect(t *testing.T) {
t.Fatalf("failed to serve manager: %s", err) t.Fatalf("failed to serve manager: %s", err)
} }
clientAlice := NewManager(mCtx, toURL(srvCfg), "alice", iface.DefaultMTU) clientAlice := NewManager(mCtx, toURL(srvCfg), "alice", iface.DefaultMTU,
WithMaxBackoffInterval(2*time.Second))
err = clientAlice.Serve() err = clientAlice.Serve()
if err != nil { if err != nil {
t.Fatalf("failed to serve manager: %s", err) t.Fatalf("failed to serve manager: %s", err)
@@ -384,7 +386,9 @@ func TestAutoReconnect(t *testing.T) {
} }
log.Infof("waiting for reconnection") log.Infof("waiting for reconnection")
time.Sleep(reconnectingTimeout + 1*time.Second) if err := waitForReady(ctx, clientAlice, 15*time.Second); err != nil {
t.Fatalf("manager did not reconnect: %s", err)
}
log.Infof("reopent the connection") log.Infof("reopent the connection")
_, err = clientAlice.OpenConn(ctx, ra, "bob") _, err = clientAlice.OpenConn(ctx, ra, "bob")
@@ -393,6 +397,21 @@ func TestAutoReconnect(t *testing.T) {
} }
} }
func waitForReady(ctx context.Context, m *Manager, timeout time.Duration) error {
deadline := time.Now().Add(timeout)
for time.Now().Before(deadline) {
if m.Ready() {
return nil
}
select {
case <-time.After(100 * time.Millisecond):
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("manager not ready within %s", timeout)
}
func TestNotifierDoubleAdd(t *testing.T) { func TestNotifierDoubleAdd(t *testing.T) {
ctx := context.Background() ctx := context.Background()