Live-Site AI Agent¶
Purpose¶
This document defines the official direction for an operational AI agent that uses telemetry, logs, deployment metadata, and known issue history to assist with support diagnostics and balance recommendations.
This is not a gameplay agent.
It is an operational analysis system that sits beside the game and helps the team understand what is going wrong or what should be adjusted.
Primary Roles¶
The live-site agent has two initial modes:
- incident triage and diagnosis
- balance analysis and recommendation generation
These may share infrastructure, but they should remain separate workflows because their goals, inputs, and review processes differ.
Core Rule¶
The live-site agent should be retrieval-first and tool-using.
It must not rely on prompt-only reasoning over partial text excerpts.
The agent should gather evidence from structured systems first, then generate a diagnosis or recommendation.
Required Inputs¶
The operational platform should make the following data available to the agent.
Incident Inputs¶
- correlation id
- run id
- phase id when available
- client version
- content snapshot version
- effect module version set
- operating environment and deployment ring
- exception traces
- structured logs
- gameplay telemetry timeline
- user ticket text or support notes
- device and platform metadata where available
Correlation Contract¶
The debugging agent only works if correlation is explicit and consistent across every client and service hop.
The required contract is:
RunIdis a GUID created once per UX launch or operator session and reused for every downstream request from that sessionCorrelationIdis a GUID created at the top of each call-stack entry point such as a button click, page action, startup workflow, or API-triggered operation- outbound HTTP calls propagate both values in
X-EchoSpire-Run-IdandX-EchoSpire-Correlation-Id - the receiving service keeps those values if valid, generates replacements only when missing, and echoes them on the response for support diagnostics
- API request telemetry, error telemetry, and gameplay telemetry must all store the same
RunIdand the request-scopedCorrelationIdso KQL can reconstruct a full flow
This means the agent can start from either identifier:
- use
CorrelationIdto inspect one top-level failing interaction - use
RunIdto inspect the entire UX session across multiple requests and gameplay events
Balance Inputs¶
- simulation batch results
- policy ids used in simulation
- live telemetry aggregates
- win rates by class, faction, and hero progression state
- quest and rift failure rates
- protection and escort success rates
- card, relic, and event performance bands
- boss and rival hero lethality data
- sample-size and confidence metadata
Tooling Model¶
The live-site agent should be given explicit tools rather than raw database dumps.
Recommended tools are:
- fetch session by correlation id
- fetch telemetry timeline by run id
- fetch exception clusters by version
- fetch deployed module and content manifest for a session
- compare incident against known issue history
- search similar historical sessions
- retrieve simulation outlier report
- retrieve class and faction balance report
- retrieve objective failure analysis
For the first debugging implementation, the minimum viable toolset should be:
- fetch recent request telemetry by
CorrelationId - fetch full session timeline by
RunId - fetch recent errors joined on
RunIdorCorrelationId - fetch API health overview for the current deployment window
- execute constrained KQL for admin-reviewed investigations
Each tool should return structured records suitable for later summarization.
Incident Triage Agent¶
Goal¶
Determine the most likely cause of a reported problem and recommend the next action.
Recommended Workflow¶
- accept correlation id, ticket, or exception reference
- retrieve session timeline and failure evidence
- retrieve deployed code version and content snapshot version
- retrieve relevant module manifest information
- compare the issue against known incidents and recent regressions
- classify the problem
- produce a diagnosis summary with confidence and evidence
Recommended Output Shape¶
The incident agent should return:
- issue classification
- likely root cause
- confidence level
- impacted subsystem
- evidence summary
- likely reproduction path
- recommended next action
- whether the issue appears code, content, configuration, deployment, or user-behavior related
Recommended Issue Classes¶
- code defect
- bad authored content
- version mismatch
- module compatibility problem
- telemetry gap or incomplete evidence
- user misunderstanding or UX clarity issue
- infrastructure failure
- unknown requires engineer review
Balance Recommendation Agent¶
Goal¶
Analyze telemetry and simulation output to produce ranked balance recommendations for designer review.
Recommended Workflow¶
- retrieve selected live and simulated balance datasets
- evaluate sample quality and confidence
- identify outlier cards, relics, classes, quests, and encounters
- compare simulator findings to live player behavior
- generate ranked hypotheses
- propose recommendation candidates
- send the output to a human review queue
Recommended Output Shape¶
The balance agent should return:
- ranked issues by severity and confidence
- suspected root-cause drivers
- impacted content ids and gameplay spaces
- recommendation candidates
- sample-size warning flags
- whether the evidence is simulator-only, live-only, or corroborated by both
Example Recommendation Types¶
- reduce or increase a card parameter range
- shift reward weight or encounter appearance rate
- improve early protection tools for a class
- weaken a rival hero loadout package
- adjust boss phase pacing
- mark a metric as inconclusive due to weak data
Safety Boundaries¶
The live-site agent must operate under the following rules:
- no direct mutation of live balance data in the first implementation
- no autonomous hotfix deployment
- no silent modification of player saves or progression state
- recommendations must be reviewed by a human before publication
- weak evidence must be surfaced as weak evidence, not padded into false confidence
Deployment And Privacy Model¶
The agent should run as an internal operational service.
Recommended characteristics are:
- internal-only access to telemetry and logs
- role-gated access through admin or support tooling
- audit logging for every agent request and output
- evidence links back to source systems where possible
- redaction and privacy rules applied before model prompting when needed
Data Correlation Direction¶
The current telemetry structure based on run id, phase id, and event id is the correct foundation.
The next operational step is to ensure support workflows can join:
- API request telemetry
- gameplay run telemetry
- deployment and version metadata
- exception logging
- support tickets
The agent becomes powerful only if those joins are reliable.
Review Workflow¶
Recommended review path for incidents:
- agent performs analysis
- support or engineer reviews evidence
- issue is classified and routed
- if accepted, a known-issue record can be created for future retrieval
Recommended review path for balance:
- agent analyzes telemetry and simulation
- designer reviews ranked recommendations
- approved recommendations become draft content changes
- simulation and validation rerun before publish
Recommended Storage Outputs¶
Agent outputs should be stored as structured analysis artifacts.
Recommended fields are:
- analysis id
- requested by
- request type
- source identifiers used
- evidence references
- summary
- recommended actions
- confidence
- status such as draft, reviewed, accepted, dismissed
Implementation Phasing¶
Phase 1: Retrieval And Summaries¶
- correlation-id driven investigation
- run-id session reconstruction across UX and API telemetry
- telemetry and exception summaries
- no recommendation automation yet
Current Workspace Implementation¶
The first working operator-facing implementation now exists as a VS Code workspace custom agent rather than an MCP server.
The repo now contains:
.github/agents/echospire-live-site-debugger.agent.md.github/prompts/investigate-live-site-incident.prompt.md
This is the fastest path to a real, usable debugging workflow because it can use the existing API telemetry endpoints immediately without adding another service boundary first.
The current agent is intentionally read-only in practice:
- it uses the API health and telemetry endpoints
- it authenticates through the existing auth API when needed
- it prefers structured telemetry endpoints before raw KQL
- it is meant to investigate, not mutate systems
How To Use It In VS Code¶
There are two supported interaction paths.
Agent Picker¶
Select EchoSpire Live-Site Debugger from the chat agent picker, then ask it to investigate an incident.
Provide:
- API base URL
CorrelationIdorRunId- symptom text or ticket summary
- optional timeframe
- optional admin credentials if you do not want it to use local seeded dev credentials
Example requests:
Investigate http://localhost:5231 correlation 6f1d4d47-8dc1-4ff4-bf7f-3b9f5475c903. The WPF client failed during startup.Investigate http://localhost:5231 run 5b87d04d-2d66-4f88-9bb3-2c2892a44db7 for the last 30 minutes. Find the first failing request and summarize likely cause.
Slash Prompt¶
Run /Investigate Live-Site Incident and provide the same inputs.
This is the easier path for repeat use because it routes directly to the custom agent with the expected workflow.
Local Dev Authentication¶
Telemetry endpoints are admin-gated.
For local development, the current API seeder provides these admin accounts:
admin/admintest-admin/test
These come from src/EchoSpire.API/Auth/AuthDataSeeder.cs and are appropriate for local-only debugging use.
Current Telemetry Endpoint Contract¶
The agent should treat the following as the primary operational surface:
GET /api/v1/healthGET /api/v1/telemetry/health/overviewGET /api/v1/telemetry/errorsGET /api/v1/telemetry/requestsGET /api/v1/telemetry/runs/{runId}POST /api/v1/telemetry/query
The raw query endpoint remains important, but it should stay a last resort after the structured endpoints have narrowed the incident.
MCP Direction¶
Yes, this can be promoted into an MCP-backed workflow.
That should be the next hardening step, not the first one.
The current custom agent is the right first implementation because:
- it works immediately with the telemetry endpoints that already exist
- it avoids introducing another service while the evidence contract is still stabilizing
- it proves which operator tools are actually useful before we freeze an MCP surface
The recommended MCP follow-up is a small internal telemetry server that exposes narrowly scoped tools such as:
login_adminget_health_overviewget_recent_errorsget_recent_requestsget_run_timelineexecute_kql
Once that exists, the custom agent should switch from shell-driven HTTP calls to MCP-only tools and drop terminal access for tighter control and better auditability.
Phase 2: Incident Classification¶
- classify likely root cause categories
- link to similar prior issues
- improve support response speed
Phase 3: Balance Recommendations¶
- simulation plus live telemetry comparisons
- ranked recommendation output
- human review and draft-only publishing path
Phase 4: Mature Operational Assistant¶
- better trend analysis
- known-issue memory
- module compatibility diagnosis
- stronger confidence modeling
Official Recommendation Summary¶
The canonical direction is:
- build the live-site AI agent as a retrieval-first operational service
- split incident triage and balance analysis into separate workflows
- use telemetry, logs, and version metadata as first-class evidence
- require human review before any live-impacting change is accepted
- treat this as an operational assistant, not a gameplay decision-maker