Live-Site AI Agent¶

Purpose¶

This document defines the official direction for an operational AI agent that uses telemetry, logs, deployment metadata, and known issue history to assist with support diagnostics and balance recommendations.

This is not a gameplay agent.

It is an operational analysis system that sits beside the game and helps the team understand what is going wrong or what should be adjusted.

Primary Roles¶

The live-site agent has two initial modes:

incident triage and diagnosis
balance analysis and recommendation generation

These may share infrastructure, but they should remain separate workflows because their goals, inputs, and review processes differ.

Core Rule¶

The live-site agent should be retrieval-first and tool-using.

It must not rely on prompt-only reasoning over partial text excerpts.

The agent should gather evidence from structured systems first, then generate a diagnosis or recommendation.

Required Inputs¶

The operational platform should make the following data available to the agent.

Incident Inputs¶

correlation id
run id
phase id when available
client version
content snapshot version
effect module version set
operating environment and deployment ring
exception traces
structured logs
gameplay telemetry timeline
user ticket text or support notes
device and platform metadata where available

Correlation Contract¶

The debugging agent only works if correlation is explicit and consistent across every client and service hop.

The required contract is:

RunId is a GUID created once per UX launch or operator session and reused for every downstream request from that session
CorrelationId is a GUID created at the top of each call-stack entry point such as a button click, page action, startup workflow, or API-triggered operation
outbound HTTP calls propagate both values in X-EchoSpire-Run-Id and X-EchoSpire-Correlation-Id
the receiving service keeps those values if valid, generates replacements only when missing, and echoes them on the response for support diagnostics
API request telemetry, error telemetry, and gameplay telemetry must all store the same RunId and the request-scoped CorrelationId so KQL can reconstruct a full flow

This means the agent can start from either identifier:

use CorrelationId to inspect one top-level failing interaction
use RunId to inspect the entire UX session across multiple requests and gameplay events

Balance Inputs¶

simulation batch results
policy ids used in simulation
live telemetry aggregates
win rates by class, faction, and hero progression state
quest and rift failure rates
protection and escort success rates
card, relic, and event performance bands
boss and rival hero lethality data
sample-size and confidence metadata

Tooling Model¶

The live-site agent should be given explicit tools rather than raw database dumps.

Recommended tools are:

fetch session by correlation id
fetch telemetry timeline by run id
fetch exception clusters by version
fetch deployed module and content manifest for a session
compare incident against known issue history
search similar historical sessions
retrieve simulation outlier report
retrieve class and faction balance report
retrieve objective failure analysis

For the first debugging implementation, the minimum viable toolset should be:

fetch recent request telemetry by CorrelationId
fetch full session timeline by RunId
fetch recent errors joined on RunId or CorrelationId
fetch API health overview for the current deployment window
execute constrained KQL for admin-reviewed investigations

Each tool should return structured records suitable for later summarization.

Incident Triage Agent¶

Goal¶

Determine the most likely cause of a reported problem and recommend the next action.

Recommended Workflow¶

accept correlation id, ticket, or exception reference
retrieve session timeline and failure evidence
retrieve deployed code version and content snapshot version
retrieve relevant module manifest information
compare the issue against known incidents and recent regressions
classify the problem
produce a diagnosis summary with confidence and evidence

Recommended Output Shape¶

The incident agent should return:

issue classification
likely root cause
confidence level
impacted subsystem
evidence summary
likely reproduction path
recommended next action
whether the issue appears code, content, configuration, deployment, or user-behavior related

Recommended Issue Classes¶

code defect
bad authored content
version mismatch
module compatibility problem
telemetry gap or incomplete evidence
user misunderstanding or UX clarity issue
infrastructure failure
unknown requires engineer review

Balance Recommendation Agent¶

Goal¶

Analyze telemetry and simulation output to produce ranked balance recommendations for designer review.

Recommended Workflow¶

retrieve selected live and simulated balance datasets
evaluate sample quality and confidence
identify outlier cards, relics, classes, quests, and encounters
compare simulator findings to live player behavior
generate ranked hypotheses
propose recommendation candidates
send the output to a human review queue

Recommended Output Shape¶

The balance agent should return:

ranked issues by severity and confidence
suspected root-cause drivers
impacted content ids and gameplay spaces
recommendation candidates
sample-size warning flags
whether the evidence is simulator-only, live-only, or corroborated by both

Example Recommendation Types¶

reduce or increase a card parameter range
shift reward weight or encounter appearance rate
improve early protection tools for a class
weaken a rival hero loadout package
adjust boss phase pacing
mark a metric as inconclusive due to weak data

Safety Boundaries¶

The live-site agent must operate under the following rules:

no direct mutation of live balance data in the first implementation
no autonomous hotfix deployment
no silent modification of player saves or progression state
recommendations must be reviewed by a human before publication
weak evidence must be surfaced as weak evidence, not padded into false confidence

Deployment And Privacy Model¶

The agent should run as an internal operational service.

Recommended characteristics are:

internal-only access to telemetry and logs
role-gated access through admin or support tooling
audit logging for every agent request and output
evidence links back to source systems where possible
redaction and privacy rules applied before model prompting when needed

Data Correlation Direction¶

The current telemetry structure based on run id, phase id, and event id is the correct foundation.

The next operational step is to ensure support workflows can join:

API request telemetry
gameplay run telemetry
deployment and version metadata
exception logging
support tickets

The agent becomes powerful only if those joins are reliable.

Review Workflow¶

Recommended review path for incidents:

agent performs analysis
support or engineer reviews evidence
issue is classified and routed
if accepted, a known-issue record can be created for future retrieval

Recommended review path for balance:

agent analyzes telemetry and simulation
designer reviews ranked recommendations
approved recommendations become draft content changes
simulation and validation rerun before publish

Recommended Storage Outputs¶

Agent outputs should be stored as structured analysis artifacts.

Recommended fields are:

analysis id
requested by
request type
source identifiers used
evidence references
summary
recommended actions
confidence
status such as draft, reviewed, accepted, dismissed

Implementation Phasing¶

Phase 1: Retrieval And Summaries¶

correlation-id driven investigation
run-id session reconstruction across UX and API telemetry
telemetry and exception summaries
no recommendation automation yet

Current Workspace Implementation¶

The first working operator-facing implementation now exists as a VS Code workspace custom agent rather than an MCP server.

The repo now contains:

.github/agents/echospire-live-site-debugger.agent.md
.github/prompts/investigate-live-site-incident.prompt.md

This is the fastest path to a real, usable debugging workflow because it can use the existing API telemetry endpoints immediately without adding another service boundary first.

The current agent is intentionally read-only in practice:

it uses the API health and telemetry endpoints
it authenticates through the existing auth API when needed
it prefers structured telemetry endpoints before raw KQL
it is meant to investigate, not mutate systems

How To Use It In VS Code¶

There are two supported interaction paths.

Agent Picker¶

Select EchoSpire Live-Site Debugger from the chat agent picker, then ask it to investigate an incident.

Provide:

API base URL
CorrelationId or RunId
symptom text or ticket summary
optional timeframe
optional admin credentials if you do not want it to use local seeded dev credentials

Example requests:

Investigate http://localhost:5231 correlation 6f1d4d47-8dc1-4ff4-bf7f-3b9f5475c903. The WPF client failed during startup.
Investigate http://localhost:5231 run 5b87d04d-2d66-4f88-9bb3-2c2892a44db7 for the last 30 minutes. Find the first failing request and summarize likely cause.

Slash Prompt¶

Run /Investigate Live-Site Incident and provide the same inputs.

This is the easier path for repeat use because it routes directly to the custom agent with the expected workflow.

Local Dev Authentication¶

Telemetry endpoints are admin-gated.

For local development, the current API seeder provides these admin accounts:

admin / admin
test-admin / test

These come from src/EchoSpire.API/Auth/AuthDataSeeder.cs and are appropriate for local-only debugging use.

Current Telemetry Endpoint Contract¶

The agent should treat the following as the primary operational surface:

GET /api/v1/health
GET /api/v1/telemetry/health/overview
GET /api/v1/telemetry/errors
GET /api/v1/telemetry/requests
GET /api/v1/telemetry/runs/{runId}
POST /api/v1/telemetry/query

The raw query endpoint remains important, but it should stay a last resort after the structured endpoints have narrowed the incident.

MCP Direction¶

Yes, this can be promoted into an MCP-backed workflow.

That should be the next hardening step, not the first one.

The current custom agent is the right first implementation because:

it works immediately with the telemetry endpoints that already exist
it avoids introducing another service while the evidence contract is still stabilizing
it proves which operator tools are actually useful before we freeze an MCP surface

The recommended MCP follow-up is a small internal telemetry server that exposes narrowly scoped tools such as:

login_admin
get_health_overview
get_recent_errors
get_recent_requests
get_run_timeline
execute_kql

Once that exists, the custom agent should switch from shell-driven HTTP calls to MCP-only tools and drop terminal access for tighter control and better auditability.

Phase 2: Incident Classification¶

classify likely root cause categories
link to similar prior issues
improve support response speed

Phase 3: Balance Recommendations¶

simulation plus live telemetry comparisons
ranked recommendation output
human review and draft-only publishing path

Phase 4: Mature Operational Assistant¶

better trend analysis
known-issue memory
module compatibility diagnosis
stronger confidence modeling

Official Recommendation Summary¶

The canonical direction is:

build the live-site AI agent as a retrieval-first operational service
split incident triage and balance analysis into separate workflows
use telemetry, logs, and version metadata as first-class evidence
require human review before any live-impacting change is accepted
treat this as an operational assistant, not a gameplay decision-maker