Skip to content

Live-Site AI Agent

Purpose

This document defines the official direction for an operational AI agent that uses telemetry, logs, deployment metadata, and known issue history to assist with support diagnostics and balance recommendations.

This is not a gameplay agent.

It is an operational analysis system that sits beside the game and helps the team understand what is going wrong or what should be adjusted.

Primary Roles

The live-site agent has two initial modes:

  • incident triage and diagnosis
  • balance analysis and recommendation generation

These may share infrastructure, but they should remain separate workflows because their goals, inputs, and review processes differ.

Core Rule

The live-site agent should be retrieval-first and tool-using.

It must not rely on prompt-only reasoning over partial text excerpts.

The agent should gather evidence from structured systems first, then generate a diagnosis or recommendation.

Required Inputs

The operational platform should make the following data available to the agent.

Incident Inputs

  • correlation id
  • run id
  • phase id when available
  • client version
  • content snapshot version
  • effect module version set
  • operating environment and deployment ring
  • exception traces
  • structured logs
  • gameplay telemetry timeline
  • user ticket text or support notes
  • device and platform metadata where available

Correlation Contract

The debugging agent only works if correlation is explicit and consistent across every client and service hop.

The required contract is:

  • RunId is a GUID created once per UX launch or operator session and reused for every downstream request from that session
  • CorrelationId is a GUID created at the top of each call-stack entry point such as a button click, page action, startup workflow, or API-triggered operation
  • outbound HTTP calls propagate both values in X-EchoSpire-Run-Id and X-EchoSpire-Correlation-Id
  • the receiving service keeps those values if valid, generates replacements only when missing, and echoes them on the response for support diagnostics
  • API request telemetry, error telemetry, and gameplay telemetry must all store the same RunId and the request-scoped CorrelationId so KQL can reconstruct a full flow

This means the agent can start from either identifier:

  • use CorrelationId to inspect one top-level failing interaction
  • use RunId to inspect the entire UX session across multiple requests and gameplay events

Balance Inputs

  • simulation batch results
  • policy ids used in simulation
  • live telemetry aggregates
  • win rates by class, faction, and hero progression state
  • quest and rift failure rates
  • protection and escort success rates
  • card, relic, and event performance bands
  • boss and rival hero lethality data
  • sample-size and confidence metadata

Tooling Model

The live-site agent should be given explicit tools rather than raw database dumps.

Recommended tools are:

  • fetch session by correlation id
  • fetch telemetry timeline by run id
  • fetch exception clusters by version
  • fetch deployed module and content manifest for a session
  • compare incident against known issue history
  • search similar historical sessions
  • retrieve simulation outlier report
  • retrieve class and faction balance report
  • retrieve objective failure analysis

For the first debugging implementation, the minimum viable toolset should be:

  • fetch recent request telemetry by CorrelationId
  • fetch full session timeline by RunId
  • fetch recent errors joined on RunId or CorrelationId
  • fetch API health overview for the current deployment window
  • execute constrained KQL for admin-reviewed investigations

Each tool should return structured records suitable for later summarization.

Incident Triage Agent

Goal

Determine the most likely cause of a reported problem and recommend the next action.

  1. accept correlation id, ticket, or exception reference
  2. retrieve session timeline and failure evidence
  3. retrieve deployed code version and content snapshot version
  4. retrieve relevant module manifest information
  5. compare the issue against known incidents and recent regressions
  6. classify the problem
  7. produce a diagnosis summary with confidence and evidence

The incident agent should return:

  • issue classification
  • likely root cause
  • confidence level
  • impacted subsystem
  • evidence summary
  • likely reproduction path
  • recommended next action
  • whether the issue appears code, content, configuration, deployment, or user-behavior related
  • code defect
  • bad authored content
  • version mismatch
  • module compatibility problem
  • telemetry gap or incomplete evidence
  • user misunderstanding or UX clarity issue
  • infrastructure failure
  • unknown requires engineer review

Balance Recommendation Agent

Goal

Analyze telemetry and simulation output to produce ranked balance recommendations for designer review.

  1. retrieve selected live and simulated balance datasets
  2. evaluate sample quality and confidence
  3. identify outlier cards, relics, classes, quests, and encounters
  4. compare simulator findings to live player behavior
  5. generate ranked hypotheses
  6. propose recommendation candidates
  7. send the output to a human review queue

The balance agent should return:

  • ranked issues by severity and confidence
  • suspected root-cause drivers
  • impacted content ids and gameplay spaces
  • recommendation candidates
  • sample-size warning flags
  • whether the evidence is simulator-only, live-only, or corroborated by both

Example Recommendation Types

  • reduce or increase a card parameter range
  • shift reward weight or encounter appearance rate
  • improve early protection tools for a class
  • weaken a rival hero loadout package
  • adjust boss phase pacing
  • mark a metric as inconclusive due to weak data

Safety Boundaries

The live-site agent must operate under the following rules:

  • no direct mutation of live balance data in the first implementation
  • no autonomous hotfix deployment
  • no silent modification of player saves or progression state
  • recommendations must be reviewed by a human before publication
  • weak evidence must be surfaced as weak evidence, not padded into false confidence

Deployment And Privacy Model

The agent should run as an internal operational service.

Recommended characteristics are:

  • internal-only access to telemetry and logs
  • role-gated access through admin or support tooling
  • audit logging for every agent request and output
  • evidence links back to source systems where possible
  • redaction and privacy rules applied before model prompting when needed

Data Correlation Direction

The current telemetry structure based on run id, phase id, and event id is the correct foundation.

The next operational step is to ensure support workflows can join:

  • API request telemetry
  • gameplay run telemetry
  • deployment and version metadata
  • exception logging
  • support tickets

The agent becomes powerful only if those joins are reliable.

Review Workflow

Recommended review path for incidents:

  • agent performs analysis
  • support or engineer reviews evidence
  • issue is classified and routed
  • if accepted, a known-issue record can be created for future retrieval

Recommended review path for balance:

  • agent analyzes telemetry and simulation
  • designer reviews ranked recommendations
  • approved recommendations become draft content changes
  • simulation and validation rerun before publish

Agent outputs should be stored as structured analysis artifacts.

Recommended fields are:

  • analysis id
  • requested by
  • request type
  • source identifiers used
  • evidence references
  • summary
  • recommended actions
  • confidence
  • status such as draft, reviewed, accepted, dismissed

Implementation Phasing

Phase 1: Retrieval And Summaries

  • correlation-id driven investigation
  • run-id session reconstruction across UX and API telemetry
  • telemetry and exception summaries
  • no recommendation automation yet

Current Workspace Implementation

The first working operator-facing implementation now exists as a VS Code workspace custom agent rather than an MCP server.

The repo now contains:

  • .github/agents/echospire-live-site-debugger.agent.md
  • .github/prompts/investigate-live-site-incident.prompt.md

This is the fastest path to a real, usable debugging workflow because it can use the existing API telemetry endpoints immediately without adding another service boundary first.

The current agent is intentionally read-only in practice:

  • it uses the API health and telemetry endpoints
  • it authenticates through the existing auth API when needed
  • it prefers structured telemetry endpoints before raw KQL
  • it is meant to investigate, not mutate systems

How To Use It In VS Code

There are two supported interaction paths.

Agent Picker

Select EchoSpire Live-Site Debugger from the chat agent picker, then ask it to investigate an incident.

Provide:

  • API base URL
  • CorrelationId or RunId
  • symptom text or ticket summary
  • optional timeframe
  • optional admin credentials if you do not want it to use local seeded dev credentials

Example requests:

  • Investigate http://localhost:5231 correlation 6f1d4d47-8dc1-4ff4-bf7f-3b9f5475c903. The WPF client failed during startup.
  • Investigate http://localhost:5231 run 5b87d04d-2d66-4f88-9bb3-2c2892a44db7 for the last 30 minutes. Find the first failing request and summarize likely cause.

Slash Prompt

Run /Investigate Live-Site Incident and provide the same inputs.

This is the easier path for repeat use because it routes directly to the custom agent with the expected workflow.

Local Dev Authentication

Telemetry endpoints are admin-gated.

For local development, the current API seeder provides these admin accounts:

  • admin / admin
  • test-admin / test

These come from src/EchoSpire.API/Auth/AuthDataSeeder.cs and are appropriate for local-only debugging use.

Current Telemetry Endpoint Contract

The agent should treat the following as the primary operational surface:

  • GET /api/v1/health
  • GET /api/v1/telemetry/health/overview
  • GET /api/v1/telemetry/errors
  • GET /api/v1/telemetry/requests
  • GET /api/v1/telemetry/runs/{runId}
  • POST /api/v1/telemetry/query

The raw query endpoint remains important, but it should stay a last resort after the structured endpoints have narrowed the incident.

MCP Direction

Yes, this can be promoted into an MCP-backed workflow.

That should be the next hardening step, not the first one.

The current custom agent is the right first implementation because:

  • it works immediately with the telemetry endpoints that already exist
  • it avoids introducing another service while the evidence contract is still stabilizing
  • it proves which operator tools are actually useful before we freeze an MCP surface

The recommended MCP follow-up is a small internal telemetry server that exposes narrowly scoped tools such as:

  • login_admin
  • get_health_overview
  • get_recent_errors
  • get_recent_requests
  • get_run_timeline
  • execute_kql

Once that exists, the custom agent should switch from shell-driven HTTP calls to MCP-only tools and drop terminal access for tighter control and better auditability.

Phase 2: Incident Classification

  • classify likely root cause categories
  • link to similar prior issues
  • improve support response speed

Phase 3: Balance Recommendations

  • simulation plus live telemetry comparisons
  • ranked recommendation output
  • human review and draft-only publishing path

Phase 4: Mature Operational Assistant

  • better trend analysis
  • known-issue memory
  • module compatibility diagnosis
  • stronger confidence modeling

Official Recommendation Summary

The canonical direction is:

  • build the live-site AI agent as a retrieval-first operational service
  • split incident triage and balance analysis into separate workflows
  • use telemetry, logs, and version metadata as first-class evidence
  • require human review before any live-impacting change is accepted
  • treat this as an operational assistant, not a gameplay decision-maker