analyzing-mlflow-session

Analyzes an MLflow session — a sequence of traces from a multi-turn chat conversation or interaction. Use when the user asks to debug a chat conversation, review session or chat history, find where a multi-turn chat went wrong, or analyze patterns across turns. Triggers on "analyze this session", "what happened in this conversation", "debug session", "review chat history", "where did this chat go wrong", "session traces", "analyze chat", "debug this chat".

38 stars

bymsbaek

View on GitHub Installation ↓

Best use case

analyzing-mlflow-session is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using analyzing-mlflow-session should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/analyze-mlflow-chat-session/SKILL.md --create-dirs "https://raw.githubusercontent.com/msbaek/dotfiles/main/.claude/skills/analyze-mlflow-chat-session/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/analyze-mlflow-chat-session/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How analyzing-mlflow-session Compares

Feature / Agent	analyzing-mlflow-session	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Analyzing an MLflow Chat Session

## What is a Session?

A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.

The session ID is stored in trace metadata under the key `mlflow.trace.session`. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.

## Reconstructing the Conversation

Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. **Do NOT fetch full traces for every turn** — use `--extract-fields` on the search command instead.

**Step 1: Discover the schema.** First, find a trace ID from the session, then fetch its full JSON to inspect the schema:

```bash
# Get the first trace in the session
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id' \
  --output json \
  --max-results 1 > /tmp/first_trace.json

# Fetch the full trace (always outputs JSON, no --output flag needed)
mlflow traces get \
  --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json
```

Find the **root span** — the span with `parent_span_id` equal to `null` (i.e., it has no parent). This is the top-level operation in the trace:

```bash
# Find the root span
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json
```

Examine its `attributes` dict to identify which keys hold the user input and system output. These could be:

- **MLflow standard attributes**: `mlflow.spanInputs` and `mlflow.spanOutputs` (set by the MLflow Python client)
- **Custom attributes**: Application-specific keys set via `@mlflow.trace` or `mlflow.start_span()` with custom attribute logging
- **Third-party OTel attributes**: Keys following GenAI Semantic Conventions, OpenInference, or other instrumentation conventions

The structure of these values also varies by application (e.g., a `query` string, a `messages` array, a dict with multiple fields). Inspect the actual attribute values to understand the format.

**If the root span has empty or missing inputs/outputs**, it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:

The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in `mlflow.spanInputs`/`mlflow.spanOutputs`) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:

```bash
# Get the root span's ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

# List immediate children of the root span with their inputs/outputs
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json
```

Also check the first trace's assessments. **Session-level assessments are attached to the first trace in the session** — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.

Both types appear in `.info.assessments`. Session-level assessments are identified by the presence of `mlflow.trace.session` in their `metadata` field:

```bash
# Show session-level assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

# Show per-turn assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json
```

**Assessment errors are not trace errors.** If an assessment has a `feedback.error` field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.

**Always consult the rationale when interpreting assessment values.** The `value` alone can be misleading — for example, a `user_frustration` assessment with `value: "no"` could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration *is* present), depending on how the scorer was configured. The `.rationale` field (a top-level assessment field, **not** nested under `.feedback`) explains what the value means in context. Include rationale when extracting assessments:

```bash
jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json
```

**Step 2: Extract across all session traces.** Once you know which attribute keys hold inputs and outputs, search for all traces in the session using `--extract-fields` to pull those fields along with assessments (see [Handling CLI Output](#handling-cli-output) for why output is written to a file):

```bash
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json
```

Then use bash commands (e.g., `jq`, `wc`, `head`) on the file to analyze it.

The `--extract-fields` example above uses `mlflow.traceInputs`/`mlflow.traceOutputs` from trace metadata — adjust the field paths based on what you discovered in step 1.

Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):

```bash
# List turns with their valid assessments (scorer errors filtered out)
jq '.traces[] | {
  trace_id: .info.trace_id,
  time: .info.request_time,
  state: .info.state,
  assessments: [.info.assessments[]? | select(.feedback.error == null) | {
    name: .assessment_name,
    value: .feedback.value
  }]
}' /tmp/session_traces.json
```

**CLI syntax notes:**

- **`--experiment-id` is required** for all `mlflow traces search` commands. The command will fail without it.
- Metadata keys containing dots **must** be escaped with backticks in filter strings and extract-fields: `` metadata.`mlflow.trace.session` ``
- **Shell quoting**: Backticks inside **double quotes** are interpreted by bash as command substitution (e.g., bash will try to run `` `mlflow.trace.session` `` as a command). Always use **single quotes** for the outer string when the value contains backticks. For example: `--filter-string 'metadata.\`mlflow.trace.session\` = "value"'`
- `--max-results` defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.

## Handling CLI Output

MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.

**Safe approach (always works):**
```bash
# Step 1: Save to file
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  [...] \
  --output json > /tmp/output.json

# Step 2: Process the file
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json
```

**Never pipe MLflow CLI output directly** (e.g., `mlflow traces search ... | jq '.'`). This can silently produce no output. Always redirect to a file first, then run commands on the file.

To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:

```bash
mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json
```


## Codebase Correlation

- **Session ID assignment**: Search the codebase for where `mlflow.trace.session` is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.
- **Context window management**: Look for how the application constructs the message history passed to the LLM at each turn. Common patterns include sliding window (last N messages), summarization of older turns, or full history. This implementation determines what context the model sees and is a frequent source of multi-turn failures.
- **Memory and state**: Some applications maintain state across turns beyond message history (e.g., extracted entities, user preferences, accumulated tool results). Search for how this state is stored and passed between turns.

## Reference Scripts

The `scripts/` subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).

- **`scripts/discover_schema.sh <EXPERIMENT_ID> <SESSION_ID>`** — Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.
- **`scripts/inspect_turn.sh <TRACE_ID>`** — Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.

## Example: Wrong Answer on Chat Turn 5

A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.

**1. Discover the schema and reconstruct the conversation.**

Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case, `mlflow.spanInputs` contains the user query and `mlflow.spanOutputs` contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.

**2. Check if the error originated in an earlier turn.**

Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.

**3. Analyze the root-cause turn as a single trace.**

Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.

**4. Recommendations.**

- Fix the retriever's data source to exclude or update outdated documents.
- Add per-turn assessments to detect errors before they propagate across the conversation.
- Consider implementing conversation-level error detection (e.g., checking consistency of answers across turns).

Related Skills

session-handoff

from msbaek/dotfiles

세션 종료 시 plan/INDEX/메모리/저널을 업데이트하고 다음 세션 재개 프롬프트 제공

searching-mlflow-docs

from msbaek/dotfiles

Searches and retrieves MLflow documentation from the official docs site. Use when the user asks about MLflow features, APIs, integrations (LangGraph, LangChain, OpenAI, etc.), tracing, tracking, or requests to look up MLflow documentation. Triggers on "how do I use MLflow with X", "find MLflow docs for Y", "MLflow API for Z".

retrieving-mlflow-traces

from msbaek/dotfiles

Retrieves MLflow traces using CLI or Python API. Use when the user asks to get a trace by ID, find traces, filter traces by status/tags/metadata/execution time, query traces, or debug failed traces. Triggers on "get trace", "search traces", "find failed traces", "filter traces by", "traces slower than", "query MLflow traces".

querying-mlflow-metrics

from msbaek/dotfiles

Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.

instrumenting-with-mlflow-tracing

from msbaek/dotfiles

Instruments Python and TypeScript code with MLflow Tracing for observability. Must be loaded when setting up tracing as part of any workflow including agent evaluation. Triggers on adding tracing, instrumenting agents/LLM apps, getting started with MLflow tracing, tracing specific frameworks (LangGraph, LangChain, OpenAI, DSPy, CrewAI, AutoGen), or when another skill references tracing setup. Examples - "How do I add tracing?", "Instrument my agent", "Trace my LangChain app", "Set up tracing for evaluation"

find-session

from msbaek/dotfiles

자연어로 이전 Claude Code 세션을 검색하고 요약하는 스킬. agf 키워드 검색과 qmd 의미 검색을 병행하여 정확도를 높이고, git 커밋 조사와 AI 요약까지 제공한다. "세션 찾아줘", "지난번 작업 찾기", "이전에 했던 ~~ 세션", "~~ 작업한 세션 보여줘", "세션 검색", "작업 이력", "resume session", "이전 대화 찾기", "그때 세션", "그 세션" 등의 요청 시 반드시 이 스킬을 사용할 것. /agf나 qmd-search를 직접 호출하기 전에 이 스킬을 먼저 고려할 것.

databricks-mlflow-evaluation

from msbaek/dotfiles

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

analyzing-mlflow-trace

from msbaek/dotfiles

Analyzes a single MLflow trace to answer a user query about it. Use when the user provides a trace ID and asks to debug, investigate, find issues, root-cause errors, understand behavior, or analyze quality. Triggers on "analyze this trace", "what went wrong with this trace", "debug trace", "investigate trace", "why did this trace fail", "root cause this trace".

weekly-newsletter

from msbaek/dotfiles

Obsidian vault에서 이번 주(토~금) 작성/수정된 글들을 모아 뉴스레터 생성. 서브 에이전트 기반 병렬 처리로 메인 컨텍스트 절약. 기술적, 리더십적으로 외부에 공유할 만한 내용을 선별하여 정리. "뉴스레터 만들어줘", "이번 주 글 정리해줘", "weekly digest" 등의 요청 시 자동 적용.

vis

from msbaek/dotfiles

Vault Intelligence System (vis) CLI를 활용한 Obsidian vault 시맨틱 검색, 자동 태깅, MOC 생성, 관련 문서 연결, 주제별 문서 연결, 주제 수집, 태그 통계, 지식 공백 분석, 중복 감지, 학습 리뷰 등 vault 지식 관리 전반을 지원하는 skill. vault 검색, 문서 정리, 태그, MOC, 관련 문서, 주제 수집, 중복 검사, 학습 리뷰, 지식 공백, 클러스터링, 인덱싱, 주제별 문서 연결, 태그 통계 관련 작업 시 자동 적용.

spark-python-data-source

from msbaek/dotfiles

Build custom Python data sources for Apache Spark using the PySpark DataSource API — batch and streaming readers/writers for external systems. Use this skill whenever someone wants to connect Spark to an external system (database, API, message queue, custom protocol), build a Spark connector or plugin in Python, implement a DataSourceReader or DataSourceWriter, pull data from or push data to a system via Spark, or work with the PySpark DataSource API in any way. Even if they just say "read from X in Spark" or "write DataFrame to Y" and there's no native connector, this skill applies.

recall

from msbaek/dotfiles

Load context from vault memory. Temporal queries (yesterday, last week, session history) use agf (history.jsonl) for fast session lookup. Topic queries use vis semantic search. "recall graph" generates interactive temporal graph of sessions and files. Every recall ends with "One Thing" - the single highest-leverage next action synthesized from results. Use when user says "recall", "what did we work on", "load context about", "remember when we", "prime context", "yesterday", "what was I doing", "last week", "session history", "recall graph", "session graph".