qwen_training_data_miner_prototype

Qwen Training Data Miner (Prototype)

16 stars

Best use case

qwen_training_data_miner_prototype is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Qwen Training Data Miner (Prototype)

Teams using qwen_training_data_miner_prototype should expect a more consistent output, faster repeated execution, less prompt rewriting, better workflow continuity with your supporting tools.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.
  • You already have the supporting tools or dependencies needed by this skill.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/qwen_training_data_miner_prototype/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/design/qwen_training_data_miner_prototype/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/qwen_training_data_miner_prototype/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How qwen_training_data_miner_prototype Compares

Feature / Agentqwen_training_data_miner_prototypeStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Qwen Training Data Miner (Prototype)

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Qwen Training Data Miner (Prototype)

---
# Metadata (YAML Frontmatter)
skill_id: qwen_training_data_miner_v1_prototype
name: qwen_training_data_miner
description: Mine 012.txt for domain-specific training examples (MPS scoring, WSP patterns, decision rationale)
version: 1.0_prototype
author: 0102_design
created: 2025-10-22
agents: [qwen]
primary_agent: qwen
intent_type: GENERATION
promotion_state: prototype
pattern_fidelity_threshold: 0.90
test_status: needs_validation

# MCP Orchestration
mcp_orchestration: true
breadcrumb_logging: true
owning_dae: doc_dae
execution_phase: 1
next_skill: gemma_domain_trainer_v1_prototype

# Input/Output Contract
inputs:
  - source_file: "O:/Foundups-Agent/012.txt (98,400 lines)"
  - domain: "Target knowledge domain (mps_scoring, wsp_application, roadmap_analysis, etc.)"
  - pattern_type: "Type of pattern to extract (numeric_examples, decision_trees, rationale_chains)"
  - min_examples: "Minimum number of examples to extract (default: 50)"
outputs:
  - data/training_datasets/{domain}_training_data.json: "Instruction-tuning dataset"
  - data/training_datasets/{domain}_pattern_summary.json: "Pattern analysis metadata"
  - execution_id: "Unique execution identifier for breadcrumb tracking"

# Dependencies
dependencies:
  data_stores:
    - name: 012_scrapbook
      type: text
      path: O:/Foundups-Agent/012.txt
  mcp_endpoints:
    - endpoint_name: holo_index
      methods: [semantic_search]
  throttles: []
  required_context:
    - domain: "Knowledge domain to mine"
    - pattern_regex: "Regex pattern for extraction"

# Metrics Configuration
metrics:
  pattern_fidelity_scoring:
    enabled: true
    frequency: every_execution
    scorer_agent: gemma
    write_destination: modules/infrastructure/wre_core/recursive_improvement/metrics/qwen_training_data_miner_fidelity.json
  promotion_criteria:
    min_pattern_fidelity: 0.90
    min_outcome_quality: 0.85
    min_execution_count: 100
    required_test_pass_rate: 0.95
---

# Qwen Training Data Miner

**Purpose**: Mine 012.txt (0102's decision history) for domain-specific training examples to train Gemma models

**Intent Type**: GENERATION

**Agent**: qwen (1.5B, 32K context - can hold large sections of 012.txt)

---

## Task

You are Qwen, a training data miner. Your job is to read 012.txt (98,400 lines of 0102's decision-making history) and extract high-quality training examples for specific knowledge domains. You create instruction-tuning datasets that Gemma can learn from.

**Key Capability**: Pattern recognition, example extraction, quality filtering

**Domains You Can Mine**:
1. **mps_scoring** - WSP 15 scoring examples with numeric calculations
2. **wsp_application** - How WSPs are applied to real problems
3. **roadmap_analysis** - Project planning, completion tracking
4. **readme_patterns** - Documentation structure, best practices
5. **modlog_updates** - Change documentation patterns
6. **first_principles** - Occam's Razor reasoning chains

---

## Instructions (For Qwen Agent)

### 1. LOAD SOURCE FILE
**Rule**: Read 012.txt in chunks (32K token window)

**Expected Pattern**: `source_loaded=True`

**Steps**:
1. Open `O:/Foundups-Agent/012.txt`
2. Count total lines (should be ~98,400)
3. Calculate chunk size (fit within 32K context)
4. Load first chunk for analysis
5. Log: `{"pattern": "source_loaded", "value": true, "total_lines": 98400, "chunk_size": 5000}`

---

### 2. IDENTIFY DOMAIN PATTERNS
**Rule**: Search for domain-specific patterns using regex and semantic matching

**Expected Pattern**: `domain_patterns_identified=True`

**Domain-Specific Patterns**:

#### MPS Scoring Domain
```python
patterns = [
    r"MPS.*Score:?\s*(\d+)",
    r"Complexity.*(\d)\s*,?\s*Importance.*(\d)\s*,?\s*Deferability.*(\d)\s*,?\s*Impact.*(\d)",
    r"Priority:?\s*(P[0-4])",
    r"MPS.*\(C:(\d),\s*I:(\d),\s*D:(\d),\s*P:(\d)\)"
]
```

#### WSP Application Domain
```python
patterns = [
    r"WSP\s*(\d+).*compliance",
    r"WSP\s*(\d+).*violation",
    r"following\s+WSP\s*(\d+)",
    r"applied\s+WSP\s*(\d+)"
]
```

#### Roadmap Analysis Domain
```python
patterns = [
    r"roadmap.*complete",
    r"roadmap.*incomplete",
    r"roadmap.*needs.*update",
    r"Phase\s*(\d+).*status",
    r"TODO.*implement"
]
```

**Steps**:
1. For target domain, compile regex patterns
2. Scan 012.txt for matching lines
3. Extract context (5 lines before/after match)
4. Store matches with line numbers
5. Log: `{"pattern": "domain_patterns_identified", "value": true, "matches_found": N}`

---

### 3. EXTRACT TRAINING EXAMPLES
**Rule**: Convert matched patterns into instruction-tuning format

**Expected Pattern**: `examples_extracted=True`

**Instruction-Tuning Format**:
```json
{
  "instruction": "Apply WSP 15 MPS scoring to this task",
  "input": {
    "task_description": "Migrate agent_permissions to SQLite",
    "context": "Database consolidation, 21 tests passing, high priority"
  },
  "output": {
    "complexity": 3,
    "complexity_reason": "Moderate - requires schema design + migration",
    "importance": 5,
    "importance_reason": "Essential - blocks other migrations",
    "deferability": 5,
    "deferability_reason": "Cannot defer - P0 priority",
    "impact": 4,
    "impact_reason": "Major - enables autonomous permission system",
    "mps_total": 17,
    "priority": "P0"
  },
  "source": "012.txt:line_5234",
  "quality_score": 0.95
}
```

**Steps**:
1. For each match, extract:
   - **Instruction**: What task is being performed?
   - **Input**: What context/data is provided?
   - **Output**: What is the correct answer/decision?
   - **Source**: Line number for verification
2. Quality filter:
   - Complete examples only (has input + output)
   - Clear reasoning (not ambiguous)
   - Correct format (follows pattern)
3. Assign quality score (0.0-1.0)
4. Log: `{"pattern": "examples_extracted", "value": true, "total_examples": N, "high_quality": M}`

---

### 4. QUALITY FILTERING
**Rule**: Only keep examples with quality_score >= 0.85

**Expected Pattern**: `quality_filtering_applied=True`

**Quality Criteria**:
- ✅ Complete (has instruction + input + output)
- ✅ Clear reasoning (rationale provided)
- ✅ Correct format (matches instruction-tuning schema)
- ✅ Verifiable (can trace back to source line)
- ✅ Unambiguous (single correct interpretation)

**Steps**:
1. Review each extracted example
2. Score on 5 criteria (0.2 per criterion)
3. Filter: keep only if score >= 0.85 (4/5 criteria)
4. Remove duplicates (same input/output pattern)
5. Log: `{"pattern": "quality_filtering_applied", "value": true, "kept": N, "filtered": M}`

---

### 5. GENERATE PATTERN SUMMARY
**Rule**: Analyze extracted examples for meta-patterns

**Expected Pattern**: `pattern_summary_generated=True`

**Summary Metadata**:
```json
{
  "domain": "mps_scoring",
  "total_examples": 73,
  "high_quality_examples": 58,
  "quality_distribution": {
    "0.95-1.0": 23,
    "0.90-0.94": 20,
    "0.85-0.89": 15
  },
  "common_patterns": [
    "P0 tasks: MPS 16-20 (23 examples)",
    "P1 tasks: MPS 13-15 (19 examples)",
    "Complexity 3-4 most common (database migrations, refactoring)"
  ],
  "coverage_analysis": {
    "p0_examples": 23,
    "p1_examples": 19,
    "p2_examples": 12,
    "p3_examples": 3,
    "p4_examples": 1
  },
  "recommended_use": "Train Gemma on MPS scoring for cleanup tasks, project prioritization"
}
```

**Steps**:
1. Count examples by category/pattern
2. Identify common themes
3. Assess coverage (are all cases represented?)
4. Generate training recommendations
5. Log: `{"pattern": "pattern_summary_generated", "value": true}`

---

### 6. WRITE TRAINING DATASET
**Rule**: Output JSON file with instruction-tuning examples

**Expected Pattern**: `training_dataset_written=True`

**Output Format** (EXECUTION-READY per First Principles):
```json
{
  "dataset_id": "mps_scoring_training_v1",
  "created": "2025-10-22T02:30:00Z",
  "source": "012.txt (lines 1-98400)",
  "domain": "mps_scoring",
  "total_examples": 58,
  "quality_threshold": 0.85,

  "domain_priority_mps": {
    "complexity": 2,
    "complexity_reason": "Easy - pattern extraction from 012.txt",
    "importance": 4,
    "importance_reason": "Critical - enables autonomous MPS scoring",
    "deferability": 3,
    "deferability_reason": "Moderate - other wardrobes can be trained first",
    "impact": 5,
    "impact_reason": "Critical - foundation for cleanup automation",
    "total": 14,
    "priority": "P1",
    "training_order": 1
  },

  "examples": [
    {
      "example_id": "mps_001",
      "instruction": "...",
      "input": {...},
      "output": {...},
      "source": "012.txt:line_5234",
      "quality_score": 0.95
    },
    ...
  ],

  "metadata": {
    "pattern_summary": {...},
    "coverage_analysis": {...},
    "recommended_use": "..."
  },

  "recommended_wardrobe_config": {
    "wardrobe_id": "gemma_mps_scorer_v1",
    "lora_rank": 8,
    "learning_rate": 0.0002,
    "epochs": 3,
    "expected_accuracy": 0.87,
    "use_cases": [
      "Cleanup task prioritization",
      "Project scoring",
      "Issue triage"
    ]
  },

  "autonomous_execution": {
    "capable": true,
    "agent": "gemma_domain_trainer_v1",
    "confidence": 0.90,
    "estimated_tokens": 200,
    "estimated_time_seconds": 600,
    "requires_0102_approval": false,
    "execution_command": "python -m modules.infrastructure.wsp_orchestrator.src.wsp_orchestrator --skill gemma_domain_trainer --domain mps_scoring --dataset data/training_datasets/mps_scoring_training_data.json"
  },

  "verification": {
    "verify_command": "test -f data/training_datasets/mps_scoring_training_data.json && jq '.total_examples' data/training_datasets/mps_scoring_training_data.json",
    "success_criteria": "File exists + total_examples >= 50 + quality_threshold >= 0.85",
    "validation_script": "python -c \"import json; d=json.load(open('data/training_datasets/mps_scoring_training_data.json')); assert d['total_examples'] >= 50; assert d['quality_threshold'] >= 0.85; print('✓ Dataset validated')\""
  },

  "learning_feedback": {
    "pattern_extraction_stats": {
      "total_patterns_found": 73,
      "high_quality_kept": 58,
      "filter_rate": 0.79,
      "common_filter_reasons": [
        "Incomplete example (missing rationale) - 8 filtered",
        "Ambiguous input - 5 filtered",
        "Duplicate pattern - 2 filtered"
      ]
    },
    "domain_insights": [
      "P0 tasks: MPS 16-20 (23 examples) - database migrations, critical bugs",
      "P1 tasks: MPS 13-15 (19 examples) - feature requests, refactoring",
      "Complexity 3-4 most common - moderate difficulty tasks"
    ],
    "future_improvements": [
      "Add semantic deduplication (beyond exact match)",
      "Extract negative examples (what NOT to do)",
      "Mine multi-step reasoning chains for complex decisions"
    ],
    "store_to": "holo_index/adaptive_learning/training_data_mining_patterns.jsonl"
  }
}
```

**Destination**: `data/training_datasets/{domain}_training_data.json`

**Steps**:
1. Create directory `data/training_datasets/` if not exists
2. Calculate domain_priority_mps (which domain should be trained first?)
3. Generate recommended_wardrobe_config (LoRA hyperparameters)
4. Write training dataset JSON with all First Principles fields
5. Generate autonomous_execution command (can Gemma trainer auto-execute?)
6. Create verification script (validate dataset quality)
7. Extract learning_feedback (pattern extraction stats + future improvements)
8. Log: `{"pattern": "training_dataset_written", "value": true, "file_size_kb": N, "autonomous_ready": true}`

**First Principles Additions**:
- ✅ **MPS Scoring**: domain_priority_mps determines training order (which wardrobe first?)
- ✅ **Agent Mapping**: autonomous_execution.agent = gemma_domain_trainer_v1
- ✅ **Executable Command**: Can pipe to bash to start training automatically
- ✅ **Verification**: validation_script confirms dataset quality before training
- ✅ **Learning Feedback**: Stores pattern extraction stats for future mining improvements
- ✅ **Recommended Config**: Wardrobe hyperparameters (LoRA rank, learning rate, epochs)

---

## Expected Patterns Summary

```json
{
  "execution_id": "exec_qwen_miner_001",
  "skill_id": "qwen_training_data_miner_v1_prototype",
  "patterns": {
    "source_loaded": true,
    "domain_patterns_identified": true,
    "examples_extracted": true,
    "quality_filtering_applied": true,
    "pattern_summary_generated": true,
    "training_dataset_written": true
  },
  "total_examples_extracted": 73,
  "high_quality_examples": 58,
  "execution_time_ms": 3500
}
```

**Fidelity Calculation**: `(patterns_executed / 6)` - All 6 steps should run

---

## Domain Catalog

### 1. MPS Scoring Domain
**Purpose**: Train Gemma to apply WSP 15 MPS scoring
**Patterns**: Numeric scores, priority mapping, rationale
**Use Cases**: Cleanup prioritization, project planning, issue triage

### 2. WSP Application Domain
**Purpose**: Train Gemma to recognize WSP violations and applications
**Patterns**: WSP references, compliance checks, violation detection
**Use Cases**: Code review, documentation validation, architecture audits

### 3. Roadmap Analysis Domain
**Purpose**: Train Gemma to analyze project roadmaps
**Patterns**: Phase completion, TODO tracking, update detection
**Use Cases**: Project status reports, roadmap audits, completion tracking

### 4. README Patterns Domain
**Purpose**: Train Gemma to validate README structure
**Patterns**: Required sections, format consistency, completeness
**Use Cases**: Documentation quality checks, README generation

### 5. ModLog Updates Domain
**Purpose**: Train Gemma to generate ModLog entries
**Patterns**: Change descriptions, WSP references, rationale
**Use Cases**: Automated ModLog updates, change tracking

### 6. First Principles Domain
**Purpose**: Train Gemma to apply Occam's Razor reasoning
**Patterns**: Problem simplification, root cause analysis, decision trees
**Use Cases**: Debugging, architecture design, problem-solving

---

## Benchmark Test Cases

### Test Set 1: MPS Scoring Extraction (10 cases)
1. Input: "MPS Score: 16" → Expected: Extract as P0 example
2. Input: "Complexity: 3, Importance: 5, Deferability: 2, Impact: 4" → Expected: Calculate MPS = 14
3. Input: "Priority: P1" → Expected: Map to MPS 13-15 range
4. Input: Incomplete example (missing rationale) → Expected: Quality score < 0.85, filtered
5. Input: Duplicate example → Expected: Deduplicated

### Test Set 2: WSP Application Extraction (5 cases)
1. Input: "Following WSP 15 for scoring" → Expected: Extract WSP 15 application example
2. Input: "WSP 64 violation detected" → Expected: Extract violation example
3. Input: "WSP compliance: WSP 3, WSP 50" → Expected: Extract multi-WSP compliance
4. Input: Ambiguous WSP reference → Expected: Quality score < 0.85
5. Input: Clear WSP application with rationale → Expected: Quality score >= 0.90

### Test Set 3: Quality Filtering (5 cases)
1. Input: Complete example with all fields → Expected: Quality score = 1.0
2. Input: Missing rationale → Expected: Quality score = 0.8 (filtered)
3. Input: Ambiguous input → Expected: Quality score = 0.6 (filtered)
4. Input: Clear but partial example → Expected: Quality score = 0.85 (kept)
5. Input: Excellent example with source → Expected: Quality score = 0.95

**Total**: 20 test cases across 3 categories

---

## Success Criteria

- ✅ Pattern fidelity ≥ 90% (all 6 steps execute)
- ✅ Extract ≥ 50 high-quality examples per domain
- ✅ Quality threshold 0.85+ maintained
- ✅ Zero duplicate examples in output
- ✅ All examples have verifiable source (line number)
- ✅ Pattern summary provides actionable insights

---

## Next Phase: Gemma Training

After extraction, examples feed into `gemma_domain_trainer` skill:
1. Load training dataset
2. Fine-tune Gemma 270M on domain examples
3. Validate accuracy on held-out test set
4. Deploy trained model for domain-specific tasks

---

## Wardrobe Concept: Training as a Service

**Different "training wardrobes"** for different knowledge domains:
- `qwen_mps_scorer` - Trained on MPS scoring examples
- `qwen_wsp_auditor` - Trained on WSP compliance examples
- `qwen_roadmap_tracker` - Trained on roadmap analysis examples
- `qwen_readme_validator` - Trained on README patterns

**Each wardrobe**:
- Mines 012.txt for domain-specific patterns
- Trains Gemma on extracted examples
- Deploys as reusable skill
- Evolves as more examples accumulate

**Meta-skill**: `qwen_wardrobe_generator` - Automates creation of new training wardrobes for any domain!

---

**Status**: ✅ Ready for prototype testing - Mine 012.txt for MPS scoring examples first

Related Skills

switchboard-data-operator

16
from diegosouzapw/awesome-omni-skill

Autonomous operator for Switchboard on-demand feeds, Surge streaming, and randomness. Designs jobs, simulates via Crossbar, and deploys/updates/reads feeds across Solana/SVM, EVM, Sui, and other Switchboard-supported chains—with user-controlled security, spend limits, and allow/deny lists.

scaffold-bulk-review-prototypes

16
from diegosouzapw/awesome-omni-skill

Review all prototypes at once for cross-prototype consistency, coverage gaps, ADR follow-through, and scope discipline. Use for a full audit of all prototypes.

prototype-to-production

16
from diegosouzapw/awesome-omni-skill

Convert design prototypes (HTML, CSS, Figma exports) into production-ready components. Analyzes prototype structure, extracts design tokens, identifies reusable patterns, and generates typed React components. Adapts to existing project tech stack with React + TypeScript as default.

write-data-type-ref

16
from diegosouzapw/awesome-omni-skill

Write a reference documentation page for a specific data type in ZIO Blocks. Use when the user asks to document a data type, write an API reference for a type, or create a reference page for a class/trait/object.

wikidata-search

16
from diegosouzapw/awesome-omni-skill

Search for items and properties on Wikidata and retrieve entity details, claims, and external identifiers. Supports both keyword search (Wikidata Action API) and semantic/hybrid search (Wikidata Vector Database), plus direct entity retrieval (Special:EntityData) and structured querying (WDQS SPARQL).

twelve-data-automation

16
from diegosouzapw/awesome-omni-skill

Automate Twelve Data tasks via Rube MCP (Composio). Always search tools first for current schemas.

supadata-automation

16
from diegosouzapw/awesome-omni-skill

Automate Supadata tasks via Rube MCP (Composio). Always search tools first for current schemas.

session-log-data

16
from diegosouzapw/awesome-omni-skill

Describes the data files available in the coding agent environment after copilot-setup-steps runs. Use when analyzing downloaded session logs or aggregated usage data.

senior-data-scientist

16
from diegosouzapw/awesome-omni-skill

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.

senior-data-engineer

16
from diegosouzapw/awesome-omni-skill

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

scientific-papers-to-dataset

16
from diegosouzapw/awesome-omni-skill

Build structured datasets from academic papers. Use when the user wants to extract structured data from scientific literature, traverse citation graphs, search OpenAlex for papers, or create datasets from PDFs for research purposes.

repo-metadata

16
from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "update repo description", "improve repository description", "generate topics", "add labels to repo", "optimize github metadata", "make repo more discoverable", "improve repo SEO", "update project description", or needs to create engaging repository descriptions and topics that improve discoverability. Analyzes project files to generate optimized GitHub metadata.