e2e-medallion-architecture

Implement end-to-end Medallion Architecture (Bronze/Silver/Gold) lakehouse patterns in Microsoft Fabric using PySpark, Delta Lake, and Fabric Pipelines. Use when the user wants to: (1) design a Bronze/Silver/Gold data lakehouse, (2) set up multi-layer workspace with lakehouses for each tier, (3) build ingestion-to-analytics pipelines with data quality enforcement, (4) optimize Spark configurations per medallion layer, (5) orchestrate Bronze-to-Silver-to-Gold flows via notebooks. Triggers: "medallion architecture", "bronze silver gold", "lakehouse layers", "e2e data pipeline", "end-to-end lakehouse", "data lakehouse pattern", "multi-layer lakehouse", "build medallion", "setup medallion".

245 stars

Best use case

e2e-medallion-architecture is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Implement end-to-end Medallion Architecture (Bronze/Silver/Gold) lakehouse patterns in Microsoft Fabric using PySpark, Delta Lake, and Fabric Pipelines. Use when the user wants to: (1) design a Bronze/Silver/Gold data lakehouse, (2) set up multi-layer workspace with lakehouses for each tier, (3) build ingestion-to-analytics pipelines with data quality enforcement, (4) optimize Spark configurations per medallion layer, (5) orchestrate Bronze-to-Silver-to-Gold flows via notebooks. Triggers: "medallion architecture", "bronze silver gold", "lakehouse layers", "e2e data pipeline", "end-to-end lakehouse", "data lakehouse pattern", "multi-layer lakehouse", "build medallion", "setup medallion".

Teams using e2e-medallion-architecture should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/e2e-medallion-architecture/SKILL.md --create-dirs "https://raw.githubusercontent.com/microsoft/skills-for-fabric/main/skills/e2e-medallion-architecture/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/e2e-medallion-architecture/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How e2e-medallion-architecture Compares

Feature / Agente2e-medallion-architectureStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Implement end-to-end Medallion Architecture (Bronze/Silver/Gold) lakehouse patterns in Microsoft Fabric using PySpark, Delta Lake, and Fabric Pipelines. Use when the user wants to: (1) design a Bronze/Silver/Gold data lakehouse, (2) set up multi-layer workspace with lakehouses for each tier, (3) build ingestion-to-analytics pipelines with data quality enforcement, (4) optimize Spark configurations per medallion layer, (5) orchestrate Bronze-to-Silver-to-Gold flows via notebooks. Triggers: "medallion architecture", "bronze silver gold", "lakehouse layers", "e2e data pipeline", "end-to-end lakehouse", "data lakehouse pattern", "multi-layer lakehouse", "build medallion", "setup medallion".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Update Check — ONCE PER SESSION (mandatory)**
> The first time this skill is used in a session, run the **check-updates** skill before proceeding.
> - **GitHub Copilot CLI / VS Code**: invoke the `check-updates` skill.
> - **Claude Code / Cowork / Cursor / Windsurf / Codex**: compare local vs remote package.json version.
> - Skip if the check was already performed earlier in this session.

> **CRITICAL NOTES**
> 1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
> 2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

# End-to-End Medallion Architecture

## Prerequisite Knowledge

Read these companion documents — they contain the foundational context this skill depends on:

- [COMMON-CORE.md](../../common/COMMON-CORE.md) — Fabric REST API patterns, authentication, token audiences, item discovery
- [COMMON-CLI.md](../../common/COMMON-CLI.md) — `az rest`, `az login`, token acquisition, Fabric REST via CLI
- [SPARK-AUTHORING-CORE.md](../../common/SPARK-AUTHORING-CORE.md) — Notebook deployment, lakehouse creation, job execution
- [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md) — **Required for notebook creation** — `.ipynb` structure requirements, cell format, `getDefinition`/`updateDefinition` workflow

For Spark-specific optimization details, see [data-engineering-patterns.md](../spark-authoring-cli/resources/data-engineering-patterns.md).

---

## Architecture Overview

**Medallion Architecture** is a data lakehouse pattern with three progressive layers:

| Layer | Purpose | Optimization Profile | Use Case |
|-------|---------|---------------------|----------|
| **Bronze** (Raw) | Land raw data exactly as received | Write-optimized, append-only, partitioned by ingestion date | Audit trail, reprocessing, lineage |
| **Silver** (Cleaned) | Deduplicated, validated, conformed data | Balanced read/write, partitioned by business date | Feature engineering, operational reporting |
| **Gold** (Aggregated) | Pre-calculated metrics for analytics | Read-optimized (ZORDER, compaction), partitioned by month/year | Power BI reports, dashboards, ad-hoc analytics via SQL endpoint |

- **Bronze**: Schema-on-read — flexible schema, Delta time travel supports audit and rollback
- **Silver**: Schema enforcement — reject non-conforming writes; handle schema evolution with `mergeSchema` when sources change
- **Gold**: Strict schema governance — curated, business-approved datasets only

---

## Must/Prefer/Avoid

### MUST DO
- Create a **separate lakehouse** for each medallion layer (Bronze, Silver, Gold)
- Add **metadata columns** in Bronze: ingestion timestamp, source file, batch ID
- Apply **data quality rules** in the Bronze-to-Silver transformation (deduplication, null handling, range validation)
- Use **Delta Lake format** for all medallion layer tables
- Use **partition-aware overwrite** in Silver/Gold writes to avoid reprocessing unchanged data
- Include **validation steps** after each layer (row counts, schema checks, anomaly detection)
- Follow the **`.ipynb` validation + Fabric nuances** in [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md#ipynb-validation--fabric-nuances) when creating notebooks via REST API — every code cell must include `"outputs": []` and `"execution_count": null`
- **Default to separate workspaces per layer** for governance and access control: one workspace each for Bronze, Silver, and Gold
- **Complete the full end-to-end flow** — do not stop after creating notebooks; always bind lakehouses, execute notebooks sequentially (Bronze → Silver → Gold), verify results, and connect Power BI to the Gold layer unless the user explicitly requests a partial setup

### PREFER
- Incremental processing (watermark pattern) over full refresh
- Separate notebooks per layer for independent testing and debugging
- ZORDER on frequently filtered columns in Gold tables
- Running OPTIMIZE after writes in Silver and Gold layers
- Environment-specific Spark configs (write-heavy for Bronze, balanced for Silver, read-heavy for Gold)
- OneLake shortcuts to expose Gold data to consumer workspaces without duplication
- Clear layer ownership: engineers own Bronze/Silver, analysts own Gold
- Fabric Variable Libraries to centralize paths and configuration across layers
- Multi-workspace deployment patterns for medium/high governance requirements (Bronze/Silver/Gold in separate workspaces)

### AVOID
- Storing all layers in a single lakehouse — this defeats isolation and independent optimization
- Skipping the Silver layer and going directly from Bronze to Gold
- Hardcoded workspace IDs, lakehouse IDs, or FQDNs — discover via REST API
- SELECT * without LIMIT on Bronze tables (they grow unboundedly)
- Running VACUUM without checking downstream dependencies
- Chaining OneLake shortcuts between medallion layers (Bronze→Silver→Gold) — each layer must be physically materialized for lineage and governance
- Copying complete implementation code into skills — guide the LLM to generate instead
- Reading from **external HTTP/HTTPS URLs** directly in Spark — Fabric Spark cannot access arbitrary external URLs; land data in lakehouse `Files/` first (via `curl`, OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse path
- Creating notebooks via REST API **without validating `.ipynb` structure** — missing `execution_count: null` or `outputs: []` on code cells causes silent failures or "Job instance failed without detail error"

---

## Workspace Setup Guidance

When setting up a medallion workspace, guide LLM to generate commands for:

1. **Default architecture: create three workspaces** (recommended):
   - `{project}-bronze-{env}`
   - `{project}-silver-{env}`
   - `{project}-gold-{env}`
2. **Create one lakehouse per workspace**:
   - Bronze workspace → `{project}_bronze` lakehouse
   - Silver workspace → `{project}_silver` lakehouse
   - Gold workspace → `{project}_gold` lakehouse
3. **Assign RBAC per layer workspace**:
   - Bronze: ingestion/engineering write permissions
   - Silver: engineering/data quality permissions
   - Gold: analytics/BI consumer access with stricter curation controls
4. **Create notebooks** for each layer (one per transformation stage) — follow `.ipynb` validation + Fabric nuances
5. **Bind each notebook to its lakehouse** — set `metadata.dependencies.lakehouse` with the correct lakehouse ID (see [notebook-api-operations.md § Default Lakehouse Binding](../spark-authoring-cli/resources/notebook-api-operations.md#default-lakehouse-binding)):
   - Bronze notebook → Bronze workspace/lakehouse
   - Silver notebook → Silver workspace/lakehouse (reads Bronze via cross-workspace oneLake access / fully qualified references)
   - Gold notebook → Gold workspace/lakehouse (reads Silver via cross-workspace access)
6. **Confirm notebook deployment** — check that `updateDefinition` returned `Succeeded`; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT call `getDefinition` to re-verify — it is an async LRO and adds unnecessary latency.
7. **Execute notebooks** sequentially — Bronze first, then Silver, then Gold — using `POST .../jobs/instances?jobType=RunNotebook` with the correct `defaultLakehouse` in execution config (both `id` and `name` required)
8. **Connect Power BI to Gold layer** — discover the Gold lakehouse SQL endpoint, create a Direct Lake semantic model, create a report with visuals on the Gold summary table (see [Gold Layer → Power BI Consumption](#gold-layer--power-bi-consumption))
9. **Create pipeline** to orchestrate the Bronze → Silver → Gold flow for recurring execution

### Explicit Override: Single Workspace

If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:

- One workspace with separate Bronze/Silver/Gold lakehouses
- Preserve layer separation logically even when workspace is shared
- Call out governance trade-offs versus multi-workspace design

Parameterize by environment: workspace name suffix (`-dev`, `-prod`), data volume (sample vs full), capacity SKU, and Bronze retention period.

---

## Bronze Layer — Ingestion Patterns

When a user requests data ingestion into the Bronze layer, guide LLM to:

1. **Land data in lakehouse first**: External data must be staged into the lakehouse `Files/` folder before Spark can read it — use one of:
   - **Fabric Pipeline Copy activity** (preferred for recurring loads) — connects to external sources (HTTP, FTP, databases, cloud storage) and writes to OneLake
   - **OneLake API / `curl`** — upload files via REST API using `storage.azure.com` token (see COMMON-CLI.md § OneLake Data Access)
   - **OneLake Shortcut** — for data already in Azure ADLS Gen2, S3, or another OneLake location
   - **`notebookutils.fs`** — copy from mounted storage paths within a notebook
   - ⚠️ **Fabric Spark cannot read from arbitrary HTTP/HTTPS URLs** — `spark.read.format("csv").load("https://...")` will fail
2. **Read from lakehouse path**: Once data is in `Files/`, read using lakehouse-relative paths (e.g., `spark.read.format("csv").load("Files/landing/daily/")`)
3. **Add metadata and write**: Tracking columns (ingestion timestamp, source file, batch ID), Delta table with descriptive name, partition by ingestion date, append mode
4. **Validate**: Log row counts, validate schema structure, flag anomalies vs historical patterns

---

## Silver Layer — Transformation Patterns

When a user requests Bronze-to-Silver transformation, guide LLM to:

- **Quality rules**: Deduplicate on natural/composite key, filter invalid ranges, handle nulls (drop required, fill optional), validate logical constraints
- **Schema conformance**: snake_case column names, standardized data types, derived columns (durations, percentages, categories)
- **Schema evolution**: Use `mergeSchema` option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasets
- **Write strategy**: Partition by business date, partition-aware overwrite, run OPTIMIZE after write, log before/after metrics

---

## Gold Layer — Aggregation Patterns

When a user requests Gold analytics tables, guide LLM to generate:

- **Common aggregates**: Daily/weekly/monthly summaries, dimensional analysis (by location, category, type), trend breakdowns over time, demand patterns (hour-of-day, day-of-week)
- **Spark session config** — set these properties in the Gold notebook **before** any write operations:
  ```python
  spark.conf.set("spark.sql.parquet.vorder.default", "true")
  spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
  spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
  ```
  - **V-Order** (`vorder.default`) — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performance
  - **Optimize Write** (`optimizeWrite.enabled`) — coalesces small partitions into optimally-sized files (target ~1 GB per `binSize`), reducing file count and improving scan efficiency
- **Optimization**: ZORDER on filter columns, run OPTIMIZE after writes, pre-aggregate metrics to avoid runtime computation

---

## End-to-End Execution Flow

When setting up medallion architecture end-to-end, the LLM **must not stop** after creating notebooks and deploying code. The complete lifecycle is:

```
Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results
```

### Step-by-Step

1. **Create layer workspaces and lakehouses (default)** — one workspace and one lakehouse per layer (Bronze, Silver, Gold); capture workspace IDs and lakehouse IDs
2. **Create notebooks** — one per layer, with valid `.ipynb` structure (see [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md))
3. **Bind lakehouse to each notebook** — include `metadata.dependencies.lakehouse` in the `.ipynb` payload with:
   - `default_lakehouse`: the target lakehouse GUID
   - `default_lakehouse_name`: the lakehouse display name
   - `default_lakehouse_workspace_id`: the workspace GUID
4. **Deploy notebook content** — `updateDefinition` with the Base64-encoded `.ipynb` payload (content + lakehouse binding together)
5. **Confirm deployment** — check that each `updateDefinition` LRO returned `Succeeded`; that is sufficient. Do NOT call `getDefinition` to re-verify — it is an async LRO and adds significant latency per notebook.
6. **Execute notebooks sequentially** — use `POST .../jobs/instances?jobType=RunNotebook`:
   - Pass `defaultLakehouse` with both `id` and `name` in `executionData.configuration`
   - Run Bronze first → poll until `Completed` → run Silver → poll → run Gold → poll
   - Check for recent jobs before submitting (prevent duplicates — see SPARK-AUTHORING-CORE.md)
7. **Verify results** — after each notebook completes, confirm expected tables exist and row counts are reasonable
8. **Connect Power BI to Gold** — create semantic model + report on Gold summary tables (see [Gold Layer → Power BI Consumption](#gold-layer--power-bi-consumption))

### Common Failure: Stopping After Notebook Creation

If the flow stops after deploying notebook code without binding or executing:
- Notebooks will have no lakehouse context → `spark.sql()` and relative paths (`Tables/`, `Files/`) fail at runtime
- The user sees no output or results — the architecture is set up but never tested
- **Always complete through step 7** unless the user explicitly asks to stop at a specific step

---

## Gold Layer → Power BI Consumption

After Gold tables are populated, connect Power BI to surface the analytics. 
Build a semantic model on top of the Gold lakehouse, using DirectLake. 


### Step-by-Step

1. **Discover the Gold lakehouse SQL endpoint** — call `GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}` and extract `properties.sqlEndpointProperties.connectionString` and `provisioningStatus`; wait until status is `Success`
2. **Verify Gold tables via SQL** — connect to the SQL endpoint using `sqlcmd` (see [COMMON-CLI.md § SQL / TDS Data-Plane Access](../../common/COMMON-CLI.md#sql--tds-data-plane-access)) and confirm the target table exists:
   ```sql
   SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
   ```
3. **Create a semantic model** — use the [powerbi-authoring-cli](../powerbi-authoring-cli/SKILL.md) skill for semantic model creation and TMDL deployment. Create via `POST /v1/workspaces/{workspaceId}/items` with `type: "SemanticModel"` then deploy definition via `updateDefinition` using TMDL format (see [ITEM-DEFINITIONS-CORE.md § SemanticModel](../../common/ITEM-DEFINITIONS-CORE.md#semanticmodel)):
   - The model must reference the Gold lakehouse SQL endpoint as its data source
   - Define a table mapping to the Gold summary table (e.g., `nyc_taxi_daily_summary`)
   - Use **Direct Lake** mode — this connects directly to Delta tables in OneLake without data import
   - Include measures for key aggregations you find interesting (e.g., `Total Trips`, `Avg Fare`, `Total Revenue`, `Month over Month Growth`)
4. **Create a Power BI report** — `POST /v1/workspaces/{workspaceId}/items` with `type: "Report"` then deploy definition via `updateDefinition` using PBIR format (see [ITEM-DEFINITIONS-CORE.md § Report](../../common/ITEM-DEFINITIONS-CORE.md#report)):
   - Reference the semantic model created in step 3 via `definition.pbir`
   - Define at least one page with visuals on the Gold summary table
   - Suggested visuals: line chart (daily trend), card (KPI totals), bar chart (by category), table (detail view)
5. **Verify end-to-end** — use the `powerbi-consumption-cli` skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the report

### Principles

- **Discover SQL endpoint dynamically** — the connection string is in `properties.sqlEndpointProperties.connectionString` on the lakehouse response; never hardcode it
- **Wait for SQL endpoint provisioning** — status must be `Success` before connecting; newly created lakehouses may take minutes to provision
- **Prefer Direct Lake mode** — avoids data duplication; semantic model reads directly from OneLake Delta tables
- **Match table/column names exactly** — the semantic model table definition must use the exact Delta table and column names from the Gold lakehouse
- **For semantic model authoring** (TMDL, refresh, permissions), cross-reference the [powerbi-authoring-cli](../powerbi-authoring-cli/SKILL.md) skill
- **For DAX query validation**, cross-reference the [powerbi-consumption-cli](../powerbi-consumption-cli/SKILL.md) skill

---

## Pipeline Orchestration

When a user requests a pipeline for the medallion flow, guide LLM to design with:

- **Structure**: Sequential activities (Bronze → Silver → Gold), each waiting for previous success; independent Gold aggregations can run in parallel; include validation and notification activities
- **Parameterization**: Pipeline-level processing date (defaults to yesterday), passed to all notebooks; dynamic date expressions
- **Scheduling**: Daily aligned with source refresh, watermark-based incremental processing, periodic full refresh for corrections
- **Error handling**: Retry with backoff for transient failures, alerting for persistent failures, graceful degradation (downstream uses previous data if upstream fails)

---

## Environment Optimization

**For detailed Spark configurations and optimization strategies, see [data-engineering-patterns.md](../spark-authoring-cli/resources/data-engineering-patterns.md).**

| Layer | Profile | Key Settings |
|-------|---------|-------------|
| Bronze | Write-heavy | Disable V-Order, enable autoCompact, large file targets, partition by ingestion_date |
| Silver | Balanced | Enable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns |
| Gold | Read-heavy | V-Order (`spark.sql.parquet.vorder.default=true`), Optimize Write (`optimizeWrite.enabled=true`, `binSize=1g`), vectorized readers, adaptive execution, ZORDER on all filter columns, pre-aggregate metrics |

---

## Examples

### Example 1: Set Up Medallion Workspaces (Default)

**Prompt**: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"

**What the LLM should generate**: REST API calls to:
1. Create workspaces: `sales-bronze-dev`, `sales-silver-dev`, `sales-gold-dev`
2. Create one lakehouse in each workspace: `sales_bronze`, `sales_silver`, `sales_gold`
3. Assign RBAC roles per workspace/layer

```bash
# Workspace creation (see COMMON-CLI.md for full patterns)
cat > /tmp/body.json << 'EOF'
{"displayName": "sales-analytics-dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces" \
  --body @/tmp/body.json --query "id" --output tsv)

# Create Bronze lakehouse
cat > /tmp/body.json << 'EOF'
{"displayName": "sales_bronze", "type": "Lakehouse"}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
  --body @/tmp/body.json
```

### Example 2: Design Bronze Ingestion

**Prompt**: "Ingest daily CSV files into bronze lakehouse with metadata columns"

**What the LLM should generate**: PySpark notebook that:
1. Reads source files with schema inference or explicit schema
2. Adds `ingestion_timestamp`, `source_file`, `batch_id` columns
3. Writes to Delta table partitioned by ingestion date
4. Logs row count and validation metrics

```python
# Bronze ingestion pattern (guide LLM to generate full implementation)
from pyspark.sql.functions import current_timestamp, input_file_name, lit
import uuid

batch_id = str(uuid.uuid4())
df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/")
      .withColumn("ingestion_timestamp", current_timestamp())
      .withColumn("source_file", input_file_name())
      .withColumn("batch_id", lit(batch_id)))
df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
```

### Example 3: Bronze-to-Silver Transformation

**Prompt**: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"

**What the LLM should generate**: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.

### Example 4: End-to-End Pipeline

**Prompt**: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"

**What the LLM should generate**: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.

Related Skills

sqldw-consumption-cli

245
from microsoft/skills-for-fabric

Execute read-only T-SQL queries against Fabric Data Warehouse, Lakehouse SQL Endpoints, and Mirrored Databases via CLI. Default skill for any lakehouse data query (row counts, SELECT, filtering, aggregation) unless the user explicitly requests PySpark or Spark DataFrames. Use when the user wants to: (1) query warehouse/lakehouse data, (2) count rows or explore lakehouse tables, (3) discover schemas/columns, (4) generate T-SQL scripts, (5) monitor SQL performance, (6) export results to CSV/JSON. Triggers: "warehouse", "SQL query", "T-SQL", "query warehouse", "show warehouse tables", "show lakehouse tables", "query lakehouse", "lakehouse table", "how many rows", "count rows", "SQL endpoint", "describe warehouse schema", "generate T-SQL script", "warehouse performance", "export SQL data", "connect to warehouse", "lakehouse data", "explore lakehouse".

sqldw-authoring-cli

245
from microsoft/skills-for-fabric

Execute authoring T-SQL (DDL, DML, data ingestion, transactions, schema changes) against Microsoft Fabric Data Warehouse and SQL endpoints from agentic CLI environments. Use when the user wants to: (1) create/alter/drop tables from terminal, (2) insert/update/delete/merge data via CLI, (3) run COPY INTO or OPENROWSET ingestion, (4) manage transactions or stored procedures, (5) perform schema evolution, (6) use time travel or snapshots, (7) generate ETL/ELT shell scripts, (8) create views/functions/procedures on Lakehouse SQLEP. Triggers: "create table in warehouse", "insert data via T-SQL", "load from ADLS", "COPY INTO", "run ETL with T-SQL", "alter warehouse table", "upsert with T-SQL", "merge into warehouse", "create T-SQL procedure", "warehouse time travel", "recover deleted warehouse data", "create warehouse schema", "deploy warehouse", "transaction conflict", "snapshot isolation error".

spark-consumption-cli

245
from microsoft/skills-for-fabric

Analyze lakehouse data interactively using Fabric Livy sessions and PySpark/Spark SQL for advanced analytics, DataFrames, cross-lakehouse joins, Delta time-travel, and unstructured/JSON data. Use when the user explicitly asks for PySpark, Spark DataFrames, Livy sessions, or Python-based analysis — NOT for simple SQL queries. Triggers: "PySpark", "Spark SQL", "analyze with PySpark", "Spark DataFrame", "Livy session", "lakehouse with Python", "PySpark analysis", "PySpark data quality", "Delta time-travel with Spark".

spark-authoring-cli

245
from microsoft/skills-for-fabric

Develop Microsoft Fabric Spark/data engineering workflows with intelligent routing to specialized resources. Provides core workspace/lakehouse management and routes to: data engineering patterns, development workflow, or infrastructure orchestration. Use when the user wants to: (1) manage Fabric workspaces and resources, (2) develop notebooks and PySpark applications, (3) design data pipelines and orchestration, (4) provision infrastructure as code. Triggers: "develop notebook", "data engineering", "workspace setup", "pipeline design", "infrastructure provisioning", "Delta Lake patterns", "Spark development", "lakehouse configuration", "organize lakehouse tables", "create Livy session", "notebook deployment".

powerbi-consumption-cli

245
from microsoft/skills-for-fabric

The ONLY supported path for read-only Microsoft Fabric Power BI semantic model (formerly "Power BI dataset") query interactions. Execute DAX queries via the MCP server ExecuteQuery tool to: (1) discover semantic model metadata (tables, columns, measures, relationships, hierarchies, etc.) and their properties, (2) retrieve data from a semantic model. Triggers: "DAX query", "semantic model metadata", "list semantic model tables", "run EVALUATE", "get measure expression".

powerbi-authoring-cli

245
from microsoft/skills-for-fabric

Create, manage, and deploy Power BI semantic models inside Microsoft Fabric workspaces via `az rest` CLI against Fabric and Power BI REST APIs. Use when the user wants to: (1) create a semantic model from TMDL definition files, (2) retrieve or download semantic model definitions, (3) update a semantic model definition with modified TMDL, (4) trigger or manage dataset refresh operations, (5) configure data sources, parameters, or permissions, (6) deploy semantic models between pipeline stages. Covers Fabric Items API (CRUD) and Power BI Datasets API (refresh, data sources, permissions). For read-only DAX queries, use `powerbi-consumption-cli`. For fine-grained modeling changes, route to `powerbi-modeling-mcp`. Triggers: "create semantic model", "upload TMDL", "download semantic model TMDL", "refresh dataset", "semantic model deployment pipeline", "dataset permissions", "list dataset users", "semantic model authoring".

eventhouse-consumption-cli

245
from microsoft/skills-for-fabric

Run KQL queries against Fabric Eventhouse for real-time intelligence and time-series analytics using `az rest` against the Kusto REST API. Covers KQL operators (where, summarize, join, render), Eventhouse schema discovery (.show tables), time-series patterns with bin(), and ingestion monitoring. Use when the user wants to: 1. Run read-only KQL queries against an Eventhouse or KQL Database 2. Discover Eventhouse table schema and metadata 3. Analyse real-time or time-series data with KQL operators 4. Monitor ingestion health and active KQL queries 5. Export KQL results to JSON Triggers: "kql query", "kusto query", "eventhouse query", "kql database", "real-time intelligence", "time-series kql", "query eventhouse", "explore eventhouse", "show tables kql"

eventhouse-authoring-cli

245
from microsoft/skills-for-fabric

Execute KQL management commands (table management, ingestion, policies, functions, materialized views) against Fabric Eventhouse and KQL Databases via CLI. Use when the user wants to: 1. Create or alter KQL tables, columns, or functions 2. Ingest data into an Eventhouse (inline, from storage, streaming) 3. Configure retention, caching, or partitioning policies 4. Create or manage materialized views and update policies 5. Manage data mappings for ingestion pipelines 6. Deploy KQL schema via scripts Triggers: "create kql table", "kql ingestion", "ingest into eventhouse", "kql function", "materialized view", "kql retention policy", "eventhouse schema", "kql authoring", "create eventhouse table", "kql mapping"

check-updates

245
from microsoft/skills-for-fabric

Check for skills-for-fabric marketplace updates at session start. Compares local version against GitHub releases and shows changelog if updates are available. Use when the user wants to: (1) check for skill updates, (2) see what's new in skills-for-fabric, (3) verify current version. Triggers: "check for updates", "am I up to date", "what version", "update skills", "show changelog".

skill-test

245
from microsoft/skills-for-fabric

Manage the skills-for-fabric evaluation framework: add eval plans for new or existing skills, list available tests and their results, generate eval datasets, review metrics, and check test coverage. Directs test execution to the tests/ folder. Triggers: "add tests", "add evals", "list tests", "show eval results", "run tests", "generate eval data", "eval metrics", "test coverage", "missing tests". "show tests"

quality-check

245
from microsoft/skills-for-fabric

Run local quality checks on skills-for-fabric before committing. Validates all skills in the skills/ folder for structural compliance, semantic disambiguation, broken references, and content quality. Use before submitting a PR to catch issues early. Triggers: "check my skills", "run quality check", "validate skills", "pre-commit check", "lint skills".

best-practices-check

245
from microsoft/skills-for-fabric

Verify skills-for-fabric against Microsoft Fabric best practices from the internet. Searches for current best practices, compares them against skill content, and identifies gaps or improvements. Use when the user wants to: (1) validate a skill covers industry best practices, (2) find missing guidance, (3) improve skill quality with current recommendations. Triggers: "check best practices", "validate best practices", "best practices for", "compare against best practices", "skill coverage".