database-architect

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

database-architect is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.

Teams using database-architect should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/database-architect/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/database-architect/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/database-architect/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How database-architect Compares

Feature / Agent	database-architect	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Database Architect Skill

<identity>
Database Architect Skill - Designs efficient database schemas, optimizes queries, plans indexes, and creates migration strategies for both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases.
</identity>

<capabilities>
- Designing normalized and denormalized schemas
- Query optimization and execution plan analysis
- Index strategy planning
- Data modeling (ER diagrams, relationships)
- Migration planning and versioning
- Performance troubleshooting
</capabilities>

<instructions>
<execution_process>

### Step 1: Understand Data Requirements

Gather requirements:

1. **Entities**: What data needs to be stored?
2. **Relationships**: How do entities relate (1:1, 1:N, N:M)?
3. **Access Patterns**: How will data be queried?
4. **Volume**: Expected data size and growth rate
5. **Consistency**: ACID requirements vs eventual consistency

### Step 2: Design Schema

**For Relational Databases**:

1. **Normalize**: Start with 3NF to reduce redundancy
2. **Define Primary Keys**: Use surrogate keys (UUID/SERIAL) or natural keys
3. **Define Foreign Keys**: Establish referential integrity
4. **Consider Denormalization**: Only for proven performance needs

**For NoSQL Databases**:

1. **Model for Queries**: Design documents/collections around access patterns
2. **Embed vs Reference**: Embed for 1:1/1:few, reference for 1:many
3. **Shard Key Selection**: Choose keys that distribute evenly

### Step 3: Plan Indexes

Index strategy based on query patterns:

```sql
-- Example: Users table with common queries
CREATE INDEX idx_users_email ON users(email);           -- Exact match
CREATE INDEX idx_users_name ON users(last_name, first_name);  -- Range/sort
CREATE INDEX idx_users_created ON users(created_at DESC);     -- Ordering
```

**Index Guidelines**:

- Index columns used in WHERE, JOIN, ORDER BY
- Consider composite indexes for multi-column queries
- Avoid over-indexing (slows writes)
- Use covering indexes for read-heavy queries

### Step 4: Plan Migrations

Create versioned migrations:

```
migrations/
  001_create_users.sql
  002_add_email_index.sql
  003_create_orders.sql
```

**Migration Best Practices**:

- Always include up and down migrations
- Test migrations on production-like data
- Plan for zero-downtime migrations
- Backup before running migrations

### Step 5: Optimize Queries

Analyze and improve slow queries:

1. **Use EXPLAIN ANALYZE**: Understand execution plans
2. **Identify Table Scans**: Replace with index scans
3. **Optimize JOINs**: Ensure indexes on join columns
4. **Batch Operations**: Use bulk inserts/updates
5. **Connection Pooling**: Reduce connection overhead

### Step 6: PostgreSQL 17 Features (2024–2026)

Leverage PostgreSQL 17 capabilities where applicable:

**Performance improvements:**

- New `VACUUM` memory management — up to 20x lower memory footprint; vacuum now runs faster on busy systems
- Streaming I/O interface accelerates sequential scans on large datasets
- `BRIN` indexes support parallel builds
- B-tree indexes are more efficient for `IN` clause queries
- Optimized CTE (Common Table Expression) planning

**SQL/JSON enhancements (PG 17):**

- `JSON_TABLE()` — converts JSON data into relational table representation
- JSON constructors and identity functions (`JSON()`, `JSON_SCALAR()`, `JSON_ARRAY()`, `JSON_OBJECT()`)
- Use `jsonpath` for expressive path-based queries over JSONB columns

**Incremental backups:**

- `pg_basebackup` supports incremental backup; combine with `pg_upgrade` for zero-data-loss major version upgrades

**Logical replication improvements:**

- Failover control for logical replication slots
- `pg_createsubscriber` creates logical replicas from physical standbys
- `pg_upgrade` now preserves logical replication slots across major version upgrades

**Security:**

- New `MAINTAIN` privilege — grants targeted maintenance rights without full superuser access
- `sslnegotiation=direct` client option for direct TLS handshake (avoids round-trip)

**COPY improvements:**

- `COPY ... ON_ERROR ignore` — continues import on row-level errors instead of aborting

### Step 7: pgvector for AI Embeddings

Store and query vector embeddings alongside relational data to avoid a separate vector database:

```sql
-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column (1536 dims for OpenAI text-embedding-3-small)
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest neighbor (ANN) search
-- lists = sqrt(row_count) is a good starting value
CREATE INDEX idx_documents_embedding ON documents
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- HNSW index (faster queries, more memory; preferred for < 1M vectors)
CREATE INDEX idx_documents_embedding_hnsw ON documents
    USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- Similarity search (cosine distance)
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;
```

**When to use pgvector vs. dedicated vector DB:**

- Up to ~10M vectors: pgvector is sufficient (sub-50ms queries with HNSW index)
- Above 10M vectors or requiring specialized ANN algorithms: consider Pinecone, Weaviate, or Qdrant
- pgvector advantage: same backups, replication, and connection pooling as the rest of PostgreSQL

### Step 8: Table Partitioning Strategies

Use declarative partitioning for tables expected to exceed available RAM:

```sql
-- Range partitioning by date (common for time-series / logs)
CREATE TABLE events (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB
) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Hash partitioning for even distribution (e.g., multi-tenant)
CREATE TABLE orders (
    id UUID NOT NULL,
    tenant_id UUID NOT NULL,
    total DECIMAL(12,2)
) PARTITION BY HASH (tenant_id);

CREATE TABLE orders_p0 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 2);
CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 3);
```

**Partition pruning:** PostgreSQL automatically skips irrelevant partitions when the partition key appears in `WHERE`. Always include the partition key in queries.

**Index on partitioned tables:** Indexes created on the parent table are automatically created on all child partitions.

### Step 9: JSONB Patterns at Scale

```sql
-- Generated columns promote hot JSONB fields to indexed native columns
CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL,
    -- Promote frequently filtered fields to B-tree indexed generated columns
    country TEXT GENERATED ALWAYS AS (data->>'country') STORED,
    signup_date DATE GENERATED ALWAYS AS ((data->>'signup_date')::DATE) STORED
);
CREATE INDEX idx_customers_country ON customers (country);
CREATE INDEX idx_customers_signup ON customers (signup_date);

-- GIN index for containment / key-existence queries
CREATE INDEX idx_customers_data_gin ON customers USING GIN (data);

-- Partial GIN index for large tables (index only active records)
CREATE INDEX idx_customers_data_active ON customers
    USING GIN (data) WHERE (data->>'status') = 'active';

-- jsonpath query example (PG 17)
SELECT * FROM customers
WHERE data @? '$.tags[*] ? (@ == "premium")';
```

### Step 10: Connection Pooling

Use a connection pooler in front of PostgreSQL for all production deployments:

**PgBouncer** (lightweight, battle-tested):

```ini
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction       ; transaction pooling for stateless apps
max_client_conn = 1000
default_pool_size = 25
server_pool_size = 5
```

**Supavisor** (cloud-native, multi-tenant, Elixir-based):

- Designed for serverless / edge functions with thousands of short-lived connections
- Supports both session and transaction pooling modes
- Used by default in Supabase deployments; available as self-hosted

**Pooling modes:**
| Mode | Use Case | Notes |
|------|----------|-------|
| Session | Long-running connections, `LISTEN/NOTIFY` | 1 client = 1 server connection |
| Transaction | Stateless APIs (recommended default) | Most efficient; breaks `SET` / prepared statements |
| Statement | Rarely needed | Each statement can use a different server connection |

</execution_process>

<best_practices>

1. **Normalize First**: Optimize later based on data
2. **Index Thoughtfully**: Based on actual query patterns
3. **Use Migrations**: Never modify schema directly
4. **Monitor Performance**: Use database profiling tools
5. **Plan for Scale**: Consider partitioning for large tables
6. **Upgrade to PostgreSQL 17**: Benefit from new VACUUM memory manager, SQL/JSON functions, and incremental backup support
7. **Use pgvector for AI**: Store embeddings in PostgreSQL with HNSW or IVFFlat indexes before reaching for a dedicated vector database
8. **JSONB at Scale**: Promote hot JSONB fields to generated columns with B-tree indexes; use GIN for containment queries
9. **Always Pool Connections**: Use PgBouncer (transaction mode) or Supavisor for all production PostgreSQL deployments
10. **Partition Large Tables**: Apply range (time-series), list (category), or hash (multi-tenant) partitioning for tables projected to exceed server RAM

</best_practices>
</instructions>

<examples>
<usage_example>
**Schema Design Request**:

```
Design a schema for an e-commerce orders system
```

**Example Response**:

```sql
-- Users table
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Orders table
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    status VARCHAR(20) DEFAULT 'pending',
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Order items (1:N relationship)
CREATE TABLE order_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
    product_id UUID NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2)
);

-- Indexes for common queries
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);
```

</usage_example>
</examples>

## Rules

- Always justify denormalization with performance data
- Include rollback strategy for all migrations
- Document relationships and constraints

## Related Workflow

This skill has a corresponding workflow for complex multi-agent scenarios:

- **Workflow**: `.claude/workflows/database-architect-skill-workflow.md`
- **When to use workflow**: For comprehensive database design including requirements analysis, schema design, query optimization, migration planning, and testing (multi-phase, multi-agent)
- **When to use skill directly**: For quick schema reviews or single-agent database tasks

## Iron Laws

1. **NEVER** make schema changes without versioned migrations that include both UP and DOWN scripts — manual DDL in production is not recoverable.
2. **ALWAYS** normalize to at least 3NF before considering denormalization — never prematurely optimize without measured performance evidence.
3. **ALWAYS** plan indexes based on actual query patterns from EXPLAIN ANALYZE — never add indexes speculatively before profiling real workloads.
4. **NEVER** test or deploy a migration without running it against production-like data first — schema issues surface under realistic volume, not on empty tables.
5. **ALWAYS** use connection pooling (Supavisor or PgBouncer) in production — direct connections from serverless functions exhaust the database connection limit under load.

## Anti-Patterns

| Anti-Pattern                                 | Why It Fails                                               | Correct Approach                                                           |
| -------------------------------------------- | ---------------------------------------------------------- | -------------------------------------------------------------------------- |
| Manual DDL directly on production            | No rollback path; breaks migration history                 | Always use versioned migrations with DOWN scripts                          |
| Premature denormalization                    | Adds complexity before profiling; often no measurable gain | Normalize first, denormalize only after EXPLAIN ANALYZE reveals bottleneck |
| Indexing every column                        | Slows writes; wastes storage; misleads query planner       | Index only columns that appear in WHERE, JOIN, and ORDER BY clauses        |
| Adding NOT NULL column without default       | Locks entire table during migration on large datasets      | Add nullable column, backfill in batches, then add NOT NULL constraint     |
| Direct connections from serverless functions | Connection limit exhausted under load spike                | Use PgBouncer or Supavisor in transaction-pooling mode                     |

## Memory Protocol (MANDATORY)

**Before starting:**

```bash
cat .claude/context/memory/learnings.md
```

**After completing:**

- New pattern -> `.claude/context/memory/learnings.md`
- Issue found -> `.claude/context/memory/issues.md`
- Decision made -> `.claude/context/memory/decisions.md`

> ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Related Skills

docker-database

from diegosouzapw/awesome-omni-skill

Configure database containers with security, persistence, and health checks

Database Sync

from diegosouzapw/awesome-omni-skill

Automate database synchronization, replication, migration, and cross-platform data integration

database-skill

from diegosouzapw/awesome-omni-skill

Design and manage relational databases including table creation, migrations, and schema design. Use for database modeling and maintenance.

architecture-paradigm-pipeline

from diegosouzapw/awesome-omni-skill

Consult this skill when designing data pipelines or transformation workflows. Use when data flows through fixed sequence of transformations, stages can be independently developed and tested, parallel processing of stages is beneficial. Do not use when selecting from multiple paradigms - use architecture-paradigms first. DO NOT use when: data flow is not sequential or predictable. DO NOT use when: complex branching/merging logic dominates.

architecture-advisor

from diegosouzapw/awesome-omni-skill

Helps solo developers with AI agents choose optimal architecture (monolithic/microservices/hybrid)

architecting-data

from diegosouzapw/awesome-omni-skill

Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.

architect-reviewer

from diegosouzapw/awesome-omni-skill

Use this agent when you need to evaluate system design decisions, architectural patterns, and technology choices at the macro level.

architect-agent

from diegosouzapw/awesome-omni-skill

Coordinates planning, delegation, and evaluation across architect and code agent workspaces. Use when asked to "write instructions for code agent", "initialize architect workspace", "grade code agent work", "send instructions", or "verify code agent setup".

arch-database

from diegosouzapw/awesome-omni-skill

DB architecture: relational vs document vs graph vs vector, schema design, indexing, replication, sharding

aidf-architect

from diegosouzapw/awesome-omni-skill

Software architect focused on system design, patterns, and long-term maintainability. Designs and plans without implementing code directly.

agent-native-architecture

from diegosouzapw/awesome-omni-skill

Build applications where agents are first-class citizens. Use this skill when designing autonomous agents, creating MCP tools, implementing self-modifying systems, or building apps where features are outcomes achieved by agents operating in a loop.

agent-architecture

from diegosouzapw/awesome-omni-skill

Use when designing or implementing AI agent systems. Covers tool-using agents with mandatory guardrails, SSE streaming (FastAPI → Next.js via Vercel AI SDK v6), LangGraph stateful multi-agent graphs, episodic memory via pgvector, MCP overview, and production failure modes with anti-pattern/fix code pairs.