apify-generate-output-schema

Generate output schemas (dataset_schema.json, output_schema.json, key_value_store_schema.json) for an Apify Actor by analyzing its source code. Use when creating or updating Actor output schemas.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

apify-generate-output-schema is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Generate output schemas (dataset_schema.json, output_schema.json, key_value_store_schema.json) for an Apify Actor by analyzing its source code. Use when creating or updating Actor output schemas.

Teams using apify-generate-output-schema should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/apify-generate-output-schema/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/apify/agent-skills/apify-generate-output-schema/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/apify-generate-output-schema/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How apify-generate-output-schema Compares

Feature / Agent	apify-generate-output-schema	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Generate output schemas (dataset_schema.json, output_schema.json, key_value_store_schema.json) for an Apify Actor by analyzing its source code. Use when creating or updating Actor output schemas.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Generate Actor Output Schema

You are generating output schema files for an Apify Actor. The output schema tells Apify Console how to display run results. You will analyze the Actor's source code, create `dataset_schema.json`, `output_schema.json`, and `key_value_store_schema.json` (if the Actor uses key-value store), and update `actor.json`.

## Core Principles

- **Analyze code first**: Read the Actor's source to understand what data it actually pushes to the dataset — never guess
- **Every field is nullable**: APIs and websites are unpredictable — always set `"nullable": true`
- **Anonymize examples**: Never use real user IDs, usernames, or personal data in examples
- **Verify against code**: If TypeScript types exist, cross-check the schema against both the type definition AND the code that produces the values
- **Reuse existing patterns**: Before generating schemas, check if other Actors in the same repository already have output schemas — match their structure, naming conventions, description style, and formatting
- **Don't reinvent the wheel**: Reuse existing type definitions, interfaces, and utilities from the codebase instead of creating duplicate definitions

---

## Phase 1: Discover Actor Structure

**Goal**: Locate the Actor and understand its output

Initial request: $ARGUMENTS

**Actions**:
1. Create todo list with all phases
2. Find the `.actor/` directory containing `actor.json`
3. Read `actor.json` to understand the Actor's configuration
4. Check if `dataset_schema.json`, `output_schema.json`, and `key_value_store_schema.json` already exist
5. **Search for existing schemas in the repository**: Look for other `.actor/` directories or schema files (e.g., `**/dataset_schema.json`, `**/output_schema.json`, `**/key_value_store_schema.json`) to learn the repo's conventions — match their description style, field naming, example formatting, and overall structure
6. Find all places where data is pushed to the dataset:
   - **JavaScript/TypeScript**: Search for `Actor.pushData(`, `dataset.pushData(`, `Dataset.pushData(`
   - **Python**: Search for `Actor.push_data(`, `dataset.push_data(`, `Dataset.push_data(`
7. Find all places where data is stored in the key-value store:
   - **JavaScript/TypeScript**: Search for `Actor.setValue(`, `keyValueStore.setValue(`, `KeyValueStore.setValue(`
   - **Python**: Search for `Actor.set_value(`, `key_value_store.set_value(`, `KeyValueStore.set_value(`
8. Find output type definitions — **reuse them directly** instead of recreating from scratch:
   - **TypeScript**: Look for output type interfaces/types (e.g., in `src/types/`, `src/types/output.ts`). If an interface or type already defines the output shape, derive the schema fields from it — do not create a parallel definition
   - **Python**: Look for TypedDict, dataclass, or Pydantic model definitions. Use the existing field names, types, and docstrings as the source of truth
9. Check for existing shared schema utilities or helper functions in the codebase that handle schema generation or validation — reuse them rather than creating new logic
10. If inline `storages.dataset` or `storages.keyValueStore` config exists in `actor.json`, note it for migration

Present findings to user: list all discovered dataset output fields, key-value store keys, their types, and where they come from.

---

## Phase 2: Generate `dataset_schema.json`

**Goal**: Create a complete dataset schema with field definitions and display views

### File structure

```json
{
    "actorSpecification": 1,
    "fields": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
            // ALL output fields here — every field the Actor can produce,
            // not just the ones shown in the overview view
        },
        "required": [],
        "additionalProperties": true
    },
    "views": {
        "overview": {
            "title": "Overview",
            "description": "Most important fields at a glance",
            "transformation": {
                "fields": [
                    // 8-12 most important field names
                ]
            },
            "display": {
                "component": "table",
                "properties": {
                    // Display config for each overview field
                }
            }
        }
    }
}
```

### Consistency with existing schemas

If existing output schemas were found in the repository during Phase 1 (step 5), follow their conventions:
- Match the **description writing style** (sentence case vs. lowercase, period vs. no period, etc.)
- Match the **field naming convention** (camelCase vs. snake_case) — this must also match the actual keys produced by the Actor code
- Match the **example value style** (e.g., date formats, URL patterns, placeholder names)
- Match the **view structure** (number of fields in overview, display format choices)
- Match the **JSON formatting** (indentation, property ordering, spacing) — all schemas in the same repository must use identical formatting, including standalone Actors

When the Actor code already has well-defined TypeScript interfaces or Python type classes, derive fields directly from those types rather than re-analyzing pushData/push_data calls from scratch. The type definition is the canonical source.

### Hard rules (no exceptions)

| Rule | Detail |
|------|--------|
| **All fields in `properties`** | The `fields.properties` object must contain **every** field the Actor can output, not just the fields shown in the overview view. The views section selects a subset for display — the `properties` section must be the complete superset |
| `"nullable": true` | On **every** field — APIs are unpredictable |
| `"additionalProperties": true` | On the **top-level `fields` object** AND on **every nested object** within `properties`. This is the most commonly missed rule — it must appear at both levels |
| `"required": []` | Always empty array — on the **top-level `fields` object** AND on **every nested object** within `properties` |
| Anonymized examples | No real user IDs, usernames, or content |
| `"type"` required with `"nullable"` | AJV rejects `nullable` without a `type` on the same field |

> **Warning — most common mistakes**:
> 1. Only including fields that appear in the overview view. The `fields.properties` must list ALL output fields, even if they are not in the `views` section.
> 2. Only adding `"required": []` and `"additionalProperties": true` on nested object-type properties but forgetting them on the top-level `fields` object. Both levels need them.

> **Note**: `nullable` is an Apify-specific extension to JSON Schema draft-07. It is intentional and correct.

### Field type patterns

**String field:**
```json
"title": {
    "type": "string",
    "description": "Title of the scraped item",
    "nullable": true,
    "example": "Example Item Title"
}
```

**Number field:**
```json
"viewCount": {
    "type": "number",
    "description": "Number of views",
    "nullable": true,
    "example": 15000
}
```

**Boolean field:**
```json
"isVerified": {
    "type": "boolean",
    "description": "Whether the account is verified",
    "nullable": true,
    "example": true
}
```

**Array field:**
```json
"hashtags": {
    "type": "array",
    "description": "Hashtags associated with the item",
    "items": { "type": "string" },
    "nullable": true,
    "example": ["#example", "#demo"]
}
```

**Nested object field:**
```json
"authorInfo": {
    "type": "object",
    "description": "Information about the author",
    "properties": {
        "name": { "type": "string", "nullable": true },
        "url": { "type": "string", "nullable": true }
    },
    "required": [],
    "additionalProperties": true,
    "nullable": true,
    "example": { "name": "Example Author", "url": "https://example.com/author" }
}
```

**Enum field:**
```json
"contentType": {
    "type": "string",
    "description": "Type of content",
    "enum": ["article", "video", "image"],
    "nullable": true,
    "example": "article"
}
```

**Union type (e.g., TypeScript `ObjectType | string`):**
```json
"metadata": {
    "type": ["object", "string"],
    "description": "Structured metadata object, or error string if unavailable",
    "nullable": true,
    "example": { "key": "value" }
}
```

### Anonymized example values

Use realistic but generic values. Follow platform ID format conventions:

| Field type | Example approach |
|---|---|
| IDs | Match platform format and length (e.g., 11 chars for YouTube video IDs) |
| Usernames | `"exampleuser"`, `"sampleuser123"` |
| Display names | `"Example Channel"`, `"Sample Author"` |
| URLs | Use platform's standard URL format with fake IDs |
| Dates | `"2025-01-15T12:00:00.000Z"` (ISO 8601) |
| Text content | Generic descriptive text, e.g., `"This is an example description."` |

### Views section

- `transformation.fields`: List 8–12 most important field names (order = column order in UI)
- `display.properties`: One entry per overview field with `label` and `format`
- Available formats: `"text"`, `"number"`, `"date"`, `"link"`, `"boolean"`, `"image"`, `"array"`, `"object"`

Pick fields that give users the most useful at-a-glance summary of the data.

---

## Phase 3: Generate `key_value_store_schema.json` (if applicable)

**Goal**: Define key-value store collections if the Actor stores data in the key-value store

> **Skip this phase** if no `Actor.setValue()` / `Actor.set_value()` calls were found in Phase 1 (beyond the default `INPUT` key).

### File structure

```json
{
    "actorKeyValueStoreSchemaVersion": 1,
    "title": "<Descriptive title — what the key-value store contains>",
    "description": "<One sentence describing the stored data>",
    "collections": {
        "<collectionName>": {
            "title": "<Human-readable title>",
            "description": "<What this collection contains>",
            "keyPrefix": "<prefix->"
        }
    }
}
```

### How to identify collections

Group the discovered `setValue` / `set_value` calls by key pattern:

1. **Fixed keys** (e.g., `"RESULTS"`, `"summary"`) — use `"key"` (exact match)
2. **Dynamic keys with a prefix** (e.g., `"screenshot-${id}"`, `f"image-{name}"`) — use `"keyPrefix"`

Each group becomes a collection.

### Collection properties

| Property | Required | Description |
|----------|----------|-------------|
| `title` | Yes | Shown in UI tabs |
| `description` | No | Shown in UI tooltips |
| `key` | Conditional | Exact key for single-key collections (use `key` OR `keyPrefix`, not both) |
| `keyPrefix` | Conditional | Prefix for multi-key collections (use `key` OR `keyPrefix`, not both) |
| `contentTypes` | No | Restrict allowed MIME types (e.g., `["image/jpeg"]`, `["application/json"]`) |
| `jsonSchema` | No | JSON Schema draft-07 for validating `application/json` content |

### Examples

**Single file output (e.g., a report):**
```json
{
    "actorKeyValueStoreSchemaVersion": 1,
    "title": "Analysis Results",
    "description": "Key-value store containing analysis output",
    "collections": {
        "report": {
            "title": "Report",
            "description": "Final analysis report",
            "key": "REPORT",
            "contentTypes": ["application/json"]
        }
    }
}
```

**Multiple files with prefix (e.g., screenshots):**
```json
{
    "actorKeyValueStoreSchemaVersion": 1,
    "title": "Scraped Files",
    "description": "Key-value store containing downloaded files and screenshots",
    "collections": {
        "screenshots": {
            "title": "Screenshots",
            "description": "Page screenshots captured during scraping",
            "keyPrefix": "screenshot-",
            "contentTypes": ["image/png", "image/jpeg"]
        },
        "documents": {
            "title": "Documents",
            "description": "Downloaded document files",
            "keyPrefix": "doc-",
            "contentTypes": ["application/pdf", "text/html"]
        }
    }
}
```

---

## Phase 4: Generate `output_schema.json`

**Goal**: Create the output schema that tells Apify Console where to find results

For most Actors that push data to a dataset, this is a minimal file:

```json
{
    "actorOutputSchemaVersion": 1,
    "title": "<Descriptive title — what the Actor returns>",
    "description": "<One sentence describing the output data>",
    "properties": {
        "dataset": {
            "type": "string",
            "title": "Results",
            "description": "Dataset containing all scraped data",
            "template": "{{links.apiDefaultDatasetUrl}}/items"
        }
    }
}
```

> **Critical**: Each property entry **must** include `"type": "string"` — this is an Apify-specific convention. The Apify meta-validator rejects properties without it (and rejects `"type": "object"` — only `"string"` is valid here).

If `key_value_store_schema.json` was generated in Phase 3, add a second property:
```json
"files": {
    "type": "string",
    "title": "Files",
    "description": "Key-value store containing downloaded files",
    "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
}
```

### Available template variables

- `{{links.apiDefaultDatasetUrl}}` — API URL of default dataset
- `{{links.apiDefaultKeyValueStoreUrl}}` — API URL of default key-value store
- `{{links.publicRunUrl}}` — Public run URL
- `{{links.consoleRunUrl}}` — Console run URL
- `{{links.apiRunUrl}}` — API run URL
- `{{links.containerRunUrl}}` — URL of webserver running inside the run
- `{{run.defaultDatasetId}}` — ID of the default dataset
- `{{run.defaultKeyValueStoreId}}` — ID of the default key-value store

---

## Phase 5: Update `actor.json`

**Goal**: Wire the schema files into the Actor configuration

**Actions**:
1. Read the current `actor.json`
2. Add or update the `storages.dataset` reference:
   ```json
   "storages": {
       "dataset": "./dataset_schema.json"
   }
   ```
3. If `key_value_store_schema.json` was generated, add the reference:
   ```json
   "storages": {
       "dataset": "./dataset_schema.json",
       "keyValueStore": "./key_value_store_schema.json"
   }
   ```
4. Add or update the `output` reference:
   ```json
   "output": "./output_schema.json"
   ```
5. If `actor.json` had inline `storages.dataset` or `storages.keyValueStore` objects (not string paths), migrate their content into the respective schema files and replace the inline objects with file path strings

---

## Phase 6: Review and Validate

**Goal**: Ensure correctness and completeness

**Checklist**:
- [ ] **Every** output field from the source code is in `dataset_schema.json` `fields.properties` — not just the overview view fields but ALL fields the Actor can produce
- [ ] Every field has `"nullable": true`
- [ ] The **top-level `fields` object** has both `"additionalProperties": true` and `"required": []`
- [ ] Every **nested object** within `properties` also has `"additionalProperties": true` and `"required": []`
- [ ] Every field has a `"description"` and an `"example"`
- [ ] All example values are anonymized
- [ ] `"type"` is present on every field that has `"nullable"`
- [ ] Views list 8–12 most useful fields with correct display formats
- [ ] `output_schema.json` has `"type": "string"` on every property
- [ ] If key-value store is used: `key_value_store_schema.json` has collections matching all `setValue`/`set_value` calls
- [ ] If key-value store is used: each collection uses either `key` or `keyPrefix` (not both)
- [ ] `actor.json` references all generated schema files
- [ ] Schema field names match the actual keys in the code (camelCase/snake_case consistency)
- [ ] If existing schemas were found in the repo, the new schema follows their conventions (description style, example format, view structure)
- [ ] Schema fields are derived from existing type definitions (interfaces, TypedDicts, dataclasses) where available — no duplicated or divergent field definitions

Present the generated schemas to the user for review before writing them.

---

## Phase 7: Summary

**Goal**: Document what was created

Report:
- Files created or updated
- Number of fields in the dataset schema
- Number of collections in the key-value store schema (if generated)
- Fields selected for the overview view
- Any fields that need user clarification (ambiguous types, unclear nullability)
- Suggested next steps (test locally with `apify run`, verify output tab in Console)

Related Skills

validating-api-schemas

from ComeOnOliver/skillshub

Validate API schemas against OpenAPI, JSON Schema, and GraphQL specifications. Use when validating API schemas and contracts. Trigger with phrases like "validate API schema", "check OpenAPI spec", or "verify schema".

schema-validator

from ComeOnOliver/skillshub

Schema Validator - Auto-activating skill for Data Pipelines. Triggers on: schema validator, schema validator Part of the Data Pipelines skill category.

response-schema-generator

from ComeOnOliver/skillshub

Response Schema Generator - Auto-activating skill for API Development. Triggers on: response schema generator, response schema generator Part of the API Development skill category.

prisma-schema-helper

from ComeOnOliver/skillshub

Prisma Schema Helper - Auto-activating skill for Backend Development. Triggers on: prisma schema helper, prisma schema helper Part of the Backend Development skill category.

graphql-schema-generator

from ComeOnOliver/skillshub

Graphql Schema Generator - Auto-activating skill for API Development. Triggers on: graphql schema generator, graphql schema generator Part of the API Development skill category.

designing-database-schemas

from ComeOnOliver/skillshub

Process use when you need to work with database schema design. This skill provides schema design and migrations with comprehensive guidance and automation. Trigger with phrases like "design schema", "create migration", or "model database".

database-schema-visualizer

from ComeOnOliver/skillshub

Database Schema Visualizer - Auto-activating skill for Visual Content. Triggers on: database schema visualizer, database schema visualizer Part of the Visual Content skill category.

comparing-database-schemas

from ComeOnOliver/skillshub

This skill leverages the database-diff-tool plugin to compare database schemas, generate migration scripts, and provide rollback procedures. It is triggered when the user requests database schema comparisons, migration script generation, or database synchronization. Use this skill when asked to identify differences between database schemas (PostgreSQL or MySQL), create safe migration scripts with transaction safety, validate changes before deployment, or generate rollback procedures. The skill is activated by requests involving terms like "database diff", "schema comparison", "generate migration script", "database synchronization", or `/db-diff`.

apify-webhooks-events

from ComeOnOliver/skillshub

Implement Apify webhooks for Actor run notifications and event-driven pipelines. Use when setting up run completion alerts, building event-driven scraping pipelines, or configuring ad-hoc webhooks for individual runs. Trigger: "apify webhook", "apify events", "actor run notification", "apify run succeeded webhook", "apify ad-hoc webhook".

apify-upgrade-migration

from ComeOnOliver/skillshub

Upgrade Apify SDK, apify-client, and Crawlee versions safely. Use when migrating between SDK versions, handling breaking changes, or updating from Apify SDK v2 to v3 (Crawlee split). Trigger: "upgrade apify", "apify migration", "apify breaking changes", "update apify SDK", "crawlee upgrade", "apify v2 to v3".

apify-security-basics

from ComeOnOliver/skillshub

Secure Apify API tokens, configure proxy access, and protect Actor data. Use when hardening API key management, setting up environment-specific tokens, or auditing Apify security configuration. Trigger: "apify security", "apify secrets", "secure apify token", "apify API key security", "rotate apify token".

apify-sdk-patterns

from ComeOnOliver/skillshub

Production-ready patterns for Apify SDK and apify-client in TypeScript. Use when building Actors with Crawlee, managing datasets/KV stores, or implementing robust client wrappers with retry and validation. Trigger: "apify SDK patterns", "apify best practices", "apify client wrapper", "crawlee patterns", "idiomatic apify".