salesforce-data-pipeline-etl

Export large Salesforce datasets to a lakehouse via Bulk API 2.0, CDC streams, or Salesforce Data Pipelines. NOT for ad-hoc exports.

8 stars

byPranavNagrecha

View on GitHub Installation ↓

Best use case

salesforce-data-pipeline-etl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Export large Salesforce datasets to a lakehouse via Bulk API 2.0, CDC streams, or Salesforce Data Pipelines. NOT for ad-hoc exports.

Teams using salesforce-data-pipeline-etl should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/salesforce-data-pipeline-etl/SKILL.md --create-dirs "https://raw.githubusercontent.com/PranavNagrecha/AwesomeSalesforceSkills/main/skills/integration/salesforce-data-pipeline-etl/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/salesforce-data-pipeline-etl/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How salesforce-data-pipeline-etl Compares

Feature / Agent	salesforce-data-pipeline-etl	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Export large Salesforce datasets to a lakehouse via Bulk API 2.0, CDC streams, or Salesforce Data Pipelines. NOT for ad-hoc exports.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Salesforce Data Pipeline / ETL

Production Salesforce → lake pipelines combine a one-time Bulk API 2.0 snapshot with an ongoing CDC or Platform Event stream. Naive incremental loads on LastModifiedDate lose updates during the query window; CDC guarantees ordered delta capture.

## Adoption Signals

Analytics workloads that need <1h freshness on Salesforce data in a warehouse.

- Bulk API 2.0 export when the source is full-table and the warehouse handles deduplication.
- Change Data Capture (CDC) export when downstream consumers need ordered events with primary-key delta semantics.

## Recommended Workflow

1. Initial snapshot: Bulk API 2.0 query job per object; store in staging.
2. Subscribe to CDC for each object via Pub/Sub API (or Change Data Capture Stream).
3. Apply deltas onto snapshot using event ChangeEventHeader.commitTimestamp + changeType (CREATE/UPDATE/DELETE/GAP_FILL).
4. Handle GAP_FILL events with a re-query of affected Ids (CDC may gap on heavy load).
5. Monitor replay lag; auto-refresh from Bulk snapshot if lag > threshold.

## Key Considerations

- CDC event retention is 3 days — downtime >3 days requires full re-snapshot.
- Field-level deletions of custom fields trigger schema migrations downstream.
- Big Object data cannot be streamed via CDC.
- Avoid SOQL polling at scale — you hit API limits.

## Worked Examples (see `references/examples.md`)

- *Snowflake mirror* — Finance analytics
- *Gap-fill handler* — CDC gap after outage

## Common Gotchas (see `references/gotchas.md`)

- **CDC retention exceeded** — Consumer offline >3 days; events lost.
- **Missing GAP_FILL** — Silently lose records.
- **SOQL polling fallback** — Eats API allocation.

## Top LLM Anti-Patterns (full list in `references/llm-anti-patterns.md`)

- LastModifiedDate polling as primary path
- Ignoring GAP_FILL events
- No replay-id checkpointing

## Official Sources Used

- Apex REST & Callouts — https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_callouts.htm
- Named Credentials — https://help.salesforce.com/s/articleView?id=sf.named_credentials_about.htm
- Connect REST API — https://developer.salesforce.com/docs/atlas.en-us.chatterapi.meta/chatterapi/
- Private Connect — https://help.salesforce.com/s/articleView?id=sf.private_connect_overview.htm
- Bulk API 2.0 — https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/
- Pub/Sub API — https://developer.salesforce.com/docs/platform/pub-sub-api/guide/intro.html

Related Skills

sandbox-data-masking

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when configuring or reviewing Salesforce Data Mask to protect PII/PHI in partial or full copy sandboxes after a refresh. Trigger keywords: data mask, sandbox masking, PII in sandbox, GDPR sandbox, HIPAA non-production, mask contacts, obfuscate fields non-production. NOT for sandbox refresh mechanics (use sandbox-refresh-and-templates), NOT for production data anonymization, NOT for Shield Platform Encryption at rest.

salesforce-shield-deployment

from PranavNagrecha/AwesomeSalesforceSkills

Roll out Shield (Platform Encryption + Event Monitoring + Field Audit Trail) end-to-end, sequencing feature enablement to avoid data lockout. NOT for Classic Encryption or general PE design.

gdpr-data-privacy

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when implementing GDPR or CCPA data privacy controls in Salesforce: Individual sObject linkage, consent tracking, Right to Be Forgotten (RTBF) requests, data subject request handling, and Privacy Center configuration. Trigger keywords: GDPR, data privacy, consent management, right to erasure, Individual object, ContactPointConsent, ShouldForget, data subject request, Privacy Center, data portability. NOT for general data quality cleanup, duplicate management, field-level encryption (see platform-encryption skill), or sandbox data masking (see sandbox-data-masking skill).

ferpa-compliance-in-salesforce

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when implementing FERPA (Family Educational Rights and Privacy Act) compliance controls in Salesforce Education Cloud or Education Data Architecture (EDA): LearnerProfile FERPA boolean fields, directory information opt-out via FLS and Individual data privacy flags, ContactPointTypeConsent for parental and third-party disclosure, 45-day student records response window tracking, and consent workflow automation. Trigger keywords: FERPA, student records privacy, LearnerProfile, parental disclosure, directory information opt-out, education data privacy, student consent, education cloud compliance. NOT for GDPR/CCPA general data privacy (see gdpr-data-privacy skill), platform encryption at rest (see platform-encryption skill), or HIPAA health-data compliance.

data-classification-labels

from PranavNagrecha/AwesomeSalesforceSkills

Classify Salesforce fields by data sensitivity and compliance category using the four built-in classification attributes (SecurityClassification, ComplianceGroup, BusinessOwnerId, BusinessStatus). Covers Metadata API deployment, Tooling API querying, and Einstein Data Detect recommendations. NOT for data masking, Shield Platform Encryption, or runtime access control enforcement.

customer-data-request-workflow

from PranavNagrecha/AwesomeSalesforceSkills

Implement GDPR/CCPA data subject rights (access, deletion, rectification) using Salesforce Privacy Center and/or custom workflow. NOT for general backup or org-level data retention policy.

omnistudio-deployment-datapacks

from PranavNagrecha/AwesomeSalesforceSkills

Use when exporting, importing, or version-controlling OmniStudio components using DataPacks via the OmniStudio DataPacks tool or vlocity CLI. Covers DataPack export/import, Git version control integration, CI/CD for OmniStudio. NOT for SFDX-based metadata deployment of non-OmniStudio components.

omnistudio-asynchronous-data-operations

from PranavNagrecha/AwesomeSalesforceSkills

Use Integration Procedures queues, DataRaptor Chain, and Remote Actions with async patterns for long-running OmniStudio flows. NOT for simple DataRaptor reads.

industries-cpq-vs-salesforce-cpq

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when comparing Industries CPQ (formerly Vlocity CPQ) with Salesforce CPQ (Revenue Cloud managed package) — covering feature parity, decision criteria, migration paths, and coexistence patterns. Trigger keywords: Vlocity CPQ, Industries CPQ, Salesforce CPQ comparison, Revenue Cloud migration, CPQ selection, which CPQ to use. NOT for implementing, configuring, or debugging either CPQ product.

dataraptor-transform-optimization

from PranavNagrecha/AwesomeSalesforceSkills

Use when DataRaptor Transform operations are slow, hit governor limits, or use Apex where formula fields would suffice. Covers formula vs Apex expressions, bulk transform sizing, and chained transform composition. Triggers: 'dataraptor transform slow', 'dataraptor formula vs apex', 'dataraptor bulk transform', 'dr governor limit'. NOT for DataRaptor Extract or Load performance.

dataraptor-patterns

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing or reviewing OmniStudio DataRaptors, especially Extract versus Turbo Extract versus Transform versus Load, field mapping strategy, performance tradeoffs, and when to move work into Integration Procedures or Apex. Triggers: 'DataRaptor Extract', 'Turbo Extract', 'DataRaptor Load', 'DataRaptor Transform', 'OmniStudio data mapping'. NOT for overall OmniScript journey design or Integration Procedure sequencing when the main question is not the DataRaptor shape itself.

lwc-datatable-advanced

from PranavNagrecha/AwesomeSalesforceSkills

Advanced lightning-datatable patterns — inline edit + draftValues, custom cell types via extending LightningDatatable, sortable columns, infinite scroll with onloadmore, row-level errors, and the cost of large data sets. NOT for read-only display of small lists (plain lightning-datatable suffices) or fully custom grids (use a third-party library).