data-cloud-integration-strategy

Use this skill when designing or troubleshooting the data pipeline strategy for connecting source systems to Data Cloud — including ingestion API pattern selection (streaming vs. batch), connector type decisions, DSO-to-DLO-to-DMO pipeline lag, and lakehouse federation patterns. Triggers on: Data Cloud ingestion API setup, streaming vs batch connector decision, Data Cloud connector types, MuleSoft Direct for Data Cloud, data pipeline lag for segmentation. NOT for standard Salesforce integration patterns (use integration-patterns skill), not for querying Data Cloud once data is ingested (use data-cloud-query-api), not for configuring standard admin connectors through the UI only.

8 stars

byPranavNagrecha

View on GitHub Installation ↓

Best use case

data-cloud-integration-strategy is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using data-cloud-integration-strategy should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-cloud-integration-strategy/SKILL.md --create-dirs "https://raw.githubusercontent.com/PranavNagrecha/AwesomeSalesforceSkills/main/skills/integration/data-cloud-integration-strategy/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/data-cloud-integration-strategy/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How data-cloud-integration-strategy Compares

Feature / Agent	data-cloud-integration-strategy	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Cloud Integration Strategy

This skill activates when a practitioner is designing or troubleshooting how source systems connect to Data Cloud. It covers ingestion API pattern selection (streaming vs. bulk), connector type decisions, multi-hop pipeline lag (DSO → DLO → DMO), schema constraints, and lakehouse federation options. It does NOT cover post-ingestion querying (use data-cloud-query-api) or standard Salesforce-to-Salesforce integration.

---

## Before Starting

Gather this context before working on anything in this domain:

- Data Cloud Ingestion API has two mutually exclusive patterns on a single connector: **streaming** (fire-and-forget micro-batches, ~3 minutes async, hard 200 KB per request limit) and **bulk** (CSV files, 150 MB max per file, 100 files max per job). One connector cannot mix both patterns.
- Streaming ingestion is NOT real-time. The ~3 minute processing interval is async and carries no sub-3-minute SLA. Practitioners who assume sub-minute latency will be surprised.
- Every connector writes to a Data Stream Object (DSO), which flows DSO → DLO (Data Lake Object) → DMO (Data Model Object). This multi-hop pipeline introduces lag before data is available for segmentation or activation.

---

## Core Concepts

### Connector Types

Data Cloud supports four categories of connectors:

| Connector Type | Use Case | Examples |
|---|---|---|
| Built-in (CRM Connector) | Salesforce CRM objects (near-real-time via Change Data Capture) | Salesforce org objects |
| Cloud Storage | Files from S3, GCS, Azure Data Lake | CSV/Parquet files on schedule |
| Ingestion API | Custom source systems via REST API | App databases, custom events |
| MuleSoft Direct | Unstructured sources, SharePoint, Confluence | Content repositories, legacy systems |

MuleSoft Direct requires separate MuleSoft licensing. It is the only connector type that handles unstructured content ingestion.

### Streaming vs. Bulk Ingestion API

**Streaming Ingestion:**
- Fire-and-forget micro-batches processed approximately every 3 minutes
- Hard limit: 200 KB per request
- No sub-3-minute latency guarantee — async processing
- Suitable for event-driven or high-frequency low-payload data
- A synchronous validation endpoint exists for dev-mode schema pre-flight

**Bulk Ingestion:**
- CSV files only (UTF-8, comma-delimited, RFC 4180 compliant)
- Up to 150 MB per file, maximum 100 files per job
- Full-replace semantics — partial updates (patch) are NOT supported
- Suitable for daily/nightly loads of large datasets

A single Ingestion API connector cannot use both modes — choose at connector creation time.

### DSO → DLO → DMO Pipeline Lag

Every connector writes data into a Data Stream Object (DSO). The platform then processes DSO records into a Data Lake Object (DLO) and subsequently maps to a Data Model Object (DMO) for segmentation and identity resolution. This multi-hop pipeline introduces cumulative lag. Data is typically not available for segmentation within minutes of ingestion — practitioners must account for this lag in SLA commitments.

Identity resolution (Unified Profile creation) runs as frequently as every 15 minutes but is independent of connector ingestion lag.

### Schema Constraints for Ingestion API

Ingestion API schemas are defined in OpenAPI 3.0.x YAML format. After a schema is deployed, changes are largely irreversible: fields cannot be removed, field types cannot be changed, and objects cannot be deleted. Engagement-category DSOs require a `DateTime` field as mandatory. Plan schema carefully before deploying to production.

### Lakehouse Federation

Data Cloud supports zero-copy federation to external lakehouse platforms (Snowflake, Databricks, BigQuery, Redshift) via Data Federation. This allows querying external data in Data Cloud without physical ingestion. Batch ingestion caps at 100M rows or 50 GB per object — for datasets above this threshold, federation is the recommended approach.

---

## Common Patterns

### Pattern 1: Streaming Ingestion for Application Event Data

**When to use:** Source system generates frequent, small payloads (e.g., behavioral events, clickstream, IoT sensor readings) and near-real-time availability (within minutes) is acceptable.

**How it works:**

1. Create an Ingestion API connector in Data Cloud Setup, select Streaming mode.
2. Define the OpenAPI 3.0.x schema for the event payload — include DateTime field for engagement-category objects.
3. Obtain Data Cloud-specific OAuth token for the connected app.
4. POST events to the streaming endpoint (200 KB max per request).
5. Events batch approximately every 3 minutes into DSO, then flow to DLO and DMO.

**Why not use bulk:** Bulk is file-based (CSV), batched, and designed for high-volume periodic loads. Streaming suits event-driven, low-payload, frequent patterns.

### Pattern 2: Bulk Ingestion for Nightly Data Warehouse Sync

**When to use:** A large relational dataset (e.g., order history, historical customer records) needs to be loaded from an external data warehouse on a nightly schedule.

**How it works:**

1. Export source data as UTF-8 CSV (max 150 MB per file, 100 files per job).
2. Create an Ingestion API connector in Bulk mode.
3. POST the CSV files to the bulk upload endpoint in a single job.
4. Monitor job status until processing completes.
5. Data flows DSO → DLO → DMO with multi-hop pipeline lag.

**Why not use streaming:** Large files cannot be streamed due to the 200 KB per-request limit. Bulk handles full dataset replacement efficiently.

---

## Decision Guidance

| Situation | Recommended Approach | Reason |
|---|---|---|
| High-frequency low-payload events from custom app | Streaming Ingestion API | Fire-and-forget micro-batches, event-driven |
| Nightly full-dataset load from data warehouse | Bulk Ingestion API (CSV) | Handles large files, 150 MB/file max |
| Salesforce CRM object sync | CRM Connector (built-in) | Native, near-real-time via CDC, no custom code |
| SharePoint or Confluence content | MuleSoft Direct | Only connector type for unstructured sources |
| Dataset > 50 GB or 100M rows | Data Federation (zero-copy) | Exceeds physical ingestion limits |
| Partial record updates (patch semantics) | Cannot use Ingestion API — redesign | Bulk is full-replace only; no patch support |

---

## Recommended Workflow

1. Identify the source system type, data volume, and latency requirements before selecting a connector type.
2. For custom sources, decide streaming vs. bulk based on payload size and frequency — a single connector cannot mix both.
3. Design the OpenAPI 3.0.x schema carefully before deploying — field removal and type changes are not supported post-deployment.
4. For engagement-category DSOs, include a mandatory DateTime field in the schema.
5. Implement OAuth token flow for the Ingestion API connected app; use the streaming validation endpoint for schema pre-flight during development.
6. Account for the DSO → DLO → DMO multi-hop lag when communicating data availability SLAs to stakeholders.
7. For large datasets (>50 GB), evaluate Data Federation instead of physical ingestion.

---

## Review Checklist

- [ ] Connector type matches source system (CRM, cloud storage, Ingestion API, MuleSoft Direct)
- [ ] Streaming vs. bulk decision documented with rationale
- [ ] Schema designed and reviewed before deployment — irreversible after deploy
- [ ] 200 KB streaming limit and 150 MB / 100-file bulk limits verified against volume
- [ ] DateTime field included for engagement-category DSOs
- [ ] Multi-hop pipeline lag (DSO → DLO → DMO) communicated to stakeholders
- [ ] For datasets > 50 GB: Data Federation considered as alternative
- [ ] MuleSoft Direct licensing confirmed if selected

---

## Salesforce-Specific Gotchas

1. **Streaming Is Not Real-Time** — The Ingestion API streaming mode processes batches approximately every 3 minutes asynchronously. There is no sub-minute SLA. Any integration that depends on sub-minute latency in Data Cloud will not be met by the Ingestion API.

2. **Single Connector Cannot Mix Streaming and Bulk** — A connector is created in either streaming or bulk mode. Switching modes requires creating a new connector, and changing schema after the fact is largely impossible (no field removal, no type change). Choose mode and schema carefully at inception.

3. **Bulk Ingestion Is Full-Replace Only** — Bulk ingestion does not support partial updates (PATCH semantics). Every bulk job replaces the full dataset for the objects in scope. Partial-update use cases must use streaming ingestion or a different approach.

4. **Multi-Hop Lag Before Segmentation** — Data ingested via Ingestion API is NOT immediately available for segmentation or activation. It must traverse DSO → DLO → DMO first. Identity resolution adds a further processing step. Plans that assume immediate availability will fail.

5. **Irreversible Schema After Deployment** — Ingestion API schemas are defined in OpenAPI 3.0.x YAML. Once deployed, you cannot remove a field, change its type, or delete an object. Retiring a schema requires creating a new connector with a new schema and migrating historical data.

---

## Output Artifacts

| Artifact | Description |
|---|---|
| Connector type selection | Documented decision: CRM/cloud storage/Ingestion API/MuleSoft Direct with rationale |
| Ingestion API schema | OpenAPI 3.0.x YAML for custom source DSO |
| Pipeline lag estimate | DSO → DLO → DMO latency projection for stakeholder SLA communication |
| Ingestion client code | OAuth token flow + streaming/bulk POST implementation |

---

## Related Skills

- data-cloud-query-api — for querying DMO data once ingestion is complete
- data-cloud-activation-development — for event-driven actions on ingested DMO data
- rest-api-patterns — for standard Salesforce REST/SOQL integration patterns
- mulesoft-anypoint-architecture — for MuleSoft Direct integration architecture

Related Skills

shield-event-log-retention-strategy

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing Salesforce Shield Event Monitoring retention, SIEM routing, and storage-tier strategy — which event types to keep, for how long, where, and how to answer audit queries across hot/warm/cold tiers. Triggers: 'shield event log retention', 'route event monitoring to splunk', 'how long to keep login history', 'siem salesforce integration', 'event monitoring storage tier'. NOT for enabling Shield (see salesforce-shield-deployment).

scim-provisioning-integration

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing or reviewing SCIM-based user lifecycle provisioning into Salesforce from Okta, Azure AD / Entra, or another IdP — create/update/deactivate, group-to-permission-set mapping, attribute mapping, and deprovisioning semantics. Triggers: 'scim provisioning', 'okta scim salesforce', 'entra salesforce provisioning', 'user deactivation automation', 'group to permission set mapping'. NOT for SSO/authentication setup (see single-sign-on skills).

sandbox-data-masking

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when configuring or reviewing Salesforce Data Mask to protect PII/PHI in partial or full copy sandboxes after a refresh. Trigger keywords: data mask, sandbox masking, PII in sandbox, GDPR sandbox, HIPAA non-production, mask contacts, obfuscate fields non-production. NOT for sandbox refresh mechanics (use sandbox-refresh-and-templates), NOT for production data anonymization, NOT for Shield Platform Encryption at rest.

oauth-redirect-and-domain-strategy

from PranavNagrecha/AwesomeSalesforceSkills

Design Connected App OAuth callback URLs, My Domain naming, Enhanced Domains cutover, and cross-environment redirect handling. Trigger keywords: oauth redirect uri, connected app callback, my domain, enhanced domains, sandbox url change, oauth login host. Does NOT cover: end-user login flow UX, Experience Cloud branding, or SAML-only SSO configuration.

mfa-enforcement-strategy

from PranavNagrecha/AwesomeSalesforceSkills

Plan and operate Salesforce org-wide multi-factor authentication (MFA) enforcement: verification methods, phased rollout, SSO and API-only considerations, exemptions, and operational readiness. NOT for designing Login Flow post-authentication logic, IP allowlists, or conditional step-up policies—use ip-range-and-login-flow-strategy, network-security-and-trusted-ips, or transaction-security-policies instead.

ip-range-and-login-flow-strategy

from PranavNagrecha/AwesomeSalesforceSkills

Design and implement Salesforce Login Flows (Screen Flows assigned to profiles or Experience Cloud sites) that run post-authentication to enforce conditional MFA, IP-based branching, terms-of-service acceptance, or user data collection. Covers Login Flow creation in Flow Builder, profile/site assignment, IP-aware decision logic, and ConnectedAppPlugin extension points. NOT for static IP allowlisting or profile Login IP Ranges (see network-security-and-trusted-ips), org-wide session policies, or SSO/SAML IdP configuration.

gdpr-data-privacy

from PranavNagrecha/AwesomeSalesforceSkills

Use this skill when implementing GDPR or CCPA data privacy controls in Salesforce: Individual sObject linkage, consent tracking, Right to Be Forgotten (RTBF) requests, data subject request handling, and Privacy Center configuration. Trigger keywords: GDPR, data privacy, consent management, right to erasure, Individual object, ContactPointConsent, ShouldForget, data subject request, Privacy Center, data portability. NOT for general data quality cleanup, duplicate management, field-level encryption (see platform-encryption skill), or sandbox data masking (see sandbox-data-masking skill).

experience-cloud-security

from PranavNagrecha/AwesomeSalesforceSkills

Use when configuring access controls, sharing, or site security for authenticated or guest Experience Cloud (community) users: external OWD, Sharing Sets, Share Groups, CSP, clickjack protection, guest user record access. NOT for internal sharing model configuration (use sharing-and-visibility).

data-classification-labels

from PranavNagrecha/AwesomeSalesforceSkills

Classify Salesforce fields by data sensitivity and compliance category using the four built-in classification attributes (SecurityClassification, ComplianceGroup, BusinessOwnerId, BusinessStatus). Covers Metadata API deployment, Tooling API querying, and Einstein Data Detect recommendations. NOT for data masking, Shield Platform Encryption, or runtime access control enforcement.

customer-data-request-workflow

from PranavNagrecha/AwesomeSalesforceSkills

Implement GDPR/CCPA data subject rights (access, deletion, rectification) using Salesforce Privacy Center and/or custom workflow. NOT for general backup or org-level data retention policy.

omnistudio-lwc-integration

from PranavNagrecha/AwesomeSalesforceSkills

Use when embedding OmniScripts in Lightning Web Components, registering custom LWC elements inside OmniScript screens, or calling OmniScript/Integration Procedures from LWC. Triggers: embed omniscript in LWC, custom LWC element in OmniScript, call OmniScript from Lightning page, omnistudio-omni-script tag, seed data JSON, OmniScript launch from LWC. NOT for standalone LWC development, standard Flow embedding, or OmniScript-to-OmniScript embedding.

omnistudio-deployment-datapacks

from PranavNagrecha/AwesomeSalesforceSkills

Use when exporting, importing, or version-controlling OmniStudio components using DataPacks via the OmniStudio DataPacks tool or vlocity CLI. Covers DataPack export/import, Git version control integration, CI/CD for OmniStudio. NOT for SFDX-based metadata deployment of non-OmniStudio components.