error-handling-in-integrations

Use this skill to design orchestration-layer error handling for Salesforce integrations — covering Platform Event replay recovery, dead-letter queue routing, cross-channel error notification patterns, circuit breaker design, and trigger suspension recovery. Trigger keywords: integration error handling, Platform Event retry, integration dead letter queue, EventBus RetryableException, integration circuit breaker, event bus trigger suspended. NOT for Apex exception handling (use apex-exception-handling skill), HTTP error response contracts (use api-error-handling-design), or retry backoff patterns (use retry-and-backoff-patterns).

8 stars

byPranavNagrecha

View on GitHub Installation ↓

Best use case

error-handling-in-integrations is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using error-handling-in-integrations should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-handling-in-integrations/SKILL.md --create-dirs "https://raw.githubusercontent.com/PranavNagrecha/AwesomeSalesforceSkills/main/skills/integration/error-handling-in-integrations/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/error-handling-in-integrations/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How error-handling-in-integrations Compares

Feature / Agent	error-handling-in-integrations	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Error Handling in Integrations

This skill activates when a developer or integration architect needs to design orchestration-layer error handling for Salesforce integrations. It covers Platform Event trigger suspension recovery, dead-letter queue patterns, circuit breaker design, and cross-channel failure notifications — distinct from single-transaction Apex exception handling and HTTP error response contracts.

---

## Before Starting

Gather this context before working on anything in this domain:

- Platform Events retain messages for 72 hours (replay window). The stable deduplication key is the event message ID — not the Replay ID, which can be corrupted after Salesforce maintenance events.
- `EventBus.RetryableException` triggers up to 9 automatic retries before the Platform Event trigger is suspended. A suspended trigger stops processing ALL new events on that channel.
- Existing skills cover related but distinct topics: `integration/retry-and-backoff-patterns` covers HTTP retry backoff; `integration/api-error-handling-design` covers HTTP error response contracts. This skill covers orchestration-layer routing and recovery.
- The most critical mistake: throwing RetryableException for permanent errors (bad data, invalid config) causes 9 retries of a record that will never succeed, then suspends the trigger — blocking all event processing.

---

## Core Concepts

### Platform Event Trigger Suspension and Recovery

Failure escalation path:
1. EventBus.RetryableException thrown → up to 9 automatic retries
2. After 9 failures: trigger suspended — all new events stop processing
3. Re-enable: Setup > Platform Events > [Event] > Subscribe Triggers > Resume
4. Set Replay ID on re-enable to replay missed events from the 72-hour window

Recovery requires: knowing the last successfully processed Replay ID (must be stored by the subscriber), fixing the root cause, then resuming with replay from the correct position.

### Dead-Letter Queue Pattern

Salesforce has no built-in DLQ. Implement explicitly:
- Custom object `Integration_DLQ__c` with: Source_System__c, Event_Type__c, Payload__c (JSON text), Error_Message__c, Retry_Count__c, Status__c (Pending / Failed_Max_Retries / Resolved)
- Scheduled Apex retries Pending DLQ entries periodically
- After max retries: mark as Failed_Max_Retries and trigger ops notification

### Circuit Breaker

For unstable external systems:
- Track consecutive failures in a Custom Setting (failure count + timestamp)
- CLOSED (normal): calls go through
- OPEN (threshold exceeded): skip external call, log OPEN state, notify ops
- HALF_OPEN (after cooldown): attempt one test call; success → CLOSE; failure → OPEN again

### Cross-Channel Notification

- Platform Event `Integration_Error__e` → Flow → Email Alert for standard failures
- Platform Event → Named Credential callout to Slack webhook for high-severity
- Case creation for SLA-impacting failures requiring human resolution
- CRM Analytics or Lightning report on DLQ volume for operations dashboards

---

## Common Patterns

### Pattern: RetryableException for Transient, DLQ for Permanent

```apex
trigger OrderEventTrigger on OrderEvent__e (after insert) {
    for (OrderEvent__e event : Trigger.new) {
        try {
            OrderIntegrationService.processEvent(event);
            // Store Replay ID on success
            Integration_State__c.getInstance().Last_Replay_Id__c = event.ReplayId;
            update Integration_State__c.getInstance();
        } catch (OrderIntegrationService.TransientException e) {
            // Transient: throw RetryableException for auto-retry
            throw new EventBus.RetryableException('Transient: ' + e.getMessage());
        } catch (Exception e) {
            // Permanent: write to DLQ, do NOT throw RetryableException
            insert new Integration_DLQ__c(
                Event_Type__c = 'OrderEvent',
                Payload__c = JSON.serialize(event),
                Error_Message__c = e.getMessage(),
                Status__c = 'Pending_Retry'
            );
        }
    }
}
```

---

## Decision Guidance

| Failure Scenario | Recommended Pattern | Reason |
|---|---|---|
| Transient external error (timeout, 503) | RetryableException | Platform auto-retry handles transient failures |
| Permanent data error (invalid payload) | Write to DLQ; no RetryableException | RetryableException on permanent errors suspends trigger |
| External system down > 1 hour | Circuit breaker OPEN + DLQ accumulate | Prevent cascade failures and API limit exhaustion |
| Trigger suspended | Recovery runbook: Replay ID re-enable | 72-hour window enables recovery |
| Silent failures not visible to ops | Cross-channel notification platform event | Ops must know about failures immediately |

---

## Recommended Workflow

1. Identify the integration pattern (Platform Events, REST, CDC, Bulk API) — each has different failure modes.
2. For Platform Event subscribers: implement RetryableException for transient errors only; DLQ for permanent failures.
3. Implement Replay ID tracking: store last successful Replay ID in a Custom Setting on every successful event.
4. Design DLQ object schema and Scheduled Apex retry job with configurable max retry count.
5. Design cross-channel notification: Platform Event for failures → Flow Email + Slack + Case creation based on severity.
6. For unstable external systems: implement circuit breaker using Custom Setting for failure count and circuit state.
7. Document the trigger suspension recovery runbook in team operations documentation.

---

## Review Checklist

- [ ] RetryableException used only for transient errors (not permanent)
- [ ] DLQ pattern implemented for permanent failures
- [ ] Replay ID tracking implemented on every successful Platform Event
- [ ] Trigger suspension recovery runbook documented
- [ ] Cross-channel error notification designed
- [ ] Circuit breaker designed for unstable external systems
- [ ] DLQ retry job with max retry limit and ops alert threshold

---

## Salesforce-Specific Gotchas

Non-obvious platform behaviors that cause real production problems:

1. **Trigger suspension affects ALL events — not just the failing ones** — Suspending a Platform Event trigger blocks all new events on that channel until manually re-enabled. One bad payload can halt all integration processing.
2. **Replay ID is unstable after Salesforce maintenance** — Replay IDs can become stale after maintenance. Store the event message ID for deduplication; use Replay ID only for starting the replay position.
3. **RetryableException on permanent errors suspends the trigger faster** — Throwing RetryableException on a permanent failure wastes all 9 retries on a record that will never succeed, then suspends the trigger. Only use RetryableException for genuinely transient errors.

---

## Output Artifacts

| Artifact | Description |
|---|---|
| DLQ schema and retry job | Integration_DLQ__c design and Scheduled Apex pattern |
| Trigger suspension recovery runbook | Steps to re-enable and replay after trigger suspension |
| Circuit breaker design | Custom Setting schema and state-transition logic |
| Cross-channel notification design | Error event → notification channel mapping |

---

## Related Skills

- `integration/retry-and-backoff-patterns` — HTTP retry backoff for external API calls
- `integration/api-error-handling-design` — HTTP error response contracts
- `integration/event-driven-architecture-patterns` — Platform Event architecture
- `admin/integration-pattern-selection` — upstream pattern selection

Related Skills

omnistudio-error-handling-patterns

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing fault behavior across Integration Procedures, DataRaptors, OmniScripts, and FlexCards — error routing, user-facing messaging, retry semantics, and idempotency. Triggers: 'omnistudio error', 'integration procedure fault', 'dataraptor error handling', 'omniscript retry', 'flexcard action failure'. NOT for general Apex exception design or Flow fault paths.

lwc-error-boundaries

from PranavNagrecha/AwesomeSalesforceSkills

Isolate component errors so one failure does not blank an entire page using errorCallback and graceful fallbacks. NOT for server-side Apex exception design.

lightning-navigation-dead-link-handling

from PranavNagrecha/AwesomeSalesforceSkills

Use when an LWC navigates via NavigationMixin to records or pages that may no longer exist, lack the user's access, or be permanently moved. Triggers: 'lightning navigation 404', 'navigate to deleted record', 'NavigationMixin error toast', 'graceful fallback when target page missing', 'permission denied on navigation'. NOT for general routing within an SPA or for Experience Cloud public-facing routing.

common-lwc-runtime-errors

from PranavNagrecha/AwesomeSalesforceSkills

Diagnose and fix runtime errors in Lightning Web Components including wire adapter failures, shadow DOM boundary violations, event propagation mistakes, async rendering timing bugs, NavigationMixin errors, Lightning Locker vs Lightning Web Security conflicts, and slot projection problems. Triggers: 'wire adapter returns undefined', 'querySelector returns null in LWC', 'custom event not received by parent', 'LWC component not rendering after connected callback', 'NavigationMixin page reference error'. NOT for LWC fundamentals, build/deployment errors, or Aura component debugging.

api-error-handling-design

from PranavNagrecha/AwesomeSalesforceSkills

Designing HTTP error classification, RFC 7807-style error payload structure, and client-side error parsing for Salesforce REST/SOAP integrations and custom Apex REST endpoints. Use when deciding which HTTP status codes to return from custom Apex REST services, how to structure error response bodies, how to classify inbound API errors as retry-safe vs non-retry-safe, or how to parse Salesforce error responses on the consumer side. NOT for retry execution mechanics or circuit breaker implementation (use retry-and-backoff-patterns). NOT for Apex exception class design (use apex-error-handling-framework). NOT for OAuth error flows (use oauth-flows-and-connected-apps).

flow-runtime-error-diagnosis

from PranavNagrecha/AwesomeSalesforceSkills

Use when a Salesforce Flow throws a runtime error, sends an unhandled fault email, or produces unexpected results in production or sandbox. Triggers: 'Flow error email', 'Flow failed at element', 'null reference in Flow', 'Flow SOQL limit error', 'Flow DML in loop error'. NOT for Flow design or building new flows (use record-triggered-flow-patterns or other flow/* skills), NOT for Flow debug log setup (use flow-debugging).

flow-error-monitoring

from PranavNagrecha/AwesomeSalesforceSkills

Set up monitoring + alerting for Flow runtime errors at org scale: routing fault emails, Flow runtime error reports, custom centralized logging (Integration_Log__c), escalation thresholds, and trend detection. NOT for diagnosing a specific flow error (use flow-runtime-error-diagnosis). NOT for debug-mode setup (use flow-debugging).

fault-handling

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing, reviewing, or troubleshooting Salesforce Flow fault handling, error logging, and bulk-safe automation paths. Triggers: 'fault connector', '$Flow.FaultMessage', 'flow failed', 'record-triggered flow rollback', 'screen flow error'. NOT for generic Flow type selection unless the main risk is failure handling; NOT for Apex exception handling (see apex/exception-handling-patterns).

deployment-error-troubleshooting

from PranavNagrecha/AwesomeSalesforceSkills

Use when a Salesforce metadata deployment fails and you need to diagnose and fix the error. Trigger keywords: 'deployment failed', 'component failure', 'dependent class is invalid', 'code coverage failed', 'UNSUPPORTED_API_VERSION', 'deploy error', 'test failure blocking deploy', 'rollbackOnError', 'missing dependency deploy'. NOT for authoring destructive changes manifests (use destructive-changes-deployment). NOT for CI/CD pipeline setup (use github-actions-for-salesforce or gitlab-ci-for-salesforce). NOT for change set mechanics (use change-set-deployment).

deployment-error-diagnosis

from PranavNagrecha/AwesomeSalesforceSkills

Pattern catalog of common Salesforce metadata-deploy errors and their fixes — `Cannot change type` (field type already in use), dependent-metadata ordering (deploy field before profile that references it), profile / permission set delta issues (deactivated permissions blocking deploy), missing-reference errors, test class coverage failures, and the package.xml-shape mistakes that produce confusing first-line errors. Covers the SFDX / Metadata API error message shapes and how to translate them into the actual fix. NOT for designing the deployment pipeline (use devops/sfdx-cicd-pipeline), NOT for change set orchestration (use admin/changeset-builder).

exception-handling

from PranavNagrecha/AwesomeSalesforceSkills

Use when writing, reviewing, or debugging Apex exception handling, DmlException behavior, custom exception hierarchies, or user-safe error messages. Triggers: 'DmlException', 'swallowed exception', 'AuraHandledException', 'trigger rollback', 'try catch'. NOT for choosing async execution models or general governor-limit tuning.

error-handling-framework

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing or implementing a cross-cutting Apex error handling framework: custom exception hierarchies, rollback-safe logging via Platform Events, BatchApexErrorEvent processing, correlation ID threading, or a unified catch/log/rethrow utility class. Trigger keywords: 'error framework', 'centralized logging', 'rollback-safe log', 'BatchApexErrorEvent', 'correlation ID async', 'AuraHandledException boundary', 'Error_Log__c design'. NOT for individual try/catch block syntax help, basic DmlException handling, or choosing between synchronous and asynchronous execution models.