scheduled-apex-failure-detection-and-monitoring

Use when nightly batch / scheduled Apex jobs are failing without anyone noticing — covers why uncaught exceptions in `execute()` go to the debug log instead of email, how to query `AsyncApexJob` for `Status`, `NumberOfErrors`, and `ExtendedStatus`, when to implement `Database.RaisesPlatformEvents` so the platform publishes `BatchApexErrorEvent` on uncaught failures, how to subscribe to that event with an Apex trigger and notify operators, and how to layer a custom watcher schedule on top so silent-failure modes (job that never started, scheduled class deleted, queue stuck on `Queued`) still surface. Triggers: 'nightly batch failed at 2am with no notification', 'how do we know if a scheduled apex job is failing', 'BatchApexErrorEvent vs custom retry logic', 'Setup Apex Jobs only shows last 7 days, where else can I look', 'job is stuck in queued status nobody noticed for a week'. NOT for general Apex exception handling patterns (use apex/apex-exception-handling-and-logging), NOT for Batch Apex authoring or chunking strategy (use apex/batch-apex-design), NOT for Setup → Apex Jobs UI walkthrough as an admin task (use admin/batch-job-scheduling-and-monitoring), NOT for retry logic itself (use apex/scheduled-apex-retry-patterns once authored).

8 stars

byPranavNagrecha

View on GitHub Installation ↓

Best use case

scheduled-apex-failure-detection-and-monitoring is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using scheduled-apex-failure-detection-and-monitoring should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scheduled-apex-failure-detection-and-monitoring/SKILL.md --create-dirs "https://raw.githubusercontent.com/PranavNagrecha/AwesomeSalesforceSkills/main/skills/apex/scheduled-apex-failure-detection-and-monitoring/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scheduled-apex-failure-detection-and-monitoring/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scheduled-apex-failure-detection-and-monitoring Compares

Feature / Agent	scheduled-apex-failure-detection-and-monitoring	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Scheduled Apex Failure Detection And Monitoring

Activate this skill when a Salesforce team needs to *know* their scheduled and batch Apex jobs failed — not just hope they did. The default platform behavior swallows uncaught exceptions in async execution into the debug log, with no email, no record, and no signal to the system of record. This skill closes that gap with three layered mechanisms: in-job try-catch logging, the `BatchApexErrorEvent` platform event for batch-class crashes, and a custom `AsyncApexJob` watcher schedule for everything else (stuck jobs, never-fired schedules, deleted classes).

---

## Before Starting

Gather this context before designing failure detection:

- **What kind of async work is failing.** Three different things commonly get called "scheduled Apex" and each has a different failure surface:
  - A `Schedulable` class invoked by `System.schedule(...)` directly. Uncaught exception in `execute(SchedulableContext)` is logged to the debug log only.
  - A `Schedulable` whose `execute()` calls `Database.executeBatch(new MyBatch(), N)` to enqueue a batch. The schedule fires fine, but the call to `executeBatch` itself can fail (e.g. concurrency limit, governor in the scheduler context), and the batch never runs.
  - A `Database.Batchable` job in flight. Uncaught exception in `start()`, `execute()`, or `finish()` aborts the *current chunk* and is recorded on `AsyncApexJob` — but only if the platform sees the exception. With `Database.RaisesPlatformEvents` it also publishes `BatchApexErrorEvent`.
- **Who owns "the system noticed."** Most orgs assume Setup → Apex Jobs is the source of truth. It only shows the last 7 days, only shows what made it onto the queue, and is rarely checked. If the failure path depends on a human opening that page, the failure path is broken.
- **Whether Event Monitoring is licensed.** With Event Monitoring, the `ApexExecution` event log file (`EVENT_TYPE = ApexExecution`) records every Apex execution including async, with `STATUS` and `MESSAGE` fields useful for reconciliation. Without it, `AsyncApexJob` SOQL is the primary signal.
- **Existing logging substrate.** If the org already has a custom log object (commonly `Application_Log__c` or similar) populated by an `ApplicationLogger`, the failure notifications should write there too. If not, decide *now* whether you're shipping a logger as part of this work or piggybacking on Custom Notifications + email.

---

## Core Concepts

### Concept 1 — Why scheduled Apex fails silently by default

Three behaviors compound into the silent-failure mode:

1. **Uncaught exceptions in async `execute()` do not email the user who scheduled the job.** Synchronous Apex sends an unhandled-exception email to the user and to addresses configured in Setup → Apex Exception Email. *Async* Apex (Schedulable, Batch, Queueable, `@future`) does not — the exception is recorded on the `AsyncApexJob` row (in `ExtendedStatus`) and written to the debug log, but no email is sent unless you handle the exception yourself or subscribe to `BatchApexErrorEvent`. This is the single biggest gap teams discover after a quarter-end batch fails on a Sunday.
2. **Setup → Apex Jobs is a 7-day rolling window.** The page renders from `AsyncApexJob`, which is retained for 7 days for completed entries, longer for failed/aborted, but the UI filter typically defaults to recent entries and does not surface "schedule fired but enqueue failed" cleanly.
3. **A scheduled job whose underlying Apex class is deleted continues to occupy a `CronTrigger` slot but cannot fire.** No email, no warning. The job appears in `CronTrigger` but produces no `AsyncApexJob` rows. The first symptom is almost always "we noticed last quarter's data wasn't being refreshed."

The design implication: any failure-detection design must cover (a) caught and logged exceptions, (b) uncaught exceptions during batch execution, and (c) jobs that should have run but didn't.

### Concept 2 — `AsyncApexJob` is the queryable record for every async execution

Every async Apex execution (Future, Queueable, Batch, Scheduled) creates an `AsyncApexJob` row. The relevant fields for monitoring are:

- **`Status`** — values are `Queued`, `Preparing`, `Processing`, `Aborted`, `Completed`, `Failed`, `Holding`. `Failed` is what you check; `Aborted` is operator-initiated; `Holding` indicates Flex Queue throttling.
- **`NumberOfErrors`** — count of chunks that errored within a Batch job. A Batch can `Complete` with `NumberOfErrors > 0` — completion does not imply success.
- **`ExtendedStatus`** — short description of the most recent error (truncated to ~255 chars). This is what surfaces in Setup → Apex Jobs.
- **`JobType`** — `BatchApex`, `BatchApexWorker`, `Queueable`, `Future`, `ScheduledApex`, `ApexToken`, `SharingRecalculation`, etc. Filter the watcher to job types you care about.
- **`ApexClassId`** — FK to `ApexClass.Id`. Resolve via `[SELECT Name FROM ApexClass WHERE Id = :id]` to get the class name in the alert.
- **`MethodName`**, **`CompletedDate`**, **`CreatedDate`** — timing context for "stuck job" detection (e.g. `Status = 'Queued' AND CreatedDate < :Datetime.now().addHours(-2)`).

A baseline failure-detection SOQL is therefore:

```apex
SELECT Id, JobType, ApexClassId, ApexClass.Name, Status, NumberOfErrors,
       ExtendedStatus, CompletedDate, CreatedDate
FROM AsyncApexJob
WHERE Status IN ('Failed', 'Aborted')
  AND CompletedDate >= :Datetime.now().addHours(-25)
ORDER BY CompletedDate DESC
```

The 25-hour window covers a daily cadence with overlap for clock skew; tighten or widen based on cadence.

### Concept 3 — `BatchApexErrorEvent` is the platform's signal for uncaught batch failures

`BatchApexErrorEvent` is a standard Platform Event that the platform publishes automatically whenever a Batch Apex execution throws an uncaught exception — but only if the batch class is annotated with `Database.RaisesPlatformEvents`. The event carries the failing job's `AsyncApexJobId`, the exception type and message, the stack trace, the JSON-serialized job scope (so a subscriber can re-enqueue the failed batch on a smaller scope), and `Phase` (which lifecycle method threw — `START`, `EXECUTE`, or `FINISH`).

To activate it:

1. Implement `Database.RaisesPlatformEvents` on your batch class:
   ```apex
   public class NightlyAccountRollup
     implements Database.Batchable<SObject>, Database.RaisesPlatformEvents {
       // start, execute, finish
   }
   ```
2. Subscribe with an Apex trigger on `BatchApexErrorEvent`:
   ```apex
   trigger BatchApexErrorEventTrigger on BatchApexErrorEvent (after insert) {
     for (BatchApexErrorEvent evt : Trigger.new) {
       // log, notify, optionally re-enqueue
     }
   }
   ```

Two important constraints:
- The event covers **uncaught** exceptions only. Anything you `try { ... } catch(Exception e) { }` away never publishes.
- It covers **Batch Apex only**. Queueable, Schedulable, `@future` failures do not produce `BatchApexErrorEvent`. Those need either in-job try-catch logging or the `AsyncApexJob` watcher.

### Concept 4 — The watcher schedule pattern fills the gaps

A *watcher* is a separate scheduled Apex class — typically running every 15 minutes or hourly — whose only job is to query `AsyncApexJob` and `CronTrigger` for failure or stuck conditions and notify operators. This catches the failure modes neither try-catch nor `BatchApexErrorEvent` cover:

- **Job that never enqueued.** A `Schedulable` whose `execute()` threw before `Database.executeBatch` was called. The schedule fires (`CronTrigger.PreviousFireTime` advances), but no Batch `AsyncApexJob` row appears for that window.
- **Job stuck in `Queued`.** Concurrency limits or Flex Queue saturation can leave a job in `Status = 'Queued'` for hours. With no `Failed` status, neither try-catch nor `BatchApexErrorEvent` triggers.
- **Job that completed with errors but didn't throw.** A Batch Apex `execute()` that catches its own exceptions and increments a counter — `Status` ends as `Completed` but `NumberOfErrors > 0`. Watchers should treat these as failures.
- **Schedule that no longer maps to a class.** `CronTrigger` row exists, but the underlying class was deleted. The watcher reads `CronTrigger.CronJobDetail.Name` and reconciles against expected schedules.

The watcher is itself scheduled Apex, so the same considerations apply: it must `try-catch` its own execution, write to a log, and ideally have a *second*, lighter-weight watcher (e.g. an external uptime ping) confirming the watcher itself ran. This is the limit of in-org monitoring — at the boundary you need an external observer.

---

## Recommended Workflow

1. **Inventory existing scheduled jobs and classify them.** Query `CronTrigger` for active schedules, cross-reference against `ApexClass`, and bucket each into Schedulable-only, Schedulable-launching-Batch, or pure Batch invoked elsewhere. Run `scripts/check_scheduled_apex_failure_detection_and_monitoring.py` against the SFDX project to flag classes that schedule themselves without exception handling.
2. **Add try-catch + structured logging to every `execute()` body.** Whether `Schedulable` or `Batchable`, wrap the body in try-catch and write to a log object on exception. The catch must re-throw only if you *want* the platform to mark the job `Failed` (and, for batch classes implementing `Database.RaisesPlatformEvents`, publish `BatchApexErrorEvent`). For Schedulable classes that launch batches, log a "schedule fired" entry before `executeBatch` so a missing entry signals a pre-enqueue failure.
3. **Add `Database.RaisesPlatformEvents` to every business-critical batch class.** Audit each `Database.Batchable` class. For ones whose failure has user impact, add the marker interface and ship a `BatchApexErrorEventTrigger` that logs + notifies. See `references/examples.md` Example 2.
4. **Build the `AsyncApexJob` watcher schedule.** A separate Schedulable class that queries `AsyncApexJob` for `Status IN ('Failed','Aborted')` in the last hour, plus stuck `Queued`/`Holding` jobs older than your SLA, plus `Completed` jobs with `NumberOfErrors > 0`. For each finding, log + notify. See Example 3.
5. **Pick a notification channel and make it idempotent.** Email Alerts work for low volume but are easy to ignore. Custom Notifications (bell icon) are durable but have no Slack/external surface. For ops, a Platform Event subscribed by a middleware bridge to Slack/Pagerduty is the most reliable. Whichever you pick, deduplicate by `AsyncApexJobId` so a watcher running every 15 minutes does not page operators five times for the same failure.
6. **Document the failure runbook alongside the alert.** Each alert payload should include the `AsyncApexJob.Id`, class name, `ExtendedStatus`, and a link to a runbook describing how to re-run, what side effects to expect, and who to escalate to. Notifications without runbooks become noise within two weeks.
7. **Verify by injecting a controlled failure.** In a sandbox, deploy a batch that throws on a known input, schedule it, and confirm the alert fires through every channel (log, `BatchApexErrorEvent` subscriber, watcher) within the expected window. Fail open — assume the watcher itself can fail and have at least one external check.

---

## Related Skills

- `apex/apex-exception-handling-and-logging` — general Apex exception patterns; the structured logger you call from `execute()` lives there
- `apex/batch-apex-design` — chunking, scope size, and stateful design for Batch Apex
- `admin/batch-job-scheduling-and-monitoring` — admin-facing monitoring via Setup → Apex Jobs and Scheduled Jobs UI
- `architect/org-limits-monitoring` — broader org-level limits monitoring (Flex Queue saturation, async limits) which also surfaces in this domain
- `apex/apex-custom-notifications-from-apex` — how to publish a Custom Notification from an Apex trigger or class, for the bell-icon alert path

Related Skills

event-monitoring

from PranavNagrecha/AwesomeSalesforceSkills

Shield Event Monitoring: event log types, downloading logs via REST API and SOQL, real-time event monitoring with streaming API, and threat detection policies. NOT for debug logs (use debug-logs-and-developer-console). NOT for custom platform event publishing/subscribing (use platform-events-apex).

apex-managed-sharing-patterns

from PranavNagrecha/AwesomeSalesforceSkills

Grant row-level access programmatically via __Share records when declarative sharing rules cannot express the policy. NOT for OWD, role hierarchy, or criteria-based sharing rule design.

lwc-imperative-apex

from PranavNagrecha/AwesomeSalesforceSkills

Call Apex methods imperatively from LWC — on button click, lifecycle hooks, or conditional logic. Covers import syntax, cacheable vs non-cacheable, async/await patterns, error handling, loading states, and Promise.all. NOT for wire service (use wire-service-patterns) and NOT for testing Apex mocks (use lwc-testing).

scheduled-erp-sync-pattern

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing a recurring (15-minute / hourly / nightly) data exchange between Salesforce and an external ERP system (Oracle EBS, SAP, NetSuite, Workday, Dynamics, etc.) where Salesforce is the *initiator* and pulls or pushes deltas on a schedule. Covers the full pattern: scheduled Apex → Queueable callout chain → REST request to ERP → upsert into a staging custom object → downstream reconciliation; plus watermark management (timestamp / cursor / full-refresh modes), idempotency via External ID, retry with exponential backoff, dead-letter custom object, and the volume thresholds that should redirect you to Bulk API 2.0, Change Data Capture, or MuleSoft. Triggers: 'integration to oracle erp every 15 minutes', 'scheduled sync pattern enterprise erp', 'pull netsuite invoices into salesforce nightly', 'apex schedulable callout to sap', 'how do i sync salesforce contacts to workday hourly', 'design a polling integration to my erp'. NOT for one-shot ETL imports (use data/data-loader-bulk-api), NOT for real-time inbound from ERP via Platform Events / Pub-Sub API (use integration/platform-events-publish-subscribe), NOT for outbound *event-driven* push (use integration/change-data-capture-consumer-pattern), NOT for MuleSoft / iPaaS architecture decisions (use architect/mulesoft-vs-native-integration-decision). When the data volume routinely exceeds 10K records per cycle or sub-minute latency is required, explicitly route to the Streaming / CDC / iPaaS skills instead.

dataweave-for-apex

from PranavNagrecha/AwesomeSalesforceSkills

Use when transforming structured data inside Apex — CSV → JSON, XML → SObject list, JSON → flattened CSV, or schema-mapping a third-party payload to a Salesforce model — and the existing options (`JSON.deserialize`, `Dom.Document`, hand-written loops) are getting unwieldy. Triggers: 'apex transform csv json xml without external library', 'system.dataweave script', 'salesforce native dataweave apex execute', 'transform xml to sobject apex no mulesoft', 'json reshape salesforce apex script'. NOT for MuleSoft Anypoint DataWeave running off-platform (use mulesoft-anypoint-architecture), NOT for Apex JSON serialization basics (use apex-json-serialization), NOT for Bulk API CSV ingest (use bulk-api-2-patterns).

scheduled-flows

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing or reviewing schedule-triggered flows for recurring automation, replacement of time-based workflow patterns, bounded record selection, idempotent processing, and escalation to Apex when volume is too high. Triggers: 'scheduled flow design', 'nightly flow job', 'time based workflow replacement', 'schedule triggered flow limits'. NOT for record-triggered scheduled paths or large-scale batch processing that should be built directly in Batch Apex.

scheduled-flow-not-running-debug

from PranavNagrecha/AwesomeSalesforceSkills

Use when a Schedule-Triggered Flow is configured but is not firing at the expected time, or appears active in Setup → Flows but never produces output. Covers where scheduled flows actually surface (Setup → Scheduled Jobs, NOT Setup → Apex Jobs), the AsyncApexJob / CronTrigger evidence trail, top causes (deactivated scheduling user, daylight-savings transitions, time-zone mismatches between scheduling user and org, fault-paths that quietly stop the schedule, daily async-Apex limit pressure), and recovery steps (re-schedule via Apex, run the flow manually with the same input, switch the scheduling user). Triggers: 'how do I schedule a flow to run every monday', 'scheduled flow not firing', 'flow scheduled but no execution', 'scheduled flow stopped working last week', 'monday 6am scheduled flow did not run after dst change'. NOT for designing a new scheduled flow's record scope or idempotency (use flow/scheduled-flows), NOT for record-triggered Scheduled Paths that don't fire (use flow/flow-time-based-patterns), NOT for general Batch Apex job monitoring (use admin/batch-job-scheduling-and-monitoring).

flow-invocable-from-apex

from PranavNagrecha/AwesomeSalesforceSkills

Author @InvocableMethod Apex classes that Flow can call as Actions. Design the input / output variable contract, bulk semantics (one list in, one list out), null handling, and error surfacing. Also covers the inverse direction: calling a flow from Apex via Flow.Interview. NOT for general Apex authoring (use apex-service-selector-domain). NOT for REST-exposed Apex (use apex-rest-resource-patterns).

flow-error-monitoring

from PranavNagrecha/AwesomeSalesforceSkills

Set up monitoring + alerting for Flow runtime errors at org scale: routing fault emails, Flow runtime error reports, custom centralized logging (Integration_Log__c), escalation thresholds, and trend detection. NOT for diagnosing a specific flow error (use flow-runtime-error-diagnosis). NOT for debug-mode setup (use flow-debugging).

flow-apex-defined-types

from PranavNagrecha/AwesomeSalesforceSkills

Design and use Apex-Defined Types as Flow variables for structured non-sObject data (HTTP callout payloads, External Service responses, complex configuration). Trigger keywords: apex-defined type, flow variable, @AuraEnabled class, flow http callout response. Does NOT cover building HTTP Callout Actions themselves, External Services schema, or raw Apex invocable methods.

deployment-monitoring

from PranavNagrecha/AwesomeSalesforceSkills

Tracking the real-time and historical status of Salesforce metadata deployments via Metadata API checkDeployStatus, REST deployRequest polling, and the Deployment Status Setup page. Covers DeployResult field interpretation, component error triage, concurrent deployment queue behavior, and 30-day history limits. NOT for post-deployment functional smoke testing (use post-deployment-validation). NOT for CI/CD pipeline setup (use github-actions-for-salesforce). NOT for rollback execution.

org-limits-monitoring

from PranavNagrecha/AwesomeSalesforceSkills

Use when designing or implementing proactive monitoring of Salesforce org-level limits such as API call consumption, storage usage, custom object counts, or platform event allocations. Trigger phrases: 'how do I monitor org limits programmatically', 'set up alerts before we hit API limits', 'REST Limits resource usage', 'OrgLimits.getAll() in Apex', 'scheduled limit checks', 'proactive limit threshold alerting', 'Company Information limits dashboard', 'we keep getting surprised by limit breaches in production'. NOT for per-transaction governor limit planning (use limits-and-scalability-planning). NOT for Connected App API throttling or rate limiting policies (use api-security-and-rate-limiting). NOT for individual Apex code optimization against transaction limits (use apex-cpu-and-heap-optimization).