Skip to content

Retries & ordering

Delivery contract

PropertyGuarantee
DeliveryAt-least-once.
OrderingStrict FIFO per aggregate_id. Different aggregates may interleave.
Latency (p50 / p95)200 ms / 2 s from originating commit, in steady state.
Retry window24 hours.
Timeout per attempt10 seconds.

Your handler must be idempotent on event.id. Plan for retries; they will happen.

When we retry

A delivery is retried when the response is:

  • A non-2xx status (3xx, 4xx, 5xx).
  • A connection failure (DNS, TCP refused, TLS handshake failure).
  • A timeout exceeding 10 seconds.

A 2xx with any body counts as success — the body is logged but not inspected.

Backoff schedule

attempt delay (from previous failure)
1 immediate
2 15 s
3 1 min
4 5 min
5 30 min
6 2 h
7 6 h
8 12 h
9 24 h (final)

After attempt 9, the delivery is marked failed and emits webhook_endpoint.delivery_failed (sent to all other enabled endpoints, since the failing one obviously can’t receive it).

Total retry budget: ~24 hours.

Per-aggregate ordering

Events for the same (aggregate_type, aggregate_id) are delivered in strict commit order. If an attempt for event N fails, event N+1 for the same aggregate is held until N succeeds or exhausts retries.

Example: if invoice.finalized for inv_123 is failing your handler, the subsequent invoice.paid for the same invoice waits. This is what makes processing safe: you’ll never see “paid” before “finalized.”

Different aggregates are independent. inv_123 getting stuck does not delay deliveries for inv_456 or sub_….

Disabled endpoints

If an endpoint returns 4xx for 100 consecutive deliveries (regardless of event type), Paylera disables it automatically:

  • The endpoint moves to status: disabled.
  • A webhook_endpoint.disabled event fires (to other endpoints).
  • Pending deliveries for the disabled endpoint are dropped (not held forever).

Re-enable explicitly:

PATCH /v1/admin/webhook-endpoints/{id}
{ "status": "enabled" }

Pending events from the time the endpoint was disabled are not replayed. Use the deliveries API to inspect what was missed and manually replay relevant ones.

Inspecting & replaying

GET /v1/admin/webhook-endpoints/{id}/deliveries?status=failed&limit=50
POST /v1/admin/webhook-endpoints/{id}/deliveries/{delivery_id}/retry

Manual retries don’t count against the disable-after-100 threshold.

Bulk replay

For larger remediation (you fixed a bug; you want to replay everything from the last 6 hours):

POST /v1/admin/webhook-endpoints/{id}/replay
{
"from": "2026-05-06T08:00:00Z",
"to": "2026-05-06T14:00:00Z",
"event_types": ["invoice.paid"]
}

Bulk replays are scheduled as background jobs and respect the same ordering guarantees.

Handling duplicates correctly

The standard pattern:

-- once per processed event
INSERT INTO processed_webhook_events (event_id) VALUES ($1)
ON CONFLICT (event_id) DO NOTHING
RETURNING true;

If the insert returned no row (conflict), you’ve seen this event before — return 2xx without doing the work. The conflict is what makes your handler idempotent; the insert is what makes it atomic with the work.

Common failures and fixes

SymptomLikely causeFix
Many 408s in deliveries logHandler taking >10 sMove work off the request thread; ack quickly.
5xx bursts then a recoveryDeploy or downstream outageInspect the deliveries that retried — confirm the events processed correctly after recovery.
Same event delivered repeatedly to a healthy handlerYou’re returning a non-2xx (e.g. 204 with a body)Return a clean 2xx (200 or 204 with empty body).
One stuck aggregate’s events pile upBug in handler for that aggregate typeFix the bug, ack the stuck event, the queue drains in seconds.

SLA

Webhook ingress availability: 99.95% measured by 28-day rolling window. Delivery latency p99: 2 s in steady state. Burn-rate alerts and the public status page reflect both. The full SLO contract is in Trust at Paylera.