Retries & ordering

Delivery contract

Property	Guarantee
Delivery	At-least-once.
Ordering	Strict FIFO per `aggregate_id`. Different aggregates may interleave.
Latency (p50 / p95)	200 ms / 2 s from originating commit, in steady state.
Retry window	24 hours.
Timeout per attempt	10 seconds.

Your handler must be idempotent on event.id. Plan for retries; they will happen.

When we retry

A delivery is retried when the response is:

A non-2xx status (3xx, 4xx, 5xx).
A connection failure (DNS, TCP refused, TLS handshake failure).
A timeout exceeding 10 seconds.

A 2xx with any body counts as success — the body is logged but not inspected.

Backoff schedule

attempt   delay (from previous failure)
1         immediate
2         15 s
3         1 min
4         5 min
5         30 min
6         2 h
7         6 h
8         12 h
9         24 h  (final)

After attempt 9, the delivery is marked failed and emits webhook_endpoint.delivery_failed (sent to all other enabled endpoints, since the failing one obviously can’t receive it).

Total retry budget: ~24 hours.

Per-aggregate ordering

Events for the same (aggregate_type, aggregate_id) are delivered in strict commit order. If an attempt for event N fails, event N+1 for the same aggregate is held until N succeeds or exhausts retries.

Example: if invoice.finalized for inv_123 is failing your handler, the subsequent invoice.paid for the same invoice waits. This is what makes processing safe: you’ll never see “paid” before “finalized.”

Different aggregates are independent. inv_123 getting stuck does not delay deliveries for inv_456 or sub_….

Disabled endpoints

If an endpoint returns 4xx for 100 consecutive deliveries (regardless of event type), Paylera disables it automatically:

The endpoint moves to status: disabled.
A webhook_endpoint.disabled event fires (to other endpoints).
Pending deliveries for the disabled endpoint are dropped (not held forever).

Re-enable explicitly:

PATCH /v1/admin/webhook-endpoints/{id}
{ "status": "enabled" }

Pending events from the time the endpoint was disabled are not replayed. Use the deliveries API to inspect what was missed and manually replay relevant ones.

Inspecting & replaying

GET /v1/admin/webhook-endpoints/{id}/deliveries?status=failed&limit=50

POST /v1/admin/webhook-endpoints/{id}/deliveries/{delivery_id}/retry

Manual retries don’t count against the disable-after-100 threshold.

Bulk replay

For larger remediation (you fixed a bug; you want to replay everything from the last 6 hours):

POST /v1/admin/webhook-endpoints/{id}/replay
{
  "from": "2026-05-06T08:00:00Z",
  "to":   "2026-05-06T14:00:00Z",
  "event_types": ["invoice.paid"]
}

Bulk replays are scheduled as background jobs and respect the same ordering guarantees.

Handling duplicates correctly

The standard pattern:

-- once per processed event
INSERT INTO processed_webhook_events (event_id) VALUES ($1)
ON CONFLICT (event_id) DO NOTHING
RETURNING true;

If the insert returned no row (conflict), you’ve seen this event before — return 2xx without doing the work. The conflict is what makes your handler idempotent; the insert is what makes it atomic with the work.

Common failures and fixes

Symptom	Likely cause	Fix
Many 408s in deliveries log	Handler taking >10 s	Move work off the request thread; ack quickly.
5xx bursts then a recovery	Deploy or downstream outage	Inspect the deliveries that retried — confirm the events processed correctly after recovery.
Same event delivered repeatedly to a healthy handler	You’re returning a non-2xx (e.g. 204 with a body)	Return a clean 2xx (200 or 204 with empty body).
One stuck aggregate’s events pile up	Bug in handler for that aggregate type	Fix the bug, ack the stuck event, the queue drains in seconds.

SLA

Webhook ingress availability: 99.95% measured by 28-day rolling window. Delivery latency p99: 2 s in steady state. Burn-rate alerts and the public status page reflect both. The full SLO contract is in Trust at Paylera.