Retries & ordering
Delivery contract
| Property | Guarantee |
|---|---|
| Delivery | At-least-once. |
| Ordering | Strict FIFO per aggregate_id. Different aggregates may interleave. |
| Latency (p50 / p95) | 200 ms / 2 s from originating commit, in steady state. |
| Retry window | 24 hours. |
| Timeout per attempt | 10 seconds. |
Your handler must be idempotent on event.id. Plan for retries; they
will happen.
When we retry
A delivery is retried when the response is:
- A non-2xx status (3xx, 4xx, 5xx).
- A connection failure (DNS, TCP refused, TLS handshake failure).
- A timeout exceeding 10 seconds.
A 2xx with any body counts as success — the body is logged but not inspected.
Backoff schedule
attempt delay (from previous failure)1 immediate2 15 s3 1 min4 5 min5 30 min6 2 h7 6 h8 12 h9 24 h (final)After attempt 9, the delivery is marked failed and emits
webhook_endpoint.delivery_failed (sent to all other enabled
endpoints, since the failing one obviously can’t receive it).
Total retry budget: ~24 hours.
Per-aggregate ordering
Events for the same (aggregate_type, aggregate_id) are delivered in
strict commit order. If an attempt for event N fails, event N+1 for
the same aggregate is held until N succeeds or exhausts retries.
Example: if invoice.finalized for inv_123 is failing your handler,
the subsequent invoice.paid for the same invoice waits. This is what
makes processing safe: you’ll never see “paid” before “finalized.”
Different aggregates are independent. inv_123 getting stuck does
not delay deliveries for inv_456 or sub_….
Disabled endpoints
If an endpoint returns 4xx for 100 consecutive deliveries (regardless of event type), Paylera disables it automatically:
- The endpoint moves to
status: disabled. - A
webhook_endpoint.disabledevent fires (to other endpoints). - Pending deliveries for the disabled endpoint are dropped (not held forever).
Re-enable explicitly:
PATCH /v1/admin/webhook-endpoints/{id}{ "status": "enabled" }Pending events from the time the endpoint was disabled are not replayed. Use the deliveries API to inspect what was missed and manually replay relevant ones.
Inspecting & replaying
GET /v1/admin/webhook-endpoints/{id}/deliveries?status=failed&limit=50POST /v1/admin/webhook-endpoints/{id}/deliveries/{delivery_id}/retryManual retries don’t count against the disable-after-100 threshold.
Bulk replay
For larger remediation (you fixed a bug; you want to replay everything from the last 6 hours):
POST /v1/admin/webhook-endpoints/{id}/replay{ "from": "2026-05-06T08:00:00Z", "to": "2026-05-06T14:00:00Z", "event_types": ["invoice.paid"]}Bulk replays are scheduled as background jobs and respect the same ordering guarantees.
Handling duplicates correctly
The standard pattern:
-- once per processed eventINSERT INTO processed_webhook_events (event_id) VALUES ($1)ON CONFLICT (event_id) DO NOTHINGRETURNING true;If the insert returned no row (conflict), you’ve seen this event before — return 2xx without doing the work. The conflict is what makes your handler idempotent; the insert is what makes it atomic with the work.
Common failures and fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Many 408s in deliveries log | Handler taking >10 s | Move work off the request thread; ack quickly. |
| 5xx bursts then a recovery | Deploy or downstream outage | Inspect the deliveries that retried — confirm the events processed correctly after recovery. |
| Same event delivered repeatedly to a healthy handler | You’re returning a non-2xx (e.g. 204 with a body) | Return a clean 2xx (200 or 204 with empty body). |
| One stuck aggregate’s events pile up | Bug in handler for that aggregate type | Fix the bug, ack the stuck event, the queue drains in seconds. |
SLA
Webhook ingress availability: 99.95% measured by 28-day rolling window. Delivery latency p99: 2 s in steady state. Burn-rate alerts and the public status page reflect both. The full SLO contract is in Trust at Paylera.