Incident Response

Cross-Cutting — All Engagements👤 Alfred (Orchestrator) coordinates · Devin (Delivery Ops) resolves · Rick (Revenue) communicates

SOP — Incident Response

**Tag**: Cross-Cutting — All Engagements
**Owner**: Alfred (Orchestrator) coordinates · Devin (Delivery Ops) resolves · Rick (Revenue) communicates

---

Purpose

When something breaks — an API goes down, emails bounce, a CRM sync fails, or a client reports an issue — we respond fast, fix it, and communicate clearly. This SOP defines severity levels, response times, resolution procedures, and communication templates.

---

SEV-1: Critical — Service Down

**Definition**: Client's core automation is completely non-functional. No leads flowing, no outreach sending, no data syncing.
**Examples**: OpenRouter API down, outreach tool suspended, CRM integration broken, client's entire pipeline offline.
**Response time**: 1 hour (acknowledge) / 4 hours (resolve or provide workaround)
**Communication**: Immediate notification to client + Clayton escalation via Telegram

SEV-2: High — Degraded Service

**Definition**: Automation is partially working but a key component is broken or producing errors.
**Examples**: Email bounce rate >10%, enrichment API returning errors, CRM sync dropping records, one channel of multi-channel outreach down.
**Response time**: 4 hours (acknowledge) / 24 hours (resolve)
**Communication**: Same-day notification to client

SEV-3: Medium — Minor Issue

**Definition**: Non-critical component is broken or performing below expectations. Core service is unaffected.
**Examples**: Reporting dashboard not updating, one email in a sequence has formatting issues, duplicate records appearing in CRM.
**Response time**: 24 hours (acknowledge) / 72 hours (resolve)
**Communication**: Include in next scheduled update (or sooner if client reports it)

SEV-4: Low — Cosmetic / Enhancement

**Definition**: Minor visual or UX issues. No functional impact.
**Examples**: Typo in email template, dashboard layout misaligned, non-critical field missing from CRM view.
**Response time**: 72 hours (acknowledge) / 1 week (resolve)
**Communication**: Acknowledge and schedule fix. No urgent notification needed.

---

Step 1: Detection

Incidents are detected via:
- **Automated monitoring**: Cron job failures, API error logs, bounce rate alerts
- **Client report**: Client emails/messages about an issue
- **Agent detection**: Agent notices anomaly during routine work
- **Scheduled review**: Rick's 6pm CRM review cron catches data issues

**Who detects → Who acts:**
- Alfred detects (cron/monitoring) → Alfred triages and assigns
- Client reports → Rick acknowledges, Alfred triages
- Devin detects (during build) → Devin fixes immediately if in scope

Step 2: Triage (Alfred)

1. **Classify severity** (SEV-1 through SEV-4)
2. **Identify scope**: Which client(s) affected? Which system(s)?
3. **Assign resolver**: Usually Devin for technical, Rick for communication
4. **Log incident** in workspace memory:
   - Timestamp, severity, description, affected client(s), assigned resolver

Step 3: Acknowledge (Rick)

Within the response time SLA for the severity level:
1. Send acknowledgment to the client using `incident-notification` template
2. Include: what we know so far, that we're investigating, expected next update time
3. Do NOT speculate about cause or promise a fix time unless you're certain

**Template: incident-notification**
> Subject: [LeadsPanther] Issue Detected — [Brief Description]
>
> Hi [Name],
>
> We've identified an issue with [specific system/automation].
>
> **What's happening**: [1-2 sentence plain-language description]
> **What we're doing**: Our team is investigating and working on a fix.
> **Next update**: We'll update you by [specific time].
>
> If you have questions, reply to this email.
>
> — LeadsPanther Team

Step 4: Investigate & Fix (Devin)

1. **Diagnose**: Check logs, API status pages, integration configs
2. **Isolate**: Is this our system, the client's system, or a third-party service?
3. **Fix or workaround**:
   - If fixable: apply fix, test, verify
   - If third-party outage: implement workaround if possible, monitor for resolution
   - If client-side: guide client through the fix
4. **Document**: What was broken, what caused it, what was done to fix it

Step 5: Verify

1. **Devin** confirms the fix works end-to-end
2. Test the specific failure scenario
3. Monitor for 1 hour (SEV-1) or 24 hours (SEV-2+) to confirm stability

Step 6: Resolve & Communicate (Rick)

1. Send resolution notification using `incident-resolved` template
2. Update HubSpot: add note to client deal with incident summary

**Template: incident-resolved**
> Subject: [LeadsPanther] Resolved — [Brief Description]
>
> Hi [Name],
>
> The issue with [specific system] has been resolved.
>
> **What happened**: [1-2 sentence plain-language root cause]
> **What we did**: [1-2 sentence fix description]
> **Current status**: Everything is back to normal and running as expected.
>
> If you notice anything else, just reply to this email.
>
> — LeadsPanther Team

Step 7: Post-Incident (SEV-1 and SEV-2 only)

1. **Alfred** creates post-incident entry in workspace memory:
   - Timeline of events
   - Root cause
   - Resolution
   - Prevention measures
2. For Premium clients on SEV-1 incidents, send `incident-postmortem`:
   - Full timeline
   - Root cause analysis
   - What we've changed to prevent recurrence

---

API Down (OpenRouter, Apollo, HubSpot)

1. Check provider status page
2. If provider outage: switch to backup (model cascade fallback for OpenRouter)
3. If our config: verify API keys, check rate limits, review recent changes
4. If rate limited: reduce request volume, implement backoff
5. Notify affected clients if outage exceeds 1 hour

Email Bounce Spike (>5% bounce rate)

1. Pause outreach immediately
2. Check: Is the sending domain warmed up? Is SPF/DKIM/DMARC configured?
3. Review bounced addresses: hard bounces (bad addresses) vs. soft bounces (temp issues)
4. If hard bounces: verify email verification pipeline is working
5. If domain issue: check domain reputation, reduce send volume, contact ESP support
6. Do NOT resume sending until bounce rate is under 2%

CRM Sync Break

1. Check integration connection status
2. Verify API credentials haven't expired
3. Check for schema changes (new fields, deleted fields)
4. Run test sync with a single record
5. If data loss suspected: check last successful sync timestamp, identify gap
6. Backfill missing records

Client Reports "It's Not Working"

1. **Rick** acknowledges within SLA
2. Ask clarifying questions: What specifically isn't working? When did it stop? Any changes on their end?
3. **Devin** investigates the specific system the client references
4. Often the issue is on the client's side (changed a password, modified a setting) — guide them through fixing it
5. Document resolution for future reference

Outreach Tool Suspension (Instantly, Smartlead)

1. This is a SEV-1 — all outreach stops
2. Check platform for suspension reason (usually volume or spam complaints)
3. Do NOT create a new account — this makes it worse
4. Contact platform support with appeal
5. If suspended permanently: migrate to backup tool
6. Notify affected clients immediately

---

Escalation Matrix

| Situation | Escalate To | Method |
|-----------|-------------|--------|
| SEV-1 incident | Clayton | Telegram (immediate) |
| Client threatening to leave | Clayton | Telegram (immediate) |
| Payment/refund dispute | Clayton | Telegram + email summary |
| Data breach suspicion | Clayton | Telegram (immediate) |
| Third-party outage >4 hours | Clayton | Telegram (update) |
| SEV-2 not resolved in 24 hours | Clayton | Telegram |
| Any legal threat | Clayton | Telegram (immediate) + Laura |

---

Daily (Automated via Cron)

- [ ] All cron jobs fired successfully
- [ ] API error rate < 1%
- [ ] Email bounce rate < 2%
- [ ] CRM sync last successful < 24 hours ago

Weekly (Manual Review)

- [ ] Review all client-reported issues from the week
- [ ] Check outreach tool health (send limits, reputation score)
- [ ] Verify all integrations are connected
- [ ] Review Apollo/enrichment credit usage

Monthly

- [ ] Domain reputation check (all sending domains)
- [ ] Full integration audit (disconnect and reconnect if flaky)
- [ ] Review incident log for patterns
- [ ] Update playbooks if new incident types emerged

---

Agent Responsibilities

| Agent | Incident Role |
|-------|--------------|
| Alfred | Detection (cron monitoring), triage, severity classification, assignment, post-incident logging |
| Devin | Technical investigation, diagnosis, fix implementation, verification |
| Rick | Client communication (acknowledge, update, resolve notifications), HubSpot logging |
| Clayton | Escalation target for SEV-1, client disputes, data/security issues |

---

Anti-Patterns (Do NOT)

- Do NOT ignore an incident hoping it resolves itself. Acknowledge within SLA, always.
- Do NOT blame third parties in client communication. Say "we're working on it," not "HubSpot broke."
- Do NOT promise fix times unless you're certain. Use "we'll update you by [time]" instead.
- Do NOT make changes to production systems without testing the fix first.
- Do NOT skip the post-incident log. Every SEV-1 and SEV-2 must be documented for prevention.

View raw source

# SOP — Incident Response

**Tag**: Cross-Cutting — All Engagements
**Owner**: Alfred (Orchestrator) coordinates · Devin (Delivery Ops) resolves · Rick (Revenue) communicates

---

## Purpose

When something breaks — an API goes down, emails bounce, a CRM sync fails, or a client reports an issue — we respond fast, fix it, and communicate clearly. This SOP defines severity levels, response times, resolution procedures, and communication templates.

---

## Severity Levels

### SEV-1: Critical — Service Down
**Definition**: Client's core automation is completely non-functional. No leads flowing, no outreach sending, no data syncing.
**Examples**: OpenRouter API down, outreach tool suspended, CRM integration broken, client's entire pipeline offline.
**Response time**: 1 hour (acknowledge) / 4 hours (resolve or provide workaround)
**Communication**: Immediate notification to client + Clayton escalation via Telegram

### SEV-2: High — Degraded Service
**Definition**: Automation is partially working but a key component is broken or producing errors.
**Examples**: Email bounce rate >10%, enrichment API returning errors, CRM sync dropping records, one channel of multi-channel outreach down.
**Response time**: 4 hours (acknowledge) / 24 hours (resolve)
**Communication**: Same-day notification to client

### SEV-3: Medium — Minor Issue
**Definition**: Non-critical component is broken or performing below expectations. Core service is unaffected.
**Examples**: Reporting dashboard not updating, one email in a sequence has formatting issues, duplicate records appearing in CRM.
**Response time**: 24 hours (acknowledge) / 72 hours (resolve)
**Communication**: Include in next scheduled update (or sooner if client reports it)

### SEV-4: Low — Cosmetic / Enhancement
**Definition**: Minor visual or UX issues. No functional impact.
**Examples**: Typo in email template, dashboard layout misaligned, non-critical field missing from CRM view.
**Response time**: 72 hours (acknowledge) / 1 week (resolve)
**Communication**: Acknowledge and schedule fix. No urgent notification needed.

---

## Incident Response Procedure

### Step 1: Detection
Incidents are detected via:
- **Automated monitoring**: Cron job failures, API error logs, bounce rate alerts
- **Client report**: Client emails/messages about an issue
- **Agent detection**: Agent notices anomaly during routine work
- **Scheduled review**: Rick's 6pm CRM review cron catches data issues

**Who detects → Who acts:**
- Alfred detects (cron/monitoring) → Alfred triages and assigns
- Client reports → Rick acknowledges, Alfred triages
- Devin detects (during build) → Devin fixes immediately if in scope

### Step 2: Triage (Alfred)
1. **Classify severity** (SEV-1 through SEV-4)
2. **Identify scope**: Which client(s) affected? Which system(s)?
3. **Assign resolver**: Usually Devin for technical, Rick for communication
4. **Log incident** in workspace memory:
   - Timestamp, severity, description, affected client(s), assigned resolver

### Step 3: Acknowledge (Rick)
Within the response time SLA for the severity level:
1. Send acknowledgment to the client using `incident-notification` template
2. Include: what we know so far, that we're investigating, expected next update time
3. Do NOT speculate about cause or promise a fix time unless you're certain

**Template: incident-notification**
> Subject: [LeadsPanther] Issue Detected — [Brief Description]
>
> Hi [Name],
>
> We've identified an issue with [specific system/automation].
>
> **What's happening**: [1-2 sentence plain-language description]
> **What we're doing**: Our team is investigating and working on a fix.
> **Next update**: We'll update you by [specific time].
>
> If you have questions, reply to this email.
>
> — LeadsPanther Team

### Step 4: Investigate & Fix (Devin)
1. **Diagnose**: Check logs, API status pages, integration configs
2. **Isolate**: Is this our system, the client's system, or a third-party service?
3. **Fix or workaround**:
   - If fixable: apply fix, test, verify
   - If third-party outage: implement workaround if possible, monitor for resolution
   - If client-side: guide client through the fix
4. **Document**: What was broken, what caused it, what was done to fix it

### Step 5: Verify
1. **Devin** confirms the fix works end-to-end
2. Test the specific failure scenario
3. Monitor for 1 hour (SEV-1) or 24 hours (SEV-2+) to confirm stability

### Step 6: Resolve & Communicate (Rick)
1. Send resolution notification using `incident-resolved` template
2. Update HubSpot: add note to client deal with incident summary

**Template: incident-resolved**
> Subject: [LeadsPanther] Resolved — [Brief Description]
>
> Hi [Name],
>
> The issue with [specific system] has been resolved.
>
> **What happened**: [1-2 sentence plain-language root cause]
> **What we did**: [1-2 sentence fix description]
> **Current status**: Everything is back to normal and running as expected.
>
> If you notice anything else, just reply to this email.
>
> — LeadsPanther Team

### Step 7: Post-Incident (SEV-1 and SEV-2 only)
1. **Alfred** creates post-incident entry in workspace memory:
   - Timeline of events
   - Root cause
   - Resolution
   - Prevention measures
2. For Premium clients on SEV-1 incidents, send `incident-postmortem`:
   - Full timeline
   - Root cause analysis
   - What we've changed to prevent recurrence

---

## Common Incident Playbooks

### API Down (OpenRouter, Apollo, HubSpot)
1. Check provider status page
2. If provider outage: switch to backup (model cascade fallback for OpenRouter)
3. If our config: verify API keys, check rate limits, review recent changes
4. If rate limited: reduce request volume, implement backoff
5. Notify affected clients if outage exceeds 1 hour

### Email Bounce Spike (>5% bounce rate)
1. Pause outreach immediately
2. Check: Is the sending domain warmed up? Is SPF/DKIM/DMARC configured?
3. Review bounced addresses: hard bounces (bad addresses) vs. soft bounces (temp issues)
4. If hard bounces: verify email verification pipeline is working
5. If domain issue: check domain reputation, reduce send volume, contact ESP support
6. Do NOT resume sending until bounce rate is under 2%

### CRM Sync Break
1. Check integration connection status
2. Verify API credentials haven't expired
3. Check for schema changes (new fields, deleted fields)
4. Run test sync with a single record
5. If data loss suspected: check last successful sync timestamp, identify gap
6. Backfill missing records

### Client Reports "It's Not Working"
1. **Rick** acknowledges within SLA
2. Ask clarifying questions: What specifically isn't working? When did it stop? Any changes on their end?
3. **Devin** investigates the specific system the client references
4. Often the issue is on the client's side (changed a password, modified a setting) — guide them through fixing it
5. Document resolution for future reference

### Outreach Tool Suspension (Instantly, Smartlead)
1. This is a SEV-1 — all outreach stops
2. Check platform for suspension reason (usually volume or spam complaints)
3. Do NOT create a new account — this makes it worse
4. Contact platform support with appeal
5. If suspended permanently: migrate to backup tool
6. Notify affected clients immediately

---

## Escalation Matrix

| Situation | Escalate To | Method |
|-----------|-------------|--------|
| SEV-1 incident | Clayton | Telegram (immediate) |
| Client threatening to leave | Clayton | Telegram (immediate) |
| Payment/refund dispute | Clayton | Telegram + email summary |
| Data breach suspicion | Clayton | Telegram (immediate) |
| Third-party outage >4 hours | Clayton | Telegram (update) |
| SEV-2 not resolved in 24 hours | Clayton | Telegram |
| Any legal threat | Clayton | Telegram (immediate) + Laura |

---

## Monitoring Checklist (Proactive)

### Daily (Automated via Cron)
- [ ] All cron jobs fired successfully
- [ ] API error rate < 1%
- [ ] Email bounce rate < 2%
- [ ] CRM sync last successful < 24 hours ago

### Weekly (Manual Review)
- [ ] Review all client-reported issues from the week
- [ ] Check outreach tool health (send limits, reputation score)
- [ ] Verify all integrations are connected
- [ ] Review Apollo/enrichment credit usage

### Monthly
- [ ] Domain reputation check (all sending domains)
- [ ] Full integration audit (disconnect and reconnect if flaky)
- [ ] Review incident log for patterns
- [ ] Update playbooks if new incident types emerged

---

## Agent Responsibilities

| Agent | Incident Role |
|-------|--------------|
| Alfred | Detection (cron monitoring), triage, severity classification, assignment, post-incident logging |
| Devin | Technical investigation, diagnosis, fix implementation, verification |
| Rick | Client communication (acknowledge, update, resolve notifications), HubSpot logging |
| Clayton | Escalation target for SEV-1, client disputes, data/security issues |

---

## Anti-Patterns (Do NOT)

- Do NOT ignore an incident hoping it resolves itself. Acknowledge within SLA, always.
- Do NOT blame third parties in client communication. Say "we're working on it," not "HubSpot broke."
- Do NOT promise fix times unless you're certain. Use "we'll update you by [time]" instead.
- Do NOT make changes to production systems without testing the fix first.
- Do NOT skip the post-incident log. Every SEV-1 and SEV-2 must be documented for prevention.