โ SOPs
Incident Response
Cross-Cutting โ All Engagements๐ค Alfred (Orchestrator) coordinates ยท Devin (Delivery Ops) resolves ยท Rick (Revenue) communicates
SOP โ Incident Response
**Tag**: Cross-Cutting โ All Engagements **Owner**: Alfred (Orchestrator) coordinates ยท Devin (Delivery Ops) resolves ยท Rick (Revenue) communicates ---
Purpose
When something breaks โ an API goes down, emails bounce, a CRM sync fails, or a client reports an issue โ we respond fast, fix it, and communicate clearly. This SOP defines severity levels, response times, resolution procedures, and communication templates. ---
SEV-1: Critical โ Service Down
**Definition**: Client's core automation is completely non-functional. No leads flowing, no outreach sending, no data syncing. **Examples**: OpenRouter API down, outreach tool suspended, CRM integration broken, client's entire pipeline offline. **Response time**: 1 hour (acknowledge) / 4 hours (resolve or provide workaround) **Communication**: Immediate notification to client + Clayton escalation via Telegram
SEV-2: High โ Degraded Service
**Definition**: Automation is partially working but a key component is broken or producing errors. **Examples**: Email bounce rate >10%, enrichment API returning errors, CRM sync dropping records, one channel of multi-channel outreach down. **Response time**: 4 hours (acknowledge) / 24 hours (resolve) **Communication**: Same-day notification to client
SEV-3: Medium โ Minor Issue
**Definition**: Non-critical component is broken or performing below expectations. Core service is unaffected. **Examples**: Reporting dashboard not updating, one email in a sequence has formatting issues, duplicate records appearing in CRM. **Response time**: 24 hours (acknowledge) / 72 hours (resolve) **Communication**: Include in next scheduled update (or sooner if client reports it)
SEV-4: Low โ Cosmetic / Enhancement
**Definition**: Minor visual or UX issues. No functional impact. **Examples**: Typo in email template, dashboard layout misaligned, non-critical field missing from CRM view. **Response time**: 72 hours (acknowledge) / 1 week (resolve) **Communication**: Acknowledge and schedule fix. No urgent notification needed. ---
Step 1: Detection
Incidents are detected via: - **Automated monitoring**: Cron job failures, API error logs, bounce rate alerts - **Client report**: Client emails/messages about an issue - **Agent detection**: Agent notices anomaly during routine work - **Scheduled review**: Rick's 6pm CRM review cron catches data issues **Who detects โ Who acts:** - Alfred detects (cron/monitoring) โ Alfred triages and assigns - Client reports โ Rick acknowledges, Alfred triages - Devin detects (during build) โ Devin fixes immediately if in scope
Step 2: Triage (Alfred)
1. **Classify severity** (SEV-1 through SEV-4) 2. **Identify scope**: Which client(s) affected? Which system(s)? 3. **Assign resolver**: Usually Devin for technical, Rick for communication 4. **Log incident** in workspace memory: - Timestamp, severity, description, affected client(s), assigned resolver
Step 3: Acknowledge (Rick)
Within the response time SLA for the severity level: 1. Send acknowledgment to the client using `incident-notification` template 2. Include: what we know so far, that we're investigating, expected next update time 3. Do NOT speculate about cause or promise a fix time unless you're certain **Template: incident-notification** > Subject: [LeadsPanther] Issue Detected โ [Brief Description] > > Hi [Name], > > We've identified an issue with [specific system/automation]. > > **What's happening**: [1-2 sentence plain-language description] > **What we're doing**: Our team is investigating and working on a fix. > **Next update**: We'll update you by [specific time]. > > If you have questions, reply to this email. > > โ LeadsPanther Team
Step 4: Investigate & Fix (Devin)
1. **Diagnose**: Check logs, API status pages, integration configs 2. **Isolate**: Is this our system, the client's system, or a third-party service? 3. **Fix or workaround**: - If fixable: apply fix, test, verify - If third-party outage: implement workaround if possible, monitor for resolution - If client-side: guide client through the fix 4. **Document**: What was broken, what caused it, what was done to fix it
Step 5: Verify
1. **Devin** confirms the fix works end-to-end 2. Test the specific failure scenario 3. Monitor for 1 hour (SEV-1) or 24 hours (SEV-2+) to confirm stability
Step 6: Resolve & Communicate (Rick)
1. Send resolution notification using `incident-resolved` template 2. Update HubSpot: add note to client deal with incident summary **Template: incident-resolved** > Subject: [LeadsPanther] Resolved โ [Brief Description] > > Hi [Name], > > The issue with [specific system] has been resolved. > > **What happened**: [1-2 sentence plain-language root cause] > **What we did**: [1-2 sentence fix description] > **Current status**: Everything is back to normal and running as expected. > > If you notice anything else, just reply to this email. > > โ LeadsPanther Team
Step 7: Post-Incident (SEV-1 and SEV-2 only)
1. **Alfred** creates post-incident entry in workspace memory: - Timeline of events - Root cause - Resolution - Prevention measures 2. For Premium clients on SEV-1 incidents, send `incident-postmortem`: - Full timeline - Root cause analysis - What we've changed to prevent recurrence ---
API Down (OpenRouter, Apollo, HubSpot)
1. Check provider status page 2. If provider outage: switch to backup (model cascade fallback for OpenRouter) 3. If our config: verify API keys, check rate limits, review recent changes 4. If rate limited: reduce request volume, implement backoff 5. Notify affected clients if outage exceeds 1 hour
Email Bounce Spike (>5% bounce rate)
1. Pause outreach immediately 2. Check: Is the sending domain warmed up? Is SPF/DKIM/DMARC configured? 3. Review bounced addresses: hard bounces (bad addresses) vs. soft bounces (temp issues) 4. If hard bounces: verify email verification pipeline is working 5. If domain issue: check domain reputation, reduce send volume, contact ESP support 6. Do NOT resume sending until bounce rate is under 2%
CRM Sync Break
1. Check integration connection status 2. Verify API credentials haven't expired 3. Check for schema changes (new fields, deleted fields) 4. Run test sync with a single record 5. If data loss suspected: check last successful sync timestamp, identify gap 6. Backfill missing records
Client Reports "It's Not Working"
1. **Rick** acknowledges within SLA 2. Ask clarifying questions: What specifically isn't working? When did it stop? Any changes on their end? 3. **Devin** investigates the specific system the client references 4. Often the issue is on the client's side (changed a password, modified a setting) โ guide them through fixing it 5. Document resolution for future reference
Outreach Tool Suspension (Instantly, Smartlead)
1. This is a SEV-1 โ all outreach stops 2. Check platform for suspension reason (usually volume or spam complaints) 3. Do NOT create a new account โ this makes it worse 4. Contact platform support with appeal 5. If suspended permanently: migrate to backup tool 6. Notify affected clients immediately ---
Escalation Matrix
| Situation | Escalate To | Method | |-----------|-------------|--------| | SEV-1 incident | Clayton | Telegram (immediate) | | Client threatening to leave | Clayton | Telegram (immediate) | | Payment/refund dispute | Clayton | Telegram + email summary | | Data breach suspicion | Clayton | Telegram (immediate) | | Third-party outage >4 hours | Clayton | Telegram (update) | | SEV-2 not resolved in 24 hours | Clayton | Telegram | | Any legal threat | Clayton | Telegram (immediate) + Laura | ---
Daily (Automated via Cron)
- [ ] All cron jobs fired successfully - [ ] API error rate < 1% - [ ] Email bounce rate < 2% - [ ] CRM sync last successful < 24 hours ago
Weekly (Manual Review)
- [ ] Review all client-reported issues from the week - [ ] Check outreach tool health (send limits, reputation score) - [ ] Verify all integrations are connected - [ ] Review Apollo/enrichment credit usage
Monthly
- [ ] Domain reputation check (all sending domains) - [ ] Full integration audit (disconnect and reconnect if flaky) - [ ] Review incident log for patterns - [ ] Update playbooks if new incident types emerged ---
Agent Responsibilities
| Agent | Incident Role | |-------|--------------| | Alfred | Detection (cron monitoring), triage, severity classification, assignment, post-incident logging | | Devin | Technical investigation, diagnosis, fix implementation, verification | | Rick | Client communication (acknowledge, update, resolve notifications), HubSpot logging | | Clayton | Escalation target for SEV-1, client disputes, data/security issues | ---
Anti-Patterns (Do NOT)
- Do NOT ignore an incident hoping it resolves itself. Acknowledge within SLA, always. - Do NOT blame third parties in client communication. Say "we're working on it," not "HubSpot broke." - Do NOT promise fix times unless you're certain. Use "we'll update you by [time]" instead. - Do NOT make changes to production systems without testing the fix first. - Do NOT skip the post-incident log. Every SEV-1 and SEV-2 must be documented for prevention.
View raw source
# SOP โ Incident Response **Tag**: Cross-Cutting โ All Engagements **Owner**: Alfred (Orchestrator) coordinates ยท Devin (Delivery Ops) resolves ยท Rick (Revenue) communicates --- ## Purpose When something breaks โ an API goes down, emails bounce, a CRM sync fails, or a client reports an issue โ we respond fast, fix it, and communicate clearly. This SOP defines severity levels, response times, resolution procedures, and communication templates. --- ## Severity Levels ### SEV-1: Critical โ Service Down **Definition**: Client's core automation is completely non-functional. No leads flowing, no outreach sending, no data syncing. **Examples**: OpenRouter API down, outreach tool suspended, CRM integration broken, client's entire pipeline offline. **Response time**: 1 hour (acknowledge) / 4 hours (resolve or provide workaround) **Communication**: Immediate notification to client + Clayton escalation via Telegram ### SEV-2: High โ Degraded Service **Definition**: Automation is partially working but a key component is broken or producing errors. **Examples**: Email bounce rate >10%, enrichment API returning errors, CRM sync dropping records, one channel of multi-channel outreach down. **Response time**: 4 hours (acknowledge) / 24 hours (resolve) **Communication**: Same-day notification to client ### SEV-3: Medium โ Minor Issue **Definition**: Non-critical component is broken or performing below expectations. Core service is unaffected. **Examples**: Reporting dashboard not updating, one email in a sequence has formatting issues, duplicate records appearing in CRM. **Response time**: 24 hours (acknowledge) / 72 hours (resolve) **Communication**: Include in next scheduled update (or sooner if client reports it) ### SEV-4: Low โ Cosmetic / Enhancement **Definition**: Minor visual or UX issues. No functional impact. **Examples**: Typo in email template, dashboard layout misaligned, non-critical field missing from CRM view. **Response time**: 72 hours (acknowledge) / 1 week (resolve) **Communication**: Acknowledge and schedule fix. No urgent notification needed. --- ## Incident Response Procedure ### Step 1: Detection Incidents are detected via: - **Automated monitoring**: Cron job failures, API error logs, bounce rate alerts - **Client report**: Client emails/messages about an issue - **Agent detection**: Agent notices anomaly during routine work - **Scheduled review**: Rick's 6pm CRM review cron catches data issues **Who detects โ Who acts:** - Alfred detects (cron/monitoring) โ Alfred triages and assigns - Client reports โ Rick acknowledges, Alfred triages - Devin detects (during build) โ Devin fixes immediately if in scope ### Step 2: Triage (Alfred) 1. **Classify severity** (SEV-1 through SEV-4) 2. **Identify scope**: Which client(s) affected? Which system(s)? 3. **Assign resolver**: Usually Devin for technical, Rick for communication 4. **Log incident** in workspace memory: - Timestamp, severity, description, affected client(s), assigned resolver ### Step 3: Acknowledge (Rick) Within the response time SLA for the severity level: 1. Send acknowledgment to the client using `incident-notification` template 2. Include: what we know so far, that we're investigating, expected next update time 3. Do NOT speculate about cause or promise a fix time unless you're certain **Template: incident-notification** > Subject: [LeadsPanther] Issue Detected โ [Brief Description] > > Hi [Name], > > We've identified an issue with [specific system/automation]. > > **What's happening**: [1-2 sentence plain-language description] > **What we're doing**: Our team is investigating and working on a fix. > **Next update**: We'll update you by [specific time]. > > If you have questions, reply to this email. > > โ LeadsPanther Team ### Step 4: Investigate & Fix (Devin) 1. **Diagnose**: Check logs, API status pages, integration configs 2. **Isolate**: Is this our system, the client's system, or a third-party service? 3. **Fix or workaround**: - If fixable: apply fix, test, verify - If third-party outage: implement workaround if possible, monitor for resolution - If client-side: guide client through the fix 4. **Document**: What was broken, what caused it, what was done to fix it ### Step 5: Verify 1. **Devin** confirms the fix works end-to-end 2. Test the specific failure scenario 3. Monitor for 1 hour (SEV-1) or 24 hours (SEV-2+) to confirm stability ### Step 6: Resolve & Communicate (Rick) 1. Send resolution notification using `incident-resolved` template 2. Update HubSpot: add note to client deal with incident summary **Template: incident-resolved** > Subject: [LeadsPanther] Resolved โ [Brief Description] > > Hi [Name], > > The issue with [specific system] has been resolved. > > **What happened**: [1-2 sentence plain-language root cause] > **What we did**: [1-2 sentence fix description] > **Current status**: Everything is back to normal and running as expected. > > If you notice anything else, just reply to this email. > > โ LeadsPanther Team ### Step 7: Post-Incident (SEV-1 and SEV-2 only) 1. **Alfred** creates post-incident entry in workspace memory: - Timeline of events - Root cause - Resolution - Prevention measures 2. For Premium clients on SEV-1 incidents, send `incident-postmortem`: - Full timeline - Root cause analysis - What we've changed to prevent recurrence --- ## Common Incident Playbooks ### API Down (OpenRouter, Apollo, HubSpot) 1. Check provider status page 2. If provider outage: switch to backup (model cascade fallback for OpenRouter) 3. If our config: verify API keys, check rate limits, review recent changes 4. If rate limited: reduce request volume, implement backoff 5. Notify affected clients if outage exceeds 1 hour ### Email Bounce Spike (>5% bounce rate) 1. Pause outreach immediately 2. Check: Is the sending domain warmed up? Is SPF/DKIM/DMARC configured? 3. Review bounced addresses: hard bounces (bad addresses) vs. soft bounces (temp issues) 4. If hard bounces: verify email verification pipeline is working 5. If domain issue: check domain reputation, reduce send volume, contact ESP support 6. Do NOT resume sending until bounce rate is under 2% ### CRM Sync Break 1. Check integration connection status 2. Verify API credentials haven't expired 3. Check for schema changes (new fields, deleted fields) 4. Run test sync with a single record 5. If data loss suspected: check last successful sync timestamp, identify gap 6. Backfill missing records ### Client Reports "It's Not Working" 1. **Rick** acknowledges within SLA 2. Ask clarifying questions: What specifically isn't working? When did it stop? Any changes on their end? 3. **Devin** investigates the specific system the client references 4. Often the issue is on the client's side (changed a password, modified a setting) โ guide them through fixing it 5. Document resolution for future reference ### Outreach Tool Suspension (Instantly, Smartlead) 1. This is a SEV-1 โ all outreach stops 2. Check platform for suspension reason (usually volume or spam complaints) 3. Do NOT create a new account โ this makes it worse 4. Contact platform support with appeal 5. If suspended permanently: migrate to backup tool 6. Notify affected clients immediately --- ## Escalation Matrix | Situation | Escalate To | Method | |-----------|-------------|--------| | SEV-1 incident | Clayton | Telegram (immediate) | | Client threatening to leave | Clayton | Telegram (immediate) | | Payment/refund dispute | Clayton | Telegram + email summary | | Data breach suspicion | Clayton | Telegram (immediate) | | Third-party outage >4 hours | Clayton | Telegram (update) | | SEV-2 not resolved in 24 hours | Clayton | Telegram | | Any legal threat | Clayton | Telegram (immediate) + Laura | --- ## Monitoring Checklist (Proactive) ### Daily (Automated via Cron) - [ ] All cron jobs fired successfully - [ ] API error rate < 1% - [ ] Email bounce rate < 2% - [ ] CRM sync last successful < 24 hours ago ### Weekly (Manual Review) - [ ] Review all client-reported issues from the week - [ ] Check outreach tool health (send limits, reputation score) - [ ] Verify all integrations are connected - [ ] Review Apollo/enrichment credit usage ### Monthly - [ ] Domain reputation check (all sending domains) - [ ] Full integration audit (disconnect and reconnect if flaky) - [ ] Review incident log for patterns - [ ] Update playbooks if new incident types emerged --- ## Agent Responsibilities | Agent | Incident Role | |-------|--------------| | Alfred | Detection (cron monitoring), triage, severity classification, assignment, post-incident logging | | Devin | Technical investigation, diagnosis, fix implementation, verification | | Rick | Client communication (acknowledge, update, resolve notifications), HubSpot logging | | Clayton | Escalation target for SEV-1, client disputes, data/security issues | --- ## Anti-Patterns (Do NOT) - Do NOT ignore an incident hoping it resolves itself. Acknowledge within SLA, always. - Do NOT blame third parties in client communication. Say "we're working on it," not "HubSpot broke." - Do NOT promise fix times unless you're certain. Use "we'll update you by [time]" instead. - Do NOT make changes to production systems without testing the fix first. - Do NOT skip the post-incident log. Every SEV-1 and SEV-2 must be documented for prevention.