Self-Healing Agent
Autonomous database healing that detects and resolves performance issues automatically, keeping your systems running smoothly 24/7.
How Autonomous Healing Works
DB24x7's self-healing agent continuously monitors your database health and automatically takes corrective actions when issues are detected. The agent uses machine learning to understand your database patterns and make intelligent decisions.
Healing Workflow
- 1Detection: The agent detects anomalies in real-time metrics such as slow queries, connection spikes, or high resource usage.
- 2Analysis: Root cause analysis determines the exact issue and impact severity using historical patterns and ML models.
- 3Action Selection: The agent selects the most appropriate healing action based on the issue type and configured policies.
- 4Approval Check: Depending on action severity, either executes automatically or requests approval from designated team members.
- 5Execution: The healing action is executed with safety guardrails and rollback capabilities.
- 6Verification: Post-action monitoring confirms the issue is resolved and no new issues were introduced.
Configurable Healing Actions
Kill Long-Running Queries
Automatically terminate queries that exceed configured time thresholds and are causing system slowdowns.
Trigger: Query runtime > 5 minutes
Action: KILL QUERY <query_id>
Safety: Exclude admin sessionsClear Connection Pool
Reset connection pools when detecting connection leaks or exhausted connection limits.
Trigger: Connections > 90% max
Action: Clear idle connections
Safety: Grace period for active txnsApply Missing Indexes
Automatically create indexes detected by the query optimizer to improve performance.
Trigger: Repeated full table scans
Action: CREATE INDEX (concurrent)
Safety: Off-peak hours onlyRestart Stalled Replication
Detect and restart replication lag issues automatically to maintain data consistency.
Trigger: Replication lag > 60s
Action: Restart replication thread
Safety: Alert on repeated failuresVacuum and Analyze
Trigger maintenance operations when bloat detection or stale statistics are identified.
Trigger: Table bloat > 30%
Action: VACUUM ANALYZE
Safety: Throttled by I/O usageScale Resources
Automatically scale compute resources when sustained high utilization is detected.
Trigger: CPU > 85% for 10 minutes
Action: Scale up instance size
Safety: Requires approvalAction Policies and Approvals
Control which actions can run automatically and which require human approval based on risk levels and business requirements.
Policy Configuration Example
{
"policies": [
{
"name": "Low-Risk Auto-Heal",
"actions": ["kill_query", "clear_connections"],
"approval_required": false,
"conditions": {
"environment": ["staging", "production"],
"time_windows": ["00:00-23:59"],
"max_frequency": "10 per hour"
}
},
{
"name": "Medium-Risk with Approval",
"actions": ["apply_index", "vacuum_analyze"],
"approval_required": true,
"approvers": ["@dba-team", "@ops-lead"],
"timeout": "15 minutes",
"conditions": {
"environment": ["production"],
"time_windows": ["02:00-06:00"]
}
},
{
"name": "High-Risk Manual Only",
"actions": ["scale_resources", "failover"],
"approval_required": true,
"approvers": ["@senior-dba", "@engineering-director"],
"require_all_approvers": true,
"conditions": {
"environment": ["production"]
}
}
]
}Approval Workflow
- Approval requests are sent via Slack, email, or mobile push
- One-click approval or rejection from notification
- Auto-reject if no response within timeout period
- Approval history tracked for compliance and auditing
Audit Logging
Every self-healing action is logged with complete context for compliance, troubleshooting, and continuous improvement.
Audit Log Entry Structure
{
"audit_id": "heal_20260207_143052_a7f9d2",
"timestamp": "2026-02-07T14:30:52.123Z",
"database": "prod-api-db-01",
"issue_detected": {
"type": "long_running_query",
"severity": "high",
"metric_values": {
"query_duration": "342s",
"cpu_usage": "95%",
"blocked_connections": 23
}
},
"action_taken": {
"type": "kill_query",
"query_id": "1234567",
"query_text": "SELECT * FROM orders JOIN ...",
"user": "app_readonly",
"execution_time": "2026-02-07T14:30:53.456Z"
},
"approval": {
"required": false,
"policy": "Low-Risk Auto-Heal"
},
"outcome": {
"status": "success",
"verification": {
"cpu_after": "45%",
"blocked_connections_after": 0,
"recovery_time": "3.2s"
}
},
"metadata": {
"agent_version": "2.4.1",
"confidence_score": 0.94,
"similar_past_actions": 47
}
}Searchable History
Query audit logs by database, action type, date range, or outcome status.
Timeline View
Visualize healing actions over time with issue context and resolution metrics.
Export Reports
Generate compliance reports in JSON, CSV, or PDF format for auditing purposes.
Safety Guardrails
Multiple layers of protection ensure self-healing actions never cause more harm than good.
Rate Limiting
Prevents action storms by limiting the frequency of automated actions per database, preventing cascading failures. Default: Maximum 10 actions per hour per database.
Confidence Threshold
ML models must reach a minimum confidence score (default: 85%) before recommending actions. Low-confidence scenarios always require human review.
Rollback Capabilities
Actions that modify database state (indexes, configuration changes) include automatic rollback if post-action verification detects degraded performance.
Circuit Breaker
If repeated actions fail or cause issues, the agent enters a "safe mode" and requires manual re-enablement after investigation. Prevents automated trial-and-error loops.
Environment Protection
Production databases can have stricter policies than staging environments. Critical production systems can require approval for all actions.
Maintenance Window Awareness
Respects defined maintenance windows and avoids disruptive actions during peak business hours. Can be configured per database and action type.
Enabling Self-Healing
Quick Start Configuration
- 1Navigate to Settings > Automation > Self-Healing
- 2Select your database and click Enable Self-Healing
- 3Choose a preset policy (Conservative, Balanced, or Aggressive) or create custom policies
- 4Configure approval workflows and notification channels
- 5Start with "Learning Mode" to observe recommendations without auto-execution