Self-Healing Agent

Autonomous database healing that detects and resolves performance issues automatically, keeping your systems running smoothly 24/7.

How Autonomous Healing Works

DB24x7's self-healing agent continuously monitors your database health and automatically takes corrective actions when issues are detected. The agent uses machine learning to understand your database patterns and make intelligent decisions.

Healing Workflow

1
Detection: The agent detects anomalies in real-time metrics such as slow queries, connection spikes, or high resource usage.
2
Analysis: Root cause analysis determines the exact issue and impact severity using historical patterns and ML models.
3
Action Selection: The agent selects the most appropriate healing action based on the issue type and configured policies.
4
Approval Check: Depending on action severity, either executes automatically or requests approval from designated team members.
5
Execution: The healing action is executed with safety guardrails and rollback capabilities.
6
Verification: Post-action monitoring confirms the issue is resolved and no new issues were introduced.

Configurable Healing Actions

Kill Long-Running Queries

Automatically terminate queries that exceed configured time thresholds and are causing system slowdowns.

Trigger: Query runtime > 5 minutes
Action: KILL QUERY <query_id>
Safety: Exclude admin sessions

Clear Connection Pool

Reset connection pools when detecting connection leaks or exhausted connection limits.

Trigger: Connections > 90% max
Action: Clear idle connections
Safety: Grace period for active txns

Apply Missing Indexes

Automatically create indexes detected by the query optimizer to improve performance.

Trigger: Repeated full table scans
Action: CREATE INDEX (concurrent)
Safety: Off-peak hours only

Restart Stalled Replication

Detect and restart replication lag issues automatically to maintain data consistency.

Trigger: Replication lag > 60s
Action: Restart replication thread
Safety: Alert on repeated failures

Vacuum and Analyze

Trigger maintenance operations when bloat detection or stale statistics are identified.

Trigger: Table bloat > 30%
Action: VACUUM ANALYZE
Safety: Throttled by I/O usage

Scale Resources

Automatically scale compute resources when sustained high utilization is detected.

Trigger: CPU > 85% for 10 minutes
Action: Scale up instance size
Safety: Requires approval

Action Policies and Approvals

Control which actions can run automatically and which require human approval based on risk levels and business requirements.

Policy Configuration Example

{
  "policies": [
    {
      "name": "Low-Risk Auto-Heal",
      "actions": ["kill_query", "clear_connections"],
      "approval_required": false,
      "conditions": {
        "environment": ["staging", "production"],
        "time_windows": ["00:00-23:59"],
        "max_frequency": "10 per hour"
      }
    },
    {
      "name": "Medium-Risk with Approval",
      "actions": ["apply_index", "vacuum_analyze"],
      "approval_required": true,
      "approvers": ["@dba-team", "@ops-lead"],
      "timeout": "15 minutes",
      "conditions": {
        "environment": ["production"],
        "time_windows": ["02:00-06:00"]
      }
    },
    {
      "name": "High-Risk Manual Only",
      "actions": ["scale_resources", "failover"],
      "approval_required": true,
      "approvers": ["@senior-dba", "@engineering-director"],
      "require_all_approvers": true,
      "conditions": {
        "environment": ["production"]
      }
    }
  ]
}

Approval Workflow

Approval requests are sent via Slack, email, or mobile push
One-click approval or rejection from notification
Auto-reject if no response within timeout period
Approval history tracked for compliance and auditing

Audit Logging

Every self-healing action is logged with complete context for compliance, troubleshooting, and continuous improvement.

Audit Log Entry Structure

{
  "audit_id": "heal_20260207_143052_a7f9d2",
  "timestamp": "2026-02-07T14:30:52.123Z",
  "database": "prod-api-db-01",
  "issue_detected": {
    "type": "long_running_query",
    "severity": "high",
    "metric_values": {
      "query_duration": "342s",
      "cpu_usage": "95%",
      "blocked_connections": 23
    }
  },
  "action_taken": {
    "type": "kill_query",
    "query_id": "1234567",
    "query_text": "SELECT * FROM orders JOIN ...",
    "user": "app_readonly",
    "execution_time": "2026-02-07T14:30:53.456Z"
  },
  "approval": {
    "required": false,
    "policy": "Low-Risk Auto-Heal"
  },
  "outcome": {
    "status": "success",
    "verification": {
      "cpu_after": "45%",
      "blocked_connections_after": 0,
      "recovery_time": "3.2s"
    }
  },
  "metadata": {
    "agent_version": "2.4.1",
    "confidence_score": 0.94,
    "similar_past_actions": 47
  }
}

Searchable History

Query audit logs by database, action type, date range, or outcome status.

Timeline View

Visualize healing actions over time with issue context and resolution metrics.

Export Reports

Generate compliance reports in JSON, CSV, or PDF format for auditing purposes.

Safety Guardrails

Multiple layers of protection ensure self-healing actions never cause more harm than good.

Rate Limiting

Prevents action storms by limiting the frequency of automated actions per database, preventing cascading failures. Default: Maximum 10 actions per hour per database.

Confidence Threshold

ML models must reach a minimum confidence score (default: 85%) before recommending actions. Low-confidence scenarios always require human review.

Rollback Capabilities

Actions that modify database state (indexes, configuration changes) include automatic rollback if post-action verification detects degraded performance.

Circuit Breaker

If repeated actions fail or cause issues, the agent enters a "safe mode" and requires manual re-enablement after investigation. Prevents automated trial-and-error loops.

Environment Protection

Production databases can have stricter policies than staging environments. Critical production systems can require approval for all actions.

Maintenance Window Awareness

Respects defined maintenance windows and avoids disruptive actions during peak business hours. Can be configured per database and action type.

Enabling Self-Healing

Quick Start Configuration

1Navigate to Settings > Automation > Self-Healing
2Select your database and click Enable Self-Healing
3Choose a preset policy (Conservative, Balanced, or Aggressive) or create custom policies
4Configure approval workflows and notification channels
5Start with "Learning Mode" to observe recommendations without auto-execution

Best Practice: Start with Learning Mode for 1-2 weeks to build confidence and tune policies before enabling full auto-execution.