Health Checks
Implement health checks to ensure saga workers are running correctly and can handle traffic.
Liveness vs Readiness​
Liveness Probe​
Answers: "Is the process alive?"
- Used by orchestrators to restart unhealthy containers
- Should be fast and simple
- Fails only if the process is deadlocked or crashed
Readiness Probe​
Answers: "Can the service accept traffic?"
- Used by load balancers to route traffic
- Checks if dependencies are available
- Fails during startup or degraded states
Basic Implementation​
import express from 'express';
import { createBus } from '@saga-bus/core';
const app = express();
const bus = createBus({ ... });
// Liveness - always returns OK if the process is running
app.get('/health/live', (req, res) => {
res.json({ status: 'ok', timestamp: new Date().toISOString() });
});
// Readiness - checks if bus is ready to process messages
app.get('/health/ready', async (req, res) => {
const isReady = bus.isRunning();
if (isReady) {
res.json({
status: 'ready',
timestamp: new Date().toISOString(),
});
} else {
res.status(503).json({
status: 'not_ready',
timestamp: new Date().toISOString(),
});
}
});
Comprehensive Health Checks​
Checking Dependencies​
interface HealthCheckResult {
status: 'healthy' | 'degraded' | 'unhealthy';
checks: Record<string, ComponentHealth>;
timestamp: string;
}
interface ComponentHealth {
status: 'healthy' | 'unhealthy';
latency?: number;
message?: string;
}
async function checkHealth(): Promise<HealthCheckResult> {
const checks: Record<string, ComponentHealth> = {};
// Check bus status
checks.bus = {
status: bus.isRunning() ? 'healthy' : 'unhealthy',
};
// Check database
const dbStart = Date.now();
try {
await store.ping();
checks.database = {
status: 'healthy',
latency: Date.now() - dbStart,
};
} catch (error) {
checks.database = {
status: 'unhealthy',
message: error.message,
};
}
// Check message transport
const transportStart = Date.now();
try {
await transport.ping();
checks.transport = {
status: 'healthy',
latency: Date.now() - transportStart,
};
} catch (error) {
checks.transport = {
status: 'unhealthy',
message: error.message,
};
}
// Determine overall status
const unhealthyChecks = Object.values(checks).filter(
(c) => c.status === 'unhealthy'
);
let status: HealthCheckResult['status'];
if (unhealthyChecks.length === 0) {
status = 'healthy';
} else if (unhealthyChecks.length < Object.keys(checks).length) {
status = 'degraded';
} else {
status = 'unhealthy';
}
return {
status,
checks,
timestamp: new Date().toISOString(),
};
}
app.get('/health', async (req, res) => {
const health = await checkHealth();
const statusCode = {
healthy: 200,
degraded: 200, // Still accepts traffic
unhealthy: 503,
}[health.status];
res.status(statusCode).json(health);
});
Response Example​
{
"status": "healthy",
"checks": {
"bus": { "status": "healthy" },
"database": { "status": "healthy", "latency": 5 },
"transport": { "status": "healthy", "latency": 12 }
},
"timestamp": "2024-01-15T10:30:00Z"
}
Built-in Health Routes​
Express Integration​
import { createHealthRoutes } from '@saga-bus/express';
const healthRoutes = createHealthRoutes(bus, {
store,
transport,
additionalChecks: {
'external-api': async () => {
const response = await fetch('https://api.example.com/health');
return response.ok;
},
},
});
app.use('/health', healthRoutes);
Fastify Integration​
import { createHealthPlugin } from '@saga-bus/fastify';
await fastify.register(createHealthPlugin, {
bus,
store,
transport,
});
// Routes:
// GET /health/live
// GET /health/ready
// GET /health (detailed)
Kubernetes Configuration​
Pod Spec​
spec:
containers:
- name: saga-worker
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 0
periodSeconds: 2
failureThreshold: 30
Startup Probe​
Use startup probes for slow-starting applications:
startupProbe:
httpGet:
path: /health/live
port: 3000
failureThreshold: 30
periodSeconds: 2
Docker Health Checks​
Dockerfile​
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -f http://localhost:3000/health/live || exit 1
Docker Compose​
services:
saga-worker:
build: .
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health/live"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
AWS Health Checks​
ECS​
{
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:3000/health/live || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
Application Load Balancer​
resource "aws_lb_target_group" "saga_worker" {
name = "saga-worker"
port = 3000
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/health/ready"
port = "traffic-port"
protocol = "HTTP"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
matcher = "200"
}
}
Deep Health Checks​
Database Query Test​
async function checkDatabaseHealth(): Promise<ComponentHealth> {
const start = Date.now();
try {
// Test actual query capability
await pool.query('SELECT 1 as health_check');
// Check connection pool
const poolInfo = {
total: pool.totalCount,
idle: pool.idleCount,
waiting: pool.waitingCount,
};
return {
status: 'healthy',
latency: Date.now() - start,
details: poolInfo,
};
} catch (error) {
return {
status: 'unhealthy',
message: error.message,
};
}
}
Transport Connection Test​
async function checkTransportHealth(): Promise<ComponentHealth> {
const start = Date.now();
try {
// Verify connection is established
const isConnected = await transport.isConnected();
if (!isConnected) {
return {
status: 'unhealthy',
message: 'Transport disconnected',
};
}
// Check queue depth (optional)
const queueDepth = await transport.getQueueDepth();
return {
status: 'healthy',
latency: Date.now() - start,
details: { queueDepth },
};
} catch (error) {
return {
status: 'unhealthy',
message: error.message,
};
}
}
Circuit Breaker Pattern​
import CircuitBreaker from 'opossum';
const healthCheckBreaker = new CircuitBreaker(checkHealth, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
});
app.get('/health', async (req, res) => {
try {
const health = await healthCheckBreaker.fire();
res.json(health);
} catch (error) {
// Circuit is open
res.status(503).json({
status: 'unhealthy',
message: 'Health check circuit open',
timestamp: new Date().toISOString(),
});
}
});
Metrics from Health Checks​
import { Counter, Histogram } from 'prom-client';
const healthCheckDuration = new Histogram({
name: 'health_check_duration_seconds',
help: 'Duration of health checks',
labelNames: ['check'],
});
const healthCheckStatus = new Counter({
name: 'health_check_total',
help: 'Health check results',
labelNames: ['check', 'status'],
});
async function instrumentedHealthCheck() {
const checks = ['database', 'transport', 'bus'];
for (const check of checks) {
const end = healthCheckDuration.startTimer({ check });
const result = await performCheck(check);
end();
healthCheckStatus.inc({ check, status: result.status });
}
}
Best Practices​
- Keep liveness probes simple - They should never fail unless the process is truly dead
- Use appropriate timeouts - Don't let health checks hang
- Cache expensive checks - Don't overwhelm dependencies
- Include latency metrics - Track check performance
- Use startup probes - Allow time for initialization
- Return meaningful status codes - 200 for healthy, 503 for unhealthy