Container Policies
This guide explains how to use policies to monitor container health and implement custom failure handling strategies.
Motivation
Containers restart failing child processes automatically, but sometimes you need more intelligent behavior:
- Detect failure patterns: Repeated segfaults indicate serious bugs that won't fix themselves by restarting.
- Prevent resource waste: Stop trying to restart processes that will never succeed.
- Monitor health: Track failure rates and alert when thresholds are exceeded.
- Custom responses: Implement application-specific logic for different failure types.
Use policies when you need:
- Segfault detection: Stop the container after multiple segfaults indicate memory corruption.
- Failure rate monitoring: Alert or stop when children fail too frequently.
- Custom logging: Track specific failure types for debugging.
- Graceful degradation: Give unhealthy children extra time before killing them.
Default Behavior
Containers use DEFAULT = self.new.freeze unless you specify otherwise. The default policy:
- Allows children to restart indefinitely if configured with
restart: true. - Kills children immediately when health checks fail.
- Kills children immediately when startup timeouts are exceeded.
This is appropriate for most applications, but you can customize it.
Creating Custom Policies
Policies are Ruby classes that inherit from class Async::Container::Policy and override callback methods:
require "async/container"
class SegfaultDetectionPolicy < Async::Container::Policy
def initialize(max_segfaults: 3, window: 60)
@max_segfaults = max_segfaults
@segfault_rate = Async::Container::Rate.new(window: window)
end
def child_exit(container, child, status, name:, key:, **options)
if segfault?(status)
@segfault_rate.add(1)
segfault_count = @segfault_rate.total
Console.warn(self, "Segfault detected",
name: name,
count: segfault_count,
rate: @segfault_rate.per_minute
)
# Stop container if too many segfaults
if segfault_count >= @max_segfaults
Console.error(self, "Too many segfaults, stopping container",
count: segfault_count,
rate: @segfault_rate.per_second
)
container.stop(false)
end
end
end
end
controller = Async::Container::Controller.new
# Use the custom policy:
def controller.make_policy
SegfaultDetectionPolicy.new(max_segfaults: 5, window: 120)
end
# ...
controller.run
Policy Callbacks
Policies can implement these callbacks:
child_spawn
Called when a child process starts:
def child_spawn(container, child, name:, key:, **options)
Console.info(self, "Child started", name: name)
end
Use this for tracking which children are running or initializing per-child state.
child_exit
Called when a child process exits (success or failure):
def child_exit(container, child, status, name:, key:, **options)
if success?(status)
Console.info(self, "Child succeeded", name: name)
else
Console.warn(self, "Child failed",
name: name,
exit_code: exit_code(status),
signal: signal(status)
)
end
end
This is the main callback for implementing failure detection logic. You can:
- Track failure rates.
- Detect specific failure types (segfaults, aborts).
- Call
container.stopto stop the entire container.
health_check_failed
Called when a health check timeout is exceeded. The default implementation logs and kills the child:
def health_check_failed(container, child, age:, timeout:, **options)
Console.warn(self, "Health check failed", child: child, age: age)
# Send alert before killing
send_alert("Health check failed for child")
# Call default behavior
super
end
Override this to:
- Send alerts before killing.
- Give children more time (don't kill immediately).
- Implement custom recovery logic.
startup_failed
Called when a startup timeout is exceeded. The default implementation logs and kills the child:
def startup_failed(container, child, age:, timeout:, **options)
# Custom logging with more context
Console.error(self, "Child never became ready",
child: child,
age: age,
timeout: timeout
)
# Kill the child (default behavior)
super
end
Helper Methods
The class Async::Container::Policy class provides helper methods for detecting specific failure conditions:
segfault?(status)- Returns true if the process was terminated by SIGSEGV.abort?(status)- Returns true if the process was terminated by SIGABRT.killed?(status)- Returns true if the process was terminated by SIGKILL.success?(status)- Returns true if the process exited successfully.signal(status)- Returns the signal number that terminated the process.exit_code(status)- Returns the exit code.