Path Failure Detection and Recovery
Model: Operators are provided with special streams that encapsulate the real I/O streams; any errors will be detected at this level.
Detection: If there is an error when reading or writing, then one of the adjacent operators must be down.
Policy (Both Implemented):
- Propagation: Detect error and kill self; error will propagate back to the source.
- Notification: Detect error, notify the source operator via control path and wait until streams are reset.
Recovery:
- Handled by the source operator.
- If still running, intermediate operators are killed (soft state + thread); source and destination operators are kept alive.
- Reinstantiate the path and, if successful, reimplement it; otherwise look for a new logical path and repeat the process of instantiation and implementation.
End to End Argument:
- Recovery does not guarantee no data is lost: a full-transactional queuing system could do this, but some streaming architectures would prefer dropped packets to delays. Service endpoint provides reliability, if needed. Paths notify endpoints of potential data loss.