Version: 0.0.0 (Beta)

Error Handling

This document describes how the SDK handles and surfaces errors.

The SDK treats errors as observable events, not silent failures.

Internal SDK errors

The SDK logs its own errors using the same logging system exposed to users.

Internal trace IDs

SDK internal logs use predictable trace IDs:

f"datanadhi-internal-{module_name}"
f"datanadhi-async-worker"
f"datanadhi-drain-worker"
f"datanadhi-health-monitor"

This allows filtering SDK logs from application logs.

Log level filtering

SDK has a separate log level threshold: datanadhi_log_level (default: INFO)

def can_log(self, incoming_level: int):
    return incoming_level > self.config["datanadhi_log_level"]

# Only log SDK errors if they exceed threshold
if _datanadhi_internal and self.can_log(logging.WARNING):
    self.logger.warning("SDK warning", ...)

Example SDK logs

No rules configured:

logger.warning(
    "No rules set! Defaulting to stdout True",
    context={
        "reason": "Rules Empty",
        "datanadhi_dir": str(datanadhi_dir),
    },
    trace_id="datanadhi-internal-my_app",
    _datanadhi_internal=True,
)

EchoPost binary download failed:

logger.warning(
    "Unable to download EchoPost binary",
    context={"type": "network_error", "detail": "..."},
    trace_id="datanadhi-internal-my_app",
    _datanadhi_internal=True,
)

Worker errors:

logger.error(
    "Worker error",
    context={"error": str(e)},
    trace_id="datanadhi-async-worker",
    _datanadhi_internal=True,
)

Properties

SDK errors never interfere with user logs
Always include structured context
Respect dedicated log level threshold
Use predictable trace IDs for filtering

SDK log level

The SDK has a separate internal log level.

This allows users to:

See SDK errors and warnings
Filter SDK noise independently
Debug delivery and configuration issues

SDK logs are never hidden silently.

Error propagation

Errors inside the SDK:

Are logged to the user's logging system
Do not raise exceptions in user code
Do not break application execution

Delivery failures are isolated from logging calls.

Exception handling patterns

Rule evaluation:

try:
    pipelines, stdout = evaluate_rules(log_dict, rules)
except Exception:
    # Silently return empty on evaluation error
    return [], False

Worker loops:

try:
    while not shutdown:
        item = queue.get()
        process_item(item)
except Exception as e:
    logger.error(
        "Worker error",
        context={"error": str(e)},
        _datanadhi_internal=True,
    )
finally:
    session.close()

EchoPost communication:

try:
    sent = send_log_over_unix_grpc(...)
    if sent:
        queue.task_done()
    else:
        queue.writeback_batch([item])
except Exception as e:
    logger.error("Echopost error", ...)
    queue.writeback_batch([item])

Key principles

Exceptions caught at boundaries
Errors logged, never raised to user
Failed items requeued or dropped
Application continues unaffected

Failure isolation and data persistence

The SDK is designed so that:

Logging continues even during partial failure
Delivery failures do not cascade
Recovery is automatic when possible

Data loss is explicit and treated as a last resort.

Dropped data handling

When items are permanently dropped, they are stored to disk for later recovery:

from datanadhi.utils.files import store_dropped_data

file_path = store_dropped_data(
    datanadhi_dir,
    items,  # List of (pipelines, payload) tuples
    reason  # "primary_failed", "fallback_failed", "drain_worker_failed"
)

Storage format

<datanadhi_dir>/.dropped/
├── primary_failed_20250129_143022_abc123.json
├── fallback_failed_20250129_143025_def456.json
└── drain_worker_failed_20250129_143030_ghi789.json

Each file contains:

{
  "timestamp": "2025-01-29T14:30:22Z",
  "reason": "primary_failed",
  "items": [
    {
      "pipelines": ["pipeline-1"],
      "log_data": { ... }
    }
  ]
}

When data is dropped

Scenario	Action
Primary server 4xx/5xx error	Drop + store to disk
Fallback server 4xx/5xx error	Drop + store to disk
Drain worker send failed	Drop + store to disk
Queue full on submit	Return False, log not submitted
Exception in worker	Drop + mark done (prevents blocking)

Recovery

Dropped data files can be:

Manually replayed via scripts
Ingested via batch upload endpoints
Used for debugging and auditing

Design intent

Errors should be:

Visible
Structured
Non-blocking
Recoverable

The SDK favors correctness and observability over aggressive retries.

Internal SDK errors​

Internal trace IDs​

Log level filtering​

Example SDK logs​

Properties​

SDK log level​

Error propagation​

Exception handling patterns​

Key principles​

Failure isolation and data persistence​

Dropped data handling​

Storage format​

When data is dropped​

Recovery​

Design intent​