Lessons Learned¶

SQLite Concurrent Access with Encrypted Databases¶

Problem: Backend service and CLI commands accessing the same encrypted SQLite database caused "file is not a database" errors and backend crashes.

Root Cause: Concurrent access conflicts between multiple processes with persistent connections to encrypted database files.

Failed Approaches: Connection health checks, fresh connections per operation, and database proxy patterns all failed due to performance or architectural constraints.

Solution: Multi-layered approach combining WAL mode (PRAGMA journal_mode=WAL), busy timeout (PRAGMA busy_timeout=10000), comprehensive retry logic for lock/corruption errors, and exponential backoff with forced reconnection.

Key Lessons: Encrypted SQLite requires special handling beyond standard concurrency patterns. No single mitigation works - requires WAL mode + timeouts + retry logic + connection recovery. CLI independence is non-negotiable. Integration testing with concurrent processes is essential.

Windows Unicode Encoding and Graceful Shutdown¶

Problem: Backend server crashed immediately on startup with UnicodeEncodeError when using emoji characters in console output on Windows.

Root Cause: Windows Command Prompt uses CP1252 encoding by default, which cannot handle Unicode emoji characters (🚀, 📡, etc.).

Failed Approaches: Complex signal handler coordination, asyncio signal handlers, and Uvicorn signal override attempts all failed because the server never started successfully.

Solution: Remove emoji characters from console output and implement file-based shutdown mechanism (gateway.shutdown file in runtime directory) instead of signal-based approaches.

Key Lessons: Simple encoding issues can masquerade as complex architectural problems. File-based IPC is more reliable than signals for cross-platform graceful shutdown. Always test with actual Windows console environments, not just IDE terminals.

Windows Background Processes and Console Windows¶

Problem: Background processes spawned via subprocess still showed CMD windows on Windows despite various creation flags and STARTUPINFO configurations.

Root Cause: Using python.exe always creates a console window on Windows, regardless of subprocess creation flags.

Solution: Use pythonw.exe (windowless Python interpreter) instead of python.exe for background processes.

Key Lessons: The executable itself determines console behavior on Windows. Subprocess flags cannot override the fundamental nature of console vs. windowless executables.

FastAPI Decorator Signature Preservation¶

Problem: Admin API endpoints generated incorrect args and kwargs parameters in OpenAPI spec, causing "Field required" errors despite correct authentication.

Root Cause: Custom exception handling decorator not using @functools.wraps(func), causing FastAPI to lose original function signatures.

Solution: Add @functools.wraps(func) to preserve function metadata in decorators.

Key Lessons: FastAPI relies on function signatures for OpenAPI generation. Always use @functools.wraps in decorators to preserve metadata.

ZMQ Log Transport Connection Management¶

Problem: ZMQ logging transport never sent logs to database despite broker being available and transport initialized.

Root Cause: mark_broker_ready() only set _broker_available flag but never connected the MessageBusClient. Transport readiness check required both broker availability AND client connection.

Solution: Modified mark_broker_ready() to immediately schedule client connection when broker becomes available.

Key Lessons: Broker availability ≠ Client connection. Both conditions must be true for transport readiness. Always verify end-to-end message flow, not just individual component states.

Circular Import Dependencies in Logging Systems¶

Problem: ZMQ logging transport silently failed to initialize, causing all logs to fall back to direct database writes (which also failed silently).

Root Cause: Circular import dependency prevented MessageBusClient from being imported at module load time. The logging module tried to import from bus.py, which imported from config.py, creating a circular reference back to logging components.

Debugging Challenges: - Silent import failures wrapped in try/except blocks - Multiple fallback mechanisms masked the core issue
- Async connection timing made it hard to trace when connections weren't happening - No obvious error messages pointing to import failures

Solution: Replace module-level imports with lazy imports inside initialization methods to break the circular dependency chain.

Key Lessons: - Circular imports cause silent failures that cascade through dependent systems - Always add debug output to import failures, even in try/except blocks - Lazy imports can resolve circular dependencies when modules need each other - Test import paths independently before testing full system integration - Silent fallbacks should log warnings, not fail completely silently

LogBuffer Flush Timing for Startup Log Capture¶

Problem: LogBuffer successfully captured all startup logs but they weren't persisting to database. Buffered logs were being flushed immediately when ZMQ broker became available, but LogConsumer hadn't subscribed to the logs/ topic yet.

Root Cause: Timing mismatch between broker availability and consumer readiness. The buffer flush occurred before the LogConsumer service was fully connected and subscribed to receive messages.

Failed Approach: Flushing buffer immediately when mark_broker_ready() was called, assuming broker availability meant the entire logging pipeline was ready.

Solution: Implement two-phase buffer flush timing: 1. Mark ZMQ transport broker as ready (enable direct logging for new messages) 2. Wait for LogConsumer to connect and subscribe to logs/ topic 3. Add brief delay (100ms) for subscription to be fully established 4. Then flush all buffered startup logs

Implementation: Modified _notify_log_transport_broker_ready() to only set _broker_available = True without flushing, then added _connect_log_consumer_and_flush_buffer() to coordinate timing.

Key Lessons: - Broker availability ≠ Consumer readiness - both must be verified before buffer flush - LogBuffer timing is critical: too early = lost messages, too late = delayed visibility - Always coordinate buffer flush with the actual message consumer, not just the broker - Brief delays after subscription establishment prevent race conditions - End-to-end message flow testing is essential for distributed logging systems

Logging Recursion Loops and Infrastructure Safety¶

Problem: Logging systems can create infinite recursion loops when logging code attempts to log its own operations, causing stack overflow crashes.

Root Cause: When ZMQ transport, message bus operations, database writes, or log consumers try to use standard logging (logger.info(), logger.error()), they create recursive loops where each log attempt generates another log message.

Critical Pattern: Use print() statements instead of loggers within logging infrastructure code.

Solution:

# CORRECT - debugging logging system itself
def _send_log_to_zmq(self, log_data):
    try:
        self.socket.send_json(log_data)
        print(f"[ZMQ TRANSPORT] Log sent successfully")  # ✅ Safe
    except Exception as e:
        print(f"[ZMQ TRANSPORT] Failed to send: {e}")    # ✅ Safe
        # self.logger.error(f"Send failed: {e}")         # ❌ RECURSION!

Key Lessons: - NEVER log within logging transport, message bus, database writes, or log handler code - Use print() statements for debugging logging infrastructure itself - Recursion: Logger → ZMQ → Logger → ZMQ → Stack Overflow - Infrastructure logger pattern helps but doesn't eliminate recursion risk - Test logging infrastructure separately from application logging

CurveZMQ Multi-Process Authentication Keys¶

Problem: After enabling CurveZMQ encryption, modelservice logs stopped reaching the database despite successful connection messages. Backend logs worked fine, but all modelservice logs were silently dropped.

Root Cause: Each Python process generated different shared broker keys in memory. Backend process generated one set of CurveZMQ keys, modelservice process generated different keys. When modelservice tried to authenticate with the broker using mismatched keys, authentication failed silently and messages were dropped.

Debugging Challenge: CurveZMQ authentication failures are silent - no error messages, connections appear successful, but messages are dropped at the protocol level.

Failed Approach: Assuming _SHARED_BROKER_KEYS = None with runtime generation would work across processes. This only works within a single process.

Solution: Replace runtime key generation with fixed static keys that all processes can use:

# Before (broken across processes)
_SHARED_BROKER_KEYS = None
def _get_shared_broker_keys():
    if _SHARED_BROKER_KEYS is None:
        broker_public, broker_secret = zmq.curve_keypair()  # Different per process!

# After (works across processes)  
_SHARED_BROKER_PUBLIC = "Yne@$w-vo<fVvi]a<NY6T1ed:M$fCG*[IaLV{hID"
_SHARED_BROKER_SECRET = "D:)Q[IlAW!ahhC2ac:9*A}h:p?([4%wOTJ%JR%cs"

Key Lessons: - CurveZMQ authentication failures are completely silent - no errors, just dropped messages - Shared keys must be truly shared across processes, not generated per-process - Multi-process systems need persistent/static keys for CurveZMQ authentication - Always test encryption with multiple separate processes, not just single-process tests - CurveZMQ debugging requires message-level tracing, connection success doesn't guarantee message delivery

ZMQ Client Authorization and Encrypted Message Bus¶

Problem: ZMQ log transport showed "LOG SENT successfully" messages but logs never reached the database. Transport appeared to connect successfully but messages were silently dropped.

Root Cause: The "zmq_log_transport" client was not included in the message bus broker's authorized client list. Without authorization, the client couldn't authenticate with CurveZMQ encryption, causing all messages to be silently rejected by the broker.

Debugging Challenge: - Client appeared to connect successfully (connected=True) - No error messages or authentication failures reported - Messages showed "sent successfully" but never reached log consumer - Connection logs looked normal, masking the authorization failure

Solution: Add the ZMQ log transport client to the authorized client list:

# In shared/aico/core/bus.py
AUTHORIZED_CLIENTS = [
    "backend.api_gateway",
    "conversation_engine", 
    "message_bus_client_modelservice",
    "zmq_log_transport",  # ← Must be explicitly authorized
    "log_consumer"
]

Also ensure the client is included in security keypair generation:

# In cli/commands/security.py  
COMPONENTS = [
    "backend.api_gateway",
    "conversation_engine",
    "message_bus_client_modelservice", 
    "zmq_log_transport",  # ← Must have CurveZMQ keypair
    "log_consumer"
]

Key Lessons: - Every ZMQ client must be explicitly authorized in the broker's client list - CurveZMQ encryption requires both shared broker keys AND individual client authorization - Authorization failures are completely silent - no errors, just dropped messages - Client connection success != message delivery authorization - Always verify end-to-end message flow when adding new ZMQ clients - Authorization is separate from authentication - both must be configured correctly