1 files changed, 117 insertions, 0 deletions
diff --git a/doc/c10m.adoc b/doc/c10m.adoc
new file mode 100644
index 0000000..081d4a8
--- /dev/null
+++ b/doc/c10m.adoc
@@ -0,0 +1,117 @@
+    = C10M: 10 Million Concurrent Connections
+    :author: papod
+    :revdate: 2026-04-24
+
+    == Goal
+
+    Cap a single papod process at 10 million concurrent connections across
+    multiple networks. This is a C10M target.
+
+    == Architecture: Virtual Threads
+
+    Each client connection runs in its own Java virtual thread
+    (`Thread/ofVirtual`). Virtual threads are cheap (~1KB stack) and
+    multiplexed onto a small pool of carrier threads (default: CPU core
+    count).
+
+    This works well *if* virtual threads never pin their carrier thread.
+
+    == Pinning Problem
+
+    A virtual thread *pins* its carrier thread when it blocks inside a
+    `synchronized` block or JNI call. The carrier thread cannot run other
+    virtual threads while pinned. With ~8–16 carrier threads and millions
+    of virtual threads, even a small percentage pinning simultaneously
+    exhausts the carrier pool and stalls the entire server.
+
+    === Datomic
+
+    Datomic Free/Pro was written before Project Loom. Its internals almost
+    certainly use `synchronized` rather than `java.util.concurrent.locks`.
+    Every `@(d/transact ...)` and `(d/q ...)` call may pin.
+
+    At IRC-hobby scale (hundreds of connections) this is harmless. At C10M
+    it is a showstopper.
+
+    === Diagnosis
+
+    Run with `-Djdk.tracePinnedThreads=short` under load to identify
+    pinning sites.
+
+    == Mitigation Strategies
+
+    === 1. Bounded Datomic thread pool
+
+
+    Offload all `d/transact` and `d/q` calls to a fixed-size pool of
+    platform threads. Virtual threads submit work and `await` via
+    `CompletableFuture` (no pinning).
+
+    [source,clojure]
+    ----
+    (def ^:private datomic-pool
+      (java.util.concurrent.Executors/newFixedThreadPool 64))
+
+    (defn transact-async [conn tx-data]
+      (let [cf (java.util.concurrent.CompletableFuture.)]
+        (.submit datomic-pool
+          (reify Runnable
+            (run [_]
+              (try
+                (.complete cf @(d/transact conn tx-data))
+                (catch Exception e
+                  (.completeExceptionally cf e))))))
+        cf))
+    ----
+
+    Pros:: Simple, contained. Virtual threads never touch Datomic directly.
+    Cons:: Adds a queue. Pool size must be tuned (too small = backpressure,
+    too large = Datomic contention).
+
+    === 2. Use `d/transact-async` natively
+
+    Datomic's `d/transact-async` returns a future without blocking the
+    caller. Restructure the reply pipeline to be async: `handle-privmsg`
+    returns a deferred result, `send-replies!` awaits it.
+
+    Pros:: No extra pool. Uses Datomic's own async path.
+    Cons:: Requires restructuring the entire request→reply pipeline from
+    synchronous to async. Large change.
+
+    === 3. Replace Datomic Free
+
+    Use a database client that is virtual-thread-safe:
+
+    - XTDB v2 (built on Arrow/Kafka, Java 21 aware)
+    - Raw JDBC + HikariCP (virtual-thread-friendly since 5.1.0)
+    - SQLite via JDBC (single-writer, but no `synchronized` in the driver)
+
+
+    Pros:: Eliminates the root cause.
+    Cons:: Migration cost. Loss of Datomic's immutable history model (which
+    papod relies on for CHATHISTORY, edit history, audit).
+
+    == Recommendation
+
+    Start with *Strategy 1* (bounded thread pool). It is the smallest
+    change, keeps the existing synchronous handler signatures, and can be
+    implemented incrementally:
+
+    1. Wrap `d/transact` and `d/q` calls behind helper functions.
+    2. Run those helpers on a fixed platform-thread pool.
+    3. Virtual threads `(.get future)` on the result — this blocks the
+       virtual thread (fine, no pinning) without blocking a carrier thread.
+
+    Revisit if the pool becomes a bottleneck under load testing.
+
+    == Open Questions
+
+    - What is the actual pinning profile of Datomic Free under load?
+      (Measure before optimizing.)
+    - Should the persistence layer be swappable (protocol/interface) to
+      allow future migration?
+    - Is the in-memory `clients` atom (`atom {}` with 10M entries)
+      performant enough, or does it need a concurrent map?
+    - At C10M, `doseq` over channel members for PRIVMSG fan-out is O(n).
+      Channels with 100K members need a different broadcast strategy.
+