More documentation

author: Ivar Refsdal <ivar.refsdal@nsd.no> 2021-09-15 20:44:19 +0200
committer: Ivar Refsdal <ivar.refsdal@nsd.no> 2021-09-15 20:44:19 +0200
commit: 0e67ff75e079aa601375b25346293a53385623d9 (patch)
tree: a5f95975cb0263a8f31873f8669a53feed0d30a8
parent: Only keep handlers from old config (diff)
download: fiinha-0e67ff75e079aa601375b25346293a53385623d9.tar.gz
fiinha-0e67ff75e079aa601375b25346293a53385623d9.tar.xz
1 files changed, 90 insertions, 2 deletions
diff --git a/README.md b/README.md
index cdf8af3..e747b97 100644
--- a/README.md
+++ b/README.md
@@ -31,7 +31,7 @@ On-prem only.
                      ; This includes nil, false and everything else.
                      (log/info "got payload" payload))))
 
-; Start threadpool
+; Start threadpool that picks up queue jobs
 (yq/start!)
 
 ; Queue a job
@@ -85,7 +85,6 @@ The queue way to solve this would be:
   (let [{:some/keys [id] :as db-item} (process user-input)
     @(d/transact conn [db-item
                        (yq/put :get-ext-ref {:id id})])))
-
 ```
 
 Here `post-handler` will always succeed as long as the transaction commits.
@@ -103,3 +102,92 @@ the database function [:db/cas (compare-and-swap)](https://docs.datomic.com/on-p
 to achieve a write-once behaviour.
 The yoltq system treats cas failures as job successes
 when a consumer has `:allow-cas-failure?` set to `true` in its options.
+
+## How it works
+
+### Queue jobs
+
+Creating queue jobs is done by `@(d/transact conn [...other data... (yq/put :q {:work 123})])`.
+Inspecting `(yq/put :q {:work 123})]` you will see something like this:
+
+```clojure
+#:com.github.ivarref.yoltq{:id #uuid"614232a8-e031-45bb-8660-be146eaa32a2", ; Queue job id 
+                           :queue-name :q, ; Destination queue                                 
+                           :status :init, ; Status
+                           :payload "{:work 123}", ; Payload persisted to the database with pr-str
+                           :bindings "{}", 
+                           :lock #uuid"037d7da1-5158-4243-8f72-feb1e47e15ca", ; Lock to protect from multiple consumers
+                           :tries 0, ; How many times the job has been executed
+                           :init-time 4305758012289 ; Time of initialization (System/nanoTime)
+                           }
+```
+
+This is the queue job as it will be stored into the database. 
+You can see that the payload, i.e. the second argument of `yq/put`,
+is persisted into the database. Thus the payload must be `pr-str`-able.
+
+
+A queue job will initially have status `:init`.
+It will then transition to the following statuses:
+
+* `:processing`: When the queue job begins processing in the queue consumer function.
+* `:done`: If the queue consumer function returns normally.
+* `:error`: If the queue consumer function throws an exception.
+
+### Queue consumers
+
+...
+
+### Listening for queue jobs
+
+When `(yq/start!)` is invoked, a threadpool is started.
+
+One thread is permanently allocated for listening to the 
+[tx-report-queue](https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/tx-report-queue)
+and responding to changes. This means that yoltq will respond 
+and process newly created queue jobs fairly quickly.
+This also means that queue jobs in status `:init` will almost always* be processed without
+any type of backoff.
+
+This pool also schedules polling jobs that will regularly check for various statuses:
+
+* Jobs in status `:error` that have waited for at least `:error-backoff-time` (default: 5 seconds) will be retried.
+* Jobs that have been in `:processing` for at least `:hung-backoff-time` (default: 30 minutes) will be considered hung and retried.
+* Old `:init-backoff-time` (default: 1 minute) `:init` jobs that have not been processed. *Queue jobs can be left in status `:init` during application restart/upgrade, and thus the need for this strategy.
+
+
+### Retry and backoff strategy
+
+Yoltq assumes that if a queue consumer throws an exception for one item, it
+will also do the same for another item in the immediate future, 
+assuming the remote system that the queue consumer represents is still down.
+Thus if there are ten failures for queue `:q`, it does not make sense to
+retry all of them at once.
+
+The retry polling job that runs regularly (`:poll-delay`, default: every 10 seconds)
+thus stops at the first failure.
+Each queue have their own polling job, so if one queue is down, it will not stop
+retrying every other queue.
+
+The retry polling job will continue to eagerly process queue jobs as long as it 
+encounters only successes.
+
+While the `:error-backoff-time` of default 5 seconds may seem short, in practice
+if there is a lot of failed items and the external system is still down,
+the actual backoff time will be longer.
+
+
+### Ordering
+
+There is no attempt at ordering the execution of queue jobs.
+In fact the opposite is done to guard against the case that a single failing queue job
+could effectively take down the entire retry polling job.
+
+### Stuck threads
+
+A single thread is dedicated to monitoring how much time a queue consumer 
+spends on a single job. If this exceeds `:hung-backoff-time` (default: 30 minutes),
+the queue job will be marked as failed and the stack trace of the offending
+consumer will be logged.
+
+### Total health and system sanity 
+\ No newline at end of file
author	Ivar Refsdal <ivar.refsdal@nsd.no>	2021-09-15 20:44:19 +0200
committer	Ivar Refsdal <ivar.refsdal@nsd.no>	2021-09-15 20:44:19 +0200
commit	0e67ff75e079aa601375b25346293a53385623d9 (patch)
tree	a5f95975cb0263a8f31873f8669a53feed0d30a8
parent	Only keep handlers from old config (diff)
download	fiinha-0e67ff75e079aa601375b25346293a53385623d9.tar.gz fiinha-0e67ff75e079aa601375b25346293a53385623d9.tar.xz