Near-frontier AI for mortals.
helexa is a self-hosted LLM serving stack, written in Rust, for people who run open-weight models on their own consumer GPUs. It has two components:
Two principles constrain everything in this repository:
GPU acquisition is harder than it was a year ago, and the gap between what cloud providers charge and what your own silicon costs keeps widening. The intersection of those two principles — near-frontier models, squeezed onto hardware you own — is helexa's entire niche.
The secondary objective is predictable consumption. If you own the hardware, your tooling shouldn't break because a cloud provider changed billing, deprecated a model, or reshaped an API. cortex's OpenAI and Anthropic surfaces are a stability contract: point your editor, agent, or CLI at it once, and it keeps working.
This is an intentionally different path from vLLM, SGLang, and peers — not a smaller version of them. Out of scope, permanently:
One thing that is not a principle: CUDA exclusivity. All high-end consumer hardware is in scope. helexa is CUDA-only today because that's the hardware on the bench — nothing ships untested — and ROCm or other consumer accelerators join as soon as there's real hardware to build against.
In scope, and where the engineering effort goes: aggressive quantization (GGUF Q4_K_M / Q6_K / Q8_0), NCCL tensor parallelism across heterogeneous consumer GPUs, careful CUDA failure handling, and single-request latency — the performance that one operator at a keyboard actually feels.
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│ OpenAI + Anthropic APIs
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
cortex discovers each neuron's hardware (devices, VRAM, compute
capability) at runtime and matches it against a model catalogue
(models.toml) to decide placement: which models fit where, what to
evict when VRAM is tight, where to route a request right now. Adding a
GPU host to the fleet is one [[neurons]] entry — no device specs in
config.
| Crate | Purpose |
|---|---|
cortex-core | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway | Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron | Per-host daemon: GPU discovery, in-process candle inference, NCCL tensor parallelism, model lifecycle API |
cortex-cli | CLI entrypoint (cortex serve, cortex status, etc.) |
helexa-acp | Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default |
neuron runs inference in-process on candle — there is no external inference server to babysit. The parts that earn their keep:
Drop
contract is structurally safe, and a driver error poisons one worker
— visibly — instead of hanging the whole process.See CLAUDE.md for design rationale and
crates/neuron/src/harness/device_worker/ for the worker narrative.
Pre-built RPMs for Fedora:
dnf copr enable helexa/helexa
dnf install cortex # on the gateway host
dnf install helexa-neuron # on each GPU host
systemctl enable --now cortex # or neuron, respectively
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles (VRAM requirements, quant, device minimums,
pinning) live in models.toml — see models.example.toml.
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# one catalogue across every node
curl http://localhost:31313/v1/models
cargo build --release
CI runs on every push; keep it green locally:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) build SRPMs for cortex and helexa-neuron
and publish to COPR.
Pre-1.0 and moving fast. The gateway path (routing, eviction, translation, metrics) is stable and tested; the candle-native engine is under active development — expect the supported-model list to track the open-weight frontier, deliberately narrowly.
Development happens at https://git.lair.cafe/helexa/helexa; https://github.com/helexa-ai/helexa is a read-only mirror.
GPL-3.0
189 activities
24968e9 feat(neuron): Stage B — end-to-end text+image chat for Qwen3.67df84fe feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.65c520c7 feat(deploy): gitea workflow for rolling RPM deploys + host bootstrapd0292ed feat(cortex): catalogue source field + scheme-qualified /models/loadd4e1b05 feat(neuron,cortex-core): source-aware loader (scheme:org/name)61adff3 feat(neuron): preflight placement check with structured errors0af8c8d chore(ci): enable colored logs for readability61adff3 feat(neuron): preflight placement check with structured errors0af8c8d chore(ci): enable colored logs for readability435fd10 fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferredcb30383 feat(neuron): render the model's chat_template with chat_template_kwargs4400835 feat(neuron): emit response.in_progress between created and output_item.added2f387f3 ci: export CUDA paths in cuda-check so cudarc build.rs finds nvccfc9a8c4 feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas7733eec feat(neuron): strip reasoning from chat completions by defaultClosing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough).
The original proposal — "route Qwen3 <think> to ReasoningDelta" — assumed Qwen3-specific tag parsing. Investigating the actual leak (Zed's commit-message generator showing <think> blocks in the field) surfaced two problems:
crates/open_ai/src/completion.rs — no chat_template_kwargs, no Responses-API capability detection). The wire format has no slot for reasoning, so anything inside <think> arrives as plain content. #5's proposed fix wouldn't help that path because there's no reasoning-event family in chat completions to route to.The leak fixes cleanly with a model-agnostic seam: at model load time, probe the tokenizer's added_tokens for any token whose content matches a known reasoning-marker convention. Store the open/close token IDs on LoadedModel (or None for non-reasoning models). The inference loop's token-level state machine routes between TextDelta and ReasoningDelta without any hardcoded model knowledge.
The chat-completions projector then drops ReasoningDelta by default (matching the wire format's lack of a reasoning slot), opt-in via header for callers like helexa-acp that want the markers back.
That's #8. Companion is #9 (pass chat_template_kwargs through to the chat template at tokenisation), which gives clients a request-side lever to suppress thinking at generation time — also model-agnostic since neuron doesn't interpret the kwarg, just forwards it.
The Responses-API mapping (ReasoningDelta → response.reasoning_summary_text.delta) is still worth doing eventually but only matters once a Responses-API consumer of cortex exists; tracking under #7's reasoning sub-bullet rather than as a separate issue today.
fdc0adb docs(helexa-acp): README + example config for end-user onboarding8fa1d19 feat(helexa-acp): anthropic-messages providercad7552 ci: clear sccache env on cuda-check so cargo doesn't try to wrap rustc1818dfb feat(helexa-acp): openai-responses provider5ed1140 feat(cortex-gateway): proxy /v1/responses to neuron957f704 feat(neuron): OpenAI Responses API + ci cuda-check runner label1859777 ci: add cuda type-check job so CUDA-only borrowck errors fail fastPinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro:
0.1.16-0.1.20260527185748.git249b2e5.fc43 on hanzalova0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on beast (commit df0abfe)0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on benjy0.1.16-0.1.20260527185748.git249b2e5.fc43 on quadbrat (note: older build, since upgraded to gitdf0abfe)For helexa-acp callers: this bug predates the InferenceEvent refactor on main (commit 302ccfb, pushed 2026-05-29) and is unrelated to it — the refactor only touched the streaming output path, not the request-parsing / image-ingest path. The next deploy off a build that includes 302ccfb (or later) should reproduce identically until we actually wire image ingest through to the candle harness.
Relevant code paths to inspect during triage:
crates/cortex-gateway/src/handlers.rs::chat_completions — does the gateway preserve content arrays verbatim when proxying, or flatten them?crates/neuron/src/harness/candle.rs::format_qwen3_prompt — currently does MessageContent::Parts(parts) => parts.iter().filter(text-only).join(...), which would silently drop image_url parts.6927286 fix(neuron): clone id/model_id before TP spawn so wire projector can use them302ccfb refactor(neuron): introduce InferenceEvent + wire projection layerdf0abfe feat(helexa-acp): image input for vision-capable modelsb901657 feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errorsadbc52b feat(helexa-acp): model picker + session/set_model handler537a0fe feat(helexa-acp): context compaction for small-context local modelscbadfcf feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows3ecbb21 fix(helexa-acp): persist per round, cancel previous prompt, log loop0d841a4 feat(helexa-acp): replay session history on session/load0bbb9b7 feat(helexa-acp): session/list so Zed can discover sessions to resume5aac1ff feat(helexa-acp): session resume via session/loadec2b645 feat(helexa-acp): infer tool name from arg shape when model omits ita494c8d feat(helexa-acp): repair malformed tool calls and render failures as cardsabbedf8 chore(neuron): bump default max_tokens from 512 to 81926cc14e9 feat(helexa-acp): per-endpoint max_tokens config1c16732 feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning5a0861d fix(helexa-acp): forward Dispatch::Response to its awaiting router33652ac feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host loggingc297a54 chore(helexa-acp): log raw bash output and tool result snippets0121a19 feat(helexa-acp): inject and parse Qwen3 Hermes tool format13f4c36 chore(helexa-acp): log outgoing chat-completion body at debug level4a51a54 fix(helexa-acp): describe Stage 3 tools in the default system prompt0609f1a feat(helexa-acp): add tools, session modes, and permission gating96fc379 feat(helexa-acp): wire ACP agent loop for text-only conversationse267f58 chore(neuron): rustfmt drift in is_device_fault teste23d501 feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat249b2e5 fix(neuron): only poison the model on actual device faultsc59da83 fix(neuron): serialise single-GPU inference per loaded modelf058823 fix(neuron): don't poison the model on tokio JoinError panicsbd04d7f fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic1e13889 feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks6e1c1dd ci: retry clippy + test up to 3 times on spurious sccache failures3587695 chore(neuron): default tracing filter to info (was info,neuron=debug)740299b chore(neuron/beast): switch default-model quant from q5k to q6kcdf0f4e fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAMc4954e0 docs: per-device worker thread architecture (phase 5 of refactor)b4f3576 refactor(neuron): phase 4 — model loads move onto the device worker76ab24d refactor(neuron): phase 3 — TP forward + NCCL state move onto device workerb179204 refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker081b532 refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it7c19da9 feat(neuron): construction-complete vram/config dump + logits health + per-step vram24e20dc feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)becf61b feat(script): validate-neuron.sh waits for /health activation=readyb9e7a76 feat(gateway): surface mid-prewarm models as Loading on /v1/models800498f feat(neuron): bind listener before pre-warm, surface activation in /healthd3f2d50 feat(deploy): per-host neuron config + pre-warm headline models2740e61 fix(neuron,candle): name lifetime on acquire_pool_lock67f79c8 fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drainfc6ef0e feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups1385979 feat(neuron,candle): log per-device VRAM at chat_completion start0a1cfcd feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warningsea0e0f7 fix(neuron,tp): log leader forward errors with full contextaa88d37 fix(gateway): full observability + stop leaking upstream bodies0f00f72 fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures9b0ed0b fix(router): rewrite loopback inference URLs to use neuron's hostdc2a803 fix(rpm): migrate legacy helexa-cortex firewalld service to `cortex`e711814 feat(stage-8e-3): quantize lm_head in TP Qwen3-Nextee663e5 fix(stage-8e-2e): bump quant prefill threshold to M > 6434f9b77 feat(stage-8e-2d): route quantized matmul by M (prefill vs decode)f084aaa fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul68a606a fix(stage-8e-2b): allow quant on the TP load path4aa7190 feat(stage-8e-2): plumb quant config from ModelSpec to TP load pathbef159b feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants8d7b099 feat(stage-8d-7): direct safetensors fused-region loader89d98d1 diag(stage-8d-6): per-layer VRAM logging in TP load pathcc95fe2 feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel09c945f feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill05dc0ba feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels10c151e feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_544ae927 feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_51ebbe87 feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only70eb6af feat(tp): cancellation-safe inference + structured tracingd1a4aad fix(tp): always drain worker responses on leader failure95dc874 feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5)495d3f7 fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule5c4c8e0 fix(qwen3_5): tensor names are under `model.language_model.*`, not `model.*`a77f196 fix(qwen3_5): tensor names are under `model.language_model.*`, not `model.*`07c44d5 fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25e7eb3da feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_51802745 feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet)a70f317 feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gatec6022aa feat(stage-8b): Llama + Qwen3 MoE families on the candle harness9e31d8d feat(stage-8a): pre-flight architecture check for dense model loadsb400e8b feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache62ca125 chore: keep models.example.toml generic; deploy.sh sync's local models.toml735945e feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-loadf72dee0 feat(tp): Stage 7c-i — streaming SSE through TPd46d8d4 feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference9b8bd14 feat(tp): --tp-smoke CLI subcommand + remote validation script96d8755 fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc12549c9 fix(tp): import BackendStorage trait for CudaStorage methods46527d7 feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2)8d3194f Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears5436af9 fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze8e882c0 fix(neuron/tp): NcclError {e:?} + cudarc 0.19 deprecation cleanup93421f4 Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives05e15f3 Stage 7b-i: dense safetensors Qwen3 load pathda068de Stage 7a-ii: real NCCL handshake behind the worker pool2a7ede0 Stage 7a-i: TP worker lifecycle scaffolding18ae3c3 post-validation cleanup: cuDNN runtime + repetition penalty1a04001 fix(deploy): use dnf upgrade for stale installs, install only when absent1866b99 fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens60176e7 ci: monotonic prerelease versions + serialize CI on shared runner602e8e1 fix(neuron/candle): source tokenizer.json from base repo when GGUFe9d0a75 ci(prerelease): auto-build on every push to main6cf87e3 chore(neuron): log load_model failures server-side with full chainf9f5fa4 fix(neuron): surface full anyhow chain + ensure $HOME exists at started4d71d fix(validate-neuron): default to unsloth GGUF + capture curl errors39010c7 add script/validate-neuron.sh — end-to-end candle harness smoke test57d7ef8 chore: revert dnf. runner user has no system privs0e9671d fix(ci): drop sudo from dnf install (runner runs as root, no sudo)e29c9e3 fix(ci): ensure rust toolchain present on cuda-13.0 runner8a2334e deploy: dnf-native version check + lair.cafe repo bootstrapaad314c feat(neuron): graceful unload-on-shutdown via SIGTERM/SIGINT6779b75 feat(neuron): load default_models on service activation84f5662 feat(neuron): OpenAI-compatible SSE streaming chat completions03bed93 add asset/manifest.yml describing fleet hosts and neuron flavours4a5211d ci(prerelease): add ampere flavour alongside ada and blackwell6d2dc5f fix(ci): give fmt/clippy/test distinct CARGO_TARGET_DIR to avoid racesb713dbe fix(ci): pass GPG secrets via env to avoid Gitea log leakage5c957d0 ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe729317d feat(neuron): OpenAI-compatible non-streaming chat completion5c2bd1a feat(neuron): wire candle harness load/unload via GGUF3cccc2c refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness7f797b0 ci: parallelise fmt/clippy/test and drop sccache install step5a0360c ci: use container runner labels for CI jobs472c0e8 fix(rpm): ship firewalld service definitions with correct ports25f75fe chore: ignore local deploy script3f94c50 chore: move default ports out of common-collision ranges3e1fb60 ci: drop actions/cache for cargo registry and targetabe4ff7 ci: publish both packages to a single helexa/helexa COPR project7c3390a fix(rpm): rename neuron package to helexa-neuron2ff062d ci: commit generated %changelog entries back to main556e529 fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires1d90238 ci: migrate rpm changelog generation to reusable actiond99b25f ci: auto-generate rpm changelog entry per release034da31 fix(rpm): correct weekday in changelog entry3bb5b3c fix(rpm): drop %attr(,,user) on config files to avoid dnf silent filter9697fba fix(neuron): run service as neuron user, not cortex142e91c fix(neuron): install config at /etc/neuron/, not /etc/cortex/4a9a4fc ci: migrate copr publish to reusable action53a3c1e fix(rpm): explicitly Provides user(cortex)/group(cortex)5c7d63c ci: dump COPR per-chroot build logs to CI outputba50201 fix(rpm): rename sysusers files to match package names2091507 fix(rpm): use sysusers.d for cortex user/group creationada76b0 fix(rpm): add missing native build dependencies15ded3a ci: cache target/, disable incremental, drop redundant build7befa88 fix: yaml syntaxd03fae9 fix(ci): unset RUSTC_WRAPPER during sccache install7b2235d fix(ci): install sccache with S3 feature if missing54f9f3d ci: add sccache with MinIO backend for build cachingcaee8bb fix(ci): use GITEA_TOKEN env var for push, not checkout324dfa0 ci: add RPM packaging for cortex and neuronc85d500 ci: add RPM packaging for cortex and neuron6c238f4 refactor: rename cortex-neuron binary and crate to neurone42e8ee refactor: cortex talks to neurons instead of mistral.rs directly26e5e7e feat: implement mistral.rs harness and neuron model API6dc717e feat: add neuron daemon with GPU discovery and health endpoints67b9b04 feat: add per-request Prometheus metrics instrumentation29c8f10 feat: implement non-streaming Anthropic response translation24c5e1e feat: add LRU eviction tests and last_accessed trackingd5f19b9 test: add Phase 3 poller integration testsc2118aa test: add Phase 2 streaming SSE passthrough tests1b339b1 test: add Phase 1 integration tests for basic proxy3ad8c72 docs: add CI expectations to CLAUDE.md and README.md6bb3004 ci: add Gitea CI, RPM spec, license, and repo hygiene