hi, i'm rob

helexa/cortex

Rust 96.9%Cuda 1.7%Shell 1.2%Python 0.3%

helexa

Near-frontier AI for mortals.

helexa is a self-hosted LLM serving stack, written in Rust, for people who run open-weight models on their own consumer GPUs. It has two components:

cortex — the per-operator control plane and LLM proxy. It sits in front of your GPU fleet and presents a unified OpenAI + Anthropic compatible API surface, handling model routing, lifecycle management (load / unload / evict), request translation, and metrics.
neuron — the per-host LLM harness. One instance runs on every GPU host, serving candle-based in-process inference and managing local hardware discovery and model lifecycle.

Why

Two principles constrain everything in this repository:

Frontier or close to it. helexa serves the open-weight models that get nearest to frontier capability — not every architecture ever published.
Consumer hardware. Everything must run on the cards mortals can actually buy: a 3060 here, a 4090 there, a 5090 if you got lucky. Mixed VRAM tiers across mismatched boxes are the expected topology, not a degraded case.

GPU acquisition is harder than it was a year ago, and the gap between what cloud providers charge and what your own silicon costs keeps widening. The intersection of those two principles — near-frontier models, squeezed onto hardware you own — is helexa's entire niche.

The secondary objective is predictable consumption. If you own the hardware, your tooling shouldn't break because a cloud provider changed billing, deprecated a model, or reshaped an API. cortex's OpenAI and Anthropic surfaces are a stability contract: point your editor, agent, or CLI at it once, and it keeps working.

What helexa is not

This is an intentionally different path from vLLM, SGLang, and peers — not a smaller version of them. Out of scope, permanently:

Any-model breadth. Architectures are ported because they're at or near the frontier, not to complete a compatibility matrix.
Datacenter-class scheduling. No sophisticated continuous-batching / paged-attention machinery — the workload is a handful of operators and their agents, not 200 QPS.
Wrapping external inference engines. neuron builds directly on candle; every model architecture it serves is implemented in this repository, ported against the HuggingFace reference.

One thing that is not a principle: CUDA exclusivity. All high-end consumer hardware is in scope. helexa is CUDA-only today because that's the hardware on the bench — nothing ships untested — and ROCm or other consumer accelerators join as soon as there's real hardware to build against.

In scope, and where the engineering effort goes: aggressive quantization (GGUF Q4_K_M / Q6_K / Q8_0), NCCL tensor parallelism across heterogeneous consumer GPUs, careful CUDA failure handling, and single-request latency — the performance that one operator at a keyboard actually feels.

Architecture

┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
│ Claude Code  │  │ Zed/IDE  │  │ Tidal / mm │  │ curl / etc │
└──────┬───────┘  └─────┬────┘  └──────┬─────┘  └──────┬─────┘
       │                │              │               │
       └────────────────┴──────┬───────┴───────────────┘
                               │  OpenAI + Anthropic APIs
                    ┌──────────▼──────────┐
                    │      cortex         │
                    │  (cortex-gateway)   │
                    │                     │
                    │  Router · Metrics   │
                    │  Evictor · Translate│
                    └──┬──────┬────────┬──┘
                       │      │        │
            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
            │  neuron   │  │ neuron │  │  neuron   │
            │  :13131   │  │ :13131 │  │  :13131   │
            │  candle   │  │ candle │  │  candle   │
            └───────────┘  └────────┘  └───────────┘
                  private network (.internal)

cortex discovers each neuron's hardware (devices, VRAM, compute capability) at runtime and matches it against a model catalogue (models.toml) to decide placement: which models fit where, what to evict when VRAM is tight, where to route a request right now. Adding a GPU host to the fleet is one [[neurons]] entry — no device specs in config.

Crates

Crate	Purpose
`cortex-core`	Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types
`cortex-gateway`	Axum HTTP server: proxy, router, evictor, poller, metrics exporter
`neuron`	Per-host daemon: GPU discovery, in-process candle inference, NCCL tensor parallelism, model lifecycle API
`cortex-cli`	CLI entrypoint (`cortex serve`, `cortex status`, etc.)
`helexa-acp`	Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default

The engine

neuron runs inference in-process on candle — there is no external inference server to babysit. The parts that earn their keep:

Per-device worker threads. Every CUDA device gets one dedicated OS thread that owns its CUDA context for the daemon's lifetime. All loads, forward passes, KV-cache resets, NCCL collectives, VRAM queries, and unloads route through it; tensors never escape it alive. Context binding is pinned to a known thread, the CUDA Drop contract is structurally safe, and a driver error poisons one worker — visibly — instead of hanging the whole process.
Tensor parallelism on consumer cards. Megatron-style row/column parallel layers with NCCL all-reduce, spanning the mismatched GPUs you actually have. A step watchdog aborts wedged collectives instead of letting a request hang forever.
Current model focus: the Qwen3 family — dense and GGUF-quantized, including the hybrid linear-attention (Gated DeltaNet) generation. Vision support is in progress. Each architecture is ported against its HuggingFace reference implementation.

See CLAUDE.md for design rationale and crates/neuron/src/harness/device_worker/ for the worker narrative.

Install

Pre-built RPMs for Fedora:

dnf copr enable helexa/helexa
dnf install cortex            # on the gateway host
dnf install helexa-neuron     # on each GPU host
systemctl enable --now cortex   # or neuron, respectively

Configure

# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"

[eviction]
strategy = "lru"        # lru | priority
defrag_after_cycles = 50

[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"

[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"

Model placement profiles (VRAM requirements, quant, device minimums, pinning) live in models.toml — see models.example.toml.

Run

# start the gateway
cortex serve --config /etc/cortex/cortex.toml

# check fleet status
cortex status

# one catalogue across every node
curl http://localhost:31313/v1/models

Build from source

cargo build --release

CI runs on every push; keep it green locally:

cargo fmt --check --all                    # must be clean
cargo clippy --workspace -- -D warnings   # warnings are errors
cargo test --workspace                     # all tests must pass

Tagged releases (v*) build SRPMs for cortex and helexa-neuron and publish to COPR.

Status

Pre-1.0 and moving fast. The gateway path (routing, eviction, translation, metrics) is stable and tested; the candle-native engine is under active development — expect the supported-model list to track the open-weight frontier, deliberately narrowly.

Development happens at https://git.lair.cafe/helexa/helexa; https://github.com/helexa-ai/helexa is a read-only mirror.

License

GPL-3.0

189 activities

grenade pushed 1 commit to helexa/cortex:main

24968e9 feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6

tuesday, june 2, 2026 — 12:33:04 utc

grenade pushed 2 commits to helexa/cortex:main

7df84fe feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6
5c520c7 feat(deploy): gitea workflow for rolling RPM deploys + host bootstrap

tuesday, june 2, 2026 — 08:40:50 utc

grenade opened issue #15 in helexa/cortex

Vision: numerical validation against transformers reference

monday, june 1, 2026 — 13:18:45 utc

grenade opened issue #14 in helexa/cortex

Vision: dynamic image resolution (Qwen-VL min/max pixels)

monday, june 1, 2026 — 13:18:32 utc

grenade opened issue #13 in helexa/cortex

Vision: deploy on Qwen3.6-27B (production validation)

monday, june 1, 2026 — 13:18:18 utc

grenade opened issue #12 in helexa/cortex

Vision: tensor-parallel implementation (Stage E)

monday, june 1, 2026 — 13:18:04 utc

grenade pushed 1 commit to helexa/cortex:main

d0292ed feat(cortex): catalogue source field + scheme-qualified /models/load

monday, june 1, 2026 — 11:54:01 utc

grenade pushed 1 commit to helexa/cortex:main

d4e1b05 feat(neuron,cortex-core): source-aware loader (scheme:org/name)

monday, june 1, 2026 — 10:42:16 utc

grenade pushed 2 commits to helexa/cortex:main

61adff3 feat(neuron): preflight placement check with structured errors
0af8c8d chore(ci): enable colored logs for readability

monday, june 1, 2026 — 10:31:44 utc

grenade pushed 2 commits to helexa/cortex:phase-2-preflight

61adff3 feat(neuron): preflight placement check with structured errors
0af8c8d chore(ci): enable colored logs for readability

monday, june 1, 2026 — 10:24:37 utc

grenade pushed 1 commit to helexa/cortex:main

435fd10 fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferred

monday, june 1, 2026 — 05:59:59 utc

grenade opened issue #11 in helexa/cortex

feat(neuron): prefix KV caching across requests

monday, june 1, 2026 — 05:55:57 utc

grenade opened issue #10 in helexa/cortex

feat(cortex-gateway): Rust-native context compressor for prompt token reduction

monday, june 1, 2026 — 05:55:45 utc

grenade pushed 1 commit to helexa/cortex:main

cb30383 feat(neuron): render the model's chat_template with chat_template_kwargs

sunday, may 31, 2026 — 20:43:16 utc

grenade closed issue #9 in helexa/cortex

sunday, may 31, 2026 — 20:43:16 utc

grenade pushed 1 commit to helexa/cortex:main

4400835 feat(neuron): emit response.in_progress between created and output_item.added

sunday, may 31, 2026 — 20:30:39 utc

grenade pushed 1 commit to helexa/cortex:main

2f387f3 ci: export CUDA paths in cuda-check so cudarc build.rs finds nvcc

sunday, may 31, 2026 — 20:28:12 utc

grenade closed issue #6 in helexa/cortex

sunday, may 31, 2026 — 20:26:34 utc

grenade pushed 1 commit to helexa/cortex:main

fc9a8c4 feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas

sunday, may 31, 2026 — 20:26:34 utc

grenade pushed 1 commit to helexa/cortex:main

7733eec feat(neuron): strip reasoning from chat completions by default

sunday, may 31, 2026 — 14:55:07 utc

grenade closed issue #8 in helexa/cortex

sunday, may 31, 2026 — 14:55:07 utc

grenade opened issue #9 in helexa/cortex

Pass through `chat_template_kwargs` to the chat template at tokenization

sunday, may 31, 2026 — 14:43:38 utc

grenade opened issue #8 in helexa/cortex

Strip reasoning content from chat-completions output by default; opt-in via header

sunday, may 31, 2026 — 14:43:14 utc

grenade closed issue #5 in helexa/cortex

sunday, may 31, 2026 — 14:42:44 utc

grenade commented on #5 in helexa/cortex

Closing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough). ## Why this issue is wrong as…

Closing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough).

Why this issue is wrong as written

The original proposal — "route Qwen3 <think> to ReasoningDelta" — assumed Qwen3-specific tag parsing. Investigating the actual leak (Zed's commit-message generator showing <think> blocks in the field) surfaced two problems:

Zed's chat-completions client doesn't know about reasoning at all on the chat-completions surface (confirmed against their crates/open_ai/src/completion.rs — no chat_template_kwargs, no Responses-API capability detection). The wire format has no slot for reasoning, so anything inside <think> arrives as plain content. #5's proposed fix wouldn't help that path because there's no reasoning-event family in chat completions to route to.
A model-specific tag parser in the candle harness's hot loops creates a coupling we don't want — DeepSeek-R1, Mistral Magistral, gpt-oss, and future reasoning models all use different markers. Per-model parser config is the wrong shape.

What replaces it

The leak fixes cleanly with a model-agnostic seam: at model load time, probe the tokenizer's added_tokens for any token whose content matches a known reasoning-marker convention. Store the open/close token IDs on LoadedModel (or None for non-reasoning models). The inference loop's token-level state machine routes between TextDelta and ReasoningDelta without any hardcoded model knowledge.

The chat-completions projector then drops ReasoningDelta by default (matching the wire format's lack of a reasoning slot), opt-in via header for callers like helexa-acp that want the markers back.

That's #8. Companion is #9 (pass chat_template_kwargs through to the chat template at tokenisation), which gives clients a request-side lever to suppress thinking at generation time — also model-agnostic since neuron doesn't interpret the kwarg, just forwards it.

The Responses-API mapping (ReasoningDelta → response.reasoning_summary_text.delta) is still worth doing eventually but only matters once a Responses-API consumer of cortex exists; tracking under #7's reasoning sub-bullet rather than as a separate issue today.

sunday, may 31, 2026 — 14:42:28 utc

grenade pushed 1 commit to helexa/cortex:main

fdc0adb docs(helexa-acp): README + example config for end-user onboarding

sunday, may 31, 2026 — 11:26:00 utc

grenade pushed 1 commit to helexa/cortex:main

8fa1d19 feat(helexa-acp): anthropic-messages provider

sunday, may 31, 2026 — 11:02:03 utc

grenade pushed 1 commit to helexa/cortex:main

cad7552 ci: clear sccache env on cuda-check so cargo doesn't try to wrap rustc

sunday, may 31, 2026 — 10:55:22 utc

grenade pushed 1 commit to helexa/cortex:main

1818dfb feat(helexa-acp): openai-responses provider

sunday, may 31, 2026 — 08:30:28 utc

grenade pushed 1 commit to helexa/cortex:main

5ed1140 feat(cortex-gateway): proxy /v1/responses to neuron

sunday, may 31, 2026 — 08:21:47 utc

grenade opened issue #7 in helexa/cortex

Responses API: emit `response.in_progress` and built-in-tool event families

sunday, may 31, 2026 — 08:19:37 utc

grenade opened issue #6 in helexa/cortex

Responses API: synthesise `function_call` output items from tool-call deltas

sunday, may 31, 2026 — 08:19:16 utc

grenade opened issue #5 in helexa/cortex

Responses API: surface Qwen3 `<think>` blocks as reasoning items

sunday, may 31, 2026 — 08:18:54 utc

grenade opened issue #4 in helexa/cortex

Responses API: implement `previous_response_id` chained conversations

sunday, may 31, 2026 — 08:18:29 utc

grenade pushed 1 commit to helexa/cortex:main

957f704 feat(neuron): OpenAI Responses API + ci cuda-check runner label

sunday, may 31, 2026 — 08:13:47 utc

grenade pushed 1 commit to helexa/cortex:main

1859777 ci: add cuda type-check job so CUDA-only borrowck errors fail fast

sunday, may 31, 2026 — 06:49:55 utc

grenade commented on #3 in helexa/cortex

### Versions observed against Pinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro: - cortex `0.1.16-0.1…

Versions observed against

Pinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro:

cortex 0.1.16-0.1.20260527185748.git249b2e5.fc43 on hanzalova
helexa-neuron-blackwell 0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on beast (commit df0abfe)
helexa-neuron-ada 0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on benjy
helexa-neuron-ampere 0.1.16-0.1.20260527185748.git249b2e5.fc43 on quadbrat (note: older build, since upgraded to gitdf0abfe)

For helexa-acp callers: this bug predates the InferenceEvent refactor on main (commit 302ccfb, pushed 2026-05-29) and is unrelated to it — the refactor only touched the streaming output path, not the request-parsing / image-ingest path. The next deploy off a build that includes 302ccfb (or later) should reproduce identically until we actually wire image ingest through to the candle harness.

Relevant code paths to inspect during triage:

crates/cortex-gateway/src/handlers.rs::chat_completions — does the gateway preserve content arrays verbatim when proxying, or flatten them?
crates/neuron/src/harness/candle.rs::format_qwen3_prompt — currently does MessageContent::Parts(parts) => parts.iter().filter(text-only).join(...), which would silently drop image_url parts.

sunday, may 31, 2026 — 06:49:41 utc

grenade pushed 1 commit to helexa/cortex:main

6927286 fix(neuron): clone id/model_id before TP spawn so wire projector can use them

sunday, may 31, 2026 — 06:37:13 utc

grenade opened issue #3 in helexa/cortex

Image content (`image_url`) is dropped — multimodal chat requests are processed as text-only

sunday, may 31, 2026 — 06:30:38 utc

grenade pushed 1 commit to helexa/cortex:main

302ccfb refactor(neuron): introduce InferenceEvent + wire projection layer

friday, may 29, 2026 — 08:30:21 utc

grenade pushed 1 commit to helexa/cortex:main

df0abfe feat(helexa-acp): image input for vision-capable models

friday, may 29, 2026 — 06:43:03 utc

grenade pushed 1 commit to helexa/cortex:main

b901657 feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errors

friday, may 29, 2026 — 06:29:02 utc

grenade pushed 1 commit to helexa/cortex:main

adbc52b feat(helexa-acp): model picker + session/set_model handler

friday, may 29, 2026 — 06:10:20 utc

grenade pushed 1 commit to helexa/cortex:main

537a0fe feat(helexa-acp): context compaction for small-context local models

friday, may 29, 2026 — 05:22:06 utc

grenade pushed 1 commit to helexa/cortex:main

cbadfcf feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows

friday, may 29, 2026 — 05:06:32 utc

grenade pushed 1 commit to helexa/cortex:main

3ecbb21 fix(helexa-acp): persist per round, cancel previous prompt, log loop

thursday, may 28, 2026 — 13:29:26 utc

grenade pushed 1 commit to helexa/cortex:main

0d841a4 feat(helexa-acp): replay session history on session/load

thursday, may 28, 2026 — 13:02:05 utc

grenade pushed 1 commit to helexa/cortex:main

0bbb9b7 feat(helexa-acp): session/list so Zed can discover sessions to resume

thursday, may 28, 2026 — 11:34:46 utc

grenade pushed 1 commit to helexa/cortex:main

5aac1ff feat(helexa-acp): session resume via session/load

thursday, may 28, 2026 — 10:34:46 utc

grenade pushed 1 commit to helexa/cortex:main

ec2b645 feat(helexa-acp): infer tool name from arg shape when model omits it

thursday, may 28, 2026 — 10:14:53 utc

grenade pushed 1 commit to helexa/cortex:main

a494c8d feat(helexa-acp): repair malformed tool calls and render failures as cards

thursday, may 28, 2026 — 09:58:55 utc

grenade pushed 1 commit to helexa/cortex:main

abbedf8 chore(neuron): bump default max_tokens from 512 to 8192

thursday, may 28, 2026 — 09:38:32 utc

grenade pushed 1 commit to helexa/cortex:main

6cc14e9 feat(helexa-acp): per-endpoint max_tokens config

thursday, may 28, 2026 — 09:34:27 utc

grenade pushed 1 commit to helexa/cortex:main

1c16732 feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning

thursday, may 28, 2026 — 09:30:29 utc

grenade pushed 1 commit to helexa/cortex:main

5a0861d fix(helexa-acp): forward Dispatch::Response to its awaiting router

thursday, may 28, 2026 — 09:16:26 utc

grenade pushed 1 commit to helexa/cortex:main

33652ac feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host logging

thursday, may 28, 2026 — 08:47:32 utc

grenade pushed 1 commit to helexa/cortex:main

c297a54 chore(helexa-acp): log raw bash output and tool result snippets

thursday, may 28, 2026 — 08:15:29 utc

grenade pushed 1 commit to helexa/cortex:main

0121a19 feat(helexa-acp): inject and parse Qwen3 Hermes tool format

thursday, may 28, 2026 — 08:06:41 utc

grenade pushed 1 commit to helexa/cortex:main

13f4c36 chore(helexa-acp): log outgoing chat-completion body at debug level

thursday, may 28, 2026 — 07:38:14 utc

grenade pushed 1 commit to helexa/cortex:main

4a51a54 fix(helexa-acp): describe Stage 3 tools in the default system prompt

thursday, may 28, 2026 — 07:33:20 utc

grenade pushed 1 commit to helexa/cortex:main

0609f1a feat(helexa-acp): add tools, session modes, and permission gating

thursday, may 28, 2026 — 07:01:41 utc

grenade pushed 1 commit to helexa/cortex:main

96fc379 feat(helexa-acp): wire ACP agent loop for text-only conversations

thursday, may 28, 2026 — 06:46:35 utc

grenade pushed 2 commits to helexa/cortex:main

e267f58 chore(neuron): rustfmt drift in is_device_fault test
e23d501 feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat

thursday, may 28, 2026 — 05:13:59 utc

grenade pushed 2 commits to helexa/cortex:main

249b2e5 fix(neuron): only poison the model on actual device faults
c59da83 fix(neuron): serialise single-GPU inference per loaded model

wednesday, may 27, 2026 — 15:57:51 utc

grenade opened issue #2 in helexa/cortex

Research: Prototype a quantization kernel in pure Rust via cuda-oxide

wednesday, may 27, 2026 — 15:30:16 utc

grenade pushed 2 commits to helexa/cortex:main

f058823 fix(neuron): don't poison the model on tokio JoinError panics
bd04d7f fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic

wednesday, may 27, 2026 — 15:02:55 utc

grenade pushed 1 commit to helexa/cortex:main

1e13889 feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks

wednesday, may 27, 2026 — 13:49:57 utc

grenade opened issue #1 in helexa/cortex

Reduce TP=2 Q6K cold-load time for Qwen3.6-27B (~5 min today)

wednesday, may 27, 2026 — 10:31:42 utc

grenade pushed 1 commit to helexa/cortex:main

6e1c1dd ci: retry clippy + test up to 3 times on spurious sccache failures

wednesday, may 27, 2026 — 10:13:18 utc

grenade pushed 1 commit to helexa/cortex:main

3587695 chore(neuron): default tracing filter to info (was info,neuron=debug)

wednesday, may 27, 2026 — 09:47:34 utc

grenade pushed 2 commits to helexa/cortex:main

740299b chore(neuron/beast): switch default-model quant from q5k to q6k
cdf0f4e fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM

wednesday, may 27, 2026 — 09:36:25 utc

grenade pushed 1 commit to helexa/cortex:main

c4954e0 docs: per-device worker thread architecture (phase 5 of refactor)

wednesday, may 27, 2026 — 08:15:48 utc

grenade pushed 1 commit to helexa/cortex:main

b4f3576 refactor(neuron): phase 4 — model loads move onto the device worker

wednesday, may 27, 2026 — 07:24:42 utc

grenade pushed 1 commit to helexa/cortex:main

76ab24d refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker

wednesday, may 27, 2026 — 07:16:07 utc

grenade pushed 1 commit to helexa/cortex:main

b179204 refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker

wednesday, may 27, 2026 — 06:55:11 utc

grenade pushed 1 commit to helexa/cortex:main

081b532 refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it

wednesday, may 27, 2026 — 06:40:38 utc

grenade pushed 1 commit to helexa/cortex:main

7c19da9 feat(neuron): construction-complete vram/config dump + logits health + per-step vram

wednesday, may 27, 2026 — 06:05:00 utc

grenade pushed 1 commit to helexa/cortex:main

24e20dc feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)

tuesday, may 26, 2026 — 13:10:45 utc

grenade pushed 2 commits to helexa/cortex:main

becf61b feat(script): validate-neuron.sh waits for /health activation=ready
b9e7a76 feat(gateway): surface mid-prewarm models as Loading on /v1/models

tuesday, may 26, 2026 — 12:26:25 utc

grenade pushed 1 commit to helexa/cortex:main

800498f feat(neuron): bind listener before pre-warm, surface activation in /health

tuesday, may 26, 2026 — 12:18:07 utc

grenade pushed 1 commit to helexa/cortex:main

d3f2d50 feat(deploy): per-host neuron config + pre-warm headline models

tuesday, may 26, 2026 — 11:05:57 utc

grenade pushed 1 commit to helexa/cortex:main

2740e61 fix(neuron,candle): name lifetime on acquire_pool_lock

tuesday, may 26, 2026 — 09:37:36 utc

grenade pushed 5 commits to helexa/cortex:main

67f79c8 fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain
fc6ef0e feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups
1385979 feat(neuron,candle): log per-device VRAM at chat_completion start
0a1cfcd feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings
ea0e0f7 fix(neuron,tp): log leader forward errors with full context

tuesday, may 26, 2026 — 09:30:17 utc

grenade pushed 1 commit to helexa/cortex:main

aa88d37 fix(gateway): full observability + stop leaking upstream bodies

friday, may 22, 2026 — 04:17:29 utc

grenade pushed 1 commit to helexa/cortex:main

0f00f72 fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures

friday, may 22, 2026 — 04:10:44 utc

grenade pushed 1 commit to helexa/cortex:main

9b0ed0b fix(router): rewrite loopback inference URLs to use neuron's host

friday, may 22, 2026 — 03:23:52 utc

grenade pushed 1 commit to helexa/cortex:main

dc2a803 fix(rpm): migrate legacy helexa-cortex firewalld service to `cortex`

friday, may 22, 2026 — 03:12:54 utc

grenade pushed 1 commit to helexa/cortex:main

e711814 feat(stage-8e-3): quantize lm_head in TP Qwen3-Next

thursday, may 21, 2026 — 18:53:18 utc

grenade pushed 1 commit to helexa/cortex:main

ee663e5 fix(stage-8e-2e): bump quant prefill threshold to M > 64

thursday, may 21, 2026 — 18:50:49 utc

grenade pushed 1 commit to helexa/cortex:main

34f9b77 feat(stage-8e-2d): route quantized matmul by M (prefill vs decode)

thursday, may 21, 2026 — 18:15:36 utc

grenade pushed 1 commit to helexa/cortex:main

f084aaa fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul

thursday, may 21, 2026 — 17:05:23 utc

grenade pushed 1 commit to helexa/cortex:main

68a606a fix(stage-8e-2b): allow quant on the TP load path

thursday, may 21, 2026 — 16:17:17 utc

grenade pushed 1 commit to helexa/cortex:main

4aa7190 feat(stage-8e-2): plumb quant config from ModelSpec to TP load path

thursday, may 21, 2026 — 15:03:41 utc

grenade pushed 1 commit to helexa/cortex:main

bef159b feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants

thursday, may 21, 2026 — 14:55:29 utc

grenade pushed 1 commit to helexa/cortex:main

8d7b099 feat(stage-8d-7): direct safetensors fused-region loader

thursday, may 21, 2026 — 14:49:38 utc

grenade pushed 1 commit to helexa/cortex:main

89d98d1 diag(stage-8d-6): per-layer VRAM logging in TP load path

thursday, may 21, 2026 — 09:54:08 utc

grenade pushed 1 commit to helexa/cortex:main

cc95fe2 feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel

thursday, may 21, 2026 — 08:52:41 utc

grenade pushed 1 commit to helexa/cortex:main

09c945f feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill

thursday, may 21, 2026 — 08:50:34 utc

grenade pushed 1 commit to helexa/cortex:main

05dc0ba feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels

thursday, may 21, 2026 — 08:49:45 utc

grenade pushed 1 commit to helexa/cortex:main

10c151e feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_5

thursday, may 21, 2026 — 08:44:18 utc

grenade pushed 1 commit to helexa/cortex:main

44ae927 feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5

thursday, may 21, 2026 — 08:39:35 utc

grenade pushed 1 commit to helexa/cortex:main

1ebbe87 feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only

thursday, may 21, 2026 — 08:34:15 utc

grenade pushed 1 commit to helexa/cortex:main

70eb6af feat(tp): cancellation-safe inference + structured tracing

thursday, may 21, 2026 — 05:22:05 utc

grenade pushed 1 commit to helexa/cortex:main

d1a4aad fix(tp): always drain worker responses on leader failure

thursday, may 21, 2026 — 04:39:41 utc

grenade pushed 1 commit to helexa/cortex:main

95dc874 feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5)

wednesday, may 20, 2026 — 19:02:45 utc

grenade pushed 1 commit to helexa/cortex:main

495d3f7 fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule

wednesday, may 20, 2026 — 18:13:23 utc

grenade pushed 1 commit to helexa/cortex:main

5c4c8e0 fix(qwen3_5): tensor names are under `model.language_model.*`, not `model.*`

wednesday, may 20, 2026 — 13:48:19 utc

grenade pushed 1 commit to helexa/cortex:main

a77f196 fix(qwen3_5): tensor names are under `model.language_model.*`, not `model.*`

wednesday, may 20, 2026 — 13:47:54 utc

grenade pushed 1 commit to helexa/cortex:main

07c44d5 fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25

wednesday, may 20, 2026 — 13:18:55 utc

grenade pushed 1 commit to helexa/cortex:main

e7eb3da feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5

wednesday, may 20, 2026 — 12:52:38 utc

grenade pushed 1 commit to helexa/cortex:main

1802745 feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet)

wednesday, may 20, 2026 — 06:29:56 utc

grenade pushed 1 commit to helexa/cortex:main

a70f317 feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate

wednesday, may 20, 2026 — 05:58:04 utc

grenade pushed 1 commit to helexa/cortex:main

c6022aa feat(stage-8b): Llama + Qwen3 MoE families on the candle harness

wednesday, may 20, 2026 — 05:36:26 utc

grenade pushed 1 commit to helexa/cortex:main

9e31d8d feat(stage-8a): pre-flight architecture check for dense model loads

wednesday, may 20, 2026 — 05:27:33 utc

grenade pushed 1 commit to helexa/cortex:main

b400e8b feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache

wednesday, may 20, 2026 — 04:52:54 utc

grenade pushed 1 commit to helexa/cortex:main

62ca125 chore: keep models.example.toml generic; deploy.sh sync's local models.toml

wednesday, may 20, 2026 — 04:47:13 utc

grenade pushed 1 commit to helexa/cortex:main

735945e feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load

wednesday, may 20, 2026 — 04:39:08 utc

grenade pushed 1 commit to helexa/cortex:main

f72dee0 feat(tp): Stage 7c-i — streaming SSE through TP

wednesday, may 20, 2026 — 04:32:50 utc

grenade pushed 1 commit to helexa/cortex:main

d46d8d4 feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference

wednesday, may 20, 2026 — 03:38:36 utc

grenade pushed 1 commit to helexa/cortex:main

9b8bd14 feat(tp): --tp-smoke CLI subcommand + remote validation script

tuesday, may 19, 2026 — 16:40:29 utc

grenade pushed 1 commit to helexa/cortex:main

96d8755 fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc

tuesday, may 19, 2026 — 16:12:03 utc

grenade pushed 2 commits to helexa/cortex:main

12549c9 fix(tp): import BackendStorage trait for CudaStorage methods
46527d7 feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2)

tuesday, may 19, 2026 — 15:32:11 utc

grenade pushed 1 commit to helexa/cortex:main

8d3194f Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears

tuesday, may 19, 2026 — 15:14:58 utc

grenade pushed 1 commit to helexa/cortex:main

5436af9 fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze

tuesday, may 19, 2026 — 14:49:46 utc

grenade pushed 1 commit to helexa/cortex:main

8e882c0 fix(neuron/tp): NcclError {e:?} + cudarc 0.19 deprecation cleanup

tuesday, may 19, 2026 — 14:24:17 utc

grenade pushed 1 commit to helexa/cortex:main

93421f4 Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives

tuesday, may 19, 2026 — 14:07:23 utc

grenade pushed 1 commit to helexa/cortex:main

05e15f3 Stage 7b-i: dense safetensors Qwen3 load path

tuesday, may 19, 2026 — 14:04:04 utc

grenade pushed 1 commit to helexa/cortex:main

da068de Stage 7a-ii: real NCCL handshake behind the worker pool

tuesday, may 19, 2026 — 13:40:04 utc

grenade pushed 1 commit to helexa/cortex:main

2a7ede0 Stage 7a-i: TP worker lifecycle scaffolding

tuesday, may 19, 2026 — 12:53:04 utc

grenade pushed 1 commit to helexa/cortex:main

18ae3c3 post-validation cleanup: cuDNN runtime + repetition penalty

tuesday, may 19, 2026 — 11:48:12 utc

grenade pushed 1 commit to helexa/cortex:main

1a04001 fix(deploy): use dnf upgrade for stale installs, install only when absent

tuesday, may 19, 2026 — 11:10:52 utc

grenade pushed 2 commits to helexa/cortex:main

1866b99 fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens
60176e7 ci: monotonic prerelease versions + serialize CI on shared runner

tuesday, may 19, 2026 — 10:43:07 utc

grenade pushed 1 commit to helexa/cortex:main

602e8e1 fix(neuron/candle): source tokenizer.json from base repo when GGUF

tuesday, may 19, 2026 — 10:16:43 utc

grenade pushed 2 commits to helexa/cortex:main

e9d0a75 ci(prerelease): auto-build on every push to main
6cf87e3 chore(neuron): log load_model failures server-side with full chain

tuesday, may 19, 2026 — 10:13:40 utc

grenade pushed 3 commits to helexa/cortex:main

f9f5fa4 fix(neuron): surface full anyhow chain + ensure $HOME exists at start
ed4d71d fix(validate-neuron): default to unsloth GGUF + capture curl errors
39010c7 add script/validate-neuron.sh — end-to-end candle harness smoke test

tuesday, may 19, 2026 — 05:20:46 utc

grenade pushed 1 commit to helexa/cortex:main

57d7ef8 chore: revert dnf. runner user has no system privs

tuesday, may 19, 2026 — 04:16:44 utc

grenade pushed 4 commits to helexa/cortex:main

0e9671d fix(ci): drop sudo from dnf install (runner runs as root, no sudo)
e29c9e3 fix(ci): ensure rust toolchain present on cuda-13.0 runner
8a2334e deploy: dnf-native version check + lair.cafe repo bootstrap
aad314c feat(neuron): graceful unload-on-shutdown via SIGTERM/SIGINT

tuesday, may 19, 2026 — 04:08:33 utc

grenade pushed 2 commits to helexa/cortex:main

6779b75 feat(neuron): load default_models on service activation
84f5662 feat(neuron): OpenAI-compatible SSE streaming chat completions

monday, may 18, 2026 — 14:56:43 utc

grenade pushed 2 commits to helexa/cortex:main

249c944 chore: track deployment script
5e17081 ci(prerelease): drop redundant rustup install step

monday, may 18, 2026 — 14:50:40 utc

grenade pushed 3 commits to helexa/cortex:main

03bed93 add asset/manifest.yml describing fleet hosts and neuron flavours
4a5211d ci(prerelease): add ampere flavour alongside ada and blackwell
6d2dc5f fix(ci): give fmt/clippy/test distinct CARGO_TARGET_DIR to avoid races

monday, may 18, 2026 — 14:40:59 utc

grenade pushed 1 commit to helexa/cortex:main

b713dbe fix(ci): pass GPG secrets via env to avoid Gitea log leakage

monday, may 18, 2026 — 14:20:43 utc

grenade pushed 4 commits to helexa/cortex:main

5c957d0 ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe
729317d feat(neuron): OpenAI-compatible non-streaming chat completion
5c2bd1a feat(neuron): wire candle harness load/unload via GGUF
3cccc2c refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness

monday, may 18, 2026 — 14:07:10 utc

grenade pushed 1 commit to helexa/cortex:main

7f797b0 ci: parallelise fmt/clippy/test and drop sccache install step

monday, may 11, 2026 — 10:55:22 utc

grenade pushed 1 commit to helexa/cortex:main

5a0360c ci: use container runner labels for CI jobs

monday, may 11, 2026 — 10:29:46 utc

grenade pushed 1 commit to helexa/cortex:main

472c0e8 fix(rpm): ship firewalld service definitions with correct ports

monday, may 11, 2026 — 09:54:04 utc

grenade pushed 3 commits to helexa/cortex:main

25f75fe chore: ignore local deploy script
3f94c50 chore: move default ports out of common-collision ranges
3e1fb60 ci: drop actions/cache for cargo registry and target

thursday, april 16, 2026 — 14:45:43 utc

grenade tagged v0.1.16 in helexa/cortex

thursday, april 16, 2026 — 14:45:43 utc

grenade tagged v0.1.15 in helexa/cortex

thursday, april 16, 2026 — 14:44:30 utc

grenade tagged v0.1.14 in helexa/cortex

thursday, april 16, 2026 — 13:38:15 utc

grenade pushed 3 commits to helexa/cortex:main

abe4ff7 ci: publish both packages to a single helexa/helexa COPR project
7c3390a fix(rpm): rename neuron package to helexa-neuron
2ff062d ci: commit generated %changelog entries back to main

thursday, april 16, 2026 — 13:38:14 utc

grenade tagged v0.1.12 in helexa/cortex

thursday, april 16, 2026 — 12:33:05 utc

grenade pushed 4 commits to helexa/cortex:main

556e529 fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires
1d90238 ci: migrate rpm changelog generation to reusable action
d99b25f ci: auto-generate rpm changelog entry per release
034da31 fix(rpm): correct weekday in changelog entry

thursday, april 16, 2026 — 12:33:04 utc

grenade tagged v0.1.11 in helexa/cortex

thursday, april 16, 2026 — 12:32:25 utc

grenade tagged v0.1.10 in helexa/cortex

thursday, april 16, 2026 — 11:51:59 utc

grenade pushed 1 commit to helexa/cortex:main

3bb5b3c fix(rpm): drop %attr(,,user) on config files to avoid dnf silent filter

thursday, april 16, 2026 — 11:51:59 utc

grenade tagged v0.1.9 in helexa/cortex

thursday, april 16, 2026 — 11:50:09 utc

grenade tagged v0.1.8 in helexa/cortex

thursday, april 16, 2026 — 10:41:03 utc

grenade pushed 1 commit to helexa/cortex:main

9697fba fix(neuron): run service as neuron user, not cortex

thursday, april 16, 2026 — 10:41:02 utc

grenade tagged v0.1.7 in helexa/cortex

thursday, april 16, 2026 — 10:08:11 utc

grenade pushed 1 commit to helexa/cortex:main

142e91c fix(neuron): install config at /etc/neuron/, not /etc/cortex/

thursday, april 16, 2026 — 10:07:18 utc

grenade tagged v0.1.6 in helexa/cortex

thursday, april 16, 2026 — 10:05:17 utc

grenade tagged v0.1.5 in helexa/cortex

thursday, april 16, 2026 — 09:48:29 utc

grenade pushed 1 commit to helexa/cortex:main

4a9a4fc ci: migrate copr publish to reusable action

thursday, april 16, 2026 — 09:48:28 utc

grenade pushed 2 commits to helexa/cortex:main

53a3c1e fix(rpm): explicitly Provides user(cortex)/group(cortex)
5c7d63c ci: dump COPR per-chroot build logs to CI output

thursday, april 16, 2026 — 09:06:12 utc

grenade tagged v0.1.4 in helexa/cortex

thursday, april 16, 2026 — 09:05:09 utc

grenade tagged v0.1.3 in helexa/cortex

thursday, april 16, 2026 — 08:21:26 utc

grenade pushed 2 commits to helexa/cortex:main

ba50201 fix(rpm): rename sysusers files to match package names
2091507 fix(rpm): use sysusers.d for cortex user/group creation

thursday, april 16, 2026 — 08:21:26 utc

grenade tagged v0.1.2 in helexa/cortex

thursday, april 16, 2026 — 07:50:26 utc

grenade pushed 2 commits to helexa/cortex:main

ada76b0 fix(rpm): add missing native build dependencies
15ded3a ci: cache target/, disable incremental, drop redundant build

thursday, april 16, 2026 — 07:50:25 utc

grenade tagged v0.1.1 in helexa/cortex

thursday, april 16, 2026 — 06:26:55 utc

grenade pushed 1 commit to helexa/cortex:main

7befa88 fix: yaml syntax

thursday, april 16, 2026 — 06:26:55 utc

grenade pushed 1 commit to helexa/cortex:main

d03fae9 fix(ci): unset RUSTC_WRAPPER during sccache install

thursday, april 16, 2026 — 05:31:34 utc

grenade pushed 1 commit to helexa/cortex:main

7b2235d fix(ci): install sccache with S3 feature if missing

wednesday, april 15, 2026 — 14:44:33 utc

grenade pushed 1 commit to helexa/cortex:main

54f9f3d ci: add sccache with MinIO backend for build caching

wednesday, april 15, 2026 — 14:39:03 utc

grenade tagged v0.1.0 in helexa/cortex

wednesday, april 15, 2026 — 13:36:13 utc

grenade pushed 3 commits to helexa/cortex:main

caee8bb fix(ci): use GITEA_TOKEN env var for push, not checkout
324dfa0 ci: add RPM packaging for cortex and neuron
c85d500 ci: add RPM packaging for cortex and neuron

wednesday, april 15, 2026 — 13:36:13 utc

grenade pushed 1 commit to helexa/cortex:main

6c238f4 refactor: rename cortex-neuron binary and crate to neuron

wednesday, april 15, 2026 — 13:01:45 utc

grenade pushed 1 commit to helexa/cortex:main

e42e8ee refactor: cortex talks to neurons instead of mistral.rs directly

wednesday, april 15, 2026 — 11:46:08 utc

grenade pushed 1 commit to helexa/cortex:main

26e5e7e feat: implement mistral.rs harness and neuron model API

wednesday, april 15, 2026 — 11:29:47 utc

grenade pushed 1 commit to helexa/cortex:main

6dc717e feat: add neuron daemon with GPU discovery and health endpoints

wednesday, april 15, 2026 — 11:25:11 utc

grenade pushed 2 commits to helexa/cortex:main

67b9b04 feat: add per-request Prometheus metrics instrumentation
29c8f10 feat: implement non-streaming Anthropic response translation

tuesday, april 14, 2026 — 16:42:43 utc

grenade pushed 1 commit to helexa/cortex:main

24c5e1e feat: add LRU eviction tests and last_accessed tracking

tuesday, april 14, 2026 — 16:36:10 utc

grenade pushed 1 commit to helexa/cortex:main

d5f19b9 test: add Phase 3 poller integration tests

tuesday, april 14, 2026 — 16:32:03 utc

grenade pushed 1 commit to helexa/cortex:main

c2118aa test: add Phase 2 streaming SSE passthrough tests

tuesday, april 14, 2026 — 16:29:05 utc

grenade pushed 1 commit to helexa/cortex:main

1b339b1 test: add Phase 1 integration tests for basic proxy

tuesday, april 14, 2026 — 16:27:38 utc

grenade pushed 1 commit to helexa/cortex:main

3ad8c72 docs: add CI expectations to CLAUDE.md and README.md

tuesday, april 14, 2026 — 15:27:52 utc

grenade pushed 1 commit to helexa/cortex:main

6bb3004 ci: add Gitea CI, RPM spec, license, and repo hygiene

tuesday, april 14, 2026 — 15:24:11 utc

grenade pushed to helexa/cortex:main

tuesday, april 14, 2026 — 15:13:49 utc

grenade created helexa/cortex

tuesday, april 14, 2026 — 14:51:44 utc