When we announced that Neureus would always price LLM tokens 10% below OpenRouter, the most common question was: how? OpenRouter is already routing at near-cost margins. Where does the margin come from?
The answer is a 4-pass prompt preprocessor that runs on every /ai/chat request before it hits the provider.
The billing invariant
Here’s the deal: providers bill us for the tokens they process. We bill users for 90% of their original (uncompressed) token count. The difference — what the preprocessor saves — is our margin on token costs.
In practice, the preprocessor consistently reduces token count by 10–30%. On a 10K token request, that’s 1,000–3,000 tokens we never send to the provider. At $5/1M tokens (GPT-4o), that’s $0.005–$0.015 saved per request. At scale, those numbers add up.
This means:
- Users pay 10% below OpenRouter on every request, regardless of what the preprocessor saves
- Neureus’s margin scales with preprocessing effectiveness — the more it saves, the better our margin
- Providers get cleaner, more structured prompts (which often improves response quality too)
The 4 passes
Pass 1: Normalize
Trim leading/trailing whitespace, normalize Windows line endings (CRLF → LF), collapse multiple consecutive blank lines into one, remove zero-width characters (common in copy-pasted content).
This is purely cosmetic cleanup. Impact: small (1–3% typically), but it’s free savings with zero quality impact.
Input: "Hello \r\n\r\n\r\nWorld" (zero-width chars, CRLF, triple blank line)
Output: "Hello\n\nWorld"
Pass 2: Structure
Move system messages to index 0. Providers require system messages to come first in the messages array; sending them elsewhere is either an error or silently ignored. Many clients get this wrong, especially when building dynamic conversation history.
Beyond correctness, some providers (Anthropic especially) behave better with a well-structured system prompt than with instructions buried in user turns.
Impact: zero direct token savings, but prevents malformed requests that cause expensive retries.
Pass 3: Trim
When the estimated token count exceeds 6,000: keep the system message and the last 4 non-system turns; replace the middle of the conversation with a compact summary message.
Before trim (12K tokens):
[system]: You are a helpful assistant...
[user]: (turn 1)
[assistant]: (turn 1)
[user]: (turn 2)
[assistant]: (turn 2)
... (turns 3–8)
[user]: (turn 9 — current)
After trim (4K tokens):
[system]: You are a helpful assistant...
[user]: [Earlier conversation summary: The user asked about X, Y, Z. Key decisions made: A, B, C.]
[user]: (turn 7)
[assistant]: (turn 7)
[user]: (turn 8)
[assistant]: (turn 8)
[user]: (turn 9 — current)
Impact: 30–60% reduction on long conversations. This is the biggest single saver.
The summary message is generated inline (no additional LLM call — we use a lightweight rule-based summary). The last 4 turns are kept in full to preserve recent context.
Pass 4: Compress (opt-in)
LLM compression via Llama 3.1 8B: send your message content to a small, fast model and ask it to compress the text while preserving meaning. Only activates when you send x-neureus-options: compress=true.
Input (user message): "I am writing to inquire about the possibility of scheduling a meeting at
your earliest convenience to discuss the matter of the proposed changes to the software
architecture that were outlined in the document that was shared with the team last Tuesday."
Output: "Can we meet to discuss the architecture changes from last Tuesday's doc?"
200ms timeout — if compression takes longer, we fall back to the original message silently.
Impact: 20–40% reduction on verbose inputs. Pairs well with Pass 3 for long conversations with wordy prompts.
The billing math
Say you send a 10,000 token request (common for RAG context + history):
- Pass 1 (normalize): saves ~200 tokens → 9,800 tokens
- Pass 2 (structure): no savings, better structure
- Pass 3 (trim, if >6K): saves ~3,000 tokens → 6,800 tokens
- Pass 4 (compress, opt-in): saves ~1,200 tokens → 5,600 tokens
Provider billing: 5,600 tokens at $5/1M = $0.028 Your billing: 9,000 tokens (90% of original 10K) at $4.50/1M = $0.0405
Our margin on this request: $0.0405 - $0.028 = $0.0125
At 10M such requests per month: ~$125K margin from preprocessing alone. That funds model costs, infrastructure, and the engineering team.
Why this works better than just marking up
The alternative would be to simply charge 10% above provider cost and route directly. That works, but:
- It offers no value beyond convenience. Users are paying for routing, not savings.
- Margins are fully correlated with provider price changes — no buffer.
- There’s no compounding benefit as our models improve.
The preprocessor approach means our margin improves as we improve the preprocessor. Every quality improvement in Pass 4’s compression model is directly accretive to margin. It aligns our engineering incentives with user cost savings.
What this means for you
If you’re using Neureus, you’re getting:
- Guaranteed 10% below OpenRouter on token pricing, on every request
- Cleaner prompts to your provider (often better responses)
- Automatic context management — conversations don’t hit token limits as fast
- Optional aggressive compression for verbose inputs (add the header)
And you can see exactly what the preprocessor did: add x-neureus-debug: true to any request and the response body includes _preprocessing.tokensSaved, _preprocessing.passesRun, and _preprocessing.originalTokens vs _preprocessing.sentTokens.
We think transparent pricing mechanisms build more trust than opaque discounts. This is ours.
Start at app.neureus.ai/onboard — free tier includes the preprocessor.