RAG in 50 Lines — Build a Document Q&A App

Most RAG tutorials start by asking you to spin up Pinecone, configure OpenAI embeddings, and write chunking logic. By the time you finish setup, you’ve written 200 lines before ingesting your first document.

Here’s the same thing in 50 lines — using a managed RAG API.

What we’re building

A CLI app that:

Ingests a URL (auto-chunks, auto-embeds, stores in a vector DB)
Accepts questions on the command line
Returns answers with source attribution

The full app

import { NeureuClient } from '@neureus/sdk';
import * as readline from 'readline';

const client = new NeureuClient({
  apiKey: process.env.NEUREUS_API_KEY!,
});

async function ingest(url: string) {
  console.log(`Ingesting ${url}...`);
  await client.rag.ingest({ url });
  console.log('Done. Documents indexed and ready to query.');
}

async function ask(question: string): Promise<void> {
  const result = await client.rag.query({
    query: question,
    model: '@wai/llama-3.3-70b',  // free — or claude-haiku-4-5, gpt-4o-mini, etc.
  });

  console.log('\nAnswer:', result.answer);

  if (result.sources?.length > 0) {
    console.log('\nSources:');
    result.sources.forEach((s, i) => {
      console.log(`  ${i + 1}. ${s.title || s.documentId}`);
      if (s.excerpt) console.log(`     "${s.excerpt.slice(0, 100)}..."`);
    });
  }
}

async function main() {
  const url = process.argv[2];

  if (!url) {
    console.error('Usage: npx ts-node rag.ts <url>');
    process.exit(1);
  }

  await ingest(url);

  const rl = readline.createInterface({ input: process.stdin, output: process.stdout });

  const ask_loop = () => {
    rl.question('\nQuestion (or "exit"): ', async (input) => {
      if (input === 'exit') { rl.close(); return; }
      await ask(input);
      ask_loop();
    });
  };

  ask_loop();
}

main().catch(console.error);

That’s it. Save as rag.ts, run npx ts-node rag.ts https://your-docs.com/guide, and start asking questions.

What’s happening under the hood

When you call client.rag.ingest({ url }):

Neureus fetches the URL
Splits the content into overlapping chunks at sentence boundaries (no hardcoded chunk_size)
Generates dense vector embeddings via Cloudflare Workers AI
Stores chunks and vectors in Cloudflare Vectorize

When you call client.rag.query({ query, model }):

Generates an embedding for your question
Finds the top-k most semantically similar chunks (default k=5)
Formats the chunks as context in a prompt
Calls the LLM to generate an answer grounded in the retrieved context
Returns the answer with source chunk metadata

Things you didn’t need

Pinecone, Chroma, Weaviate, or any vector database
OpenAI Embeddings API key
LangChain, LlamaIndex, or any orchestration framework
A server to run any of the above

Variant: web app with streaming

For a web UI, swap the CLI loop for a streaming fetch:

// Next.js API route — app/api/rag/route.ts
export async function POST(req: Request) {
  const { question } = await req.json();

  const result = await client.rag.query({
    query: question,
    model: 'claude-haiku-4-5',
    stream: false,  // RAG query returns complete answer
  });

  return Response.json(result);
}

// React component
function RagChat() {
  const [answer, setAnswer] = useState('');
  const [sources, setSources] = useState([]);
  const [loading, setLoading] = useState(false);

  async function handleSubmit(question: string) {
    setLoading(true);
    const res = await fetch('/api/rag', {
      method: 'POST',
      body: JSON.stringify({ question }),
    });
    const data = await res.json();
    setAnswer(data.answer);
    setSources(data.sources ?? []);
    setLoading(false);
  }

  return (
    <div>
      <QuestionInput onSubmit={handleSubmit} />
      {loading && <p>Thinking...</p>}
      {answer && <Answer text={answer} sources={sources} />}
    </div>
  );
}

Ingesting multiple documents

const urls = [
  'https://docs.stripe.com/api',
  'https://your-runbook.notion.site',
  'https://your-wiki.com/engineering',
];

// Sequential — avoids rate limits
for (const url of urls) {
  await client.rag.ingest({ url });
  console.log(`Ingested: ${url}`);
}

// Or raw content (for databases, private files, etc.)
await client.rag.ingest({
  content: fs.readFileSync('./data/faq.txt', 'utf-8'),
  title: 'FAQ Document',
});

Choosing the right model

The model in rag.query controls answer generation, not retrieval. Retrieval always uses Vectorize + Workers AI embeddings. For the generation step:

Model	Cost	When to use
`@wai/llama-3.3-70b`	Free	Simple Q&A, high volume
`claude-haiku-4-5`	$0.27/1M	Better instruction following
`gpt-4o-mini`	$0.15/1M	OpenAI ecosystem consistency
`claude-sonnet-4-6`	$2.70/1M	Complex synthesis, long answers

For most Q&A use cases, @wai/llama-3.3-70b is free and surprisingly capable.

Limits and pricing

Free tier: 50 documents
Builder ($29/mo): unlimited documents
Workers AI models for generation: always free
Paid models for generation: 10% below OpenRouter

That’s the full story. 50 lines, functional RAG, no infra.

Get started at app.neureus.ai/onboard.