← Blog

RAG in 50 Lines — Build a Document Q&A App

Retrieval-Augmented Generation doesn't require a vector database, embedding service, or LLM orchestration library. Here's a complete RAG app in 50 lines of TypeScript.

Most RAG tutorials start by asking you to spin up Pinecone, configure OpenAI embeddings, and write chunking logic. By the time you finish setup, you’ve written 200 lines before ingesting your first document.

Here’s the same thing in 50 lines — using a managed RAG API.

What we’re building

A CLI app that:

  1. Ingests a URL (auto-chunks, auto-embeds, stores in a vector DB)
  2. Accepts questions on the command line
  3. Returns answers with source attribution

The full app

import { NeureuClient } from '@neureus/sdk';
import * as readline from 'readline';

const client = new NeureuClient({
  apiKey: process.env.NEUREUS_API_KEY!,
});

async function ingest(url: string) {
  console.log(`Ingesting ${url}...`);
  await client.rag.ingest({ url });
  console.log('Done. Documents indexed and ready to query.');
}

async function ask(question: string): Promise<void> {
  const result = await client.rag.query({
    query: question,
    model: '@wai/llama-3.3-70b',  // free — or claude-haiku-4-5, gpt-4o-mini, etc.
  });

  console.log('\nAnswer:', result.answer);

  if (result.sources?.length > 0) {
    console.log('\nSources:');
    result.sources.forEach((s, i) => {
      console.log(`  ${i + 1}. ${s.title || s.documentId}`);
      if (s.excerpt) console.log(`     "${s.excerpt.slice(0, 100)}..."`);
    });
  }
}

async function main() {
  const url = process.argv[2];

  if (!url) {
    console.error('Usage: npx ts-node rag.ts <url>');
    process.exit(1);
  }

  await ingest(url);

  const rl = readline.createInterface({ input: process.stdin, output: process.stdout });

  const ask_loop = () => {
    rl.question('\nQuestion (or "exit"): ', async (input) => {
      if (input === 'exit') { rl.close(); return; }
      await ask(input);
      ask_loop();
    });
  };

  ask_loop();
}

main().catch(console.error);

That’s it. Save as rag.ts, run npx ts-node rag.ts https://your-docs.com/guide, and start asking questions.

What’s happening under the hood

When you call client.rag.ingest({ url }):

  1. Neureus fetches the URL
  2. Splits the content into overlapping chunks at sentence boundaries (no hardcoded chunk_size)
  3. Generates dense vector embeddings via Cloudflare Workers AI
  4. Stores chunks and vectors in Cloudflare Vectorize

When you call client.rag.query({ query, model }):

  1. Generates an embedding for your question
  2. Finds the top-k most semantically similar chunks (default k=5)
  3. Formats the chunks as context in a prompt
  4. Calls the LLM to generate an answer grounded in the retrieved context
  5. Returns the answer with source chunk metadata

Things you didn’t need

  • Pinecone, Chroma, Weaviate, or any vector database
  • OpenAI Embeddings API key
  • LangChain, LlamaIndex, or any orchestration framework
  • A server to run any of the above

Variant: web app with streaming

For a web UI, swap the CLI loop for a streaming fetch:

// Next.js API route — app/api/rag/route.ts
export async function POST(req: Request) {
  const { question } = await req.json();

  const result = await client.rag.query({
    query: question,
    model: 'claude-haiku-4-5',
    stream: false,  // RAG query returns complete answer
  });

  return Response.json(result);
}
// React component
function RagChat() {
  const [answer, setAnswer] = useState('');
  const [sources, setSources] = useState([]);
  const [loading, setLoading] = useState(false);

  async function handleSubmit(question: string) {
    setLoading(true);
    const res = await fetch('/api/rag', {
      method: 'POST',
      body: JSON.stringify({ question }),
    });
    const data = await res.json();
    setAnswer(data.answer);
    setSources(data.sources ?? []);
    setLoading(false);
  }

  return (
    <div>
      <QuestionInput onSubmit={handleSubmit} />
      {loading && <p>Thinking...</p>}
      {answer && <Answer text={answer} sources={sources} />}
    </div>
  );
}

Ingesting multiple documents

const urls = [
  'https://docs.stripe.com/api',
  'https://your-runbook.notion.site',
  'https://your-wiki.com/engineering',
];

// Sequential — avoids rate limits
for (const url of urls) {
  await client.rag.ingest({ url });
  console.log(`Ingested: ${url}`);
}

// Or raw content (for databases, private files, etc.)
await client.rag.ingest({
  content: fs.readFileSync('./data/faq.txt', 'utf-8'),
  title: 'FAQ Document',
});

Choosing the right model

The model in rag.query controls answer generation, not retrieval. Retrieval always uses Vectorize + Workers AI embeddings. For the generation step:

ModelCostWhen to use
@wai/llama-3.3-70bFreeSimple Q&A, high volume
claude-haiku-4-5$0.27/1MBetter instruction following
gpt-4o-mini$0.15/1MOpenAI ecosystem consistency
claude-sonnet-4-6$2.70/1MComplex synthesis, long answers

For most Q&A use cases, @wai/llama-3.3-70b is free and surprisingly capable.

Limits and pricing

  • Free tier: 50 documents
  • Builder ($29/mo): unlimited documents
  • Workers AI models for generation: always free
  • Paid models for generation: 10% below OpenRouter

That’s the full story. 50 lines, functional RAG, no infra.

Get started at app.neureus.ai/onboard.

Try Neureus AI — start free

500 Neurons/month, no credit card required. The complete AI application backend in one API.