Building a Retrieval-Augmented Generation (RAG) System in .NET

Introduction

Retrieval-Augmented Generation (RAG) has become one of the most powerful architectures for combining structured data retrieval with natural language reasoning. While most RAG implementations live in the Python ecosystem, recent advances in .NET’s AI stack have made it fully possible to build an end-to-end RAG pipeline—embedding, retrieval, and generation—within .NET.

In this article, I’ll walk you through how we designed and built a local RAG architecture in C#, starting as a proof of concept (POC) and later evolving into a production-grade Background Verification System.

Our goal was to create a fully local and customizable solution, leveraging:

  • LLamaSharp for on-device language model inference
  • ONNX Runtime for high-performance embedding generation
  • PgVector for similarity search in PostgreSQL

Along the way, you’ll learn:

  • How to design a local RAG pipeline using .NET
  • Which libraries and NuGet packages enable tokenization, vector search, and GPU acceleration
  • How we integrated PostgreSQL + PgVector for efficient similarity search
  • The key lessons we learned about performance, scalability, and maintainability in a .NET-based AI system

By the end, you’ll have a clear understanding of how to assemble a fully local RAG architecture in .NET, and how the concepts translate from experimentation to production-ready design.

Components of the RAG Implementation

Our Retrieval-Augmented Generation (RAG) system was designed around four main components, each responsible for a distinct stage in the pipeline:

  1. Embedding Service – Generates vector embeddings for both documents and queries.
  2. Document / Knowledge Store – Manages structured data such as user profiles, company, and insurance information inside PostgreSQL with vector capabilities.
  3. LLM Engine – Performs local inference to generate human-readable answers based on retrieved context.
  4. API Layer / Retriever – Handles user requests and orchestrates the entire RAG flow, exposing endpoints for chat, search, and data lookups.

The Embedding Service

The first component we implemented was the Embedding Service.
This stage lies at the heart of semantic search and RAG pipelines. Its job is to take raw text—documents or user queries—and transform them into vector representations (embeddings) that capture meaning and context.

Before we can generate embeddings, we need to tokenize the input text, breaking it into smaller, model-friendly units such as words or subwords.
Once tokenized, the text is passed to an embedding model (such as a BERT-based ONNX model) that produces a high-dimensional numerical vector.

NuGet Packages Used

  • Microsoft.ML.Tokenizers
    A fully native, BERT-style tokenizer for .NET. It supports modern NLP tokenization directly inside .NET applications—without Python dependencies—and ensures compatibility with common transformer models.
  • Microsoft.ML.OnnxRuntime
    A high-performance inference engine for running ONNX models in .NET. It allows you to perform embedding generation, NLP inference, and other model tasks efficiently on both CPU and GPU.

These two libraries together enable fully local, dependency-free embedding generation within the .NET runtime environment.

Embedding Terminology

Tokenize

The process of splitting text into smaller parts called tokens (words, subwords, or characters).
Tokenization prepares text so machine-learning models can process it correctly.
Modern transformer-based models such as BERT rely on specialized tokenizers that match their vocabulary and structure.

Embedding

An embedding is a numerical vector representation of text that captures its semantic meaning.
It enables computers to measure similarity between different pieces of text, perform semantic search, and feed downstream AI models such as RAG systems.
In practice, embeddings form the bridge between text and understanding.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model developed by Google.
It reads text in both directions (left-to-right and right-to-left), allowing it to understand contextual meaning more accurately.
BERT and similar models are often exported to ONNX format to enable efficient inference in .NET via Microsoft.ML.OnnxRuntime.

ONNX

Open Neural Network Exchange (ONNX) is an open standard for machine learning model interoperability.
It allows models trained in frameworks like PyTorch or TensorFlow to be exported and executed anywhere—including inside .NET applications using Microsoft.ML.OnnxRuntime.
This means you can leverage powerful transformer models directly in C#, without requiring Python or the original training environment.

Inside the Embedding Service

Once the tokenizer and ONNX model were configured, we structured the embedding generation workflow in four main steps:

Input Validation & Tokenization

Before anything else, the service ensures that the input text is valid, and then converts it into token IDs using the BERT tokenizer.

Sequence Preparation (Padding, Truncation & Attention Mask)

Transformers require fixed-length input sequences.
We pad or truncate tokens to match the maximum sequence length and create an attention mask to tell the model which tokens are real and which are padding.

Tensor Creation & Model Inference

We convert our arrays into tensors, feed them into the ONNX model, and run inference using the ONNX Runtime session.

Extracting the Embedding Vector

After inference, we extract the model’s output tensor that contains the embedding, then convert it to a float array.

Document / Knowledge Store

One of the key differences between a Retrieval-Augmented Generation (RAG) system and a standard LLM is how the model retrieves knowledge.
While a traditional LLM generates answers solely from its internal parameters, a RAG system enhances this process by retrieving external context—relevant facts, documents, or structured data—from a knowledge base before generating a response.

This allows the model to produce more accurate, grounded, and up-to-date answers without retraining or fine-tuning the model itself.

To achieve this, we needed a database capable of storing and searching through vector embeddings—numerical representations of documents and user queries generated by our Embedding Service.
Each vector represents the semantic meaning of a piece of text, which allows us to find contextually similar documents even if the exact words don’t match.

For this purpose, we used PostgreSQL enhanced with PgVector, a powerful extension that adds native support for vector similarity search.
This combination gave us the scalability and reliability of a relational database, along with the semantic power of a vector store.

Why PostgreSQL with PgVector?

  • Familiarity for .NET developers: PostgreSQL integrates smoothly with Entity Framework Core.
  • Native vector operations: PgVector supports cosine similarity, Euclidean distance, and inner product operations directly in SQL queries.
  • Single unified storage: No need for a separate vector database—embeddings, metadata, and documents all live in one place.
  • Scalability: PostgreSQL handles large datasets well and benefits from mature indexing and transaction mechanisms.

NuGet Packages Used by the Knowledge Store

  • Pgvector
    Adds the vector data type and similarity operators to PostgreSQL, allowing you to store, index, and query embeddings directly within the database.
    This extension provides native vector search capabilities, supporting operations like cosine similarity and Euclidean distance—essential for semantic search.
  • Pgvector.EntityFrameworkCore
    Provides a smooth integration of pgvector with Entity Framework Core, allowing developers to work with embeddings in C# as if they were standard entity fields.
    It enables you to perform vector operations, run similarity searches, and manage persistence using familiar EF Core patterns—no raw SQL needed.

Vector Terminology

Understanding the vector concepts behind your Knowledge Store is key to grasping how semantic search works in RAG systems.

Vector Data Type

A vector data type allows you to store a list of floating-point numbers directly in a database column—these numbers typically represent embeddings generated by a machine learning model.

For example, an embedding for a piece of text might be stored as a vector of 384 or 768 float values, all contained in a single row and column.
This data type enables PostgreSQL (with pgvector) to efficiently store, index, and query these high-dimensional numerical representations.

Vector Search

Vector search (or similarity search) is the process of finding records in your database whose vector embeddings are most similar to a given query vector.

Instead of using traditional keyword or full-text search, vector search measures semantic similarity—how close two pieces of text are in meaning—by using mathematical distance functions such as:

  • Cosine similarity
  • Euclidean distance
  • Inner product

This approach allows your system to perform semantic retrieval, meaning you can find relevant documents, profiles, or knowledge base entries even when the query doesn’t contain the same words as the stored text.


In short, traditional search finds documents that look similar by words.
Vector search finds documents that mean similar things — a cornerstone of any effective RAG implementation.

Inside the Postgres Gateway

Vector-Enabled Database with Entity Framework Core

In our RAG system, we used Entity Framework Core (EF Core) as the data layer for PostgreSQL.
The integration was straightforward — thanks to the Pgvector.EntityFrameworkCore extension, which enables the vector data type directly in EF models.

To activate vector support, we simply chained .UseVector() in the UseNpgsql() configuration method:

This setup tells EF Core to recognize and handle PostgreSQL’s vector type, allowing us to seamlessly store and query embeddings as part of our entities.

Populating the Database with Embeddings

When processing person data, we needed to aggregate contextual information — such as company affiliations, travel history, and insurance details — into a single text representation before generating its embedding.

Here’s a simplified example of how we built and stored embeddings for each record:


LLM Engine

The LLM Engine is responsible for generating final answers in the RAG workflow.
Once the relevant documents or context have been retrieved from the Knowledge Store, the LLM Engine takes the original user query together with the retrieved context and produces a coherent, context-aware response.

This component acts as the “reasoning” layer of the RAG pipeline.
While the embedding and retrieval stages handle what information to bring in, the LLM Engine focuses on how to express it meaningfully.

In our project, we implemented a local LLM inference engine to ensure data privacy and reduce dependency on cloud APIs. The engine runs directly inside the .NET application using LLamaSharp and ONNX Runtime, allowing us to leverage transformer-based models efficiently on local hardware.

NuGet Packages Used by the LLM Engine

  • LLamaSharp
    A powerful .NET library that allows you to run Llama-family LLMs natively within .NET applications.
    It provides APIs for:
    • Loading and initializing models
    • Running inference
    • Interacting with local LLMs
  • LLamaSharp.Backend.Cuda12.Windows
    An optional GPU-accelerated backend for LLamaSharp, enabling inference using CUDA 12 on Windows.
    This backend takes advantage of NVIDIA GPUs for faster processing, which is crucial when dealing with large models and real-time generation tasks.

Is LLamaSharp Developed by Meta?

No — LLamaSharp is an independent open-source project.

LLamaSharp is an open-source .NET library developed and maintained by independent contributors (primarily on GitHub), not by Meta (Facebook).

Its goal is to make it easy for C# and .NET developers to load and run Llama-family models (and other compatible LLMs) directly in .NET applications, without requiring Python or external runtimes.

Under the hood, LLamaSharp uses the excellent llama.cpp project — a community-driven C/C++ implementation of the Llama model inference engine.
This design allows it to support a wide range of models that follow the llama.cpp backend format.

Supported Models

LLamaSharp is compatible not only with:

  • Llama 1, 2, 3, and 4

…but also with many other transformer-based models converted to GGUF or GGML formats, including:

  • Mistral
  • Phi
  • Vicuna
  • Alpaca

In short: LLamaSharp isn’t made by Meta—it’s a .NET gateway to the open Llama ecosystem, empowering developers to run modern LLMs locally and efficiently.

About Model Formats: GGML and GGUF

When Meta releases the Llama models, they are distributed as PyTorch weight files — these are the “raw” model checkpoints.
To make them runnable on different platforms (especially outside of Python), the open-source community converts these weights into lighter, portable formats such as GGML and GGUF.

  • GGML (General Graphical Model Library) and GGUF are binary model formats optimized for fast inference on both CPUs and GPUs.
  • GGUF is the modern evolution of GGML — more flexible, better structured, and designed to handle model metadata and quantization more efficiently.
  • These formats are directly supported by llama.cpp, and by extension, LLamaSharp, making it possible to run Llama-family and other compatible models locally in C# applications.

In practice, when you download a .gguf model file, it’s already optimized for CPU/GPU use with libraries like LLamaSharp or Ollama.

Understanding Tokens in LLMs

What is a Token?

A token is a unit of text that an LLM processes — it could be:

  • A whole word
  • A subword (part of a word)
  • Or even a single character

Example:
"WeAreDeveloper!"[ "We", "Are", "Dev", "eloper", "!" ]5 tokens

Why Are Tokens Important?

  • Billing:
    If you’re using cloud-based models (like OpenAI or Azure OpenAI), you’re charged based on the number of tokens processed — both the prompt and the model’s response.
    More tokens = higher cost.
  • Model Limits:
    Each model has a maximum token limit per request — e.g., 512, 4,096, or even 128,000+ tokens.
    If your combined prompt and response exceed this limit, the model will truncate or reject the input.

For local inference with LLamaSharp, token management is still essential — it determines how much context your model can “see” and directly affects coherence and performance.

Token Performance and Limits

The number of tokens used in a prompt directly affects model performance, speed, and resource consumption. Understanding this balance is key to building efficient RAG pipelines.

Shorter Prompts (Fewer Tokens)

  • Faster inference
  • Lower memory and compute usage
  • Ideal for short, direct queries
  • Limited context awareness

Longer Prompts (More Tokens)

  • Allow richer context and more detailed conversations
  • Enable summarization or question answering over large documents
  • Slower inference and higher resource usage

Impact of Token Limits (e.g., 128 vs 256 vs 4,096 tokens)

  • Smaller limits (128–256 tokens):
    The model retains less context and may “forget” earlier parts of a conversation.
    Faster and cheaper, but not suitable for large documents or multi-turn reasoning.
  • Larger limits (4,096 tokens or more):
    Allow the model to process larger context windows—useful for summarizing or reasoning over long text.
    Slower and more resource-intensive, especially on local GPUs.

Tip: When designing your RAG system, balance context size with inference speed. For most local LLM setups, 1,024–2,048 tokens offer a good trade-off between performance and context retention.

LLM Terminology

Model

Model files (such as those for Llama, GPT, or any LLM) contain the learned parameters of the neural network after training. These include:

  • Weights:
    Numerical values representing every connection in the model’s layers. They determine how input tokens are transformed into outputs.
  • Biases:
    Additional parameters within each neuron or layer that help adjust activations and improve learning accuracy.

Prompt

The prompt is the input or instruction given to an LLM, guiding what the model should generate or perform.
A prompt can be a question, a statement, or a structured instruction set.

Temperature

A temperature parameter controls the level of randomness or creativity in the model’s responses during inference.

  • Low temperature (e.g., 0.1): Deterministic, focused, and consistent output.
  • High temperature (e.g., 1.0): More creative, varied, and less predictable responses.

Instructive Models

An instructive model is an LLM fine-tuned specifically to follow user instructions and perform tasks given in prompts.
Models like GPT-3.5-turbo-instruct or Llama-2-Chat are designed to better understand, follow, and execute instructions, resulting in outputs that are more aligned, helpful, and contextually relevant.

Inside the LLM Engine Service

Method Signature

The core inference method is asynchronous and returns a stream of generated text.
It accepts several parameters that control the model’s behavior during inference.

Description:

  • prompt: The user’s input or instruction.
  • antiPrompts: A list of phrases that, if generated, stop the model early to prevent unwanted completions.
  • maxTokens: The maximum number of tokens to generate.
  • temperature: Controls output randomness and creativity.

Sampling Pipeline Configuration

The sampling pipeline defines how the model chooses its next token at each generation step.

Explanation:

  • Temperature: Adjusts creativity and randomness in generation.
  • TopK & TopP: Control the diversity of token sampling (from top K most probable tokens or from a probability mass P).
  • Seed: Makes the output deterministic and reproducible.

Inference Parameters Setup

All configuration values are gathered into a single object for cleaner management.

This consolidates all inference-related parameters before execution.

Streaming Inference

The model generates text incrementally and returns it as an asynchronous stream.
Each chunk of text is yielded immediately to the caller, allowing live updates in UI or console applications.

This streaming approach provides real-time feedback similar to how OpenAI’s chat APIs work, enhancing interactivity in user-facing applications.

API Layer / RAG

Once the three core components—Knowledge Store, LLM Engine, and Embeddings Generator—are in place, implementing Retrieval-Augmented Generation (RAG) becomes straightforward.

At its core, a RAG system connects these components in a logical sequence to create an intelligent, context-aware response workflow.

What Is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is a hybrid architecture that combines two main AI capabilities:

Retrieval

The system searches external databases or knowledge stores for the most relevant information based on a user query.
This is usually done using vector search or semantic similarity matching against precomputed embeddings.

Generation

The retrieved information is passed to a Large Language Model (LLM), which combines it with the user’s query to generate a final, contextually accurate response.

Why RAG?

Traditional LLMs rely solely on their internal training data. This means they cannot access fresh, domain-specific, or private information.
RAG solves this limitation by allowing the model to “look up” relevant data from external sources—making responses more accurate, up-to-date, and trustworthy.

Inside the RAG Service

Here’s how the process flows inside your .NET-based RAG Service:

  1. Receive Query – The API accepts a user’s query.
  2. Generate Embedding – The query is transformed into a vector embedding.
  3. Vector Search – The system searches the database (e.g., PostgreSQL with pgvector) for semantically similar records.
  4. Context Assembly – The relevant documents or records are fetched to build the contextual input.
  5. LLM Generation – The combined context and query are passed to the LLM Engine, which streams back the final coherent answer.

This architecture allows your API to serve as a bridge between knowledge retrieval and language generation, creating an intelligent workflow that produces factually grounded, dynamic answers.

Inside the RAG Service

Finding the Most Relevant Person

At the heart of the RAG process is retrieval — identifying the most relevant data entries for a user’s query.
In this case, the system performs a hybrid vector and LLM-based search to find the most contextually relevant person record.

Process Overview:

  1. Embedding Generation:
    The query is converted into a vector representation using the Embedding Service.
  2. Vector Search:
    Using the generated embedding, a similarity search is performed in PostgreSQL (via pgvector) to find the top candidates.
  3. Result Ranking:
    Candidates are sorted based on cosine similarity scores.
  4. Context Preparation:
    The most relevant results are passed to the LLM along with contextual data for deeper evaluation and summarization.

This ensures that only the most semantically relevant records are considered before generating the final output.

Generating the Final Answer with the LLM

Once the relevant context has been retrieved, the next step is generation — using the LLM to create a coherent, detailed, and context-aware response.

Process Overview:

  1. Prompt Engineering:
    The system constructs a prompt that includes the query and the retrieved contextual information.
  2. Streaming Inference:
    The LLM Engine is called using an asynchronous stream, sending partial results as they are generated.
  3. Responsive Output:
    This streaming approach allows for real-time UI updates or live API responses.

Example:

This approach merges efficient retrieval with dynamic, real-time text generation — forming a complete Retrieval-Augmented Generation loop within your .NET architecture.

Bringing It All Together

By combining vector databases, local LLM inference, and a retrieval-augmented architecture, we’ve built a complete, end-to-end RAG system entirely within the .NET ecosystem.

At its core, the system follows a simple but powerful pipeline:

  1. Data Storage and Embeddings
    Entities such as people, companies, or documents are stored in PostgreSQL using pgvector to enable semantic search based on vector similarity.
  2. LLM Engine
    Using LLamaSharp, large language models can run locally without Python dependencies, while GPU acceleration (via CUDA backends) ensures real-time inference performance.
  3. Retrieval-Augmented Generation (RAG) Layer
    The API layer brings everything together — generating query embeddings, retrieving the most relevant information, and passing it to the LLM to craft a coherent, context-aware answer.

This architecture demonstrates how modern AI systems can be built natively in C#, leveraging open-source tooling, local inference, and efficient vector search to deliver intelligent, self-contained solutions.

In essence, the RAG workflow turns a static database into an interactive knowledge engine, capable of understanding queries, finding relevant context, and generating precise, human-like responses — all powered by the strength and flexibility of the .NET platform.


If you’d like to explore the implementation in more detail, the complete source code is available on GitHub:
dotnet-talk.ai.profile.finder

The repository includes all major components of the .NET-based RAG pipeline—embedding generation, vector search using PostgreSQL + PgVector, and local LLM inference with LLamaSharp—along with practical examples you can run and extend for your own projects.

Leave A Comment

Programming isn’t just writing code; it’s a lens through which we redefine the world.

Programming isn’t just writing code; it’s a lens through which we redefine the world.