doug.molineux.blog

Blog

Large Language Model Architecture

12/5/2025

As software engineers accustomed to deterministic systems—where inputs define predictable outputs through explicit logic in languages like Node.js—the rise of generative AI requires a paradigm shift in understanding computing. Large Language Models (LLMs) are not knowledge bases in the traditional sense; they are complex probabilistic engines built upon deep neural networks.

This document provides a technical dissection of the operational mechanisms of LLMs, moving beyond high-level abstractions to examine the underlying stack, data processes, and emerging architectures.

1. The Fundamental Translation: Tokenization and Embeddings

The most critical initial concept is that neural networks cannot inherently process textual data. They require numerical input. The bridge between human language and machine-readable formats involves two key steps: tokenization and embedding.

Tokenization

Before processing, raw text is segmented into discrete units called tokens. A token does not strictly correspond to a word; it can be part of a word, a suffix, or punctuation. Modern LLMs frequently utilize subword tokenization algorithms, such as Byte-Pair Encoding (BPE) or WordPiece.

This approach balances vocabulary size versus sequence length. A character-level model has a small vocabulary but long sequences; a word-level model has massive, sparse vocabularies with many "unknown" tokens. Subword tokenization finds an optimal middle ground, allowing the model to handle rare words by composing them from common subword units.

Vector Embeddings

Once tokenized, these integers must be converted into a format suitable for mathematical operations within a neural net. We use embeddings — dense vectors in a high-dimensional continuous space (often ranging from 768 to several thousand dimensions).

During the training phase, the model learns to map tokens to these vector spaces based on the context in which they appear. The crucial outcome is that tokens with similar semantic meanings end up geometrically close to each other in this vector space. The concept of "King - Man + Woman ≈ Queen" is a classic example of vector arithmetic holding semantic properties within this space.

Below is a visualization of this transformation process:

1. Token to Vector 2. Semantic Vector Space (2D Projection)

"King" [ 0.92, -0.45, 0.11 ... ]
<text x="0" y="90" font-family="monospace" font-size="16" fill="#333">"Man"</text>
<path d="M 70 85 L 110 85" stroke="#666" stroke-width="2" marker-end="url(#arrow)"/>
<rect x="120" y="70" width="160" height="30" fill="#e9ecef" stroke="#ced4da" rx="4"/>
<text x="130" y="90" font-family="monospace" font-size="12" fill="#555">[ 0.15, 0.88, -0.21 ... ]</text>

<text x="0" y="150" font-family="monospace" font-size="16" fill="#333">"Woman"</text>
<path d="M 70 145 L 110 145" stroke="#666" stroke-width="2" marker-end="url(#arrow)"/>
<rect x="120" y="130" width="160" height="30" fill="#e9ecef" stroke="#ced4da" rx="4"/>
<text x="130" y="150" font-family="monospace" font-size="12" fill="#555">[ 0.85, 0.12, 0.99 ... ]</text>

<text x="0" y="210" font-family="monospace" font-size="16" fill="#333">"Queen"</text>
<path d="M 70 205 L 110 205" stroke="#666" stroke-width="2" marker-end="url(#arrow)"/>
<rect x="120" y="190" width="160" height="30" fill="#e9ecef" stroke="#ced4da" rx="4"/>
<text x="130" y="210" font-family="monospace" font-size="12" fill="#555">[ 0.99, -0.32, 0.05 ... ]</text>

<text x="0" y="260" font-family="Arial, sans-serif" font-size="12" fill="#666" font-style="italic">* Real vectors have 768+ dimensions</text>
Gender Dimension Royalty Dimension
<path d="M 20 200 L 330 200" stroke="#e9ecef" stroke-width="1"/>
<path d="M 20 100 L 330 100" stroke="#e9ecef" stroke-width="1"/>

<circle cx="80" cy="200" r="6" fill="#007bff"/>
<text x="70" y="225" font-family="Arial" font-size="14" fill="#333">Man</text>

<circle cx="80" cy="100" r="6" fill="#dc3545"/>
<text x="70" y="90" font-family="Arial" font-size="14" fill="#333">King</text>

<circle cx="250" cy="200" r="6" fill="#007bff"/>
<text x="240" y="225" font-family="Arial" font-size="14" fill="#333">Woman</text>

<circle cx="250" cy="100" r="6" fill="#dc3545"/>
<text x="240" y="90" font-family="Arial" font-size="14" fill="#333">Queen</text>

<line x1="80" y1="200" x2="80" y2="110" stroke="#28a745" stroke-width="2" stroke-dasharray="4" marker-end="url(#arrow-green)"/>

<line x1="250" y1="200" x2="250" y2="110" stroke="#28a745" stroke-width="2" stroke-dasharray="4" marker-end="url(#arrow-green)"/>

<text x="100" y="155" font-family="monospace" font-size="12" fill="#28a745" font-weight="bold">Adding "Royalty"</text>

2. The Core Engine: The Transformer Architecture

Prior to 2017, sequential data was primarily handled by Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These architectures processed data sequentially, token by token, which limited parallelization and made it difficult for the model to retain context over long sequences (the "vanishing gradient" problem).

The paradigm shifted with the introduction of the Transformer architecture by Vaswani et al. in the paper "Attention Is All You Need."

The Attention Mechanism

The primary innovation of the Transformer is self-attention. Instead of processing tokens sequentially, a Transformer processes the entire input sequence simultaneously.

For every token in a sequence, the attention mechanism calculates a relevance score against every other token in the same sequence. This allows the model to dynamically weigh the importance of context. For example, in the sentence "The server crashed because it ran out of memory," the model learns that "it" strongly relates to "server," even if several words separate them.

A standard Transformer block consists of:

  1. Multi-Head Attention: Running several attention mechanisms in parallel to capture different types of relationships.
  2. Normalization and Residual Connections: Techniques to stabilize training deep networks.
  3. Feed-Forward Neural Networks: Independent layers that process each token's attended representation further.

The final output is a probability distribution over the entire vocabulary, predicting the most likely next token. The generation process is iterative: predict the next token, append it to the input, and repeat.

Simplified Transformer Block Input Embeddings + Positional Encoding Nx Layers Stacked Multi-Head Self-Attention Add & Norm Feed-Forward Network Add & Norm Output Probabilities (Next Token)

3. The Training Paradigm: Pre-training and Alignment

The impressive capabilities of LLMs arise from a two-stage training process.

Pre-training (Self-Supervised Learning)

This phase requires massive compute infrastructure (thousands of GPUs). The model is fed enormous datasets of text. The training objective is simple: given a sequence of words, predict the next word.

By minimizing the error in prediction over trillions of tokens, the model inherently learns syntax, grammar, factual knowledge, and reasoning patterns present in the source data. It is "compressing" the internet into its weights.

Datasets: These models are trained on massive corpora such as Common Crawl (snapshots of the web), The Pile, Wikipedia, Project Gutenberg (books), and massive repositories of code (like GitHub).

Fine-Tuning and Alignment

A pre-trained base model is merely a highly advanced text completion engine. It is not inherently helpful or safe. If asked "How do I build a bomb?", a base model might simply complete the text with a plausible, dangerous continuation based on its training data.

To create models like ChatGPT, Reinforcement Learning from Human Feedback (RLHF) is employed.

  1. Supervised Fine-Tuning (SFT): The model is trained on high-quality, human-curated instruction-response pairs.
  2. Reward Modeling: A separate model is trained to grade different outputs based on human preference data.
  3. PPO (Proximal Policy Optimization): The main LLM uses the reward model to optimize its policies, learning to generate responses that maximize the reward score (i.e., are more helpful, honest, and harmless).

4. Addressing Limitations: Retrieval-Augmented Generation (RAG)

LLMs have significant limitations: their knowledge is cut off at the date their pre-training data was collected, and they are prone to "hallucination"—generating plausible-sounding but factually incorrect information.

For enterprise applications requiring up-to-date or proprietary information, re-training the model continuously is computationally infeasible. The solution is Retrieval-Augmented Generation (RAG).

This is where experience with technologies like OpenSearch and Node.js becomes highly relevant. RAG is an architectural pattern, not a new model type.

The RAG Workflow

RAG separates the knowledge base from the reasoning engine.

  1. Ingestion: Proprietary data (documents, databases) is chunked into smaller segments.
  2. Embedding: Each chunk is passed through an embedding model (like the one described in section 1) to create a vector representation.
  3. Indexing: These vectors are stored in a vector database (e.g., OpenSearch with k-NN capabilities, Pinecone, Weaviate).
  4. Retrieval: When a user query arrives, the query is embedded into the same vector space. A semantic search (using cosine similarity or dot product) identifies the nearest neighbors—the most relevant chunks of data.
  5. Generation: The retrieved context blocks are injected into the system prompt sent to the LLM. The prompt effectively becomes: "Using only the following context data: [retrieved chunks], answer the user query: [user query]."

This forces the LLM to act as a reasoning engine over provided facts, rather than relying on its possibly outdated internal parametric memory.

Retrieval-Augmented Generation (RAG) Flow User Query Application Layer (e.g., Node.js) Vector Database (e.g., OpenSearch) Semantic Search Retrieved Context LLM Service (Inference) Grounded Answer

5. Emerging Architectures and Philosophical Implications

Beyond the Transformer

While dominant, Transformers have a weakness: the computational cost of attention scales quadratically with the sequence length (doubling the input length quadruples the compute needed).

New research aims to address this efficiency bottleneck. Mixture of Experts (MoE) (used in GPT-4 and Mixtral) routes tokens to specific sub-networks, meaning only a fraction of the model is active for any given inference, saving compute.

Furthermore, architectures like Mamba (State Space Models - SSMs) are emerging. These attempt to match Transformer performance while maintaining linear scaling for long sequences, potentially replacing attention in future iterations.

The Chinese Room and AGI

Understanding LLMs requires grappling with philosophical questions about the nature of intelligence.

The Chinese Room Analogy, proposed by philosopher John Searle in 1980, is highly relevant. Imagine a person in a closed room who does not understand Chinese. They are given a rulebook (the LLM's weights/code) that instructs them on how to manipulate Chinese characters presented to them through a slot. To an outside observer, the room appears to understand Chinese perfectly, but the person inside is merely manipulating symbols syntactically without understanding their semantic meaning.

Many argue current LLMs are sophisticated "Chinese Rooms"—stochastic parrots that mimic understanding through complex statistical correlations but lack genuine comprehension or sentience.

Artificial General Intelligence (AGI) refers to a hypothetical future AI that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a level equal to or exceeding human capabilities. While current LLMs show impressive performance in narrow domains, they are generally considered distinct from AGI due to their lack of autonomous agency, genuine world models, and ability to reason outside their training distribution.

6. Further Research Areas for Engineers

For engineers looking to deepen their expertise in this domain, I recommend focusing on these technical areas:

  • Vector Database Optimization: Deep dive into HNSW (Hierarchical Navigable Small World) indexing algorithms used in OpenSearch and others for efficient nearest-neighbor search.
  • Quantization and Model Compression: Studying techniques like GPTQ or AWQ to run large models on consumer-grade hardware by reducing precision from 16-bit float to 4-bit integers.
  • Parameter-Efficient Fine-Tuning (PEFT): Researching LoRA (Low-Rank Adaptation), which allows fine-tuning massive models by training only a tiny fraction of the weights.
  • LLM Ops (MLOps for LLMs): The emerging field of managing the lifecycle, monitoring, and evaluation of LLMs in production environments.
© 2025 doug.molineux.blog. Built with Gatsby.