What is an LLM?: AI Explained

Layer 1: Surface

An LLM is a stateless function that takes text in and returns text out.

That’s it. No memory between calls. No internal database. No understanding. It is a mathematical function, trained on enormous amounts of text, that predicts what text should come next given the text it received. Every decision you make about LLM architecture flows from this single fact.

Think of it like a pure function in a microservice: same input, roughly same output, no side effects, no persistent state. The model was trained once; the weights are frozen. At runtime it applies those weights to whatever input you give it.

It does not:

Store your previous conversations (unless you send them back in)
Query a database or look anything up
“Think” in the way humans do: it produces text that looks like reasoning because reasoning-shaped text was in its training data
Learn from your usage

Why it matters

If someone on your team says “the LLM will remember that from last time” or “it knows our codebase,” they are wrong: unless your system explicitly passes that context in every single call. There is no implicit memory. This is the constraint that shapes your entire retrieval and context-management architecture.

Production Gotcha

Common Gotcha: Token counts include both your input AND the model’s output. A 128K context window isn’t 128K tokens of input: your prompt, system message, conversation history, and retrieved documents all count against the same limit.

Teams hit this when a feature works fine in testing (small inputs) but silently degrades or throws context-limit errors in production as conversation history accumulates. The fix is to track token usage from day one and implement context-trimming or summarisation before you need it.

Layer 2: Guided

The four concepts that drive every decision

Tokens, the unit of work. LLMs don’t read words. They read tokens, chunks roughly three-quarters of an English word. “Unbelievable” is three tokens. A 10-page document is ~3,000 tokens. You are billed per token, both input and output. Token efficiency is cost efficiency.

Context window: the working memory constraint. The context window is the total number of tokens the model can see in a single call. Modern models range from 128K to 1M+ tokens. Anything outside the window might as well not exist.

Temperature, the creativity dial. Temperature (0.0–2.0) controls randomness. At 0, the model picks the most probable next token. At higher values it explores less-probable options, more creative, more likely to hallucinate. Use low temperature for code and data extraction; higher for brainstorming.

Inference, the runtime cost model. Training is the one-time cost of creating the model. Inference is the ongoing cost every time someone uses it. Each API call bills you for tokens in + tokens out. Unlike per-seat SaaS, costs scale with usage and request size, budget and monitor from day one.

Step-by-step: your first API call

Here’s the pattern in provider-neutral pseudocode, followed by a concrete implementation. Every major provider follows the same basic shape, messages array, system instruction, response text, with slightly different field names.

# --- pseudocode: works with any provider ---
response = llm.chat(
    model="frontier",           # pick your model — see provider table below
    messages=[
        {"role": "user", "content": "Explain what a context window is in one sentence."}
    ],
    max_tokens=1024,
)

print(response.text)            # field name varies by SDK — see provider table
print(f"Tokens used: {response.usage.total}")

In practice: Anthropic SDK:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain what a context window is in one sentence."}
    ]
)

print(response.content[0].text)

# Always log token usage
print(f"Input tokens:  {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total:         {response.usage.input_tokens + response.usage.output_tokens}")

Tracking token budget in a multi-turn conversation

# --- pseudocode ---
history = []

def chat(user_message: str, max_context_tokens: int = 100_000) -> str:
    history.append({"role": "user", "content": user_message})

    response = llm.chat(model="frontier", messages=history, max_tokens=1024)

    reply = response.text
    history.append({"role": "assistant", "content": reply})

    used      = response.usage.total
    remaining = max_context_tokens - used
    print(f"[tokens used: {used:,} | remaining budget: {remaining:,}]")

    return reply

In practice: Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

def chat(history: list[dict], user_message: str, max_context_tokens: int = 100_000):
    """
    Naive multi-turn chat — demonstrates the context accumulation problem.
    In production you'd trim or summarise history before it hits the limit.
    """
    history.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=history,
    )

    assistant_reply = response.content[0].text
    # OpenAI: response.choices[0].message.content
    # Gemini: response.text

    history.append({"role": "assistant", "content": assistant_reply})

    used = response.usage.input_tokens + response.usage.output_tokens
    # OpenAI: response.usage.prompt_tokens + response.usage.completion_tokens

    remaining = max_context_tokens - used
    print(f"[tokens used: {used:,} | remaining budget: {remaining:,}]")

    return assistant_reply, history

history = []
reply, history = chat(history, "What is temperature in an LLM?")
print(reply)

reply, history = chat(history, "And what's a good value for code generation?")
print(reply)

Before vs. After

Naive approach: context blindness:

# BAD: Dumps entire codebase into every request
# Silently hits context limits; expensive; slow
context = read_entire_codebase()  # Could be millions of tokens
response = llm.call(f"{context}\n\nAnswer: {user_question}")

Better approach: selective retrieval:

# GOOD: Retrieve only the relevant chunks
relevant_chunks = vector_search(user_question, top_k=5)
context = "\n\n".join(relevant_chunks)  # ~2,000 tokens vs. millions
response = llm.call(f"{context}\n\nAnswer: {user_question}")

Common mistakes

Assuming memory: Calling the API twice and expecting the second call to know about the first.
Ignoring output tokens: Budgeting only for input tokens and getting surprised by cost.
Single temperature everywhere: Using the same temperature for both creative writing and data extraction.
No usage monitoring: Discovering costs and context limits the hard way in production.

Provider SDK quick reference

The pseudocode above uses a simplified interface. Here’s how those concepts map to the major SDKs: pick whichever matches your stack:

	Anthropic	OpenAI	Google Gemini
Package	`pip install anthropic`	`pip install openai`	`pip install google-genai`
Client	`anthropic.Anthropic()`	`OpenAI()`	`genai.Client()`
API call	`client.messages.create(...)`	`client.chat.completions.create(...)`	`client.models.generate_content(...)`
System prompt	`system="..."` param	`{"role": "system"}` in messages	`config=types.GenerateContentConfig(system_instruction="...")`
Response text	`response.content[0].text`	`response.choices[0].message.content`	`response.text`
Input tokens	`response.usage.input_tokens`	`response.usage.prompt_tokens`	`response.usage_metadata.prompt_token_count`
Output tokens	`response.usage.output_tokens`	`response.usage.completion_tokens`	`response.usage_metadata.candidates_token_count`
Stop signal	`response.stop_reason`	`response.choices[0].finish_reason`	`response.candidates[0].finish_reason`

Anthropic and OpenAI share a messages-style conversation format: [{"role": "user"/"assistant", "content": "..."}]. Gemini’s SDK uses a different shape: the parameter is contents, roles are "user" and "model" (not "assistant"), and content lives in parts. The concepts are the same; the field names aren’t. Later modules will reference this table rather than repeating it.

Layer 3: Deep Dive

Architecture analysis

The transformer architecture that underlies LLMs (Vaswani et al., 2017) uses self-attention to build relationships between all tokens in the input simultaneously, rather than processing them sequentially. This is why context window size matters structurally: the attention computation is quadratic in sequence length.

The “stateless function” property is not a limitation of current models: it is a deliberate design constraint that makes LLMs horizontally scalable. A stateful model would require session affinity, distributed state management, and complex coordination between replicas. Statelessness means any instance can handle any request, enabling the elastic infrastructure that makes cloud-hosted inference economically viable.

Temperature controls the sharpness of the probability distribution over next tokens. At temperature 0 it becomes a hard argmax (deterministic). At temperature 1 it samples from the raw distribution. Above 1 it flattens the distribution, making low-probability tokens more likely: useful for creative tasks but harmful for factual retrieval.

Production failure modes

Silent context overflow. As conversations grow, inputs silently exceed the window and the model either errors or starts ignoring earlier context. Teams often discover this via degraded output quality rather than errors. Fix: instrument token usage per call; implement sliding-window or summarisation when approaching 80% of the context limit.

Temperature misconfiguration. A single global temperature setting causes subtle quality problems: hallucinations in structured extraction tasks; overly conservative outputs in creative tasks. Fix: configure temperature per use case; default to 0.0 for anything requiring accuracy.

Model-as-oracle assumption. Teams build systems that trust LLM output without validation: particularly dangerous for arithmetic, dates, and citations. Fix: layer structured output parsing (JSON schema validation), domain-specific checks, and retrieval-backed verification on top of LLM generation.

Cost spikes from unbounded output. Not setting max_tokens appropriately means a single malformed prompt can trigger a very long (and expensive) completion. Fix: always set a max_tokens ceiling appropriate to the task; monitor p99 output token counts.

What is an LLM?