AI Concepts

Foundations: working with LLMs

11 topics

Tokens and the context window: the unit and the budget

Models do not see characters or words. They see tokens, and the context window is the budget you have to spend on them.

The model is not the chat product

ChatGPT and Claude.ai do a lot of work the raw model does not. Knowing the difference saves you from reinventing it badly.

#3 Live

System, user, assistant: the three roles in every chat call

Every API call is a list of role-tagged messages. The roles are not decoration; the model treats each one differently.

#4 Live

Temperature, top-p, top-k: three knobs people keep confusing

Three sampling parameters with overlapping effects. Only two of them earn their place in production.

#5 Live

Streaming vs blocking: the UX trick that changes nothing about cost

Streaming makes your AI feature feel twice as fast without changing the work the model does.

#6 Live

Token cost math: estimating the bill before you ship

Input tokens and output tokens cost different amounts. Five minutes with a calculator avoids most cost surprises.

#7 Live

TTFT vs total latency: two numbers, two different problems

Total latency is what the bill cares about. Time-to-first-token is what the user feels.

#8 Live

Embeddings and cosine similarity: turning text into a number you can compare

An embedding is a vector. Cosine similarity asks 'do these two vectors point the same way?' That is the whole story behind half of modern AI search.

#9 Live

Picking a model: the honest map of the big four

Claude, GPT, Gemini, Llama. Each one is strong somewhere and weak somewhere else. There is no universally right pick.

#10 Live

Rate limits, retries, and backoff: the boring layer that keeps you online

Every provider has limits. The difference between a flaky feature and a reliable one is a hundred lines of retry logic.

#11 Live

Playground vs production: why the prompt that worked breaks in code

Provider playgrounds quietly do five things your code does not. Knowing which ones saves a day of debugging.

Prompting as engineering

12 topics

#12 Live

System prompts that earn their tokens

A good system prompt costs 200 tokens and changes the whole feature. A bad one costs 3000 tokens and still does not work.

#13 Live

Few-shot examples: when 3 beats 0 and when 0 beats 3

Showing the model 3 to 5 examples can lift quality more than a bigger model. Sometimes it does nothing and costs you tokens.

#14 Live

Chain of thought: when reasoning out loud helps and when it just doubles the cost

Asking the model to think step by step can lift accuracy on hard tasks. It can also double your bill on easy ones.

#15 Live

Structured outputs: stop parsing model text yourself

Asking for JSON in the prompt is the old way. JSON mode and structured outputs make the model produce valid shapes by construction.

#16 Live

Schemas and validation: shapes that protect both sides

A good schema is more than 'object with three fields'. It is the contract that tells the model and your code what is allowed.

#17 Live

Multi-turn conversation: keeping state without going broke

Every turn ships the whole history. By turn 50, you are sending 30k tokens of context. Three patterns keep it sane.

#18 Live

Prompt versioning: prompts are code

If your prompt lives in a Google Doc, the next change will break a feature and no one will know which one.

#19 Live

Hallucination: what it really is and how to fight it

The model is not lying. It is doing exactly what it was trained to do: predict the next plausible token. Knowing why helps you stop it.

#20 Live

Refusal and over-refusal: when the model says 'I cannot help with that'

Models refuse for two reasons: real safety policy and an overcautious guess. The second one breaks features more often than the first.

#21 Live

Truncated and malformed JSON: when the brace never closes

Even with structured outputs, JSON can fail. Token limits, network drops, edge cases. Three patterns make it survivable.

#22 Live

Off-topic drift: when the model answers the wrong question

A user asks about taxes. Three turns later the assistant is recommending stretching exercises. Drift has a cause and a fix.

#23 Live

Cost-aware prompt design: writing prompts that do not blow the budget

Prompt design is half quality, half cost. Most teams optimize for quality and ignore cost until the bill arrives.

RAG and retrieval

10 topics

#24 Live

Agents and tool use

5 topics

#34 Live

Evaluation

8 topics

#39 Live

Production AI systems

14 topics

#47 Live

Interview craft

10 topics

#61 Live

Plain-English answers to the questions every AI engineer keeps getting asked.

Tokens and the context window: the unit and the budget

The model is not the chat product

System, user, assistant: the three roles in every chat call

Temperature, top-p, top-k: three knobs people keep confusing

Streaming vs blocking: the UX trick that changes nothing about cost

Token cost math: estimating the bill before you ship

TTFT vs total latency: two numbers, two different problems

Embeddings and cosine similarity: turning text into a number you can compare

Picking a model: the honest map of the big four

Rate limits, retries, and backoff: the boring layer that keeps you online

Playground vs production: why the prompt that worked breaks in code

System prompts that earn their tokens

Few-shot examples: when 3 beats 0 and when 0 beats 3

Chain of thought: when reasoning out loud helps and when it just doubles the cost

Structured outputs: stop parsing model text yourself

Schemas and validation: shapes that protect both sides

Multi-turn conversation: keeping state without going broke

Prompt versioning: prompts are code

Hallucination: what it really is and how to fight it

Refusal and over-refusal: when the model says 'I cannot help with that'

Truncated and malformed JSON: when the brace never closes

Off-topic drift: when the model answers the wrong question

Cost-aware prompt design: writing prompts that do not blow the budget

Picking an embedding model: the choice that shapes the whole RAG

Chunking text: the size and shape problem

Sliding window chunking: small chunks with neighbour context

Hierarchical chunking: match the child, send the parent

Picking a vector database: the five real options

pgvector: when 'we already have Postgres' is the right answer

Hybrid search: vectors plus keywords

Reciprocal rank fusion: combining lists without tuning

Reranking with cross-encoders: the second pass that fixes retrieval

Top-K and recall: how to pick K and measure if you got it right

Agent memory and state: short-term, session, persistent

Multi-agent patterns: why most should be a router, not a debate

Frameworks vs bare-metal: LangChain, LangGraph, LlamaIndex, custom

Tool design principles: idempotent, validating, actionable errors

When NOT to use an agent: the prompt chain that solves it instead

The non-determinism problem: why assert breaks LLM tests

Golden datasets: the most valuable asset on the team

Rule-based evals: regex, schema, exact match, cheap and deterministic

LLM-as-judge: picking a judge, calibrating it, the judge as its own product

Reference-free evals: groundedness, refusal correctness, JSON validity

RAG-specific evals: recall@K, faithfulness, context precision

Eval tooling: Ragas, Promptfoo, Braintrust, LangSmith, Phoenix

CI regression and A/B in production: fail the build, shadow the traffic

Streaming UX: perceived latency, partial rendering, when to start the stream

Provider-side prefix caching: free wins on long system prompts

Semantic caching: embed the query, look up similar past answers

Model routing: cheap-and-fast vs smart-and-expensive with a classifier

Output length caps and token trimming: keeping bills bounded

Open vs closed models: when self-hosting actually pays off

Self-hosted LLM serving: vLLM, TGI, Ollama, the honest comparison

Prompt injection defences: layered, never trust user input

Output validation before side effects: the 'never let the model delete' rule

PII redaction and data residency: keeping the LLM out of the audit log

LLM tracing and observability: end-to-end spans across model calls

Failover and circuit breakers: routing around a down provider

Fine-tuning, only when needed: LoRA, QLoRA, the synthetic-data trap

Cost attribution per request, per user, per feature

Reading the AI Engineer interview loop: four shapes, no standard

System design with an AI lens: model and cost first, not last

Designing a RAG in 45 minutes: the senior outline

Designing an eval pipeline: golden set, judge, CI, the metric that decides

Designing an agent: tools, guards, observability, the loop

Cost and latency as one conversation, not three

Handling 'what if the model is wrong?' with layered defence

The AI engineer take-home: eval set included, cost numbers included

Common interview traps: over-using LangChain, ignoring eval, expensive defaults

Senior signal: when not to use AI at all

No topics match these filters

Trending Tags