LLMs: A Hackers Guide
A practitioner's guide to working with LLMs, drawn from real shipping-industry AI deployments. Covers the iterative workflow for building with LLMs, input/output best practices, debugging frameworks, and honest assessments of embeddings and long-context models. ---
Key Concepts
| Concept | Definition |
|---|---|
| CPLN loop | Chat → Playground → Loop (add data/test cases) → Nest (break into subtasks) — the core iterative workflow for LLM development |
| Modalities | Text, vision, audio/speech, and code — all available and underused |
| Structured output | Using type specs (e.g., Zod, TypeScript) to constrain model output and reduce hallucinations |
| Nested prompts | Breaking large prompts into smaller, composable sub-tasks — analogous to refusing 700-line code files |
| Error taxonomy | Three root causes — app-level orchestration issues, factuality issues, instruction-following issues |
| Embeddings as fuzzy search | Best used at the *end* of a pipeline on an already-reduced search space, not as the primary retrieval mechanism |
Notes
The Iterative Loop (CPLN)
- **Chat** (~50%+ of time): Explore freely, try radically different approaches — don't commit to a single prompt early
- Most teams stop at ~40% working and go to production too soon
- Unlike code, prompts *should* be rewritten many times before settling
- **Playground** (~20%): Use any provider's playground; key feature needed is retroactive editing of prompts and conversation history ("surfing the latent space")
- **Loop**: Add more data and test cases; stress-test your hypothesis
- **Nest**: Break prompts into smaller and smaller subtasks
- If you wouldn't accept a 700-line code file, don't accept a 100-line prompt
- Every prompt can be decomposed further
- Subtasks then re-enter the same loop when new problems or customers arrive
Input Modality Best Practices
- **Speech**: Users give ~200 words when asked to speak vs. ~5 words in a text box — use it to gather richer context
- **Vision**: Captures relationships that text cannot; useful as expensive-but-dense OCR; visual diagrams are often more token-efficient than their text equivalents
- **Code/structured input**: Almost always prefer structured input *and* structured output
- Tools: TypeScript + Zod for type specs; SQL to express search logic (even if never executed)
- Structured output reduces hallucinations by constraining the token probability space
- **Lean on the model's training**: Express problems as supersets of known languages (Python, TypeScript, English) rather than inventing custom DSLs
General Dos and Don'ts
- Use all available modalities
- Try multiple models — they have diverged significantly and have different "personalities"
- Keep input/output size ratios roughly proportional; 5 words in → 20 paragraphs out = poor output
- Add structure at both input and output stages
- Add heavy abstractions (frameworks, libraries) between yourself and the model early on — you learn less and get trapped
- Stick to one model provider
- Expect massive output from minimal input
What LLMs Enable (4 Capability Classes)
Debugging Framework
- Drop to prompt level; remove abstractions and work back up
- Try a smarter model — if that fixes it, the problem is prompt/input
- Try a dumber model — reveals where reasoning is actually needed
- Transform the input (verbosity, structure, chunking)
- Add more structure to the output to expose where failures occur
- Find what separates failing cases from working cases — that difference points to a prompt fix
- Add more validation
- If the system is doing more than one of the four capability classes, separate them into distinct prompts/models
- Classify errors into: app-level orchestration, factuality, or instruction-following
- You are almost always too verbose — cut prompts down, then cut again
- Lower task complexity per prompt = easier to debug, easier to swap out broken pieces
Project Example: Transcript → Docs Tool
Cost & Performance Trajectory
- Expect 10–50x cost reduction and speed increase in the near term
- Drivers: hardware-level optimization (Nvidia), memory optimizations, quantization — all incremental engineering, not research breakthroughs
- Design systems for where costs and latency will be in 6 months, not today
Long-Context Windows (Q&A)
- Attention is quadratic — doubling context requires 4x memory/compute
- Practical long-context implementations "cheat": a pre-pass selects which tokens to attend to, so you don't truly get full-context reasoning
- Still an open research problem with no known clean solution
Input Transformation (Q&A)
- Use AI for complex transformations; use deterministic/structural methods where possible
- Simple structural transformations (split by sentence, extract title/sections from markdown) are already valuable
- Titles = highest compressed information in a document; first section = intro — these are free signals
Embeddings & Vector Search (Q&A)
- Embedding models are small — limited understanding of underlying text; long-context embeddings in particular are unreliable
- Cosine similarity is the only usable signal once embedded, and the model internals are opaque
- **Correct pattern**: reduce search space first using structured search (BM25, keyword filters, metadata, LLM-based pre-filtering), *then* apply embeddings on the reduced set
- Embeddings should be the **last step**, not the first
- For ranking/highlighting within retrieved results (e.g., finding the most relevant sentence within a page), embeddings are appropriate
- Increasingly viable alternative: use an LLM directly for retrieval tasks where embeddings were previously used
Actionable Takeaways
- Spend the majority of your time in the **Chat** phase trying genuinely different approaches — not incrementally tweaking one prompt
- **Nest every prompt**: if a task can be broken into subtasks, it should be
- Always use **structured output** (type specs, JSON schemas) — it reduces hallucinations and makes debugging easier
- **Transform your input** before passing to the model: extract structure, separate by sentence, label sections
- Use **multiple models** for different subtasks; don't assume one model is the ceiling
- When debugging, **classify the error** (app-level, factuality, instruction-following) before changing anything
- Design for **10–50x cheaper/faster** compute — build what becomes viable in 6 months, not just what's practical today
- For RAG/vector search: **reduce search space structurally first**, then apply embeddings only on the remainder
Quotes Worth Keeping
If you're not going to accept a 700-line code file as good, you shouldn't accept a 100-line prompt as good.
You might try something and genuinely be the first person in the world to have thought of that particular way of solving a problem with a model.
If you present users with a text box they'll give you five words. If you ask them to press a button and talk, they'll give you 200 words.
Embeddings should always be the last step in your pipeline. You should never be searching your full search space with embeddings.