LLMs: A Hackers Guide

Hrishi · 2026-05-22 ·▶ Watch on YouTube ·via captions

A practitioner's guide to working with LLMs, drawn from real shipping-industry AI deployments. Covers the iterative workflow for building with LLMs, input/output best practices, debugging frameworks, and honest assessments of embeddings and long-context models. ---

Key Concepts

ConceptDefinition
CPLN loopChat → Playground → Loop (add data/test cases) → Nest (break into subtasks) — the core iterative workflow for LLM development
ModalitiesText, vision, audio/speech, and code — all available and underused
Structured outputUsing type specs (e.g., Zod, TypeScript) to constrain model output and reduce hallucinations
Nested promptsBreaking large prompts into smaller, composable sub-tasks — analogous to refusing 700-line code files
Error taxonomyThree root causes — app-level orchestration issues, factuality issues, instruction-following issues
Embeddings as fuzzy searchBest used at the *end* of a pipeline on an already-reduced search space, not as the primary retrieval mechanism

Notes

The Iterative Loop (CPLN)

  • **Chat** (~50%+ of time): Explore freely, try radically different approaches — don't commit to a single prompt early
  • Most teams stop at ~40% working and go to production too soon
  • Unlike code, prompts *should* be rewritten many times before settling
  • **Playground** (~20%): Use any provider's playground; key feature needed is retroactive editing of prompts and conversation history ("surfing the latent space")
  • **Loop**: Add more data and test cases; stress-test your hypothesis
  • **Nest**: Break prompts into smaller and smaller subtasks
  • If you wouldn't accept a 700-line code file, don't accept a 100-line prompt
  • Every prompt can be decomposed further
  • Subtasks then re-enter the same loop when new problems or customers arrive

Input Modality Best Practices

  • **Speech**: Users give ~200 words when asked to speak vs. ~5 words in a text box — use it to gather richer context
  • **Vision**: Captures relationships that text cannot; useful as expensive-but-dense OCR; visual diagrams are often more token-efficient than their text equivalents
  • **Code/structured input**: Almost always prefer structured input *and* structured output
  • Tools: TypeScript + Zod for type specs; SQL to express search logic (even if never executed)
  • Structured output reduces hallucinations by constraining the token probability space
  • **Lean on the model's training**: Express problems as supersets of known languages (Python, TypeScript, English) rather than inventing custom DSLs

General Dos and Don'ts

  • Use all available modalities
  • Try multiple models — they have diverged significantly and have different "personalities"
  • Keep input/output size ratios roughly proportional; 5 words in → 20 paragraphs out = poor output
  • Add structure at both input and output stages
  • Add heavy abstractions (frameworks, libraries) between yourself and the model early on — you learn less and get trapped
  • Stick to one model provider
  • Expect massive output from minimal input

What LLMs Enable (4 Capability Classes)

    Debugging Framework

    • Drop to prompt level; remove abstractions and work back up
    • Try a smarter model — if that fixes it, the problem is prompt/input
    • Try a dumber model — reveals where reasoning is actually needed
    • Transform the input (verbosity, structure, chunking)
    • Add more structure to the output to expose where failures occur
    • Find what separates failing cases from working cases — that difference points to a prompt fix
    • Add more validation
    • If the system is doing more than one of the four capability classes, separate them into distinct prompts/models
    • Classify errors into: app-level orchestration, factuality, or instruction-following
    • You are almost always too verbose — cut prompts down, then cut again
    • Lower task complexity per prompt = easier to debug, easier to swap out broken pieces

    Project Example: Transcript → Docs Tool

      Cost & Performance Trajectory

      • Expect 10–50x cost reduction and speed increase in the near term
      • Drivers: hardware-level optimization (Nvidia), memory optimizations, quantization — all incremental engineering, not research breakthroughs
      • Design systems for where costs and latency will be in 6 months, not today

      Long-Context Windows (Q&A)

      • Attention is quadratic — doubling context requires 4x memory/compute
      • Practical long-context implementations "cheat": a pre-pass selects which tokens to attend to, so you don't truly get full-context reasoning
      • Still an open research problem with no known clean solution

      Input Transformation (Q&A)

      • Use AI for complex transformations; use deterministic/structural methods where possible
      • Simple structural transformations (split by sentence, extract title/sections from markdown) are already valuable
      • Titles = highest compressed information in a document; first section = intro — these are free signals

      Embeddings & Vector Search (Q&A)

      • Embedding models are small — limited understanding of underlying text; long-context embeddings in particular are unreliable
      • Cosine similarity is the only usable signal once embedded, and the model internals are opaque
      • **Correct pattern**: reduce search space first using structured search (BM25, keyword filters, metadata, LLM-based pre-filtering), *then* apply embeddings on the reduced set
      • Embeddings should be the **last step**, not the first
      • For ranking/highlighting within retrieved results (e.g., finding the most relevant sentence within a page), embeddings are appropriate
      • Increasingly viable alternative: use an LLM directly for retrieval tasks where embeddings were previously used

      Actionable Takeaways

      1. Spend the majority of your time in the **Chat** phase trying genuinely different approaches — not incrementally tweaking one prompt
      2. **Nest every prompt**: if a task can be broken into subtasks, it should be
      3. Always use **structured output** (type specs, JSON schemas) — it reduces hallucinations and makes debugging easier
      4. **Transform your input** before passing to the model: extract structure, separate by sentence, label sections
      5. Use **multiple models** for different subtasks; don't assume one model is the ceiling
      6. When debugging, **classify the error** (app-level, factuality, instruction-following) before changing anything
      7. Design for **10–50x cheaper/faster** compute — build what becomes viable in 6 months, not just what's practical today
      8. For RAG/vector search: **reduce search space structurally first**, then apply embeddings only on the remainder

      Quotes Worth Keeping

      If you're not going to accept a 700-line code file as good, you shouldn't accept a 100-line prompt as good.
      You might try something and genuinely be the first person in the world to have thought of that particular way of solving a problem with a model.
      If you present users with a text box they'll give you five words. If you ask them to press a button and talk, they'll give you 200 words.
      Embeddings should always be the last step in your pipeline. You should never be searching your full search space with embeddings.