LLMs: A Hackers Guide

Hrishi · 2026-05-22 ·▶ Watch on YouTube ·via captions ·3 min read

TL;DR

A practitioner's guide to working with LLMs, drawn from real shipping-industry AI deployments. Covers the iterative workflow for building with LLMs, input/output best practices, debugging frameworks, and honest assessments of embeddings and long-context models. ---

Key Concepts

CPLN loop

tap to reveal ↩

Chat → Playground → Loop (add data/test cases) → Nest (break into subtasks) — the core iterative workflow for LLM development

Modalities

tap to reveal ↩

Text, vision, audio/speech, and code — all available and underused

Structured output

tap to reveal ↩

Using type specs (e.g., Zod, TypeScript) to constrain model output and reduce hallucinations

Nested prompts

tap to reveal ↩

Breaking large prompts into smaller, composable sub-tasks — analogous to refusing 700-line code files

Error taxonomy

tap to reveal ↩

Three root causes — app-level orchestration issues, factuality issues, instruction-following issues

Embeddings as fuzzy search

tap to reveal ↩

Best used at the end of a pipeline on an already-reduced search space, not as the primary retrieval mechanism

Notes

§The Iterative Loop (CPLN)

Chat (~50%+ of time): Explore freely, try radically different approaches — don't commit to a single prompt early
Most teams stop at ~40% working and go to production too soon
Unlike code, prompts should be rewritten many times before settling
Playground (~20%): Use any provider's playground; key feature needed is retroactive editing of prompts and conversation history ("surfing the latent space")
Loop: Add more data and test cases; stress-test your hypothesis
Nest: Break prompts into smaller and smaller subtasks
If you wouldn't accept a 700-line code file, don't accept a 100-line prompt
Every prompt can be decomposed further
Subtasks then re-enter the same loop when new problems or customers arrive

§Input Modality Best Practices

Speech: Users give ~200 words when asked to speak vs. ~5 words in a text box — use it to gather richer context
Vision: Captures relationships that text cannot; useful as expensive-but-dense OCR; visual diagrams are often more token-efficient than their text equivalents
Code/structured input: Almost always prefer structured input and structured output
Tools: TypeScript + Zod for type specs; SQL to express search logic (even if never executed)
Structured output reduces hallucinations by constraining the token probability space
Lean on the model's training: Express problems as supersets of known languages (Python, TypeScript, English) rather than inventing custom DSLs

§General Dos and Don'ts

Use all available modalities
Try multiple models — they have diverged significantly and have different "personalities"
Keep input/output size ratios roughly proportional; 5 words in → 20 paragraphs out = poor output
Add structure at both input and output stages
Add heavy abstractions (frameworks, libraries) between yourself and the model early on — you learn less and get trapped
Stick to one model provider
Expect massive output from minimal input

§What LLMs Enable (4 Capability Classes)

§Debugging Framework

Drop to prompt level; remove abstractions and work back up
Try a smarter model — if that fixes it, the problem is prompt/input
Try a dumber model — reveals where reasoning is actually needed
Transform the input (verbosity, structure, chunking)
Add more structure to the output to expose where failures occur
Find what separates failing cases from working cases — that difference points to a prompt fix
Add more validation
If the system is doing more than one of the four capability classes, separate them into distinct prompts/models
Classify errors into: app-level orchestration, factuality, or instruction-following
You are almost always too verbose — cut prompts down, then cut again
Lower task complexity per prompt = easier to debug, easier to swap out broken pieces

§Project Example: Transcript → Docs Tool

§Cost & Performance Trajectory

Expect 10–50x cost reduction and speed increase in the near term
Drivers: hardware-level optimization (Nvidia), memory optimizations, quantization — all incremental engineering, not research breakthroughs
Design systems for where costs and latency will be in 6 months, not today

§Long-Context Windows (Q&A)

Attention is quadratic — doubling context requires 4x memory/compute
Practical long-context implementations "cheat": a pre-pass selects which tokens to attend to, so you don't truly get full-context reasoning
Still an open research problem with no known clean solution

§Input Transformation (Q&A)

Use AI for complex transformations; use deterministic/structural methods where possible
Simple structural transformations (split by sentence, extract title/sections from markdown) are already valuable
Titles = highest compressed information in a document; first section = intro — these are free signals

§Embeddings & Vector Search (Q&A)

Embedding models are small — limited understanding of underlying text; long-context embeddings in particular are unreliable
Cosine similarity is the only usable signal once embedded, and the model internals are opaque
Correct pattern: reduce search space first using structured search (BM25, keyword filters, metadata, LLM-based pre-filtering), then apply embeddings on the reduced set
Embeddings should be the last step, not the first
For ranking/highlighting within retrieved results (e.g., finding the most relevant sentence within a page), embeddings are appropriate
Increasingly viable alternative: use an LLM directly for retrieval tasks where embeddings were previously used

Actionable Takeaways

1Spend the majority of your time in the Chat phase trying genuinely different approaches — not incrementally tweaking one prompt
2Nest every prompt: if a task can be broken into subtasks, it should be
3Always use structured output (type specs, JSON schemas) — it reduces hallucinations and makes debugging easier
4Transform your input before passing to the model: extract structure, separate by sentence, label sections
5Use multiple models for different subtasks; don't assume one model is the ceiling
6When debugging, classify the error (app-level, factuality, instruction-following) before changing anything
7Design for 10–50x cheaper/faster compute — build what becomes viable in 6 months, not just what's practical today
8For RAG/vector search: reduce search space structurally first, then apply embeddings only on the remainder

Quotes Worth Keeping

“

If you're not going to accept a 700-line code file as good, you shouldn't accept a 100-line prompt as good.

“

You might try something and genuinely be the first person in the world to have thought of that particular way of solving a problem with a model.

“

If you present users with a text box they'll give you five words. If you ask them to press a button and talk, they'll give you 200 words.

“

Embeddings should always be the last step in your pipeline. You should never be searching your full search space with embeddings.

↓ Down the rabbit hole

AppsFlyer · Design · Tech Tools

10 User Acquisition Strategies to Increase App Installs

With 7 million apps competing for attention, a successful user acquisition (UA) strategy requires audience research, diversified channels,…