<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[DataJourney: Generative AI]]></title><description><![CDATA[Lets deep dive into this new era of Generative AI]]></description><link>https://datajourney24.substack.com/s/generative-ai</link><image><url>https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png</url><title>DataJourney: Generative AI</title><link>https://datajourney24.substack.com/s/generative-ai</link></image><generator>Substack</generator><lastBuildDate>Sat, 02 May 2026 12:24:39 GMT</lastBuildDate><atom:link href="https://datajourney24.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Pooja Palod]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datajourney24@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datajourney24@substack.com]]></itunes:email><itunes:name><![CDATA[Pooja Palod]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pooja Palod]]></itunes:author><googleplay:owner><![CDATA[datajourney24@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datajourney24@substack.com]]></googleplay:email><googleplay:author><![CDATA[Pooja Palod]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Token Economics: Why LLM Cost Is an Architecture Problem, Not a Finance Problem]]></title><description><![CDATA[This is the second post in a long-form series on building production-grade GenAI systems.]]></description><link>https://datajourney24.substack.com/p/token-economics-why-llm-cost-is-an</link><guid isPermaLink="false">https://datajourney24.substack.com/p/token-economics-why-llm-cost-is-an</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sat, 25 Apr 2026 04:46:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the second post in a long-form series on building production-grade GenAI systems. The first post covers observability- why the standard monitoring playbook doesn't transfer to GenAI pipelines, and what you need to instrument across Cost, Quality, and Latency before any of the architecture decisions in this series become actionable. This post goes deep on the first pillar: Token Economics, and why LLM cost is an architecture problem, not a finance one.</p><p>Most teams discover they have a token economics problem the same way they discover they have a technical debt problem gradually, then all at once.</p><p>The AWS bill climbs. Someone schedules a cost review. A few prompts get trimmed. The bill drops slightly, then climbs again. The cycle repeats until the system is either unprofitable at scale or someone decides to treat cost as an engineering constraint rather than a line item to manage after the fact.</p><p>This post is about building systems where that cycle never starts where cost is instrumented, controlled, and architecturally contained from the beginning. It&#8217;s the second post in a series on production GenAI systems. If you haven&#8217;t read the observability post, the instrumentation concepts here build on that foundation.</p><div><hr></div><h3>Why Token Economics Is Different From Traditional Infrastructure Cost</h3><p>In traditional software, cost scales with compute and storage. Both are relatively predictable, both respond well to standard optimization patterns, and both have decades of tooling built around them.</p><p>Token costs are different in three important ways.</p><p><strong>They scale with behavior, not just traffic.</strong> A user who asks a simple question costs a fraction of what a user who triggers a multi-step agent workflow costs. Traffic volume is only half the story, the nature of the requests matters as much as the number of them. A system that looks economical at 10 users can become expensive at 1,000 not because traffic increased 100x but because usage patterns shifted.</p><p><strong>They&#8217;re invisible without deliberate instrumentation.</strong> A slow database query shows up in your APM. A prompt that&#8217;s quietly grown to 8,000 tokens because someone kept patching in edge cases doesn&#8217;t at least not until it shows up in your monthly bill with no clear attribution.</p><p><strong>They compound across the pipeline.</strong> In a RAG system, you&#8217;re paying for embedding generation, retrieval, context assembly, and inference often across multiple model calls. Each step has its own token footprint, and inefficiencies at any stage compound into the final cost. Most cost optimization work focuses on the inference call and ignores everything upstream.</p><p>Understanding these three dynamics is the prerequisite for building systems that control cost effectively.</p><div><hr></div><h3>The Metric That Actually Matters: Cost Per Successful Task</h3><p>Token count is a useful operational metric. It&#8217;s not the right lens for understanding whether your system is economically sound.</p><p>The metric that matters is <strong>cost per successful task</strong> - what does it actually cost to deliver a correct, complete response for a given task type? This number tells you things that aggregate token counts never will:</p><ul><li><p>Whether your caching layer is working (cost per task should drop as cache hit rate rises)</p></li><li><p>Whether model routing is calibrated correctly (cost per task for simple requests should be significantly lower than for complex ones)</p></li><li><p>Whether quality and cost are moving in opposite directions (a cost optimization that degrades task success rate isn&#8217;t an optimization)</p></li><li><p>Whether your system is economically viable at your target scale (project cost per task against expected volume and you have a unit economics model)</p></li></ul><p>Getting to cost per successful task requires two things: per-request cost attribution and a definition of &#8220;successful&#8221; that your system can evaluate automatically. The first is an instrumentation problem. The second is an evaluation problem which is why cost and quality observability have to be built together, not separately.</p><div><hr></div><h3>The Three Architectural Levers</h3><h4>1. Semantic Caching</h4><p>The highest-leverage cost optimization in most production GenAI systems isn&#8217;t prompt compression or model selection  it&#8217;s not calling the LLM at all.</p><p>Semantic caching works by storing responses against vector representations of queries, then retrieving cached responses when a new query is sufficiently similar to one that&#8217;s already been answered. The threshold for &#8220;sufficiently similar&#8221; is configurable typically a cosine similarity score above 0.92-0.95 depending on how much variance you can tolerate in responses.</p><p>In systems with high query repetition customer support, internal knowledge bases, FAQ-style interfaces cache hit rates of 30-50% are achievable. At those rates, the cost reduction is substantial and the latency improvement is dramatic: a cache hit returns in milliseconds rather than seconds.</p><p>The implementation requires a vector database for similarity search and a fast key-value store (Redis is the standard choice) for response retrieval. The operational complexity is real you need cache invalidation logic, staleness handling, and monitoring for cache hit rates by query type. But for most high-volume systems the ROI justifies it quickly.</p><p>Where semantic caching breaks down: low-repetition query patterns, high variance tolerance requirements, and use cases where response freshness is critical. Don&#8217;t implement it uniformly instrument your query distribution first and apply caching selectively to the query types where repetition is actually high.</p><h4>2. Model Routing</h4><p>Not every request in your system requires the same model. This sounds obvious. Most production systems ignore it anyway defaulting to a single frontier model for everything because it&#8217;s simpler to implement and the cost problem isn&#8217;t yet acute enough to justify the routing infrastructure.</p><p>By the time the cost problem is acute, you&#8217;re refactoring a system that was never designed for routing. Build it in early.</p><p>A practical routing architecture has two tiers at minimum:</p><p><strong>Tier 1: Lightweight models for deterministic tasks</strong> - formatting, classification, extraction, summarization, structured output generation. These tasks don&#8217;t require deep reasoning. A $0.15/1M token model handles them as well as a $15/1M frontier model in most cases. The cost difference is 100x. Routing 60-70% of your requests to Tier 1 based on task type reduces your blended inference cost dramatically.</p><p><strong>Tier 2: Frontier models for complex reasoning</strong> - multi-step reasoning, ambiguous queries, tasks that require broad world knowledge or nuanced judgment. This is where frontier model capability actually matters. Reserve it for the requests that need it.</p><p>The routing layer itself can be a lightweight classifier - a small model or even a rules-based system that categorizes incoming requests by task type and routes accordingly. The classifier&#8217;s cost is negligible relative to the savings from routing correctly.</p><p>The failure mode to watch for: routing based on request complexity signals that are easy to game or misread. A short query isn&#8217;t necessarily a simple one. Build in a fallback path that escalates to Tier 2 when Tier 1 responses fall below a quality threshold and instrument escalation rates so you can tune the routing logic over time.</p><h4>3. Context Pruning</h4><p>Token bloat is the cost problem that accumulates invisibly. It doesn&#8217;t cause errors. It doesn&#8217;t trigger alerts. It just makes every request progressively more expensive and slower as the system matures.</p><p>The most common sources:</p><p><strong>Unbounded chat history</strong> - systems that pass the full conversation history to the model on every turn. At turn 3 this is fine. At turn 30, you&#8217;re sending thousands of tokens of context for a request that might need two turns of history at most. Summarize older history, prune beyond a rolling window, and track average context length per session as an operational metric.</p><p><strong>Oversized RAG retrieval</strong> - retrieving more chunks than the model can usefully attend to. Most RAG systems retrieve 5-10 chunks by default. In practice, well-ranked retrieval with 3-4 highly relevant chunks outperforms poorly-ranked retrieval with 10 chunks &#8212; and costs significantly less. Measure chunk utilization: if the model is consistently ignoring the bottom half of your retrieved context, you&#8217;re retrieving too much.</p><p><strong>Prompt template bloat</strong> - system prompts and few-shot examples that have grown over time as edge cases got patched in. Audit your prompt templates periodically. Every sentence that&#8217;s in there to handle a rare edge case is a tax on every request. Consider whether those edge cases are better handled in post-processing than in the prompt.</p><p><strong>Redundant tool definitions</strong> - in agent systems, passing the full tool schema for every available tool on every request. Pass only the tools relevant to the current task type. The token cost of unused tool definitions adds up faster than most teams expect.</p><p>Context pruning isn&#8217;t a one-time optimization &#8212; it&#8217;s an ongoing practice. Instrument context length by pipeline stage and task type, set alerts for context length growth, and treat prompt bloat as technical debt that gets addressed on a regular cadence.</p><div><hr></div><h3>Building a Cost-Aware Inference Path</h3><p>The three levers above work best when they&#8217;re integrated into a coherent inference path rather than implemented as independent optimizations. Here&#8217;s what that looks like in practice:</p><p><strong>Request intake</strong> - classify the incoming request by task type. This classification drives routing, caching lookup, and context assembly decisions downstream.</p><p><strong>Cache check</strong> - before any model call, check semantic cache. On a hit, return the cached response and log the cache hit with task type attribution. On a miss, proceed.</p><p><strong>Context assembly</strong> - assemble context with pruning applied: rolling history window, relevance-ranked RAG with chunk count capped, prompt template audit. Log assembled context length.</p><p><strong>Model routing</strong> - route to Tier 1 or Tier 2 based on task type classification. Log the routing decision.</p><p><strong>Inference</strong> &#8212; make the model call. Log token counts (input and output separately), model used, and latency.</p><p><strong>Quality check</strong> - run a lightweight quality signal on the response (format validation, output scoring for task-critical requests). Log pass/fail.</p><p><strong>Cost attribution</strong> - compute request cost from token counts and model pricing. Attribute to task type. Update cost per successful task metrics.</p><p>This path adds minimal latency overhead when implemented correctly  cache checks and context pruning are fast, routing classification is cheap, and cost attribution is a simple calculation. The instrumentation overhead is real but small relative to the cost visibility it provides.</p><div><hr></div><h3>What Good Looks Like at Scale</h3><p>A production system with mature token economics has a few properties that distinguish it from one that&#8217;s just been optimized ad hoc:</p><p><strong>Cost per successful task is stable or declining as volume grows.</strong> Caching effects improve with scale, routing gets better calibrated, and context pruning compounds. If cost per task is rising with volume, the architecture is failing.</p><p><strong>Cost is attributable by task type, pipeline stage, and time period.</strong> When the bill goes up, you can identify the cause in minutes rather than hours. You know which task type is responsible, which stage in the pipeline the cost is coming from, and when it started.</p><p><strong>Cost and quality move together, not in opposite directions.</strong> Optimizations that reduce cost while maintaining or improving task success rates are the goal. Cost reductions that degrade quality are false savings they show up in churn and support costs instead.</p><p><strong>The system degrades gracefully under cost pressure.</strong> When token budgets are constrained, the system routes more aggressively to lighter models, retrieves fewer chunks, and summarizes more aggressively rather than failing or producing expensive low-quality responses.</p><div><hr></div><h3>The Underlying Principle</h3><p>Token economics is ultimately about building systems where cost is a first-class engineering constraint rather than an afterthought. That means instrumenting it at the right granularity, designing the inference path with cost control built in, and treating cost per successful task as a metric that matters as much as latency or quality.</p><p>The teams that get this right don&#8217;t spend less time thinking about cost they spend less time being surprised by it.</p><div><hr></div><p><em>Next in the series: Evaluation -why quality instrumentation in GenAI is a system design problem, and how to build eval pipelines that catch degradation before your users do.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[You Can’t Debug What You Can’t See: Observability for Production GenAI Systems]]></title><description><![CDATA[Part 1 of a 4-part series on production GenAI systems covering Observability, Token Economics, Evaluation, and Latency & Reliability.]]></description><link>https://datajourney24.substack.com/p/you-cant-debug-what-you-cant-see</link><guid isPermaLink="false">https://datajourney24.substack.com/p/you-cant-debug-what-you-cant-see</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Tue, 14 Apr 2026 17:37:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Part 1 of a 4-part series on production GenAI systems covering Observability, Token Economics, Evaluation, and Latency &amp; Reliability.</em></p><p><em>8 min read</em></p><div><hr></div><p>Production GenAI systems fail in ways that are hard to see coming. Not because the models are bad but because the infrastructure around them isn&#8217;t built to surface the right signals. This is the first post in a long-form series on building production-grade GenAI systems: the architecture decisions, instrumentation practices, and failure patterns that separate demos from systems that hold up at scale. We&#8217;ll go deep on Token Economics, Evaluation, and Latency &amp; Reliability in the posts that follow. But observability comes first because without it, none of the rest is actionable.</p><p>Most GenAI systems are flying blind.</p><p>Not because engineers don&#8217;t care about visibility  but because the observability playbook from traditional software doesn&#8217;t transfer cleanly. You can&#8217;t just drop Datadog on an LLM pipeline and call it done. The failure modes are different, the signals are different, and the thing you&#8217;re actually trying to understand model behavior &#8212; doesn&#8217;t fit neatly into metrics, logs, or traces.</p><p>This is the gap between teams that catch problems early and teams that find out from users.</p><div><hr></div><h2>Monitoring vs. Observability: Why GenAI Needs Both</h2><p>In traditional systems, monitoring tells you something is wrong. Observability tells you why.</p><p>In GenAI systems, that distinction matters more than anywhere else &#8212; because the failure modes are probabilistic, not deterministic. A service going down is binary. A model that&#8217;s gradually drifting toward lower-quality outputs, or a retrieval pipeline that&#8217;s quietly returning less relevant chunks, isn&#8217;t. Those failures are invisible to standard monitoring until they&#8217;ve already done damage.</p><p>Monitoring covers the signals you already know to watch: latency, error rates, token usage, API availability. These are necessary but not sufficient. They&#8217;ll tell you when something is obviously broken.</p><p>Observability covers the harder question: <em>why is my system behaving this way?</em> That requires capturing enough context at each step of your pipeline inputs, outputs, intermediate states, model decisions &#8212; that you can reconstruct what happened after the fact. Not just that a request failed, but what the model received, what it returned, and where in the chain things went wrong.</p><p>The teams that get this right treat their GenAI pipeline the same way a good backend engineer treats a distributed system: every hop is a potential failure point, and every failure point needs a trace.</p><div><hr></div><h2>The Three Pillars and What Observability Looks Like for Each</h2><p>The rest of this series goes deep on Cost, Quality, and Latency individually. But observability cuts across all three and each pillar has a distinct instrumentation problem worth understanding before you get into the architecture details.</p><h3>Pillar 1: Cost (Token Economics)</h3><p>Token costs are easy to monitor in aggregate. They&#8217;re hard to observe at the request level  which is where the real problems live.</p><p>Aggregate cost metrics tell you your bill is going up. They don&#8217;t tell you which pipeline stage is responsible, which task type is burning disproportionate tokens, or whether your caching layer is actually working. For that you need per-request instrumentation: token counts broken down by input and output, cost attributed by task type, cache hit and miss rates tracked explicitly.</p><p>The failure mode to watch for: token bloat that accumulates invisibly. Chat histories that grow unchecked, RAG pipelines that retrieve far more context than the model uses, prompt templates that balloon over time as edge cases get patched in. None of these show up as errors. They show up as a cost curve that keeps climbing without a clear cause.</p><p>Good cost observability means you can answer: what did this specific request cost, why, and which part of the pipeline was responsible?</p><h3>Pillar 2: Quality (Evaluation)</h3><p>Quality is the hardest pillar to instrument because there&#8217;s no ground truth signal that arrives in real time. A slow response is immediately measurable. A response that&#8217;s subtly wrong, unhelpful, or drifting from your intended behavior isn&#8217;t at least not without deliberate instrumentation.</p><p>This is why quality observability has to be designed in, not bolted on. The core requirement: capture enough of what happened at inference time that you can evaluate it later. The full prompt, the retrieved context, the model output, and any user feedback signals that come back. Without that, you&#8217;re evaluating samples in a vacuum rather than understanding your system&#8217;s actual behavior in production.</p><p>Beyond capture, you need a lightweight async evaluation layer running against sampled live traffic an LLM judge scoring responses on relevance, accuracy, and task completion, with results feeding into a quality trend dashboard. Not real-time, not every request, but consistent enough that you&#8217;d catch a drift in quality scores over days, not weeks.</p><p>The failure mode to watch for: quality that degrades gradually across a model update, a retrieval index refresh, or a prompt change none of which trigger an alert in a standard monitoring setup.</p><p>Good quality observability means you can answer: is my system&#8217;s output quality stable over time, and if it changed, what changed first?</p><h3>Pillar 3: Latency &amp; Reliability</h3><p>Latency is the most instrumented of the three pillars in most systemsand still frequently misread. The common mistake is treating it as a single number when it&#8217;s actually a profile across pipeline stages, request types, and load levels.</p><p>A RAG pipeline, a multi-step agent, and a simple classification call have completely different latency characteristics. Averaging them together hides the outliers. And in GenAI systems, the outliers are usually where the interesting failures live a retrieval call that&#8217;s occasionally timing out, an LLM call that spikes under concurrent load, a post-processing step that quietly adds 800ms to certain request types.</p><p>The signals that matter most: TTFT (time to first token) for streaming systems, end-to-end latency broken down by pipeline stage and task type, P95 and P99 rather than averages, and retry and fallback rates tracked explicitly. Silent retries are one of the most common sources of unexpected latency spikes if your system is retrying failed LLM calls without surfacing that to your observability layer, you&#8217;re flying blind on a significant failure mode.</p><p>The failure mode to watch for: latency that looks acceptable in averages but has a long tail that&#8217;s quietly degrading user experience &#8212; and retry behavior that&#8217;s masking upstream reliability problems.</p><p>Good latency observability means you can answer: where in my pipeline is time being spent, and is my system degrading gracefully or failing silently under load?</p><div><hr></div><h2>Where Observability Breaks Down in Practice</h2><p>Even teams that build good observability infrastructure run into the same problems. Worth naming them directly:</p><p><strong>Volume vs. depth tradeoff</strong> - you can&#8217;t store full prompt/response pairs for every request at scale. Use tiered logging: full capture for errors and edge cases, sampled capture for normal traffic, aggregate metrics for everything else.</p><p><strong>LLM judge drift</strong> - if you&#8217;re using an LLM to evaluate your LLM&#8217;s outputs, your judge model can drift too. Calibrate it periodically against human review. A small weekly sample is enough to catch systematic bias before it corrupts your quality metrics.</p><p><strong>Instrumentation latency overhead</strong> - adding tracing to every pipeline step adds latency. In streaming systems this is especially sensitive. Instrument asynchronously where possible and be deliberate about what runs in the hot path.</p><p><strong>Correlation without causation</strong> - observability gives you data, not answers. A spike in latency correlated with a quality score drop doesn&#8217;t tell you which caused which. Build dashboards that surface hypotheses, not conclusions.</p><div><hr></div><h2>What a Minimal Viable Observability Stack Looks Like</h2><p>You don&#8217;t need to instrument everything on day one:</p><p><strong>Tracing</strong> - OpenTelemetry with your existing APM (Datadog, Honeycomb, Grafana). Instrument pipeline boundaries first: retrieval in/out, LLM in/out.</p><p><strong>Logging</strong> -Structured logs with trace IDs for every request. Full prompt/response capture for errors, 10-20% sample for normal traffic.</p><p><strong>Cost monitoring</strong> -Per-request token tracking with task-type attribution. Cache hit/miss rates tracked explicitly.</p><p><strong>Quality monitoring</strong> - Async LLM-as-judge eval on sampled live traffic. Quality score trend over time, not just point-in-time snapshots.</p><p><strong>Latency monitoring</strong> - P95/P99 by pipeline stage and task type. TTFT tracked separately from end-to-end latency. Retry and fallback rates surfaced explicitly.</p><p><strong>Alerting</strong> - Hard failures (error spikes, latency P95 breaches) in real time. Soft failures (quality drift, cost curve changes) on a daily digest.</p><div><hr></div><h2>The Underlying Principle</h2><p>Traditional software observability is about understanding system state. GenAI observability is about understanding system <em>behavior</em> which is harder, more ambiguous, and more consequential.</p><p>The teams building reliable GenAI systems aren&#8217;t the ones with the best models. They&#8217;re the ones who&#8217;ve built enough visibility into their pipelines that they can tell the difference between a model problem, a retrieval problem, a prompt problem, and a data problem and fix the right thing.</p><p>Instrumentation isn&#8217;t glamorous. But it&#8217;s the difference between a system you operate and a system that operates you.</p><div><hr></div><p><em>Next up: Token Economics  why LLM cost isn&#8217;t a finance problem, it&#8217;s an architecture problem, and how to build inference paths that don&#8217;t bleed margin at scale.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[From LLMs to Products: Alignment & Production]]></title><description><![CDATA[How GPT-3 became ChatGPT and how to deploy your own]]></description><link>https://datajourney24.substack.com/p/from-llms-to-products-alignment-and</link><guid isPermaLink="false">https://datajourney24.substack.com/p/from-llms-to-products-alignment-and</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sat, 27 Dec 2025 12:56:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Sfr_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Series Navigation:</strong></p><ul><li><p><a href="https://datajourney24.substack.com/p/the-need-for-transformers?r=25b2f4">Post 1: The Need for Transformers</a></p></li><li><p><a href="https://datajourney24.substack.com/p/inside-the-transformer-attention?r=25b2f4">Post 2: Inside the Transformer</a></p></li><li><p><a href="https://datajourney24.substack.com/p/scaling-to-llms-why-bigger-models?r=25b2f4">Post 3: Scaling to LLMs</a></p></li><li><p><strong>Post 4: From LLMs to Products</strong> &#8592; You are here</p></li></ul><div><hr></div><h3>What We&#8217;ll Cover</h3><p>You&#8217;ve learned how to build a massive LLM, but the real challenge is making it truly useful and reliable in real-world applications.</p><p>Base models like GPT-3 are impressive, yet they have limitations:</p><ul><li><p>Completes text but often ignores explicit instructions</p></li><li><p>Can produce toxic or harmful content</p></li><li><p>Hallucinates facts with confidence</p></li><li><p>Expensive to run at scale (initial ChatGPT ~$700K/day)</p></li></ul><p>This article walks through the journey of transforming a base LLM into a <strong>production-ready system</strong>, covering both alignment and deployment.</p><h4><strong>Part 1: Alignment - Making Models Helpful</strong></h4><ul><li><p>Instruction tuning (teaching models to follow instructions)</p></li><li><p>RLHF (Reinforcement Learning from Human Feedback)</p></li><li><p>Constitutional AI (Anthropic&#8217;s approach)</p></li><li><p>Safety and guardrails</p></li></ul><h4><strong>Part 2: Production - Deploying at Scale</strong></h4><ul><li><p>Inference optimization (quantization, KV cache)</p></li><li><p>RAG (Retrieval-Augmented Generation)</p></li><li><p>Prompt engineering patterns</p></li><li><p>Real-world architectures</p></li><li><p>Cost analysis and optimization</p></li></ul><p><strong>By the end, you&#8217;ll understand:</strong></p><ul><li><p>How ChatGPT was created from GPT-3.5</p></li><li><p>The three-phase RLHF process</p></li><li><p>When to use RAG vs fine-tuning</p></li><li><p>How to deploy LLMs cost-effectively</p></li><li><p>Production architecture patterns</p></li></ul><p>Let&#8217;s bridge the gap from research to reality.</p><div><hr></div><h3>Part 1: Alignment - Making Models Helpful</h3><h4>1. The Base Model Problem</h4><p><strong>1.1 What&#8217;s Wrong with Base Models?</strong></p><p><strong>You:</strong> &#8220;Write a Python function to sort a list&#8221;</p><p><strong>Base GPT-3:</strong></p><pre><code><code>of numbers in ascending order. The function should use the bubble sort algorithm.

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
</code></code></pre><p><strong>Observation:</strong> The model continues the text instead of directly following the instruction.</p><p><strong>Reason:</strong> Base LLMs are trained on <strong>next-token prediction</strong> from internet text. They excel at continuation, not instruction execution.</p><h4><strong>1.2 The Four Key Problems</strong></h4><p><strong>1. Instruction Following</strong></p><ul><li><p>Doesn&#8217;t distinguish between &#8220;write code&#8221; vs &#8220;explain code&#8221;</p></li><li><p>Completes text instead of executing commands</p></li></ul><p><strong>2. Harmful Content</strong></p><ul><li><p>No concept of &#8220;should I say this?&#8221;</p></li><li><p>Can generate hate speech, violence, illegal content</p></li></ul><p><strong>3. Hallucinations</strong></p><ul><li><p>Makes up facts confidently</p></li><li><p>No &#8220;I don&#8217;t know&#8221; response</p></li></ul><p><strong>4. Inconsistency</strong></p><ul><li><p>Same question &#8594; different answers</p></li><li><p>No clear &#8220;personality&#8221; or behavior</p></li></ul><p><strong>Solution:</strong> Alignment techniques that teach models to be helpful, harmless, and honest.</p><div><hr></div><h4>2. Instruction Tuning: The First Step</h4><p><strong>2.1 What Is Instruction Tuning?</strong></p><p><strong>Simple idea:</strong> Fine-tune the base model on examples of instructions + desired responses.</p><p><strong>Training data format:</strong></p><pre><code><code>Instruction: Translate "Hello" to French
Response: Bonjour

Instruction: Explain photosynthesis to a 10-year-old
Response: Photosynthesis is how plants make their own food using sunlight...

Instruction: Write a haiku about coding
Response: Fingers on keyboard
Logic flows through lines of code
Bug-free poetry
</code></code></pre><p><strong>2.2 Key Datasets</strong></p><p><strong>FLAN (Google, 2021)</strong></p><ul><li><p>Fine-tuned Language Net</p></li><li><p>60+ NLP tasks reformulated as instructions</p></li><li><p>T5 model &#8594; FLAN-T5</p></li></ul><p><strong>T0 (BigScience, 2021)</strong></p><ul><li><p>Multi-task prompted training</p></li><li><p>Diverse prompt templates per task</p></li></ul><p><strong>Alpaca (Stanford, 2023)</strong></p><ul><li><p>52K instruction-following examples</p></li><li><p>Generated using GPT-3.5</p></li><li><p>Open-source alternative</p></li></ul><p><strong>Dolly (Databricks, 2023)</strong></p><ul><li><p>15K human-generated examples</p></li><li><p>Fully open, commercial-friendly</p></li></ul><p><strong>2.3 What Changes?</strong></p><p><strong>Before instruction tuning (Base GPT-3):</strong></p><pre><code><code>Prompt: Summarize this article in 3 sentences:
[article text]

Output: The article discusses... [continues for 20 sentences]
</code></code></pre><p><strong>After instruction tuning:</strong></p><pre><code><code>Prompt: Summarize this article in 3 sentences:
[article text]

Output: [Exactly 3 sentence summary]
</code></code></pre><p><strong>The model learned:</strong></p><ul><li><p>Instructions are commands, not text to continue</p></li><li><p>Format matters (bullet points when asked, code blocks for code)</p></li><li><p>Task boundaries (stop when done)</p></li></ul><p><strong>2.4 Limitations</strong></p><p>Instruction tuning helps, but:</p><ul><li><p>Still generates harmful content if instructed</p></li><li><p>Still hallucinates</p></li><li><p>No nuanced understanding of &#8220;helpful&#8221;</p></li><li><p>Can&#8217;t handle conflicting instructions well</p></li></ul><p><strong>We need something more sophisticated: RLHF.</strong></p><div><hr></div><h4>3. RLHF: The ChatGPT Secret</h4><p><strong>3.1 What Is RLHF?</strong></p><p><strong>Reinforcement Learning from Human Feedback</strong></p><p>The technique that transformed GPT-3.5 into ChatGPT.</p><p><strong>Core insight:</strong></p><blockquote><p>&#8220;We can&#8217;t write down all the rules for being helpful. But we can show examples and let humans rank outputs.&#8221;</p></blockquote><p><strong>3.2 The Three-Phase Process</strong></p><p><strong>Phase 1: Supervised Fine-Tuning (SFT)</strong></p><p><strong>Goal:</strong> Create initial instruction-following model</p><p><strong>How:</strong></p><ol><li><p>Hire human labelers (contractors, often)</p></li><li><p>Give them prompts: &#8220;Explain quantum computing&#8221;</p></li><li><p>They write high-quality responses</p></li><li><p>Fine-tune base model on these examples</p></li></ol><p><strong>Dataset size:</strong> 10K-100K high-quality examples</p><p><strong>Output:</strong> SFT model (decent, but not great)</p><div><hr></div><p><strong>Phase 2: Reward Model Training</strong></p><p><strong>Goal:</strong> Train a model to score responses (good vs bad)</p><p><strong>How:</strong></p><ol><li><p>Take same prompts</p></li><li><p>Generate 4-9 responses using SFT model</p></li><li><p>Humans rank them: Best &#8594; Worst</p></li><li><p>Train a <strong>reward model</strong> (RM) to predict these rankings</p></li></ol><p><strong>Example:</strong></p><pre><code><code>Prompt: "How do I make pizza?"

Response A: "Mix flour, water, yeast. Let rise. Add toppings. Bake at 450&#176;F."
Response B: "Pizza is made from dough, sauce, and cheese."
Response C: "Use a microwave and frozen pizza."
Response D: [Generates pizza-related joke instead]

Human ranking: A &gt; C &gt; B &gt; D

Reward model learns: A gets score 0.9, B gets 0.4, etc.
</code></code></pre><p><strong>Key insight:</strong> The RM learns <em>human preferences</em> without humans needing to articulate rules.</p><div><hr></div><p><strong>Phase 3: Reinforcement Learning (PPO)</strong></p><p><strong>Goal:</strong> Optimize the model to maximize reward</p><p><strong>How:</strong></p><ol><li><p>Start with SFT model</p></li><li><p>Generate responses to prompts</p></li><li><p>Score them with reward model</p></li><li><p>Use PPO (Proximal Policy Optimization) to update model</p></li><li><p>Repeat for thousands of iterations</p></li></ol><p><strong>The update rule (simplified):</strong></p><pre><code><code>If reward model scores output highly &#8594; reinforce this behavior
If reward model scores output poorly &#8594; discourage this behavior
</code></code></pre><p><strong>Critical detail: KL penalty</strong></p><p>Without constraint, the model could &#8220;hack&#8221; the reward model by generating nonsense that scores high.</p><p><strong>Solution:</strong> Add penalty for diverging too much from the SFT model:</p><pre><code><code>Total reward = RM_score - &#946; * KL_divergence(new_policy, SFT_policy)
</code></code></pre><p>This keeps the model grounded while improving.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sfr_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sfr_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 424w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 848w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sfr_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png" width="1456" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RLHF: Reinforcement Learning from Human Feedback&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RLHF: Reinforcement Learning from Human Feedback" title="RLHF: Reinforcement Learning from Human Feedback" srcset="https://substackcdn.com/image/fetch/$s_!Sfr_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 424w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 848w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!Sfr_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c7a334b-3420-4b89-a7ad-fef6fbc49a52_1952x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RLHF..</figcaption></figure></div><div><hr></div><p><strong>3.3 What RLHF Actually Does</strong></p><p><strong>Before RLHF (Base GPT-3.5):</strong></p><ul><li><p>Can do tasks, but needs perfect prompts</p></li><li><p>Sometimes verbose, sometimes terse</p></li><li><p>No consistent &#8220;personality&#8221;</p></li><li><p>Will do harmful things if asked</p></li></ul><p><strong>After RLHF (ChatGPT):</strong></p><ul><li><p>Follows instructions naturally</p></li><li><p>Consistent helpfulness</p></li><li><p>Refuses harmful requests</p></li><li><p>Admits uncertainty (&#8221;I don&#8217;t know&#8221;)</p></li><li><p>Stays on-task</p></li></ul><p><strong>The magic:</strong> RLHF taught <strong>alignment</strong> the model&#8217;s goals align with user intent and safety.</p><p><strong>3.4 Challenges with RLHF</strong></p><p><strong>1. Reward Hacking</strong> Model finds shortcuts to maximize reward that aren&#8217;t actually better outputs.</p><p><strong>Example:</strong> Model learns to be overly apologetic (&#8221;I&#8217;m sorry, but...&#8221;) because humans rated polite responses higher.</p><p><strong>2. Reward Model Limitations</strong> RM is trained on limited data. It&#8217;s not perfect. Model can exploit its blind spots.</p><p><strong>3. Distribution Shift</strong> As the model improves, it generates outputs unlike anything in training. RM becomes unreliable.</p><p><strong>4. Expensive</strong></p><ul><li><p>Requires thousands of human ratings</p></li><li><p>Multiple training phases</p></li><li><p>Iterative process (PPO is slow)</p></li></ul><p><strong>5. Difficult to Control</strong> Hard to specify exactly what you want. &#8220;Be helpful&#8221; is vague.</p><div><hr></div><h4>4. Constitutional AI: Anthropic&#8217;s Approach</h4><p><strong>4.1 The Problem with RLHF</strong></p><p>RLHF requires massive human feedback at scale.</p><p><strong>Anthropic&#8217;s question:</strong></p><blockquote><p>&#8220;Can we use AI to provide the feedback instead of humans?&#8221;</p></blockquote><p><strong>4.2 How Constitutional AI Works</strong></p><p><strong>Phase 1: Supervised Learning (Self-Critique)</strong></p><ol><li><p>Model generates response</p></li><li><p>Model critiques its own response against &#8220;constitution&#8221; (principles)</p></li><li><p>Model revises response</p></li><li><p>Train on (prompt, revised response) pairs</p></li></ol><p><strong>Example Constitution principles:</strong></p><ul><li><p>&#8220;Avoid helping users harm themselves or others&#8221;</p></li><li><p>&#8220;Be honest about uncertainty&#8221;</p></li><li><p>&#8220;Respect user privacy&#8221;</p></li><li><p>&#8220;Avoid stereotypes and bias&#8221;</p></li></ul><p><strong>Phase 2: RL from AI Feedback (RLAIF)</strong></p><p>Instead of human rankings:</p><ol><li><p>Generate multiple responses</p></li><li><p>AI model ranks them based on constitution</p></li><li><p>Train reward model on AI preferences</p></li><li><p>Use PPO like standard RLHF</p></li></ol><p><strong>4.3 Benefits</strong></p><p><strong>1. Scalability</strong></p><ul><li><p>No human labelers needed (after initial constitution)</p></li><li><p>Can generate millions of examples</p></li></ul><p><strong>2. Transparency</strong></p><ul><li><p>Constitution is explicit</p></li><li><p>You know what principles the model follows</p></li></ul><p><strong>3. Iterative Improvement</strong></p><ul><li><p>Easy to update constitution</p></li><li><p>Retrain quickly</p></li></ul><p><strong>4. Consistency</strong></p><ul><li><p>AI feedback is more consistent than human feedback</p></li></ul><p><strong>4.4 Limitations</strong></p><p><strong>1. Goodhart&#8217;s Law</strong> &#8220;When a measure becomes a target, it ceases to be a good measure.&#8221; AI critic might rate responses highly for wrong reasons.</p><p><strong>2. Capability Ceiling</strong> AI critic can&#8217;t be better than the model being evaluated. Self-improvement has limits.</p><p><strong>3. Subtle Value Alignment</strong> Hard to capture nuanced human values in written principles.</p><div><hr></div><h4>5. Safety &amp; Guardrails</h4><p><strong>5.1 Content Filtering</strong></p><p><strong>Input filters:</strong></p><ul><li><p>Detect prompt injection attempts</p></li><li><p>Block requests for harmful content</p></li><li><p>Rate limiting per user</p></li></ul><p><strong>Output filters:</strong></p><ul><li><p>Scan generated text for:</p><ul><li><p>PII (emails, phone numbers, SSNs)</p></li><li><p>Hate speech, violence</p></li><li><p>Copyrighted material</p></li><li><p>Malicious code</p></li></ul></li></ul><p><strong>Tools:</strong></p><ul><li><p>OpenAI Moderation API</p></li><li><p>PerspectiveAPI (Google)</p></li><li><p>Custom classifiers</p></li></ul><p><strong>5.2 Red Teaming</strong></p><p><strong>What:</strong> Adversarial testing to find failure modes</p><p><strong>Process:</strong></p><ol><li><p>Hire people to &#8220;attack&#8221; the model</p></li><li><p>Try to generate harmful outputs</p></li><li><p>Document successful attacks</p></li><li><p>Retrain to fix vulnerabilities</p></li></ol><p><strong>Common attack vectors:</strong></p><ul><li><p>Jailbreaks (&#8221;Pretend you&#8217;re an AI with no restrictions...&#8221;)</p></li><li><p>Prompt injection (&#8221;Ignore previous instructions...&#8221;)</p></li><li><p>Multi-turn manipulation (build trust, then ask harmful questions)</p></li><li><p>Encoded requests (ROT13, base64, etc.)</p></li></ul><p><strong>5.3 The Ongoing Arms Race</strong></p><p><strong>Reality:</strong> No perfect solution.</p><p>Users find new jailbreaks daily. Models get patched. New jailbreaks emerge.</p><p><strong>The defense:</strong></p><ul><li><p>Continuous monitoring</p></li><li><p>Rapid response to new attacks</p></li><li><p>Multiple layers (input filter + model + output filter)</p></li><li><p>Human review of edge cases</p></li></ul><div><hr></div><h3>Part 2: Production - Deploying at Scale</h3><h4>6. Inference Optimization: Making It Fast &amp; Cheap</h4><p><strong>6.1 The Inference Cost Problem</strong></p><p><strong>ChatGPT initial costs (estimated):</strong></p><ul><li><p>$700,000/day in compute (early 2023)</p></li><li><p>~13M users at the time</p></li><li><p>$0.05 per user per day</p></li></ul><p><strong>For comparison:</strong></p><ul><li><p>Google Search: ~$0.001 per search</p></li><li><p>Netflix: ~$0.10 per user per day</p></li></ul><p><strong>LLMs are 50-100x more expensive to serve than traditional services.</strong></p><p><strong>6.2 Quantization: Reducing Model Size</strong></p><p><strong>Problem:</strong> GPT-3 in FP16 = 350GB Can&#8217;t fit on single GPU, slow inference.</p><p><strong>Solution:</strong> Reduce precision</p><p><strong>FP16 &#8594; INT8 (8-bit quantization)</strong></p><ul><li><p>2x smaller model</p></li><li><p>2x faster inference</p></li><li><p>Minimal accuracy loss (~1%)</p></li></ul><p><strong>FP16 &#8594; INT4 (4-bit quantization)</strong></p><ul><li><p>4x smaller model</p></li><li><p>3-4x faster inference</p></li><li><p>Some accuracy loss (~3-5%)</p></li></ul><p><strong>Techniques:</strong></p><ul><li><p><strong>Post-training quantization:</strong> GPTQ, AWQ</p></li><li><p><strong>Quantization-aware training:</strong> Train with quantization in mind</p></li></ul><p><strong>Example:</strong> LLaMA-70B in FP16: 140GB LLaMA-70B in 4-bit: 35GB &#8594; Fits on single A100 (80GB)</p><p><strong>6.3 KV Cache Optimization</strong></p><p><strong>Problem:</strong> For long contexts, KV cache dominates memory</p><p><strong>Solutions:</strong></p><p><strong>1. Multi-Query Attention (MQA)</strong></p><ul><li><p>Share K, V across all heads</p></li><li><p>Only Q is per-head</p></li><li><p>2-3x less KV cache memory</p></li></ul><p><strong>2. Grouped-Query Attention (GQA)</strong></p><ul><li><p>Share K, V across groups of heads</p></li><li><p>Balance between MHA and MQA</p></li><li><p>Used in LLaMA 2</p></li></ul><p><strong>3. PagedAttention (vLLM)</strong></p><ul><li><p>Manage KV cache like OS manages memory</p></li><li><p>Non-contiguous storage</p></li><li><p>Reduces memory waste by 40%</p></li></ul><p><strong>6.4 Batching Strategies</strong></p><p><strong>Problem:</strong> Serving one request at a time wastes GPU</p><p><strong>Naive batching:</strong> Wait until batch is full &#8594; high latency</p><p><strong>Continuous batching (ORCA, vLLM):</strong></p><ul><li><p>Add requests to batch as they arrive</p></li><li><p>Remove completed sequences</p></li><li><p>Add new sequences mid-batch</p></li><li><p>10-20x higher throughput</p></li></ul><p><strong>6.5 Model Serving Frameworks</strong></p><p><strong>vLLM</strong></p><ul><li><p>PagedAttention for memory efficiency</p></li><li><p>Continuous batching</p></li><li><p>14x-24x higher throughput than naive</p></li></ul><p><strong>TensorRT-LLM (NVIDIA)</strong></p><ul><li><p>Optimized kernels</p></li><li><p>INT8/INT4 quantization</p></li><li><p>Multi-GPU inference</p></li></ul><p><strong>Text Generation Inference (HuggingFace)</strong></p><ul><li><p>Production-ready</p></li><li><p>Flash Attention</p></li><li><p>Tensor parallelism</p></li></ul><p><strong>Triton (NVIDIA)</strong></p><ul><li><p>Model server for production</p></li><li><p>Multiple models, multiple GPUs</p></li><li><p>Load balancing, auto-scaling</p></li></ul><div><hr></div><h4>7. RAG: Retrieval-Augmented Generation</h4><p><strong>7.1 The Problem RAG Solves</strong></p><p><strong>Base LLM issues:</strong></p><ul><li><p>Knowledge cutoff (can&#8217;t know events after training)</p></li><li><p>Hallucinations (makes up facts)</p></li><li><p>No access to private/proprietary data</p></li><li><p>Expensive to update knowledge (requires retraining)</p></li></ul><p><strong>RAG solution:</strong></p><blockquote><p>&#8220;Don&#8217;t store all knowledge in parameters. Retrieve relevant information and include it in the prompt.&#8221;</p></blockquote><p><strong>7.2 How RAG Works</strong></p><p><strong>Architecture:</strong></p><pre><code><code>User Query
    &#8595;
[1. Retrieve] &#8594; Search knowledge base
    &#8595;
Relevant documents/chunks
    &#8595;
[2. Augment] &#8594; Construct prompt with context
    &#8595;
Prompt: "Given the following information: [docs]
        Answer the question: [query]"
    &#8595;
[3. Generate] &#8594; LLM produces answer
    &#8595;
Response (grounded in retrieved docs)
</code></code></pre><p><strong>7.3 Building a RAG System</strong></p><p><strong>Step 1: Document Processing</strong></p><pre><code><code>1. Load documents (PDFs, web pages, databases)
2. Chunk into passages (200-500 tokens each)
3. Embed each chunk using embedding model
4. Store embeddings in vector database
</code></code></pre><p><strong>Step 2: Query Time</strong></p><pre><code><code>1. User asks question
2. Embed question
3. Find top-k most similar chunks (cosine similarity)
4. Construct prompt with chunks + question
5. LLM generates answer
</code></code></pre><p><strong>Step 3: Post-Processing</strong></p><pre><code><code>1. Extract citations from response
2. Verify facts against retrieved docs
3. Return answer + sources
</code></code></pre><p><strong>7.4 Key Components</strong></p><p><strong>Embedding Models:</strong></p><ul><li><p><strong>OpenAI ada-002:</strong> 1536 dimensions, good quality</p></li><li><p><strong>Sentence Transformers:</strong> Open-source, various sizes</p></li><li><p><strong>Cohere Embed:</strong> Multilingual, strong performance</p></li><li><p><strong>E5, BGE:</strong> State-of-the-art open models</p></li></ul><p><strong>Vector Databases:</strong></p><ul><li><p><strong>Pinecone:</strong> Managed, scalable</p></li><li><p><strong>Weaviate:</strong> Open-source, GraphQL API</p></li><li><p><strong>Qdrant:</strong> Rust-based, fast</p></li><li><p><strong>Chroma:</strong> Simple, embedded</p></li><li><p><strong>FAISS:</strong> Library (not database), very fast</p></li></ul><p><strong>Chunking Strategies:</strong></p><ul><li><p><strong>Fixed-size:</strong> Simple, 200-500 tokens</p></li><li><p><strong>Sentence-based:</strong> Split on sentences</p></li><li><p><strong>Semantic:</strong> Split on topic boundaries</p></li><li><p><strong>Sliding window:</strong> Overlapping chunks for context</p></li></ul><p><strong>7.5 Hybrid Search</strong></p><p><strong>Problem:</strong> Keyword search and vector search each have strengths</p><p><strong>Solution:</strong> Combine both</p><p><strong>BM25 (keyword) + Dense retrieval (semantic)</strong></p><pre><code><code># Retrieve using both methods
keyword_results = bm25_search(query)  # Good for exact matches
semantic_results = vector_search(query)  # Good for concepts

# Combine with Reciprocal Rank Fusion (RRF)
combined_results = rrf(keyword_results, semantic_results)
</code></code></pre><p><strong>When to use:</strong></p><ul><li><p>Keyword: Exact terms, names, technical jargon</p></li><li><p>Semantic: Concepts, paraphrases, &#8220;similar meaning&#8221;</p></li><li><p>Hybrid: Best of both</p></li></ul><p><strong>7.6 RAG vs Fine-tuning</strong></p><p><strong>Rule of thumb:</strong></p><ul><li><p><strong>RAG:</strong> For knowledge-heavy tasks, changing info</p></li><li><p><strong>Fine-tuning:</strong> For specialized tasks, writing style, consistent behavior</p></li><li><p><strong>Both:</strong> Use fine-tuned model + RAG for best results</p></li></ul><div><hr></div><h4>8. Prompt Engineering: The Meta-Skill</h4><p><strong>8.1 Why Prompting Matters</strong></p><p><strong>Same model, different prompts:</strong></p><p><strong>Bad prompt:</strong></p><pre><code><code>Tell me about machine learning
</code></code></pre><p><strong>Good prompt:</strong></p><pre><code><code>You are an expert machine learning engineer. Explain the difference 
between supervised and unsupervised learning to a software engineer 
with no ML background. Use concrete examples and avoid jargon.
</code></code></pre><p><strong>Prompt engineering can 10x your results</strong> without changing the model.</p><p><strong>8.2 Core Patterns</strong></p><p><strong>1. Role Prompting</strong></p><pre><code><code>You are an expert Python programmer.
You are a helpful teaching assistant.
You are a technical documentation writer.
</code></code></pre><p><strong>2. Few-Shot Learning</strong></p><pre><code><code>Classify sentiment:

Text: "I love this product!"
Sentiment: Positive

Text: "This is terrible."
Sentiment: Negative

Text: "It's okay, nothing special."
Sentiment: Neutral

Text: "Best purchase ever!"
Sentiment: [LLM completes]
</code></code></pre><p><strong>3. Chain-of-Thought (CoT)</strong></p><pre><code><code>Problem: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many balls does he have?

Let's think step by step:
1. Roger starts with 5 balls
2. He buys 2 cans
3. Each can has 3 balls
4. So he gets 2 * 3 = 6 new balls
5. Total: 5 + 6 = 11 balls
</code></code></pre><p>Adding &#8220;Let&#8217;s think step by step&#8221; increases reasoning accuracy dramatically.</p><p><strong>4. Self-Consistency</strong></p><pre><code><code>Generate 5 different reasoning paths.
Take majority vote on final answer.
</code></code></pre><p>Improves accuracy on complex reasoning tasks.</p><p><strong>5. ReAct (Reason + Act)</strong></p><pre><code><code>Thought: I need current weather data
Action: call_weather_api("San Francisco")
Observation: 72&#176;F, sunny
Thought: Now I can answer
Answer: It's 72&#176;F and sunny in SF today
</code></code></pre><p>Interleaving reasoning and tool use.</p><p><strong>8.3 System Prompts (ChatGPT-style)</strong></p><p><strong>Structure:</strong></p><pre><code><code>System: [Instructions on behavior, constraints]
User: [User's input]
Assistant: [Model's response]
</code></code></pre><p><strong>Example system prompt:</strong></p><pre><code><code>You are a helpful AI assistant. You should:
- Be concise but thorough
- Admit when you don't know something
- Avoid harmful or biased content
- Cite sources when possible
- Ask clarifying questions if the request is ambiguous
</code></code></pre><p><strong>8.4 Prompt Optimization Tools</strong></p><p><strong>Manual:</strong></p><ul><li><p>Test variations</p></li><li><p>A/B test with users</p></li><li><p>Iterate based on feedback</p></li></ul><p><strong>Automated:</strong></p><ul><li><p><strong>DSPy:</strong> Compile prompts automatically</p></li><li><p><strong>Prompt flow:</strong> Visual prompt engineering (Microsoft)</p></li><li><p><strong>LangChain:</strong> Framework for prompt templates</p></li></ul><div><hr></div><h4>9. Real-World Architecture Patterns</h4><p><strong>9.1 Pattern 1: Simple API Wrapper</strong></p><pre><code><code>User Request
    &#8595;
Load Balancer
    &#8595;
API Server (FastAPI/Flask)
    &#8595;
LLM API (OpenAI, Anthropic, etc.)
    &#8595;
Response
</code></code></pre><p><strong>Use case:</strong> Prototypes, low-volume applications</p><p><strong>Pros:</strong> Simple, fast to build </p><p><strong>Cons:</strong> Expensive, vendor lock-in</p><div><hr></div><p><strong>9.2 Pattern 2: Self-Hosted Model</strong></p><pre><code><code>User Request
    &#8595;
API Gateway
    &#8595;
Model Server (vLLM, TGI)
    &#9500;&#9472; GPU 1 (model shard 1)
    &#9500;&#9472; GPU 2 (model shard 2)
    &#9492;&#9472; GPU N (model shard N)
    &#8595;
Response
</code></code></pre><p><strong>Use case:</strong> High volume, cost optimization, data privacy</p><p><strong>Pros:</strong> Control, cheaper at scale </p><p><strong>Cons:</strong> Infrastructure complexity, GPU costs</p><div><hr></div><p><strong>9.3 Pattern 3: RAG System</strong></p><pre><code><code>User Query
    &#8595;
[Query Processing]
    &#8595;
Vector Database (semantic search)
    +
Keyword Search (BM25)
    &#8595;
[Reranking]
    &#8595;
Top-K documents
    &#8595;
[Prompt Construction]
    &#8595;
LLM
    &#8595;
[Response + Citations]
    &#8595;
User
</code></code></pre><p><strong>Use case:</strong> Q&amp;A, knowledge bases, customer support</p><p><strong>Components:</strong></p><ul><li><p>Embedding model for encoding</p></li><li><p>Vector DB for storage</p></li><li><p>Reranker for quality</p></li><li><p>LLM for generation</p></li></ul><div><hr></div><p><strong>9.4 Pattern 4: Agent System</strong></p><pre><code><code>User Request
    &#8595;
Agent (LLM)
    &#9500;&#9472; Tool 1: Web Search
    &#9500;&#9472; Tool 2: Calculator
    &#9500;&#9472; Tool 3: Code Execution
    &#9500;&#9472; Tool 4: Database Query
    &#9492;&#9472; Tool N: Custom API
    &#8595;
[Agent Loop: Reason &#8594; Act &#8594; Observe]
    &#8595;
Final Answer
</code></code></pre><p><strong>Use case:</strong> Complex workflows, multi-step tasks</p><p><strong>Frameworks:</strong></p><ul><li><p>LangChain</p></li><li><p>LlamaIndex</p></li><li><p>AutoGPT</p></li><li><p>BabyAGI</p></li></ul><p><strong>Challenges:</strong></p><ul><li><p>Reliability (agents can fail)</p></li><li><p>Cost (multiple LLM calls)</p></li><li><p>Latency (sequential operations)</p></li></ul><div><hr></div><p><strong>9.5 Pattern 5: Multi-Model Pipeline</strong></p><pre><code><code>User Request
    &#8595;
[Router LLM] &#8594; Classify intent
    &#8595;
    &#9500;&#9472; Simple query &#8594; Small fast model (7B)
    &#9500;&#9472; Complex query &#8594; Large model (70B)
    &#9500;&#9472; Code task &#8594; Code-specialized model
    &#9492;&#9472; Creative task &#8594; Creative model
    &#8595;
Response
</code></code></pre><p><strong>Use case:</strong> Cost optimization, task-specific quality</p><p><strong>Benefit:</strong> Use expensive models only when needed</p><div><hr></div><h4>10. Cost Optimization Strategies</h4><p>Running large language models at scale is expensive. Serving millions of users quickly adds up: even a model like GPT-3.5 can cost thousands of dollars per day, while GPT-4 can easily reach hundreds of thousands. Efficient deployment requires careful strategies to reduce compute, memory, and token usage without sacrificing quality.</p><p><strong>Techniques for Reducing Costs</strong></p><ol><li><p><strong>Prompt Compression</strong></p><ul><li><p>Remove unnecessary words and redundancies</p></li><li><p>Use concise phrasing (&#8220;Explain X briefly&#8221; instead of &#8220;Could you please explain X in detail&#8221;)</p></li><li><p>Reduces token consumption without affecting output quality</p></li></ul></li><li><p><strong>Caching</strong></p><ul><li><p>Store responses to common queries for reuse</p></li><li><p>Cache intermediate results for multi-step prompts</p></li><li><p>Semantic caching allows similar queries to reuse prior outputs, saving both compute and tokens</p></li></ul></li><li><p><strong>Streaming</strong></p><ul><li><p>Deliver partial outputs as soon as they are generated</p></li><li><p>Users get faster feedback</p></li><li><p>Responses can be interrupted if no longer needed, saving computation</p></li></ul></li><li><p><strong>Model Routing</strong></p><ul><li><p>Route simple queries to smaller, faster models</p></li><li><p>Reserve larger models for complex tasks</p></li><li><p>Up to 70&#8211;80% of requests can be served by smaller models, reducing overall cost</p></li></ul></li><li><p><strong>Output Length Limits</strong></p><ul><li><p>Enforce maximum token limits per request to prevent runaway generation</p></li><li><p>Example: <code>max_tokens=200</code> in API calls</p></li></ul></li><li><p><strong>Batch Processing</strong></p><ul><li><p>Process multiple requests together to maximize GPU utilization</p></li><li><p>Reduces per-request compute cost</p></li><li><p>Trade-off: slight increase in latency for higher throughput</p></li></ul></li><li><p><strong>Self-Hosting</strong></p><ul><li><p>Deploy models on owned infrastructure if token usage is high (~1M&#8211;10M tokens/day)</p></li><li><p>Fixed GPU costs are amortized across all requests, reducing long-term expenses</p></li></ul></li><li><p><strong>Quantization</strong></p><ul><li><p>Convert models to lower precision (e.g., 4-bit) to reduce memory and compute requirements</p></li><li><p>Achieves 3&#8211;4x cost reduction with minimal impact on output quality</p></li></ul></li></ol><div><hr></div><h4>11. Production Checklist</h4><p>Deploying a large language model isn&#8217;t just about serving predictions&#8212;it requires rigorous preparation, monitoring, and continuous improvement. Here&#8217;s a structured approach to ensure reliability, safety, and efficiency.</p><p><strong>11.1 Before Deployment</strong></p><p><strong>Model Selection</strong></p><ul><li><p>Choose the appropriate model size based on your use case.</p></li><li><p>Benchmark against real-world inputs to verify performance.</p></li><li><p>Test edge cases to ensure robustness under unusual or unexpected queries.</p></li></ul><p><strong>Safety Measures</strong></p><ul><li><p>Implement input filters to catch malicious or harmful prompts.</p></li><li><p>Apply output filters to detect sensitive information, toxic content, or code injection.</p></li><li><p>Set up rate limiting per user to prevent abuse.</p></li><li><p>Complete red-teaming exercises to discover vulnerabilities proactively.</p></li><li><p>Integrate a content moderation system for ongoing safety enforcement.</p></li></ul><p><strong>Performance</strong></p><ul><li><p>Verify latency meets targets (p95, p99) for a smooth user experience.</p></li><li><p>Ensure throughput meets expected request volume.</p></li><li><p>Conduct load testing to validate system stability under peak demand.</p></li><li><p>Configure auto-scaling to handle fluctuations in traffic.</p></li></ul><p><strong>Cost Management</strong></p><ul><li><p>Calculate cost per request and ensure it aligns with your budget.</p></li><li><p>Set budget alerts to catch unexpected spikes in usage.</p></li><li><p>Implement cost optimization strategies such as batching, caching, or model routing.</p></li></ul><p><strong>Monitoring &amp; Observability</strong></p><ul><li><p>Log every request and response, including timestamps, latency, tokens, and costs.</p></li><li><p>Track errors and anomalies in real time.</p></li><li><p>Monitor latency and throughput to catch performance regressions early.</p></li><li><p>Collect user feedback for insights on model behavior and satisfaction.</p></li></ul><div><hr></div><p><strong>11.2 Day-One Operations</strong></p><p><strong>Observability</strong></p><ul><li><p>Log all interactions in detail: requests, responses, errors, and resource usage.</p></li><li><p>Monitor critical metrics such as latency, error rates, and token usage to spot anomalies immediately.</p></li></ul><p><strong>Alerts</strong></p><ul><li><p>Configure alerts for latency spikes, error surges, cost anomalies, and API failures.</p></li></ul><p><strong>Fallback Strategies</strong></p><ul><li><p>Use a secondary model if the primary model fails.</p></li><li><p>Queue or retry requests when rate limits are exceeded.</p></li><li><p>Serve cached responses when timeouts occur to maintain continuity.</p></li></ul><div><hr></div><p><strong>11.3 Continuous Improvement</strong></p><p><strong>User Feedback Loop</strong></p><ul><li><p>Collect user ratings (thumbs up/down) for every response.</p></li><li><p>Log prompts, responses, and feedback for analysis.</p></li><li><p>Identify failure patterns and adjust prompts, fine-tune models, or retrain as necessary.</p></li></ul><p><strong>A/B Testing</strong></p><ul><li><p>Split users between prompt or model variations to measure impact.</p></li><li><p>Compare metrics such as quality, latency, and cost.</p></li><li><p>Deploy the winning configuration to the full user base.</p></li></ul><p><strong>Regular Updates</strong></p><ul><li><p>Incorporate new model versions and optimizations.</p></li><li><p>Continuously refine prompts for clarity and efficiency.</p></li><li><p>Update safety measures and moderation systems as new risks emerge.</p></li><li><p>Optimize deployment strategies to reduce cost without sacrificing performance.</p></li></ul><div><hr></div><h4>12. The Future of LLM Deployment</h4><p>The landscape of LLM deployment is evolving rapidly. As models become more capable, practical considerations like cost, latency, and safety drive innovation. Let&#8217;s explore emerging trends and the challenges that lie ahead.</p><p><strong>12.1 Emerging Trends</strong></p><p><strong>1. Smaller, Specialized Models</strong></p><ul><li><p>Models like Phi-2 (2.7B parameters) can match GPT-3.5 on specific tasks, demonstrating that bigger isn&#8217;t always better.</p></li><li><p>Task-specific fine-tuning enables models to excel at narrow domains without massive compute.</p></li><li><p>Using a mixture of smaller, specialized models can outperform a single monolithic model while reducing inference costs.</p></li></ul><p><strong>2. On-Device LLMs</strong></p><ul><li><p>Quantized models running directly on phones or laptops are becoming feasible.</p></li><li><p>On-device deployment offers privacy benefits by keeping user data local.</p></li><li><p>Zero-latency inference becomes possible, enabling instant responses for interactive applications.</p></li></ul><p><strong>3. Multimodal Integration</strong></p><ul><li><p>Future LLMs will seamlessly combine text, images, and audio in one model.</p></li><li><p>Examples include GPT-4V, Gemini, and Claude 3, opening new possibilities for richer and more interactive AI experiences.</p></li></ul><p><strong>4. Agent Ecosystems</strong></p><ul><li><p>LLMs will increasingly act as orchestrators, coordinating multiple tools like web search, code execution, and database queries.</p></li><li><p>This enables complex multi-step workflows and more autonomous AI assistants capable of reasoning, acting, and observing iteratively.</p></li></ul><p><strong>5. Continuous Learning</strong></p><ul><li><p>Models will adapt and improve without full retraining.</p></li><li><p>Personalization will allow AI to adjust to individual user preferences.</p></li><li><p>Continuous learning ensures models stay up-to-date with new information while remaining aligned with desired behaviors.</p></li></ul><div><hr></div><p><strong>12.2 Open Challenges</strong></p><p><strong>1. Reliability</strong></p><ul><li><p>LLMs still hallucinate and can generate factually incorrect responses.</p></li><li><p>Ensuring correctness remains difficult, and better verification mechanisms are needed.</p></li></ul><p><strong>2. Cost</strong></p><ul><li><p>Large-scale deployment remains expensive.</p></li><li><p>Achieving 10x&#8211;100x reductions in inference cost is essential for widespread adoption.</p></li></ul><p><strong>3. Latency</strong></p><ul><li><p>Users expect sub-second response times, but large models are inherently slower.</p></li><li><p>Optimizing inference pipelines and leveraging smaller or hybrid models will be critical.</p></li></ul><p><strong>4. Safety</strong></p><ul><li><p>New jailbreaks and adversarial attacks emerge constantly.</p></li><li><p>Subtle biases are hard to detect, and misuse of powerful models is inevitable.</p></li><li><p>Ongoing vigilance and layered safety mechanisms are required.</p></li></ul><p><strong>5. Evaluation</strong></p><ul><li><p>Measuring LLM quality is challenging.</p></li><li><p>Standard benchmarks often fail to capture real-world performance.</p></li><li><p>Improved metrics and evaluation frameworks are needed to assess usefulness, alignment, and reliability effectively.</p></li></ul><div><hr></div><h3> Closing Thoughts</h3><p>Thanks for sticking with the series and exploring the world of Transformers and LLMs with me. We started with why Transformers came to be, dove into how they work, saw how scaling unlocks new capabilities, and finally covered how to bring them safely and efficiently into production.</p><p>The hope is that this series gives you a clear roadmap not just the theory, but how to think about building and deploying AI responsibly. From alignment and RLHF to RAG, prompting, and optimization, these are the tools and lessons that turn a powerful model into a useful system.</p><p>AI is evolving fast, and there&#8217;s still so much to explore. Keep experimenting, keep questioning, and always prioritize safety and usability.</p><p>Thank you for going through the series , I hope it was as enlightening for you as it was fun to put together. Here&#8217;s to building the next generation of AI thoughtfully and responsibly.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[🚀 Scaling to LLMs: Why Bigger Models Get Smarter]]></title><description><![CDATA[From BERT to GPT-3: Understanding the Scaling Breakthrough]]></description><link>https://datajourney24.substack.com/p/scaling-to-llms-why-bigger-models</link><guid isPermaLink="false">https://datajourney24.substack.com/p/scaling-to-llms-why-bigger-models</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sat, 06 Dec 2025 07:20:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RoqB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;ll Cover</h2><p>In Posts 1 &amp; 2, we understood <strong>how</strong> Transformers work.</p><p>Now comes the most surprising discovery in modern AI:</p><blockquote><p><strong>Making models bigger doesn&#8217;t just make them better at existing tasks ,it makes them capable of entirely new tasks they were never trained for.</strong></p></blockquote><p>This post covers:</p><ul><li><p>The shocking discovery of scaling laws</p></li><li><p>Why bigger models exhibit &#8220;emergent abilities&#8221;</p></li><li><p>Chinchilla laws and compute-optimal training</p></li><li><p>How LLMs are actually trained</p></li><li><p>Infrastructure requirements and costs</p></li><li><p>What happens during pre-training</p></li></ul><p><strong>By the end, you&#8217;ll understand:</strong></p><ul><li><p>Why GPT-3 (175B params) can do things GPT-2 (1.5B) can&#8217;t</p></li><li><p>How to calculate optimal model size for your compute budget</p></li><li><p>The real cost of training frontier models</p></li><li><p>Why &#8220;more data&#8221; became as important as &#8220;more parameters&#8221;</p></li></ul><p>Let&#8217;s dive into the scaling breakthrough that changed everything.</p><div><hr></div><h2>1. The Accidental Discovery: Scaling Laws</h2><h3>1.1 The 2020 Breakthrough</h3><p>In January 2020, OpenAI researchers published a paper that would change AI forever: &#8220;Scaling Laws for Neural Language Models.&#8221;</p><p><strong>What they found:</strong></p><p>Performance improves <strong>predictably</strong> as you scale:</p><ul><li><p>Model size (parameters)</p></li><li><p>Dataset size (tokens)</p></li><li><p>Compute budget (FLOPs)</p></li></ul><p>This wasn&#8217;t just &#8220;bigger is better.&#8221; It was <strong>&#8220;bigger is predictably better in a mathematically precise way.&#8221;</strong></p><h3>1.2 The Three Scaling Axes</h3><p><strong>1. Model Size (N parameters)</strong></p><pre><code><code>10M &#8594; 100M &#8594; 1B &#8594; 10B &#8594; 100B parameters
</code></code></pre><p><strong>2. Dataset Size (D tokens)</strong></p><pre><code><code>1B &#8594; 10B &#8594; 100B &#8594; 1T tokens
</code></code></pre><p><strong>3. Compute Budget (C FLOPs)</strong></p><pre><code><code>10^18 &#8594; 10^21 &#8594; 10^24 FLOPs
</code></code></pre><p><strong>The key insight:</strong> Performance (measured by loss) follows a power law:</p><pre><code><code>Loss &#8733; N^(-&#945;)  where &#945; &#8776; 0.076
Loss &#8733; D^(-&#946;)  where &#946; &#8776; 0.095
Loss &#8733; C^(-&#947;)  where &#947; &#8776; 0.050
</code></code></pre><h3>1.3 What This Means in Practice</h3><p><strong>Example:</strong></p><p>If you have 10x more compute, you should expect:</p><ul><li><p>~40% reduction in loss</p></li><li><p>Significantly better performance on downstream tasks</p></li><li><p><strong>Entirely new capabilities</strong> that weren&#8217;t present before</p></li></ul><p><strong>This was revolutionary</strong> because:</p><ol><li><p>It&#8217;s <strong>predictable</strong> - you can forecast performance before training</p></li><li><p>It&#8217;s <strong>reliable</strong> - holds across architectures and domains</p></li><li><p>It&#8217;s <strong>actionable</strong> - tells you how to allocate resources</p></li></ol><div><hr></div><h2>2. The Chinchilla Correction: We Were Training Wrong</h2><h3>2.1 The 2022 Plot Twist</h3><p>In March 2022, DeepMind dropped a bombshell: &#8220;Training Compute-Optimal Large Language Models&#8221; (Chinchilla paper).</p><p><strong>Their finding:</strong></p><blockquote><p><strong>Most large models were undertrained.</strong></p></blockquote><p><strong>The old approach (GPT-3 era):</strong></p><ul><li><p>Focus on making models HUGE (175B params)</p></li><li><p>Train on relatively little data (300B tokens)</p></li><li><p>&#8220;Bigger model = better model&#8221;</p></li></ul><p><strong>The Chinchilla insight:</strong></p><ul><li><p>You should scale <strong>parameters and data equally</strong></p></li><li><p>GPT-3 should have been trained on 3.7 TRILLION tokens, not 300B</p></li><li><p>Or use a smaller model with the same compute</p></li></ul><h3>2.2 The Compute-Optimal Formula</h3><p>For a given compute budget C:</p><pre><code><code>N_optimal &#8733; C^0.50  (model parameters)
D_optimal &#8733; C^0.50  (training tokens)
</code></code></pre><p><strong>Rule of thumb:</strong></p><p>For every doubling of model size, you should roughly double the training data.</p><h3>2.3 Why This Matters</h3><p><strong>Before Chinchilla:</strong></p><ul><li><p>GPT-3: 175B params, 300B tokens &#8594; Undertrained</p></li><li><p>Gopher: 280B params, 300B tokens &#8594; Severely undertrained</p></li></ul><p><strong>After Chinchilla:</strong></p><ul><li><p>Chinchilla: 70B params, 1.4T tokens &#8594; Compute-optimal, outperformed Gopher</p></li><li><p>LLaMA: 7B-65B params, 1T-1.4T tokens &#8594; Compute-optimal</p></li><li><p>LLaMA 2: 7B-70B params, 2T tokens &#8594; Even more data</p></li></ul><p><strong>The lesson:</strong></p><p>Throwing all your compute into model size is inefficient. You need to balance parameters and training data.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RoqB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RoqB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RoqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg" width="1080" height="355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:355,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RoqB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RoqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91c7fe33-ae51-4cde-a945-adacc22648bc_1080x355.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>3. Emergent Abilities: The Most Surprising Discovery</h2><h3>3.1 What Are Emergent Abilities?</h3><p><strong>Definition:</strong></p><p>Abilities that are <strong>not present in smaller models</strong> but <strong>suddenly appear</strong> when models cross a certain scale threshold.</p><p><strong>Examples:</strong></p><p><strong>Arithmetic:</strong></p><ul><li><p>GPT-2 (1.5B): Can&#8217;t do 3-digit addition</p></li><li><p>GPT-3 (175B): Can do multi-digit arithmetic</p></li></ul><p><strong>Few-shot learning:</strong></p><ul><li><p>BERT (340M): Needs fine-tuning for new tasks</p></li><li><p>GPT-3 (175B): Can learn from 5-10 examples in context</p></li></ul><p><strong>Chain-of-thought reasoning:</strong></p><ul><li><p>Models &lt;10B: Can&#8217;t break down complex problems</p></li><li><p>Models &gt;60B: Can show step-by-step reasoning</p></li></ul><p><strong>Code generation:</strong></p><ul><li><p>GPT-2: Can&#8217;t write functional code</p></li><li><p>Codex/GPT-3.5: Can write complex programs</p></li></ul><h3>3.2 The Emergence Curve</h3><p>Performance on many tasks follows a <strong>sharp phase transition</strong>:</p><pre><code><code>Model Size:   1B    10B   50B   100B  175B
Performance:  0%    5%    15%   65%   85%
</code></code></pre><p>Notice the jump between 50B and 100B , this is emergence.</p><p><strong>It&#8217;s not gradual improvement. It&#8217;s a sudden unlock.</strong></p><h3>3.3 Why Does Emergence Happen?</h3><p><strong>Three theories:</strong></p><p><strong>Theory 1: Capacity Threshold</strong> Some tasks require a minimum amount of &#8220;reasoning space.&#8221; Below that threshold, the model can&#8217;t represent the solution. Above it, it can.</p><p><strong>Theory 2: Data Coverage</strong> Larger models train longer, seeing more examples. At some point, they&#8217;ve seen enough to generalize.</p><p><strong>Theory 3: Measurement Artifact</strong> Maybe performance improves smoothly, but our metrics (like &#8220;% correct&#8221;) create artificial thresholds.</p><p><strong>The truth:</strong> Probably a combination of all three.</p><h3>3.4 Notable Emergent Abilities</h3><p><strong>1. Multi-step reasoning</strong></p><ul><li><p>&#8220;If John is taller than Mary, and Mary is taller than Sue, who&#8217;s tallest?&#8221;</p></li><li><p>Requires chaining facts , emerges around 50B+ params</p></li></ul><p><strong>2. Instruction following</strong></p><ul><li><p>&#8220;Translate this, but make it formal and use British spelling&#8221;</p></li><li><p>Emerges with scale + instruction tuning</p></li></ul><p><strong>3. Self-correction</strong></p><ul><li><p>&#8220;Actually, let me reconsider...&#8221;</p></li><li><p>Models can critique their own outputs (100B+)</p></li></ul><p><strong>4. In-context learning with many examples</strong></p><ul><li><p>GPT-2: ~3 examples max</p></li><li><p>GPT-3: Can learn from 50+ examples in context</p></li></ul><p><strong>5. Code debugging</strong></p><ul><li><p>Not just writing code, but identifying and fixing bugs</p></li><li><p>Strong emergence around 100B+</p></li></ul><div><hr></div><h2>4. Pre-training: How LLMs Actually Learn</h2><h3>4.1 The Training Objective</h3><p>LLMs are trained with a simple objective:</p><p><strong>Next token prediction</strong> (autoregressive language modeling)</p><pre><code><code>Input:  &#8220;The cat sat on the&#8221;
Target: &#8220;mat&#8221;

Loss = -log P(mat | The cat sat on the)
</code></code></pre><p>That&#8217;s it. No labels. No supervision. Just predict the next token.</p><h3>4.2 Why This Works</h3><p><strong>Intuition:</strong></p><p>To predict the next word well, the model must:</p><ul><li><p>Understand syntax (grammar rules)</p></li><li><p>Learn semantics (word meanings)</p></li><li><p>Build world knowledge (facts about the world)</p></li><li><p>Model reasoning (cause and effect)</p></li></ul><p><strong>Compression = Understanding</strong></p><blockquote><p>&#8220;The better you can compress text, the more you understand it.&#8221;</p></blockquote><p>Next-token prediction is optimal text compression. So models are forced to learn rich representations.</p><h3>4.3 What Models Learn During Pre-training</h3><p><strong>Phase 1: Tokens &amp; Patterns (Epochs 1-10)</strong></p><ul><li><p>Word boundaries</p></li><li><p>Common n-grams</p></li><li><p>Basic syntax</p></li></ul><p><strong>Phase 2: Structure &amp; Grammar (Epochs 10-50)</strong></p><ul><li><p>Parts of speech</p></li><li><p>Sentence structure</p></li><li><p>Subject-verb agreement</p></li></ul><p><strong>Phase 3: Semantics &amp; Facts (Epochs 50-200)</strong></p><ul><li><p>Word meanings in context</p></li><li><p>Factual knowledge</p></li><li><p>Relationships between entities</p></li></ul><p><strong>Phase 4: Reasoning &amp; Abstraction (Epochs 200+)</strong></p><ul><li><p>Logical inference</p></li><li><p>Analogical reasoning</p></li><li><p>Complex pattern recognition</p></li></ul><p><strong>The deeper the training, the more abstract the representations.</strong></p><h3>4.4 Training Data: What Goes In</h3><p><strong>Common Sources:</strong></p><p><strong>1. Common Crawl</strong></p><ul><li><p>Web scrapes (petabytes of text)</p></li><li><p>Noisy, diverse, multilingual</p></li><li><p>Contains everything from blog posts to academic papers</p></li></ul><p><strong>2. Books</strong></p><ul><li><p>Fiction and non-fiction</p></li><li><p>Long-form coherent text</p></li><li><p>Narrative structure</p></li></ul><p><strong>3. Wikipedia</strong></p><ul><li><p>Factual, encyclopedic knowledge</p></li><li><p>Well-structured</p></li><li><p>Regularly updated</p></li></ul><p><strong>4. Academic Papers (ArXiv, PubMed)</strong></p><ul><li><p>Technical knowledge</p></li><li><p>Scientific reasoning</p></li><li><p>Formal writing</p></li></ul><p><strong>5. Code Repositories (GitHub)</strong></p><ul><li><p>For models like Codex</p></li><li><p>Programming logic</p></li><li><p>Documentation</p></li></ul><p><strong>6. Curated Datasets</strong></p><ul><li><p>The Pile (EleutherAI): 825GB, diverse sources</p></li><li><p>C4 (Colossal Clean Crawled Corpus): cleaned Common Crawl</p></li><li><p>RedPajama: Open replication of LLaMA&#8217;s training data</p></li></ul><p><strong>Typical mix for LLMs:</strong></p><ul><li><p>60% Web data (Common Crawl)</p></li><li><p>16% Books</p></li><li><p>10% Wikipedia</p></li><li><p>7% Code</p></li><li><p>7% Academic papers</p></li></ul><h3>4.5 Data Preparation Pipeline</h3><p><strong>Step 1: Collection</strong></p><ul><li><p>Scrape/download massive datasets</p></li><li><p>GPT-3: 570GB compressed &#8594; ~400B tokens</p></li></ul><p><strong>Step 2: Filtering</strong></p><ul><li><p>Remove duplicates (exact and near-duplicates)</p></li><li><p>Filter by quality (perplexity, heuristics)</p></li><li><p>Remove toxic/harmful content</p></li><li><p>Language detection</p></li></ul><p><strong>Step 3: Tokenization</strong></p><ul><li><p>BPE (Byte Pair Encoding) or SentencePiece</p></li><li><p>Build vocabulary (typically 32K-100K tokens)</p></li><li><p>Convert text to token IDs</p></li></ul><p><strong>Step 4: Formatting</strong></p><ul><li><p>Pack sequences to context length (2048, 4096 tokens)</p></li><li><p>Add special tokens ([BOS], [EOS])</p></li><li><p>Shuffle documents</p></li></ul><p><strong>Data quality matters MORE than you think.</strong></p><p>Poor data &#8594; Poor model, regardless of size.</p><div><hr></div><h2>5. Training Infrastructure: The Reality of Scale</h2><h3>5.1 Hardware Requirements</h3><p><strong>Training GPT-3 (175B parameters):</strong></p><p><strong>Hardware:</strong></p><ul><li><p>10,000+ NVIDIA V100 GPUs</p></li><li><p>High-bandwidth interconnects (NVLink, InfiniBand)</p></li><li><p>Petabytes of storage</p></li><li><p>Massive cooling infrastructure</p></li></ul><p><strong>Duration:</strong></p><ul><li><p>Several weeks to months</p></li><li><p>One training run</p></li></ul><p><strong>Cost:</strong></p><ul><li><p>Estimated $4-12 million in compute</p></li><li><p>Plus engineering, power, cooling</p></li></ul><h3>5.2 Distributed Training Strategies</h3><p>Training 175B parameters on one GPU? Impossible.</p><p><strong>Solution: Parallel training</strong></p><p><strong>1. Data Parallelism</strong></p><ul><li><p>Split data across GPUs</p></li><li><p>Each GPU has full model copy</p></li><li><p>Synchronize gradients</p></li></ul><p><strong>Good for:</strong> Small-medium models, lots of data</p><p><strong>2. Model Parallelism</strong></p><ul><li><p>Split model across GPUs</p></li><li><p>Each GPU has part of the model</p></li><li><p>Forward/backward pass requires communication</p></li></ul><p><strong>Good for:</strong> Models that don&#8217;t fit on one GPU</p><p><strong>3. Pipeline Parallelism</strong></p><ul><li><p>Split model into stages</p></li><li><p>Different GPUs handle different layers</p></li><li><p>Micro-batches flow through pipeline</p></li></ul><p><strong>Good for:</strong> Very deep models, reducing idle time</p><p><strong>4. Tensor Parallelism</strong></p><ul><li><p>Split individual tensors (weight matrices) across GPUs</p></li><li><p>Operations computed in parallel, then combined</p></li><li><p>Used in Megatron-LM</p></li></ul><p><strong>Good for:</strong> Largest models (100B+)</p><p><strong>Real implementations use combinations:</strong></p><p>GPT-3 likely used:</p><ul><li><p>Tensor parallelism within nodes</p></li><li><p>Pipeline parallelism across nodes</p></li><li><p>Data parallelism for batch processing</p></li></ul><h3>5.3 Training Stability Tricks</h3><p><strong>Problem:</strong> Training 175B parameter models is fragile.</p><p><strong>Solutions:</strong></p><p><strong>1. Gradient Clipping</strong></p><pre><code><code>torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
</code></code></pre><p>Prevents exploding gradients.</p><p><strong>2. Learning Rate Warmup</strong></p><pre><code><code>Start: lr = 0
Warmup (10K steps): lr increases linearly to max_lr
Decay: lr decreases (cosine or polynomial)
</code></code></pre><p>Prevents early instability.</p><p><strong>3. Mixed Precision Training (FP16 + FP32)</strong></p><ul><li><p>Compute in FP16 (faster, less memory)</p></li><li><p>Keep master weights in FP32 (stability)</p></li><li><p>Loss scaling to prevent underflow</p></li></ul><p><strong>4. Activation Checkpointing</strong></p><ul><li><p>Don&#8217;t store all activations (memory)</p></li><li><p>Recompute during backward pass (compute)</p></li><li><p>Trade-off: 33% slower, 3x less memory</p></li></ul><p><strong>5. Careful Initialization</strong></p><ul><li><p>Scale initial weights by depth</p></li><li><p>Residual connections help gradient flow</p></li></ul><p><strong>6. Batch Size Scaling</strong></p><ul><li><p>Larger batches &#8594; more stable gradients</p></li><li><p>But need to adjust learning rate accordingly</p></li></ul><h3>5.4 The Cost Reality</h3><p><strong>Training costs for frontier models</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EVNP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EVNP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 424w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 848w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 1272w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EVNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png" width="1390" height="514" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:514,&quot;width&quot;:1390,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92978,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/180864817?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EVNP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 424w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 848w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 1272w, https://substackcdn.com/image/fetch/$s_!EVNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e1e704-fb36-4fd7-ac36-ac26c28e64b6_1390x514.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Inference costs are also massive:</strong></p><p>Running ChatGPT for millions of users:</p><ul><li><p>Estimated $700,000/day in compute (early estimates)</p></li><li><p>Need aggressive optimization (quantization, batching)</p></li></ul><p><strong>This is why:</strong></p><ul><li><p>Only a few companies can train frontier models</p></li><li><p>Open-source models lag behind closed ones</p></li><li><p>Efficient inference matters enormously</p></li></ul><div><hr></div><h2>6. Training Dynamics: What Actually Happens</h2><h3>6.1 The Loss Curve</h3><p>Typical loss curve during pre-training:</p><pre><code><code>Epoch:  0     100    200    300    400
Loss:   8.0   3.5    2.1    1.8    1.6
        &#9474;     &#9474;      &#9474;      &#9474;      &#9474;
        &#9474;     &#9474;      &#9474;      &#9474;      &#9492;&#9472; Refinement
        &#9474;     &#9474;      &#9474;      &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; Reasoning emerges
        &#9474;     &#9474;      &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; Factual knowledge
        &#9474;     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; Grammar learned
        &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; Random noise
</code></code></pre><p><strong>Key observations:</strong></p><ol><li><p><strong>Fast initial drop</strong> (epochs 0-50): Learning basic patterns</p></li><li><p><strong>Slower improvement</strong> (epochs 50-200): Building knowledge</p></li><li><p><strong>Diminishing returns</strong> (epochs 200+): Refinement, reasoning</p></li></ol><h3>6.2 Scaling Prevents Overfitting (Usually)</h3><p><strong>Surprising fact:</strong></p><p>Large models trained on massive data <strong>rarely overfit</strong>.</p><p><strong>Why?</strong></p><p><strong>1. Underparameterization paradox</strong> Even 175B parameters is &#8220;small&#8221; relative to the complexity of language.</p><p><strong>2. Implicit regularization</strong> SGD has regularization properties.</p><p><strong>3. Data diversity</strong> Training data is so diverse that memorization is difficult.</p><p><strong>But watch out for:</strong></p><ul><li><p>Repeated data (train on same text multiple times)</p></li><li><p>Contamination (test data in training set)</p></li></ul><h3>6.3 Perplexity: The Standard Metric</h3><p><strong>Perplexity = exp(loss)</strong></p><pre><code><code>Loss = 2.0  &#8594;  Perplexity = 7.4
Loss = 1.5  &#8594;  Perplexity = 4.5
Loss = 1.0  &#8594;  Perplexity = 2.7
</code></code></pre><p><strong>Intuition:</strong></p><p>Perplexity of 7.4 means: &#8220;On average, the model is as uncertain as if it were choosing uniformly among 7.4 options.&#8221;</p><p>Lower perplexity = better language modeling.</p><p><strong>Benchmarks:</strong></p><ul><li><p>GPT-2: Perplexity ~30 on test set</p></li><li><p>GPT-3: Perplexity ~20</p></li><li><p>GPT-4: Perplexity ~15 (estimated)</p></li></ul><p>Human-level: ~10-12 perplexity (roughly)</p><div><hr></div><h2>7. Compute-Optimal Training: The Practical Guide</h2><h3>7.1 The Budget Constraint</h3><p><strong>You have: Fixed compute budget C (in FLOPs)</strong></p><p><strong>Question: How should you allocate C?</strong></p><p><strong>Options:</strong></p><ul><li><p>Big model, little data</p></li><li><p>Small model, lots of data</p></li><li><p>Balanced (compute-optimal)</p></li></ul><h3>7.2 The Formula</h3><p>From Chinchilla paper:</p><pre><code><code>Given C compute:
N_optimal = 0.43 &#215; C^0.50  parameters
D_optimal = 0.27 &#215; C^0.50  tokens
</code></code></pre><p><strong>Example:</strong></p><p>You have 10^23 FLOPs (rough GPT-3 budget).</p><pre><code><code>N = 0.43 &#215; (10^23)^0.50 = 43B parameters
D = 0.27 &#215; (10^23)^0.50 = 270B tokens
</code></code></pre><p>GPT-3 used 175B params, 300B tokens &#8594; overparameterized, undertrained.</p><p>Optimal: ~70B params, ~1T tokens.</p><h3>7.3 Real-World Examples</h3><p><strong>LLaMA (Meta, 2023):</strong></p><ul><li><p>Followed Chinchilla scaling</p></li><li><p>7B model: 1T tokens</p></li><li><p>65B model: 1.4T tokens</p></li><li><p><strong>Result:</strong> Outperformed GPT-3 with fewer parameters</p></li></ul><p><strong>LLaMA 2:</strong></p><ul><li><p>Even more training data (2T tokens)</p></li><li><p>Same parameters (7B, 13B, 70B)</p></li><li><p>Better performance</p></li></ul><p><strong>The trend:</strong> More data, compute-optimal sizing.</p><div><hr></div><h2>8. Beyond Scale: What Else Matters?</h2><h3>8.1 Data Quality &gt; Data Quantity (Sometimes)</h3><p><strong>Example: Phi-1 (Microsoft, 2023)</strong></p><ul><li><p>Only 1.3B parameters</p></li><li><p>Trained on <strong>high-quality, curated</strong> code/text</p></li><li><p>Outperformed models 10x larger on code tasks</p></li></ul><p><strong>Lesson:</strong> Clean, high-quality data can partially compensate for size.</p><h3>8.2 Architecture Choices</h3><p><strong>Improvements since original Transformer:</strong></p><p><strong>1. Pre-norm (instead of post-norm)</strong></p><ul><li><p>Better training stability</p></li><li><p>Used in GPT-3, LLaMA</p></li></ul><p><strong>2. SwiGLU (instead of ReLU)</strong></p><ul><li><p>Better activation function</p></li><li><p>Used in PaLM, LLaMA</p></li></ul><p><strong>3. RoPE (instead of sinusoidal PE)</strong></p><ul><li><p>Better positional encoding</p></li><li><p>Used in LLaMA, GPT-NeoX</p></li></ul><p><strong>4. Grouped-Query Attention</strong></p><ul><li><p>Faster inference (less memory)</p></li><li><p>Used in LLaMA 2</p></li></ul><p><strong>These improvements are incremental (5-15% better), not revolutionary.</strong></p><p>Scaling still dominates.</p><h3>8.3 Training Duration</h3><p><strong>Question:</strong> Should you train longer?</p><p><strong>Answer:</strong> It depends on your goal.</p><p><strong>For pre-training:</strong></p><ul><li><p>Chinchilla: Train for exactly 1 epoch (20 tokens per parameter)</p></li><li><p>More epochs &#8594; overfitting risk</p></li></ul><p><strong>For fine-tuning:</strong></p><ul><li><p>Multiple epochs on small datasets is fine</p></li><li><p>Need regularization (dropout, weight decay)</p></li></ul><div><hr></div><h2>9. The Future of Scaling</h2><h3>9.1 Are We Hitting Limits?</h3><p><strong>Data wall:</strong></p><ul><li><p>We&#8217;ve used most of the internet (~1-2T tokens)</p></li><li><p>High-quality data is finite</p></li><li><p>Solution: Synthetic data, multimodal data</p></li></ul><p><strong>Compute wall:</strong></p><ul><li><p>Training GPT-5 might cost $1B+</p></li><li><p>Only a few orgs can afford this</p></li><li><p>Solution: Efficiency, sparsity, better algorithms</p></li></ul><p><strong>Returns diminishing:</strong></p><ul><li><p>Going from 10B &#8594; 100B: Huge gains</p></li><li><p>Going from 100B &#8594; 1T: Smaller gains (per parameter)</p></li><li><p>Solution: Focus on data quality, alignment</p></li></ul><h3>9.2 Alternatives to Pure Scaling</h3><p><strong>1. Mixture of Experts (MoE)</strong></p><ul><li><p>1T total parameters, but only 50B active per input</p></li><li><p>Example: Switch Transformer, GPT-4 (rumored)</p></li></ul><p><strong>2. Retrieval-Augmented Generation (RAG)</strong></p><ul><li><p>Smaller model + external knowledge base</p></li><li><p>More efficient than scaling parameters</p></li></ul><p><strong>3. Distillation</strong></p><ul><li><p>Train small model to mimic large one</p></li><li><p>Retain most performance, fraction of cost</p></li></ul><p><strong>4. Sparse Models</strong></p><ul><li><p>Most weights are zero</p></li><li><p>Activate relevant parts per input</p></li></ul><h3>9.3 The Next Frontier</h3><p><strong>Current paradigm:</strong></p><ul><li><p>Pre-train on massive unlabeled data</p></li><li><p>Fine-tune for specific tasks</p></li><li><p>Scale parameters and data together</p></li></ul><p><strong>Emerging paradigm:</strong></p><ul><li><p>Multimodal pre-training (text + images + audio)</p></li><li><p>Continuous learning (update without full retraining)</p></li><li><p>Agent-based systems (LLMs + tools + memory)</p></li><li><p>Smaller, specialized models (task-specific)</p></li></ul><p><strong>The scaling era isn&#8217;t over, but it&#8217;s evolving.</strong></p><div><hr></div><h2>10. Interview Deep-Dive: Scaling Questions</h2><h3>Q1: What are scaling laws and why do they matter?</h3><p><strong>Answer:</strong> Scaling laws describe the relationship between model performance and three factors: parameters, data, and compute. They follow power laws, meaning performance improves predictably as you scale. This matters because: (1) you can forecast performance before expensive training, (2) you can optimize resource allocation, and (3) it reveals that scale itself unlocks new capabilities, not just better performance.</p><div><hr></div><h3>Q2: What did the Chinchilla paper change?</h3><p><strong>Answer:</strong> Chinchilla showed that most large models were <strong>undertrained</strong>. The optimal strategy is to scale parameters and training data equally (both proportional to compute^0.5). GPT-3 had 175B parameters trained on 300B tokens,it should have been trained on 3.5T tokens, or been smaller. LLaMA followed this: 7B params trained on 1T tokens, outperforming GPT-3 despite being 25x smaller.</p><div><hr></div><h3>Q3: What are emergent abilities?</h3><p><strong>Answer:</strong> Abilities that appear suddenly when models cross a size threshold, not present in smaller models. Examples: multi-step reasoning (emerges ~50B+ params), in-context learning with many examples, code generation, chain-of-thought reasoning. Not gradual improvement sharp phase transitions. Suggests some tasks require minimum &#8220;reasoning capacity&#8221; to solve at all.</p><div><hr></div><h3>Q4: Why does next-token prediction work so well for learning?</h3><p><strong>Answer:</strong> To predict the next token well, a model must learn:</p><ul><li><p>Syntax (grammar rules)</p></li><li><p>Semantics (word meanings)</p></li><li><p>World knowledge (facts)</p></li><li><p>Reasoning (causality, logic)</p></li></ul><p>Next-token prediction is equivalent to optimal text compression. The better you compress, the more you must understand. This unsupervised objective forces the model to learn rich, general representations.</p><div><hr></div><h3>Q5: What&#8217;s the optimal allocation of compute between parameters and data?</h3><p><strong>Answer:</strong> Chinchilla scaling: For compute budget C, optimal is N &#8733; C^0.5 parameters and D &#8733; C^0.5 tokens. Rule of thumb: 20 tokens per parameter. So a 7B model should train on ~140B tokens, a 70B model on ~1.4T tokens. Overparameterized models waste compute.</p><div><hr></div><h3>Q6: How is distributed training done for 100B+ parameter models?</h3><p><strong>Answer:</strong> Combination of:</p><ul><li><p><strong>Tensor parallelism</strong>: Split weight matrices across GPUs</p></li><li><p><strong>Pipeline parallelism</strong>: Split layers across GPUs, micro-batching</p></li><li><p><strong>Data parallelism</strong>: Different batches on different GPUs</p></li><li><p><strong>Mixed precision</strong>: FP16 compute, FP32 master weights</p></li><li><p><strong>Gradient checkpointing</strong>: Recompute activations to save memory</p></li></ul><p>GPT-3 likely used tensor + pipeline + data parallelism across 10,000+ GPUs.</p><div><hr></div><h3>Q7: What&#8217;s the biggest bottleneck in training large models?</h3><p><strong>Answer:</strong> <strong>Communication overhead</strong>. With model/pipeline parallelism, GPUs must constantly exchange activations and gradients. At scale:</p><ul><li><p>GPU-GPU bandwidth matters more than GPU compute</p></li><li><p>Interconnect topology is critical (NVLink, InfiniBand)</p></li><li><p>Communication can dominate total time (50%+ of wall-clock)</p></li></ul><p>This is why specialized AI clusters with high-bandwidth interconnects are essential.</p><div><hr></div><h3>Q8: Why don&#8217;t large models overfit despite having billions of parameters?</h3><p><strong>Answer:</strong> Three reasons:</p><ol><li><p><strong>Underparameterization</strong>: Even 175B params is small relative to language complexity</p></li><li><p><strong>Data diversity</strong>: Training data is so varied that memorization is hard</p></li><li><p><strong>Implicit regularization</strong>: SGD has regularization properties</p></li></ol><p>BUT: Repeated data (multiple epochs on same data) or contamination (test data in training) can cause overfitting.</p><div><hr></div><h3>Q9: What&#8217;s the estimated cost of training GPT-3?</h3><p><strong>Answer:</strong> Estimated $4-12M in compute:</p><ul><li><p>~3.14 &#215; 10^23 FLOPs</p></li><li><p>10,000+ V100 GPUs</p></li><li><p>Several weeks</p></li><li><p>Plus engineering, power, infrastructure</p></li></ul><p>GPT-4 likely cost $100M+. This is why only a few companies (OpenAI, Google, Meta, Anthropic) can train frontier models.</p><div><hr></div><h3>Q10: Are we hitting scaling limits?</h3><p><strong>Answer:</strong> Partially. Three walls:</p><ul><li><p><strong>Data wall</strong>: We&#8217;ve used most high-quality internet text (~1-2T tokens)</p></li><li><p><strong>Compute wall</strong>: Training GPT-5+ might cost $1B+</p></li><li><p><strong>Diminishing returns</strong>: 100B &#8594; 1T gives smaller gains per parameter than 10B &#8594; 100B</p></li></ul><p>Solutions: Better data curation, multimodal training, sparse models (MoE), retrieval augmentation, distillation. Scaling isn&#8217;t over, but pure parameter scaling alone is slowing.</p><div><hr></div><h2>&#10024; The Bigger Picture</h2><p>The scaling breakthrough revealed something profound:</p><p><strong>Intelligence scales with compute.</strong></p><p>Not linearly, not perfectly, but reliably and predictably.</p><p>This changes everything:</p><ul><li><p><strong>For research:</strong> Forecasting capabilities becomes possible</p></li><li><p><strong>For engineering:</strong> Resource allocation becomes scientific</p></li><li><p><strong>For strategy:</strong> Whoever has most compute has an advantage</p></li></ul><p>But scaling isn&#8217;t the only path forward.</p><p><strong>The next era:</strong></p><ul><li><p>Compute-optimal training (Chinchilla paradigm)</p></li><li><p>High-quality data curation</p></li><li><p>Efficient architectures</p></li><li><p>Multimodal models</p></li><li><p>Retrieval + reasoning</p></li><li><p>Smaller, specialized models</p></li></ul><p><strong>The lesson isn&#8217;t &#8220;just make it bigger.&#8221;</strong></p><p>It&#8217;s: <strong>&#8220;Scale intelligently, allocate compute optimally, and focus on data quality as much as model size.&#8221;</strong></p><div><hr></div><h2>&#128218; References &amp; Key Papers</h2><h3><strong>Foundational Scaling Papers</strong></h3><ol><li><p><strong>Kaplan, J., et al. (2020).</strong> &#8220;Scaling Laws for Neural Language Models&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2001.08361">Paper</a><br>&#128273; <em>The original scaling laws discovery - essential reading</em></p></li><li><p><strong>Hoffmann, J., et al. (2022).</strong> &#8220;Training Compute-Optimal Large Language Models&#8221; (Chinchilla)<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2203.15556">Paper</a><br>&#128273; <em>Revised scaling laws - showed models were undertrained</em></p></li><li><p><strong>Wei, J., et al. (2022).</strong> &#8220;Emergent Abilities of Large Language Models&#8221;<br><em>TMLR 2022</em><br><a href="https://arxiv.org/abs/2206.07682">Paper</a><br>&#128273; <em>Documents abilities that emerge only at scale</em></p></li></ol><h3><strong>Major LLM Papers</strong></h3><ol start="4"><li><p><strong>Brown, T., et al. (2020).</strong> &#8220;Language Models are Few-Shot Learners&#8221; (GPT-3)<br><em>NeurIPS 2020</em><br><a href="https://arxiv.org/abs/2005.14165">Paper</a><br><em>175B parameters - demonstrated scaling potential</em></p></li><li><p><strong>Touvron, H., et al. (2023).</strong> &#8220;LLaMA: Open and Efficient Foundation Language Models&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2302.13971">Paper</a><br><em>Followed Chinchilla scaling - compute-optimal approach</em></p></li><li><p><strong>Touvron, H., et al. (2023).</strong> &#8220;Llama 2: Open Foundation and Fine-Tuned Chat Models&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2307.09288">Paper</a><br><em>Extended training data to 2T tokens</em></p></li><li><p><strong>Chowdhery, A., et al. (2022).</strong> &#8220;PaLM: Scaling Language Modeling with Pathways&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2204.02311">Paper</a><br><em>Google&#8217;s 540B parameter model</em></p></li><li><p><strong>Rae, J.W., et al. (2021).</strong> &#8220;Scaling Language Models: Methods, Analysis &amp; Insights from Training Gopher&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2112.11446">Paper</a><br><em>280B model - pre-Chinchilla approach</em></p></li></ol><h3><strong>Training &amp; Infrastructure</strong></h3><ol start="9"><li><p><strong>Shoeybi, M., et al. (2019).</strong> &#8220;Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/1909.08053">Paper</a><br><em>Tensor parallelism for large-scale training</em></p></li><li><p><strong>Narayanan, D., et al. (2021).</strong> &#8220;Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM&#8221;<br><em>SC &#8216;21</em><br><a href="https://arxiv.org/abs/2104.04473">Paper</a><br><em>Pipeline parallelism strategies</em></p></li><li><p><strong>Rajbhandari, S., et al. (2020).</strong> &#8220;ZeRO: Memory Optimizations Toward Training Trillion Parameter Models&#8221;<br><em>SC &#8216;20</em><br><a href="https://arxiv.org/abs/1910.02054">Paper</a><br><em>Memory-efficient training - used in DeepSpeed</em></p></li></ol><h3><strong>Data &amp; Tokenization</strong></h3><ol start="12"><li><p><strong>Gao, L., et al. (2020).</strong> &#8220;The Pile: An 800GB Dataset of Diverse Text for Language Modeling&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2101.00027">Paper</a><br><em>Open pre-training dataset</em></p></li><li><p><strong>Raffel, C., et al. (2020).</strong> &#8220;Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer&#8221; (T5)<br><em>JMLR 2020</em><br><a href="https://arxiv.org/abs/1910.10683">Paper</a><br><em>C4 dataset (cleaned Common Crawl)</em></p></li><li><p><strong>Sennrich, R., Haddow, B., &amp; Birch, A. (2016).</strong> &#8220;Neural Machine Translation of Rare Words with Subword Units&#8221;<br><em>ACL 2016</em><br><a href="https://arxiv.org/abs/1508.07909">Paper</a><br><em>Byte Pair Encoding (BPE) - subword tokenization</em></p></li></ol><h3><strong>Emergent Abilities &amp; Reasoning</strong></h3><ol start="15"><li><p><strong>Wei, J., et al. (2022).</strong> &#8220;Chain-of-Thought Prompting Elicits Reasoning in Large Language Models&#8221;<br><em>NeurIPS 2022</em><br><a href="https://arxiv.org/abs/2201.11903">Paper</a><br><em>CoT reasoning - emerges with scale</em></p></li><li><p><strong>Kojima, T., et al. (2022).</strong> &#8220;Large Language Models are Zero-Shot Reasoners&#8221;<br><em>NeurIPS 2022</em><br><a href="https://arxiv.org/abs/2205.11916">Paper</a><br><em>Zero-shot CoT with &#8220;Let&#8217;s think step by step&#8221;</em></p></li></ol><h3><strong>Efficient Alternatives</strong></h3><ol start="17"><li><p><strong>Gunasekar, S., et al. (2023).</strong> &#8220;Textbooks Are All You Need&#8221; (Phi-1)<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2306.11644">Paper</a><br><em>1.3B model with high-quality data outperforms larger models</em></p></li><li><p><strong>Fedus, W., et al. (2021).</strong> &#8220;Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity&#8221;<br><em>JMLR 2021</em><br><a href="https://arxiv.org/abs/2101.03961">Paper</a><br><em>Mixture of Experts - sparse scaling</em></p></li></ol><h3><strong>Analysis &amp; Interpretability</strong></h3><ol start="19"><li><p><strong>Olsson, C., et al. (2022).</strong> &#8220;In-context Learning and Induction Heads&#8221;<br><em>Transformer Circuits Thread</em><br><a href="https://arxiv.org/abs/2209.11895">Paper</a><br><em>Mechanistic analysis of how models learn in-context</em></p></li><li><p><strong>Schaeffer, R., Miranda, B., &amp; Koyejo, S. (2023).</strong> &#8220;Are Emergent Abilities of Large Language Models a Mirage?&#8221;<br><em>arXiv preprint</em><br><a href="https://arxiv.org/abs/2304.15004">Paper</a><br><em>Questions whether emergence is measurement artifact</em></p></li></ol><h2>What&#8217;s Next?</h2><p>This post covered <strong>why bigger models work</strong> and <strong>how they&#8217;re trained</strong>.</p><p><strong>Next in the series:</strong></p><ul><li><p><strong>Post 4:</strong> From LLMs to Products alignment (instruction tuning, RLHF), inference optimization, and building production systems</p></li></ul><div><hr></div><p><strong>Question for you:</strong> What surprised you most about scaling laws, the predictability, the emergent abilities, or the compute requirements?</p><p>Drop a comment, I read every one.</p><div><hr></div><p><em>If this deep-dive was valuable, share it with someone learning about LLMs. This series documents the full journey from Transformers to production-ready AI systems.</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Inside the Transformer: Attention Mechanisms Deep Dive]]></title><description><![CDATA[Understanding What Happens Inside Each Layer]]></description><link>https://datajourney24.substack.com/p/inside-the-transformer-attention</link><guid isPermaLink="false">https://datajourney24.substack.com/p/inside-the-transformer-attention</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sun, 16 Nov 2025 17:40:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LwVs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;ll Cover</h2><p>In Post 1, we understood <strong>why</strong> Transformers emerged and the basic attention formula.</p><p>Now we&#8217;re going deeper:</p><ul><li><p>What actually happens inside a single Transformer layer?</p></li><li><p>How do attention patterns evolve across layers?</p></li><li><p>What&#8217;s the role of feed-forward networks?</p></li><li><p>How does information flow through the entire architecture?</p></li><li><p>What are the practical engineering choices that matter?</p></li></ul><p><strong>By the end, you&#8217;ll understand:</strong></p><ul><li><p>Why Transformers have residual connections everywhere</p></li><li><p>What layer normalization actually does</p></li><li><p>How positional information propagates</p></li><li><p>The difference between encoder and decoder attention patterns</p></li><li><p>Why certain architectural choices (like pre-norm vs post-norm) matter</p></li></ul><p>Let&#8217;s dive in.</p><div><hr></div><h2>1. Anatomy of a Transformer Layer</h2><p>Here&#8217;s what most tutorials show you:</p><pre><code><code>Input &#8594; Self-Attention &#8594; Add &amp; Norm &#8594; Feed-Forward &#8594; Add &amp; Norm &#8594; Output
</code></code></pre><p>Here&#8217;s what actually happens (and why each piece matters):</p><h3>1.1 The Complete Picture</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LwVs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LwVs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 424w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 848w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LwVs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png" width="1026" height="1158" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1158,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:272848,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/179064850?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LwVs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 424w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 848w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!LwVs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc85f58-2ff0-4091-b8c2-2171cb0ee7ef_1026x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>A single Transformer layer has <strong>six distinct operations</strong>:</p><pre><code><code>1. Input (from previous layer or embeddings)
2. Multi-Head Self-Attention
3. Residual Connection + Dropout
4. Layer Normalization
5. Position-wise Feed-Forward Network
6. Residual Connection + Dropout + Layer Normalization
</code></code></pre><p>Let&#8217;s break down each component and understand <strong>why it exists</strong>.</p><div><hr></div><h2>2. Self-Attention: Beyond the Formula</h2><p>In Post 1, we covered the math. Now let&#8217;s understand what it&#8217;s <strong>actually computing</strong>.</p><h3>2.1 The Three Projections: Why QKV?</h3><p>Every token starts as an embedding vector (say, 768 dimensions for BERT).</p><p>We project it into three different spaces:</p><pre><code><code>Q = input @ W_Q  # Query: &#8220;What am I searching for?&#8221;
K = input @ W_K  # Key: &#8220;What am I advertising?&#8221;
V = input @ W_V  # Value: &#8220;What content do I provide?&#8221;
</code></code></pre><p><strong>Why separate projections?</strong></p><p>Think of it like a search engine:</p><ul><li><p><strong>Query (Q):</strong> Your search terms</p></li><li><p><strong>Key (K):</strong> Document titles/metadata</p></li><li><p><strong>Value (V):</strong> Document content</p></li></ul><p>You match Q with K (relevance), then retrieve V (content).</p><p><strong>The non-obvious insight:</strong> Q and K live in the same space (for dot product), but V can be in a completely different space. This separation is crucial for learning.</p><h3>2.2 What Attention Scores Actually Represent</h3><p>When we compute <code>score = Q &#183; K^T / &#8730;d_k</code>, we&#8217;re asking:</p><blockquote><p>&#8220;How much should token i care about token j?&#8221;</p></blockquote><p>But here&#8217;s what&#8217;s not obvious: <strong>these scores are relative, not absolute</strong>.</p><p>After softmax, the attention distribution <strong>must sum to 1</strong>. This means:</p><ul><li><p>High attention to one token &#8594; necessarily lower attention to others</p></li><li><p>Attention is a <strong>resource allocation</strong> problem</p></li><li><p>The model learns what to ignore as much as what to attend to</p></li></ul><p><strong>Example:</strong></p><pre><code><code>Sentence: &#8220;The cat sat on the mat&#8221;
Token &#8220;sat&#8221; attention: [0.05, 0.42, 0.15, 0.18, 0.08, 0.12]
</code></code></pre><p>The 0.42 to &#8220;cat&#8221; isn&#8217;t meaningful in isolation ,it&#8217;s meaningful because it&#8217;s <strong>much higher</strong> than 0.05 to &#8220;The&#8221; and 0.08 to &#8220;the&#8221;.</p><h3>2.3 Attention Patterns Across Layers</h3><p>Here&#8217;s something researchers discovered by visualizing attention in trained models:</p><p><strong>Early layers (1-4):</strong></p><ul><li><p>Focus on local, syntactic patterns</p></li><li><p>Adjacent token attention is high</p></li><li><p>Learn basic grammar (noun-verb, determiner-noun)</p></li></ul><p><strong>Middle layers (5-8):</strong></p><ul><li><p>Learn semantic relationships</p></li><li><p>Longer-range dependencies emerge</p></li><li><p>Capture coreference, entity relationships</p></li></ul><p><strong>Late layers (9-12):</strong></p><ul><li><p>Task-specific patterns</p></li><li><p>Very focused attention (sparse patterns)</p></li><li><p>Often just propagating information</p></li></ul><p><strong>This hierarchical learning wasn&#8217;t explicitly programmed it emerged from training</strong></p><h3>2.4 The Mystery of Attention Heads</h3><p>In an 8-head attention setup, here&#8217;s what researchers found heads learn:</p><p><strong>Head 1:</strong> Might attend to the next token (positional) </p><p><strong>Head 2:</strong> Might attend to the previous token (positional) </p><p><strong>Head 3:</strong> Might attend to sentence boundaries</p><p> <strong>Head 4:</strong> Might focus on verbs when processing subjects </p><p><strong>Head 5:</strong> Might track coreference (&#8221;it&#8221; &#8594; &#8220;cat&#8221;) <strong>Head 6-8:</strong> Often less interpretable, learning complex patterns</p><p><strong>The controversial part:</strong> Not all heads are equally important. Some heads can be <strong>pruned</strong> with minimal performance loss.</p><p>Why keep 8 heads then? <strong>Redundancy and specialization.</strong> </p><p>During training, different heads explore different patterns. By the end, some become critical, others provide insurance.</p><div><hr></div><h2>3. Layer Normalization: The Unsung Hero</h2><p>Layer normalization is often treated as a boring implementation detail. It&#8217;s not. It&#8217;s <strong>critical</strong> to making Transformers trainable.</p><h3>3.1 What It Does</h3><p>For each token, independently:</p><pre><code><code>mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
x_norm = (x - mean) / (std + epsilon)
output = gamma * x_norm + beta  # Learnable parameters
</code></code></pre><p>This normalizes across the embedding dimension (not across the batch or sequence).</p><h3>3.2 Why It Matters</h3><p><strong>Problem without LayerNorm:</strong></p><p>As you stack layers, activations can grow or shrink dramatically. By layer 12, some dimensions might be 100x larger than others. This creates:</p><ul><li><p>Gradient instability</p></li><li><p>Difficulty in learning</p></li><li><p>Slow convergence</p></li></ul><p><strong>LayerNorm fixes this</strong> by keeping activations in a stable range.</p><h3>3.3 Pre-Norm vs Post-Norm</h3><p>This is one of those details that matters more than you&#8217;d think.</p><p><strong>Post-Norm (Original Transformer):</strong></p><pre><code><code>x = LayerNorm(x + SelfAttention(x))
x = LayerNorm(x + FFN(x))
</code></code></pre><p><strong>Pre-Norm (Modern LLMs like GPT-3):</strong></p><pre><code><code>x = x + SelfAttention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
</code></code></pre><p><strong>Why Pre-Norm won:</strong></p><ol><li><p><strong>Gradient flow:</strong> Cleaner gradient path through residual connections</p></li><li><p><strong>Stability:</strong> Easier to train very deep models (100+ layers)</p></li><li><p><strong>No warm-up needed:</strong> Can use higher learning rates from the start</p></li></ol><p>GPT-3, LLaMA, and most modern LLMs use Pre-Norm.</p><div><hr></div><h2>4. Residual Connections: Why They&#8217;re Everywhere</h2><p>Every Transformer layer has <strong>two</strong> residual connections:</p><pre><code><code>x = x + SelfAttention(x)
x = x + FeedForward(x)
</code></code></pre><h3>4.1 The Gradient Superhighway</h3><p>Without residual connections, the gradient for layer 1 would need to flow through:</p><ul><li><p>12 self-attention blocks</p></li><li><p>12 feed-forward blocks</p></li><li><p>24 normalizations</p></li></ul><p>That&#8217;s 48+ operations. Gradients would vanish.</p><p><strong>With residual connections:</strong> The gradient can flow directly from output to input, bypassing all intermediate operations.</p><p>Think of it as:</p><ul><li><p><strong>Residual path:</strong> Gradient superhighway (direct route)</p></li><li><p><strong>Attention/FFN path:</strong> Side roads (optional detours)</p></li></ul><p>The model learns <strong>deltas</strong> (changes) rather than full transformations.</p><h3>4.2 What Residual Streams Actually Learn</h3><p>Here&#8217;s a mental model that helps:</p><p>Each layer adds a small update:</p><pre><code><code>Layer 1: base_representation + small_update_1
Layer 2: base_representation + small_update_1 + small_update_2
...
Layer 12: base_representation + &#931;(all updates)
</code></code></pre><p>Early layers can learn low-level features, later layers refine them, and all information is preserved through the residual stream.</p><p><strong>This is why Transformers can be so deep</strong> , each layer makes a small, additive contribution.</p><div><hr></div><h2>5. Feed-Forward Networks: The Hidden Workhorse</h2><p>After attention, every layer has a position-wise feed-forward network:</p><pre><code><code>FFN(x) = max(0, x @ W1 + b1) @ W2 + b2
</code></code></pre><p>Two linear layers with a ReLU in between.</p><h3>5.1 Why Do We Need FFN After Attention?</h3><p>Attention is great at <strong>routing information</strong> between tokens. But it&#8217;s terrible at <strong>transforming</strong> that information.</p><p><strong>Attention:</strong> &#8220;Gather relevant info from other tokens&#8221; <strong>FFN:</strong> &#8220;Process and transform that gathered info&#8221;</p><p>Think of it as:</p><ul><li><p><strong>Attention:</strong> Communication between tokens</p></li><li><p><strong>FFN:</strong> Computation within each token</p></li></ul><h3>5.2 The Hidden Dimension Expansion</h3><p>Here&#8217;s a key detail: the FFN has a hidden dimension that&#8217;s <strong>4x larger</strong> than the model dimension.</p><p>For a model with d=768:</p><ul><li><p>Input: 768 dimensions</p></li><li><p>Hidden layer: 3072 dimensions (4x expansion)</p></li><li><p>Output: 768 dimensions</p></li></ul><p><strong>Why expand then compress?</strong></p><p>The expansion gives the model <strong>expressive capacity</strong>. It can compute complex, non-linear transformations in that higher-dimensional space.</p><p><strong>Analogy:</strong> It&#8217;s like spreading out your work on a large table (3072-dim space) to do complex operations, then neatly packing it back into a small box (768-dim).</p><h3>5.3 Where Parameters Live</h3><p>Here&#8217;s a surprise: <strong>Most parameters are in the FFN, not attention.</strong></p><p>For BERT-base (110M parameters):</p><ul><li><p><strong>Attention:</strong> ~25M parameters (22%)</p></li><li><p><strong>FFN:</strong> ~75M parameters (68%)</p></li><li><p><strong>Embeddings + other:</strong> ~10M parameters (10%)</p></li></ul><p>The FFN is doing most of the heavy lifting in terms of parameter count.</p><div><hr></div><h2>6. Complete Layer Flow: Putting It All Together</h2><p>Let&#8217;s trace a single token through one Transformer layer:</p><pre><code><code>1. Input: [768-dim vector]

2. Multi-Head Attention:
   - Split into 8 heads (96-dim each)
   - Each head: Q, K, V projections &#8594; attention &#8594; weighted sum
   - Concatenate 8 heads back to 768-dim
   - Output projection

3. Residual + Dropout:
   - Add input to attention output
   - Apply dropout (random zero out during training)

4. Layer Norm:
   - Normalize across 768 dimensions

5. Feed-Forward:
   - Project to 3072-dim
   - ReLU activation
   - Project back to 768-dim

6. Residual + Dropout + Layer Norm:
   - Add previous output to FFN output
   - Apply dropout
   - Normalize

7. Output: [768-dim vector] &#8594; fed into next layer
</code></code></pre><p><strong>Key insight:</strong> The vector stays 768-dimensional throughout. It&#8217;s continuously being:</p><ul><li><p>Mixed with other tokens (attention)</p></li><li><p>Transformed (FFN)</p></li><li><p>Refined (layer norm)</p></li><li><p>Preserved (residual connections)</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DqId!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DqId!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 424w, https://substackcdn.com/image/fetch/$s_!DqId!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 848w, https://substackcdn.com/image/fetch/$s_!DqId!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 1272w, https://substackcdn.com/image/fetch/$s_!DqId!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DqId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png" width="1270" height="5218" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:5218,&quot;width&quot;:1270,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2098128,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/179064850?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DqId!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 424w, https://substackcdn.com/image/fetch/$s_!DqId!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 848w, https://substackcdn.com/image/fetch/$s_!DqId!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 1272w, https://substackcdn.com/image/fetch/$s_!DqId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d58173e-117b-49d8-9658-64dba3b23c97_1270x5218.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>7. Positional Information: How It Propagates</h2><p>In Post 1, we added positional encodings at the input. But here&#8217;s the question: <strong>how does position information survive through 12 layers?</strong></p><h3>7.1 Positional Encodings Don&#8217;t Disappear</h3><p>Once added at the input, positional information flows through:</p><ul><li><p><strong>Residual connections:</strong> Preserve the original positional signal</p></li><li><p><strong>Attention:</strong> Can learn position-dependent patterns (e.g., &#8220;pay more attention to nearby tokens&#8221;)</p></li><li><p><strong>FFN:</strong> Can condition transformations on position</p></li></ul><p><strong>The model learns to use positional information, but it&#8217;s not forced to.</strong></p><h3>7.2 Modern Alternatives: RoPE (Rotary Position Embeddings)</h3><p>Models like LLaMA use RoPE instead of sinusoidal encodings.</p><p><strong>Key difference:</strong></p><ul><li><p>Sinusoidal: Add position info to embeddings</p></li><li><p>RoPE: Rotate Q and K vectors based on position</p></li></ul><p><strong>Why RoPE is better:</strong></p><ol><li><p>Position info is <strong>baked into the attention mechanism</strong> itself</p></li><li><p>Better extrapolation to longer sequences</p></li><li><p>Relative position is more naturally represented</p></li></ol><p><strong>Formula (simplified):</strong></p><pre><code><code>Q_rotated = rotate(Q, position_m)
K_rotated = rotate(K, position_n)
attention_score = Q_rotated &#183; K_rotated^T
</code></code></pre><p>The dot product automatically captures relative position (m - n).</p><div><hr></div><h2>8. Encoder vs Decoder: Attention Pattern Differences</h2><h3>8.1 Encoder (BERT-style): Bidirectional Attention</h3><p><strong>Every token can attend to every other token</strong>, including future tokens.</p><pre><code><code>&#8220;The cat sat on the mat&#8221;

&#8220;cat&#8221; can attend to: [The, cat, sat, on, the, mat]
</code></code></pre><p><strong>Use case:</strong> Understanding tasks (classification, NER, Q&amp;A) You need full context to understand meaning.</p><h3>8.2 Decoder (GPT-style): Causal Attention</h3><p><strong>Token i can only attend to tokens 1...i</strong> (no peeking at future).</p><p>This is enforced via an <strong>attention mask</strong>:</p><pre><code><code>Attention mask (lower triangular):
1 0 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
1 1 1 1 0 0
1 1 1 1 1 0
1 1 1 1 1 1
</code></code></pre><p>Before softmax, we set masked positions to -&#8734;, so they get zero attention.</p><p><strong>Why causal?</strong> For autoregressive generation (predicting next token), the model shouldn&#8217;t cheat by looking ahead.</p><h3>8.3 Encoder-Decoder (T5-style): Cross-Attention</h3><p><strong>Decoder attends to encoder outputs:</strong></p><pre><code><code>Encoder: Processes input bidirectionally
Decoder: 
  - Self-attention (causal) on output tokens
  - Cross-attention to encoder outputs
  - Generates output autoregressively
</code></code></pre><p><strong>Cross-attention mechanism:</strong></p><ul><li><p><strong>Q:</strong> From decoder</p></li><li><p><strong>K, V:</strong> From encoder outputs</p></li></ul><p>This allows the decoder to &#8220;look at&#8221; the input while generating output.</p><div><hr></div><h2>9. What Makes Attention &#8220;Learn&#8221;?</h2><h3>9.1 Attention is Learned, Not Programmed</h3><p>The matrices W^Q, W^K, W^V are <strong>learned through backpropagation</strong>.</p><p>Initially (random initialization):</p><ul><li><p>Attention is nearly uniform</p></li><li><p>All tokens attend equally to all others</p></li><li><p>Model is useless</p></li></ul><p>During training:</p><ul><li><p>Gradients flow through attention scores</p></li><li><p>Model learns: &#8220;When I see X, attend strongly to Y&#8221;</p></li><li><p>Useful patterns emerge</p></li></ul><p><strong>The model discovers</strong> that:</p><ul><li><p>Verbs should attend to subjects</p></li><li><p>Pronouns should attend to their referents</p></li><li><p>Adjectives should attend to nouns</p></li><li><p>etc.</p></li></ul><p>None of this is hardcoded.</p><h3>9.2 The Softmax Bottleneck</h3><p>Here&#8217;s a limitation not often discussed:</p><p>Softmax forces attention to be a <strong>probability distribution</strong> (sums to 1).</p><p>This creates a bottleneck:</p><ul><li><p>If you need to attend strongly to 5 tokens, each gets ~0.2 attention</p></li><li><p>If you need to attend to 1 token, it gets ~1.0 attention</p></li></ul><p>For very long sequences, this becomes problematic. You might need information from 10 different tokens, but softmax forces you to distribute attention thinly.</p><p><strong>Solutions in research:</strong></p><ul><li><p>Sparse attention (attend to subsets)</p></li><li><p>Multi-query attention (share K, V across heads)</p></li><li><p>Attention alternatives (Mamba, RWKV)</p></li></ul><div><hr></div><h2>10. Engineering Choices That Matter</h2><h3>10.1 Dropout Placement</h3><p>Dropout is applied in <strong>three places</strong>:</p><ol><li><p>After attention output projection</p></li><li><p>After FFN output projection</p></li><li><p>Sometimes on attention weights themselves</p></li></ol><p><strong>Why?</strong> Regularization. Prevents overfitting by randomly dropping connections during training.</p><p><strong>Typical values:</strong> 0.1 (drop 10% of activations)</p><h3>10.2 Activation Functions</h3><p><strong>Original Transformer:</strong> ReLU in FFN <strong>Modern LLMs:</strong> GELU (Gaussian Error Linear Unit) or SwiGLU</p><p><strong>Why GELU?</strong></p><ul><li><p>Smoother gradients</p></li><li><p>Better empirical performance</p></li><li><p>Used in BERT, GPT-3, etc.</p></li></ul><p><strong>Formula:</strong></p><pre><code><code>GELU(x) = x * &#934;(x)  where &#934; is Gaussian CDF
</code></code></pre><p>Approximately: <code>0.5 * x * (1 + tanh(&#8730;(2/&#960;) * (x + 0.044715 * x&#179;)))</code></p><h3>10.3 Initialization</h3><p>Getting initialization right is crucial:</p><p><strong>Xavier/Glorot initialization:</strong></p><pre><code><code>W ~ N(0, 2/(d_in + d_out))
</code></code></pre><p><strong>Why it matters:</strong></p><ul><li><p>Too small &#8594; vanishing activations</p></li><li><p>Too large &#8594; exploding activations</p></li></ul><p>Modern Transformers often use scaled initialization where deeper layers get smaller initial weights.</p><h3>10.4 Learning Rate Schedules</h3><p><strong>Warmup + Decay:</strong></p><pre><code><code>1. Linear warmup: 0 &#8594; max_lr (first 4000-10000 steps)
2. Inverse square root decay: lr &#8733; 1/&#8730;step
</code></code></pre><p><strong>Why warmup?</strong> Early in training, large gradients can destabilize the model. Warmup lets the model &#8220;settle&#8221; before full-speed training.</p><div><hr></div><h2>11. Visualizing Attention: What Works, What Doesn&#8217;t</h2><h3>11.1 Attention Heatmaps</h3><p>Common visualization: plot attention weights as a matrix.</p><p><strong>What it shows:</strong> Which tokens attend to which <strong>What it doesn&#8217;t show:</strong> What information is actually extracted</p><p><strong>Limitation:</strong> High attention &#8800; high importance for the final prediction</p><h3>11.2 Better Interpretability Methods</h3><p><strong>1. Attention Rollout</strong> Combine attention across layers to see end-to-end paths</p><p><strong>2. Gradient-based Attribution</strong> Which tokens, when changed, most affect the output?</p><p><strong>3. Probing Classifiers</strong> Train simple classifiers on layer outputs to see what information is encoded</p><p><strong>4. Causal Interventions</strong> Ablate specific attention heads and measure impact</p><div><hr></div><h2>12. Common Misconceptions Revisited</h2><h3>Misconception #1: &#8220;Each layer builds higher-level features&#8221;</h3><p><strong>Reality:</strong> Not always hierarchical. Later layers sometimes undo earlier work or route around it via residual connections.</p><h3>Misconception #2: &#8220;More heads = better&#8221;</h3><p><strong>Reality:</strong> Diminishing returns. 16 heads isn&#8217;t 2x better than 8. Some research shows 4-8 heads is a sweet spot.</p><h3>Misconception #3: &#8220;Attention does all the work&#8221;</h3><p><strong>Reality:</strong> FFN has 3x more parameters and is equally critical. Attention routes information; FFN processes it.</p><h3>Misconception #4: &#8220;Layer norm is just a regularization trick&#8221;</h3><p><strong>Reality:</strong> It&#8217;s fundamental to training stability. Without it, deep Transformers are nearly untrainable.</p><div><hr></div><h2>13. Interview Deep-Dive: Architecture Questions</h2><h3>Q1: Walk me through one forward pass of a Transformer layer.</h3><p><strong>Answer:</strong></p><ol><li><p>Input (d-dim) &#8594; Multi-head attention</p></li><li><p>Add input back (residual) &#8594; Layer norm</p></li><li><p>FFN: d &#8594; 4d &#8594; d with ReLU</p></li><li><p>Add previous output (residual) &#8594; Layer norm</p></li><li><p>Output passed to next layer</p></li></ol><p>Key: Residual connections provide gradient paths; layer norm stabilizes training.</p><div><hr></div><h3>Q2: Why do we need separate Q, K, V projections?</h3><p><strong>Answer:</strong> Attention is computing a weighted sum. Q and K determine weights (via dot product), V provides content. Separating them gives the model flexibility: relevance (Q&#183;K) and content (V) can be learned independently. If we used the same projection, attention would be symmetric and less expressive.</p><div><hr></div><h3>Q3: What&#8217;s the purpose of the FFN after attention?</h3><p><strong>Answer:</strong> Attention is linear in content (weighted sum). FFN adds non-linearity and transformation capacity. Attention routes information between tokens; FFN processes information within each token. Without FFN, the model would be limited to linear combinations.</p><div><hr></div><h3>Q4: Pre-norm vs post-norm, which is better and why?</h3><p><strong>Answer:</strong> Pre-norm is better for deep models:</p><ul><li><p>Cleaner gradient flow through residuals</p></li><li><p>More stable training (no warmup needed)</p></li><li><p>Used in GPT-3, LLaMA, modern LLMs</p></li></ul><p>Post-norm was original design but struggles with very deep models (&gt;24 layers).</p><div><hr></div><h3>Q5: How does positional information propagate through layers?</h3><p><strong>Answer:</strong> Added at input, then:</p><ol><li><p>Residual connections preserve original positional encodings</p></li><li><p>Attention can learn position-dependent patterns</p></li><li><p>Model learns to use or ignore position as needed per layer</p></li></ol><p>Modern approach (RoPE): Rotate Q/K based on position, baking positional info into attention mechanism directly.</p><div><hr></div><h3>Q6: What happens during causal masking in decoder attention?</h3><p><strong>Answer:</strong> Before softmax, set future positions to -&#8734;:</p><pre><code><code>scores = QK^T / &#8730;d_k
scores[i, j] = -&#8734; where j &gt; i  # Mask future
attention = softmax(scores)  # Future positions &#8594; 0
</code></code></pre><p>This prevents token i from attending to tokens after position i, enforcing autoregressive property.</p><div><hr></div><h3>Q7: Why is &#8730;d_k important in scaled dot-product attention?</h3><p><strong>Answer:</strong> Dot product magnitude grows with dimension. For d_k = 512, unscaled dot products can be large (&#177;50), pushing softmax into saturation (extreme outputs like 0.0001, 0.9998). This kills gradients.</p><p>Dividing by &#8730;d_k normalizes variance to ~1, keeping softmax in its &#8220;soft&#8221; regime where gradients are healthy. Critical for trainability.</p><div><hr></div><h3>Q8: How much compute does self-attention use vs FFN?</h3><p><strong>Answer:</strong> Per layer for sequence length n, model dim d:</p><ul><li><p><strong>Self-attention:</strong> O(n&#178; &#183; d) for attention matrix + O(n &#183; d&#178;) for projections</p></li><li><p><strong>FFN:</strong> O(n &#183; d&#178;) typically (d &#8594; 4d &#8594; d)</p></li></ul><p>For short sequences (n &lt; d), FFN dominates compute. For long sequences (n &gt; d), attention dominates.</p><p>In practice: FFN has 3x more parameters but attention has quadratic complexity in n.</p><div><hr></div><h3>Q9: Can you remove attention heads without hurting performance?</h3><p><strong>Answer:</strong> Yes, to some extent. Research shows:</p><ul><li><p>Some heads are redundant (10-20% can be pruned)</p></li><li><p>But most heads contribute something unique</p></li><li><p>Pruning requires careful analysis (can&#8217;t just randomly remove)</p></li><li><p>Some tasks more sensitive than others</p></li></ul><p>Suggests multi-head attention has useful redundancy but isn&#8217;t wasteful.</p><div><hr></div><h3>Q10: What&#8217;s the memory bottleneck during inference?</h3><p><strong>Answer:</strong> <strong>KV cache.</strong> For autoregressive generation:</p><ul><li><p>Store K, V for all previous tokens</p></li><li><p>At each step, attend to cached K, V</p></li></ul><p>Memory: O(n &#183; layers &#183; d) per sequence For 2K context, 32 layers, d=4096: ~1GB per request</p><p>This is why context length is expensive&#8212;it&#8217;s primarily a memory problem, not compute.</p><div><hr></div><h2>14. Practical Takeaways</h2><h3>For Building Systems:</h3><ol><li><p><strong>Pre-norm architecture</strong> for new models (better training stability)</p></li><li><p><strong>GELU/SwiGLU activations</strong> over ReLU (better performance)</p></li><li><p><strong>RoPE positional encoding</strong> for better extrapolation (used in LLaMA)</p></li><li><p><strong>FlashAttention</strong> for memory-efficient training (3x faster, 10x less memory)</p></li><li><p><strong>Gradient checkpointing</strong> to trade compute for memory</p></li></ol><h3>For Understanding Models:</h3><ol><li><p><strong>Attention patterns evolve</strong> across layers (syntactic &#8594; semantic &#8594; task-specific)</p></li><li><p><strong>FFN does most computation</strong> (3x more parameters than attention)</p></li><li><p><strong>Residual connections are critical</strong> for gradient flow</p></li><li><p><strong>Not all attention heads are equal</strong> (some can be pruned)</p></li><li><p><strong>Position information propagates</strong> via residuals and attention</p></li></ol><h3>For Debugging:</h3><ol><li><p><strong>Check attention entropy</strong> (low = too focused, high = too uniform)</p></li><li><p><strong>Visualize attention rollout</strong> for multi-layer paths</p></li><li><p><strong>Monitor gradient norms</strong> (residuals help, but explosions still happen)</p></li><li><p><strong>Probe intermediate layers</strong> to see what&#8217;s learned where</p></li><li><p><strong>Ablate heads/layers</strong> to find critical components</p></li></ol><div><hr></div><h2>&#10024; The Bigger Picture</h2><p>Understanding Transformer internals isn&#8217;t just academic ,it&#8217;s practical:</p><p><strong>For research:</strong></p><ul><li><p>Know what to modify (attention alternatives, FFN variants)</p></li><li><p>Understand scaling properties</p></li><li><p>Debug training issues</p></li></ul><p><strong>For engineering:</strong></p><ul><li><p>Optimize inference (KV cache, attention kernels)</p></li><li><p>Choose architectures (encoder vs decoder)</p></li><li><p>Tune hyperparameters meaningfully</p></li></ul><p><strong>For product:</strong></p><ul><li><p>Understand capabilities and limitations</p></li><li><p>Make informed model selection</p></li><li><p>Predict behavior on edge cases</p></li></ul><p>Every layer refines the representation a bit more. Every attention head captures a different pattern. Every residual connection preserves information flow.</p><p>The beauty is in how simple components compose into powerful systems.</p><div><hr></div><h2>&#128218; References &amp; Further Reading</h2><h3>&#128313; <strong>Foundational &amp; Core Attention Papers</strong></h3><ul><li><p><strong>Bahdanau et al. (2014)</strong> &#8211; <em>Neural Machine Translation by Jointly Learning to Align and Translate</em><br><a href="https://arxiv.org/abs/1409.0473">https://arxiv.org/abs/1409.0473</a></p></li><li><p><strong>Luong et al. (2015)</strong> &#8211; <em>Effective Approaches to Attention-based Neural Machine Translation</em><br><a href="https://arxiv.org/abs/1508.04025">https://arxiv.org/abs/1508.04025</a></p></li><li><p><strong>Vaswani et al. (2017)</strong> &#8211; <em>Attention Is All You Need</em> (for multi-head attention formalization)<br><a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></p></li></ul><div><hr></div><h3>&#128313; <strong>Technical Deep Dives &amp; Visual Guides</strong></h3><ul><li><p><strong>Jay Alammar &#8211; The Illustrated Attention</strong><br><a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanisms-and-attention/">https://jalammar.github.io/visualizing-neural-machine-translation-mechanisms-and-attention/</a></p></li><li><p><strong>The Illustrated Transformer (Attention section)</strong><br><a href="https://jalammar.github.io/illustrated-transformer/">https://jalammar.github.io/illustrated-transformer/</a></p></li><li><p><strong>Lilian Weng &#8211; Attention? Attention!</strong><br><a href="https://lilianweng.github.io/posts/2018-06-24-attention/">https://lilianweng.github.io/posts/2018-06-24-attention/</a></p></li><li><p><strong>Harvard NLP &#8211; Annotated Transformer (Attention code walkthrough)</strong><br><a href="http://nlp.seas.harvard.edu/annotated-transformer/">http://nlp.seas.harvard.edu/annotated-transformer/</a></p></li><li><p><strong>Peter Bloem &#8211; Transformers from Scratch (detailed math on attention)</strong><br><a href="https://peterbloem.nl/blog/transformers">https://peterbloem.nl/blog/transformers</a></p></li></ul><div><hr></div><h3>&#128313; <strong>Research &amp; Variants of Attention</strong></h3><ul><li><p><strong>Sparse Transformers (OpenAI, 2019)</strong><br><a href="https://arxiv.org/abs/1904.10509">https://arxiv.org/abs/1904.10509</a></p></li><li><p><strong>Performer: Linear Attention (Choromanski et al., 2020)</strong><br><a href="https://arxiv.org/abs/2009.14794">https://arxiv.org/abs/2009.14794</a></p></li><li><p><strong>Longformer (Beltagy et al., 2020)</strong> &#8211; Local + Global attention pattern<br><a href="https://arxiv.org/abs/2004.05150">https://arxiv.org/abs/2004.05150</a></p></li><li><p><strong>Linformer (Wang et al., 2020)</strong> &#8211; Low-rank self-attention<br><a href="https://arxiv.org/abs/2006.04768">https://arxiv.org/abs/2006.04768</a></p></li></ul><div><hr></div><h3>&#128313; <strong>Videos &amp; Talks</strong></h3><ul><li><p><strong>Yannic Kilcher &#8211; Attention Mechanisms Explained</strong></p></li></ul><ul><li><p><strong>Andrew Ng &#8211; Self-Attention Explanation (DeepLearning.AI)</strong></p></li></ul><ul><li><p><strong>MIT 6.S191 &#8211; Lecture on Attention Mechanisms</strong></p></li></ul><ul><li><p><strong>Karpathy &#8211; &#8220;Let&#8217;s Build Attention From Scratch&#8221; (implicit in GPT lecture)</strong></p></li></ul><div><hr></div><h1>What&#8217;s Next?</h1><p>This post covered <strong>what happens inside a Transformer</strong>. </p><p>Next in the series:</p><ul><li><p><strong>Post 3:</strong> Scaling Laws &amp; Training LLMs</p></li><li><p><strong>Post 4:</strong> Alignment &amp; Production</p></li></ul><div><hr></div><p><em>If this deep-dive was valuable, share it with someone learning ML. This series documents everything I wish I understood when building with Transformers.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[🧠 The Need for Transformers]]></title><description><![CDATA[How Attention Revolutionized Deep Learning]]></description><link>https://datajourney24.substack.com/p/the-need-for-transformers</link><guid isPermaLink="false">https://datajourney24.substack.com/p/the-need-for-transformers</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sun, 02 Nov 2025 07:52:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LkAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>1. The Breaking Point: When RNNs Hit the Wall</h2><p>For years, sequence modeling was ruled by <strong>RNNs</strong> and <strong>LSTMs</strong>. They were the go-to models for text, speech, and time-series data, anything where order mattered.</p><p>The idea behind them was simple but clever: process data <strong>one step at a time</strong>, and pass information forward through a hidden state. This way, the model could &#8220;remember&#8221; previous inputs as it read new ones.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>It worked well for short sequences. But the cracks appeared quickly.</p><h3>The Real Problems </h3><p><strong>1. Vanishing/Exploding Gradients</strong> - the famous one everyone talks about. But here&#8217;s what matters practically: Even with gradient clipping and LSTMs, you&#8217;re still fighting an uphill battle. Information from token 1 has to survive 100+ sequential transformations to influence token 100. That&#8217;s a game of telephone with exponential decay.</p><p><strong>2. Sequential Bottleneck</strong> - this is the killer. Every step waits for the previous one. Your GPU sits there, mostly idle, processing one token at a time. It&#8217;s like having a 100-lane highway but being forced to drive single-file.</p><p><strong>3. The Hidden State Compression Problem</strong>- here&#8217;s the intuition nobody tells you:</p><blockquote><p>Imagine I tell you a story and ask: &#8220;Now summarize everything important in exactly 512 numbers.&#8221; Then I add more story. &#8220;Okay, still 512 numbers. Don&#8217;t forget the beginning!&#8221;</p><p>That&#8217;s what we asked RNNs to do.</p></blockquote><p>LSTMs added &#8220;gates&#8221; - like giving you permission to forget certain things. Better, but still fundamentally a lossy compression game.</p><h3>The Insight That Changed Everything</h3><p>In 2014, Bahdanau introduced attention for neural machine translation. The key insight wasn&#8217;t the math - it was the <strong>question</strong>:</p><blockquote><p>&#8220;Why compress the entire source sentence into one vector when the decoder can just look back and grab what it needs?&#8221;</p></blockquote><p>It&#8217;s the difference between:</p><ul><li><p>Taking notes on a book, then writing an essay from memory (RNN)</p></li><li><p>Writing an essay with the book open, referencing specific passages (Attention)</p></li></ul><p>But they still used RNNs to process the sequence sequentially.</p><p>In 2017, Vaswani et al. asked the radical question:</p><blockquote><p>&#8220;What if we throw out recurrence entirely and use <em>only</em> attention?&#8221;</p></blockquote><p>That paper  &#8220;Attention Is All You Need&#8221; became the most cited AI paper of the decade.</p><div><hr></div><h2>2. Architecture: Self-Attention Under the Hood</h2><p>Let me show you what actually happens inside a Transformer, with the intuition first, math second.</p><h3>2.1 The Core Idea: Attention as Database Lookup</h3><p>Think of self-attention as a <strong>differentiable database query</strong>.</p><p>Every token in your sequence is simultaneously:</p><ul><li><p><strong>A query</strong> asking: &#8220;What information do I need?&#8221;</p></li><li><p><strong>A key</strong> announcing: &#8220;I contain this type of information&#8221;</p></li><li><p><strong>A value</strong> holding: &#8220;Here&#8217;s my actual content&#8221;</p></li></ul><p>When processing the word &#8220;bank&#8221; in &#8220;I withdrew money from the bank&#8221;, the token:</p><ul><li><p><strong>Queries</strong> for context about transactions, finance</p></li><li><p><strong>Keys</strong> from nearby tokens like &#8220;money&#8221; and &#8220;withdrew&#8221; light up</p></li><li><p><strong>Values</strong> from those tokens flow into &#8220;bank&#8221;&#8217;s new representation</p></li></ul><p>The genius: <strong>every token queries every other token simultaneously</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LkAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LkAO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 424w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 848w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 1272w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LkAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png" width="728" height="1528.8" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2898,&quot;width&quot;:1380,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:639563,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/177778383?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LkAO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 424w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 848w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 1272w, https://substackcdn.com/image/fetch/$s_!LkAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57018508-6623-4523-9a8e-9a398744dc6e_1380x2898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2.2 The Math (Now That You Get It)</h3><p>For each token, we create three vectors via learned projections:</p><p><strong>Query (Q):</strong> What am I looking for? <strong>Key (K):</strong> What do I contain?<br><strong>Value (V):</strong> What information do I carry?</p><p>Compute relevance scores between all query-key pairs:</p><pre><code><code>Score(Q_i, K_j) = Q_i &#183; K_j
</code></code></pre><p>Scale to prevent saturation (critical for training stability):</p><pre><code><code>Scaled Score = (Q_i K_j^T) / &#8730;d_k
</code></code></pre><p>Why divide by &#8730;d_k? Because dot products grow with dimensionality. Without scaling, softmax gets extreme values (0.00001, 0.00001, 0.99998) instead of smooth distributions. This kills gradient flow.</p><p>Apply softmax to get attention distribution:</p><pre><code><code>Attention Weights = softmax(QK^T / &#8730;d_k)
</code></code></pre><p>Compute weighted sum of values:</p><pre><code><code>Self-Attention(Q, K, V) = softmax(QK^T / &#8730;d_k)V
</code></code></pre><p> All tokens processed in parallel, one massive matrix multiplication.</p><h3>2.3 Visual: What Attention Actually Looks Like</h3><pre><code><code>Input: &#8220;The cat sat on the mat&#8221;

Token: &#8220;sat&#8221;
&#9500;&#9472; High attention to: &#8220;cat&#8221; (subject), &#8220;mat&#8221; (location)
&#9500;&#9472; Medium attention to: &#8220;on&#8221;, &#8220;the&#8221;
&#9492;&#9472; Low attention to: &#8220;The&#8221; (first token)

Token: &#8220;mat&#8221;  
&#9500;&#9472; High attention to: &#8220;sat&#8221; (action), &#8220;on&#8221; (relation)
&#9500;&#9472; Medium attention to: &#8220;the&#8221; (determiner)
&#9492;&#9472; Low attention to: &#8220;The&#8221;, &#8220;cat&#8221;
</code></code></pre><p>Each token builds a new representation by <strong>pulling information</strong> from relevant tokens, weighted by attention scores.</p><h3>2.4 Multi-Head Attention: Why One Attention Isn&#8217;t Enough</h3><p>Here&#8217;s the non-obvious insight: <strong>different types of relationships matter simultaneously</strong>.</p><p>Consider &#8220;The chef who runs the restaurant cooked the meal&#8221;</p><p>You need to track:</p><ul><li><p><strong>Syntactic structure</strong>: &#8220;who&#8221; refers to &#8220;chef&#8221;, not &#8220;restaurant&#8221;</p></li><li><p><strong>Semantic roles</strong>: &#8220;chef&#8221; is the agent, &#8220;meal&#8221; is the bject</p></li><li><p><strong>Long-range dependencies</strong>: &#8220;cooked&#8221; connects to &#8220;chef&#8221; across 5 words</p></li><li><p><strong>Local context</strong>: &#8220;the restaurant&#8221; is a noun phrase unit</p></li></ul><p>Single attention can&#8217;t capture all these patterns optimally.</p><p><strong>Solution:</strong> Run <strong>h</strong> attention operations in parallel (typically 8-16 heads).</p><pre><code><code>MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O

where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
</code></code></pre><p>Each head learns different relationship patterns:</p><ul><li><p>Head 1: Subject-verb relationships</p></li><li><p>Head 2: Noun-modifier pairs</p></li><li><p>Head 3: Long-range dependencies</p></li><li><p>Head 4: Positional/sequential patterns</p></li><li><p>...and so on</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOWq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOWq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 424w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 848w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 1272w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOWq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png" width="1380" height="2851" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2851,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1946976,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/177778383?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8f37c1d-5ecf-4660-ac57-3ca09bf0ff5d_1380x3036.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOWq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 424w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 848w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 1272w, https://substackcdn.com/image/fetch/$s_!xOWq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b73ac06-9086-4ac7-bcad-4bcf169a65e3_1380x2851.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2.5 Positional Encoding: Teaching Order Without Recurrence</h3><p><strong>Problem:</strong> Self-attention is permutation-invariant. &#8220;Dog bites man&#8221; and &#8220;Man bites dog&#8221; produce identical attention patterns.</p><p><strong>Solution:</strong> Inject position information directly into embeddings.</p><p>The original paper used sinusoidal encodings:</p><pre><code><code>PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
</code></code></pre><p>Why sinusoids? Two clever properties:</p><ol><li><p><strong>Relative positions</strong>: PE(pos+k) can be expressed as a linear function of PE(pos)</p></li><li><p><strong>Unbounded length</strong>: Works for any sequence length, no training needed</p></li></ol><p>Modern models often use <strong>learned positional embeddings</strong> (GPT) or <strong>rotary embeddings</strong> (RoPE in LLaMA) which have better extrapolation properties.</p><div><hr></div><h2></h2><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/p/the-need-for-transformers?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/p/the-need-for-transformers?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://datajourney24.substack.com/p/the-need-for-transformers?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><h2>3. Why This Architecture Won</h2><p>Let me tell you what actually mattered for Transformers&#8217; success  and it&#8217;s not what most people emphasize.</p><h3> Parallelization: The GPU Unlock</h3><p><strong>RNN/LSTM:</strong></p><pre><code><code>Step 1: Process token 1  [GPU: 5% utilized]
Step 2: Process token 2  [GPU: 5% utilized]  
Step 3: Process token 3  [GPU: 5% utilized]
...
Step 512: Process token 512 [GPU: 5% utilized]
</code></code></pre><p><strong>Transformer:</strong></p><pre><code><code>Step 1: Process ALL 512 tokens simultaneously [GPU: 95% utilized]
</code></code></pre><p>This isn&#8217;t just faster  it&#8217;s <strong>2-3 orders of magnitude faster</strong> for long sequences. This is what made GPT-3 (175B parameters) feasible to train.</p><h3> Global Context: See Everything, Attend to What Matters</h3><p>RNNs forced information through a bottleneck. Transformers let every token <strong>directly access</strong> every other token.</p><p>In &#8220;The trophy doesn&#8217;t fit in the suitcase because it&#8217;s too big&#8221;:</p><ul><li><p>LSTM struggles to connect &#8220;it&#8221; &#8594; &#8220;trophy&#8221; across 7 tokens</p></li><li><p>Transformer directly computes attention between &#8220;it&#8221; and both &#8220;trophy&#8221; and &#8220;suitcase&#8221;</p></li></ul><p>The model learns &#8220;big&#8221; + &#8220;doesn&#8217;t fit&#8221; &#8594; probably referring to trophy, not suitcase.</p><h3>Engineering Beauty: Why Systems Engineers Love Transformers</h3><ol><li><p><strong>Stateless:</strong> No hidden state to serialize/deserialize between steps</p></li><li><p><strong>Cacheable:</strong> In autoregressive generation, previous token representations are cached (KV cache)</p></li><li><p><strong>Analyzable:</strong> Attention weights are interpretable- you can visualize what the model &#8220;looks at&#8221;</p></li><li><p><strong>Modular:</strong> Easy to swap encoders/decoders, add/remove layers, change attention patterns</p></li></ol><div><hr></div><h2>4. The Complexity Trade-off (And Why We Accept It)</h2><h3>The O(n&#178;) Elephant in the Room</h3><p>Self-attention computes interactions between <strong>all pairs of tokens</strong>:</p><ul><li><p>Sequence length 512: 262,144 interactions</p></li><li><p>Sequence length 2048: 4,194,304 interactions</p></li><li><p>Sequence length 8192: 67,108,864 interactions</p></li></ul><p><strong>Complexity:</strong> O(n&#178; &#183; d) time, O(n&#178;) memory</p><p>For context: RNN is O(n &#183; d&#178;) - linear in sequence length, quadratic in dimension.</p><p>So why did we accept quadratic complexity?</p><p><strong>Three reasons:</strong></p><ol><li><p><strong>GPUs love matrix multiplication</strong> : O(n&#178;) on a GPU is often faster than O(n) on a CPU</p></li><li><p><strong>Most NLP tasks used short sequences</strong> (&#8804;512 tokens) where n&#178; wasn&#8217;t prohibitive</p></li><li><p><strong>The performance gain was massive</strong> - quadratic cost, 10x better accuracy</p></li></ol><h3>Modern Solutions</h3><p>When quadratic became a problem (long documents, DNA sequences, code):</p><p><strong>Sparse Attention</strong> (Longformer, BigBird): Only attend to local neighbors + global tokens + random samples</p><ul><li><p>Reduces complexity to O(n &#183; k) where k &lt;&lt; n</p></li><li><p>Loses some global context</p></li></ul><p><strong>Linear Attention</strong> (Performer, Linformer):<br>Approximate softmax(QK^T)V with lower-rank operations</p><ul><li><p>O(n) complexity</p></li><li><p>Slight accuracy drop</p></li></ul><p><strong>FlashAttention</strong> (2022): Don&#8217;t change the algorithm , optimize GPU memory access patterns</p><ul><li><p>Same O(n&#178;) complexity</p></li><li><p>3x faster, 10x less memory</p></li><li><p>This is what powers 100K+ context windows today</p></li></ul><div><hr></div><h2>5. Interview Deep-Dive: Questions That Matter</h2><h3>Q1. Why did RNNs struggle with long-term dependencies?</h3><p><strong>Surface answer:</strong> Vanishing gradients.</p><p><strong>Deep answer:</strong> Sequential processing creates a <strong>gradient path</strong> of length n. Even with careful initialization and gating (LSTM), each step multiplies by a matrix. After 100+ steps, either:</p><ul><li><p>Products converge to zero (vanishing)</p></li><li><p>Products explode (unbounded)</p></li></ul><p>The gradient w.r.t. token 1 has to flow through 100+ matrix multiplications. Attention creates <strong>direct paths</strong> - gradient flows in O(1) steps regardless of distance.</p><div><hr></div><h3>Q2. What&#8217;s the intuition behind Q, K, V?</h3><p><strong>Analogy:</strong> Search engine.</p><ul><li><p><strong>Query (Q):</strong> Your search terms , what you&#8217;re looking for</p></li><li><p><strong>Key (K):</strong> Document titles/metadata , what each document is about</p></li><li><p><strong>Value (V):</strong> Document content , actual information you retrieve</p></li></ul><p>You compute relevance (Q&#183;K), rank results (softmax), and retrieve content (weighted V).</p><p>Every token is simultaneously searching and being searched.</p><div><hr></div><h3>Q3. Why divide by &#8730;d_k in scaled dot-product attention?</h3><p><strong>Surface answer:</strong> To prevent large dot products.</p><p><strong>The real reason:</strong> Dot product magnitude grows with dimensionality.</p><p>If Q and K are unit-variance, Q&#183;K has variance d_k. For d_k = 512, typical dot products are in range [-50, 50]. After softmax, you get extreme distributions: (0.00001, 0.99998, 0.00001)</p><p>This creates two problems:</p><ol><li><p><strong>Saturation:</strong> Softmax derivatives &#8594; 0, killing gradients</p></li><li><p><strong>Instability:</strong> Small input changes cause massive output swings</p></li></ol><p>Dividing by &#8730;d_k normalizes variance back to 1, keeping softmax in the &#8220;soft&#8221; regime where gradients are healthy.</p><div><hr></div><h3>Q4. How do Transformers enable parallel computation?</h3><p><strong>Key insight:</strong> Attention is a <strong>three-matrix multiplication</strong> problem.</p><pre><code><code>Attention = softmax(QK^T / &#8730;d_k) &#183; V
</code></code></pre><ul><li><p>QK^T: (n &#215; d) &#183; (d &#215; n) &#8594; (n &#215; n) attention matrix</p></li><li><p>softmax: element-wise, fully parallelizable</p></li><li><p>Attention &#183; V: (n &#215; n) &#183; (n &#215; d) &#8594; (n &#215; d) output</p></li></ul><p>All token interactions computed in <strong>one batched operation</strong>. RNNs required n sequential steps.</p><p>Modern GPUs do matrix multiplication at 200+ TFLOPS . Transformers exploit this perfectly.</p><div><hr></div><h3>Q5. What&#8217;s the difference between encoder-only and decoder-only Transformers?</h3><p><strong>Encoder-only (BERT):</strong></p><ul><li><p>Bidirectional attention - each token sees past AND future</p></li><li><p>Good for: classification, NER, Q&amp;A (understanding tasks)</p></li><li><p>Training: Masked language modeling (predict random masked tokens)</p></li></ul><p><strong>Decoder-only (GPT):</strong></p><ul><li><p>Causal attention - token i can only see tokens 1...i (via attention mask)</p></li><li><p>Good for: text generation, completion (generative tasks)</p></li><li><p>Training: Next token prediction (autoregressive language modeling)</p></li></ul><p><strong>Encoder-Decoder (T5, BART):</strong></p><ul><li><p>Encoder: bidirectional on input</p></li><li><p>Decoder: causal, cross-attends to encoder outputs</p></li><li><p>Good for: translation, summarization (seq2seq tasks)</p></li></ul><div><hr></div><h3>Q6. What&#8217;s the main bottleneck of Transformers?</h3><p><strong>Training:</strong> Compute (O(n&#178; &#183; d) attention + O(n &#183; d&#178;) FFN) <strong>Inference:</strong> Memory for KV cache</p><p>At inference, we cache K and V for all previous tokens. For 8K context, 32 layers, d=4096: ~2GB per request. This is why &#8220;context length&#8221; is expensive - it&#8217;s mostly a memory problem.</p><div><hr></div><h3>Q7. Why do we need positional encoding?</h3><p>Self-attention is a <strong>set operation</strong> - order-invariant.</p><p>Without positional info:</p><ul><li><p>&#8220;Dog bites man&#8221; = &#8220;Man bites dog&#8221;</p></li><li><p>&#8220;Not bad&#8221; = &#8220;Bad not&#8221;</p></li></ul><p>Positional encoding adds <strong>order signal</strong> directly to embeddings, so the model can learn position-dependent patterns.</p><p>Why not just use token position as a feature? Because:</p><ol><li><p>Absolute position isn&#8217;t what matters - &#8220;third word&#8221; means nothing</p></li><li><p>Relative position matters more distance and direction between tokens</p></li><li><p>Sinusoidal encoding captures relative position implicitly via phase relationships</p></li></ol><div><hr></div><h3>Q8. How do you handle sequences longer than training length?</h3><p><strong>Problem:</strong> Train on 512 tokens, inference on 2048 tokens.</p><p><strong>Solutions:</strong></p><ol><li><p><strong>Sinusoidal PE:</strong> Extrapolates naturally (original Transformer)</p></li><li><p><strong>Learned PE:</strong> Interpolate embeddings (okay but degraded)</p></li><li><p><strong>ALiBi:</strong> Bias attention by relative distance (no explicit encoding)</p></li><li><p><strong>RoPE:</strong> Rotate Q,K based on position (used in LLaMA, best extrapolation)</p></li></ol><p>Modern long-context models (32K, 100K+) use RoPE + careful finetuning on longer sequences.</p><div><hr></div><h2>The Bigger Picture</h2><p>Transformers didn&#8217;t just improve NLP - they <strong>unified sequence modeling</strong> across domains.</p><p><strong>Same architecture</strong>, different data:</p><ul><li><p>Text &#8594; GPT, BERT, T5</p></li><li><p>Images &#8594; Vision Transformer (ViT)</p></li><li><p>Audio &#8594; Whisper, AudioLM</p></li><li><p>Video &#8594; VideoGPT, Phenaki</p></li><li><p>Molecules &#8594; AlphaFold (protein structures)</p></li><li><p>Code &#8594; Codex, GitHub Copilot</p></li><li><p>Multimodal &#8594; CLIP, Flamingo, GPT-4</p></li></ul><p>The insight: <strong>Everything can be tokenized into sequences</strong>. And attention is a universal way to model relationships.</p><div><hr></div><h2>&#128218; <strong>References &amp; Further Reading</strong></h2><p>Here are some high-quality papers, articles, and visual guides to explore if you want to go deeper:</p><h3>&#128313; <strong>Foundational Papers</strong></h3><ul><li><p><strong>Vaswani et al. (2017)</strong> &#8211; <em>&#8220;<a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need&#8221;</a></em><a href="https://arxiv.org/abs/1706.03762">, NeurIPS 2017</a></p></li><li><p><strong>Bahdanau et al. (2014)</strong> &#8211; <em>&#8220;<a href="https://arxiv.org/abs/1409.0473">Neural Machine Translation by Jointly Learning to Align and Translate&#8221;</a></em></p></li><li><p><strong>Hochreiter &amp; Schmidhuber (1997)</strong> &#8211; <em>&#8220;Long Short-Term Memory&#8221;</em><br><a href="https://www.bioinf.jku.at/publications/older/2604.pdf">https://www.bioinf.jku.at/publications/older/2604.pdf</a></p></li></ul><h3>&#128313; <strong>Technical Deep Dives</strong></h3><ul><li><p><a href="https://jalammar.github.io/illustrated-transformer/">Jay Alammar &#8211; </a><em><a href="https://jalammar.github.io/illustrated-transformer/">&#8220;The Illustrated Transformer&#8221;</a></em></p></li><li><p><a href="https://lilianweng.github.io/posts/2018-06-24-attention/">Lilian Weng &#8211; </a><em><a href="https://lilianweng.github.io/posts/2018-06-24-attention/">&#8220;Attention? Attention!&#8221;</a></em></p></li><li><p><a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">Harvard NLP &#8211; </a><em><a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">&#8220;Annotated Transformer (Tensor2Tensor Implementation)&#8221;</a></em></p></li></ul><h3>&#128313; <strong>Videos &amp; Talks</strong></h3><ul><li><p>Yannic Kilcher &#8211; <em>&#8220;Attention Is All You Need &#8211; Paper Explained&#8221;</em> (YouTube)</p></li><li><p>Andrej Karpathy &#8211; <em>&#8220;Let&#8217;s build GPT from scratch&#8221;</em> (YouTube, 2023)</p></li><li><p>DeepLearning.AI &#8211; <em>&#8220;Transformers Explained&#8221;</em> short course by Andrew Ng</p></li></ul><div><hr></div><h2>What&#8217;s Next?</h2><p>This post covered <strong>why</strong> Transformers emerged and <strong>what</strong> makes them tick.</p><p><strong>Next in the series:</strong></p><ul><li><p><strong>Post 2:</strong> Deep dive into attention mechanisms  visualizing heads, understanding learned patterns</p></li><li><p><strong>Post 3:</strong> Scaling laws and emergent abilities why bigger models suddenly get qualitatively smarter</p></li><li><p><strong>Post 4:</strong> From Transformers to LLMs  training objectives, instruction tuning, RLHF</p></li></ul><p><strong>Question for you:</strong> What was the &#8220;aha!&#8221; moment that made Transformers click for you? Drop a comment . I read every one.</p><p><em>If you found this valuable, share it with someone learning ML. This series is my attempt to document everything I wish I knew when I started building with Transformers.</em></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[🪆 Matryoshka Embeddings: Russian Dolls for AI]]></title><description><![CDATA[When we think of embeddings, one trade-off always comes up:]]></description><link>https://datajourney24.substack.com/p/matryoshka-embeddings-russian-dolls</link><guid isPermaLink="false">https://datajourney24.substack.com/p/matryoshka-embeddings-russian-dolls</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Tue, 19 Aug 2025 10:55:15 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7c56accb-7642-4d5b-9f4f-203d026f7a35_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When we think of embeddings, one trade-off always comes up:</p><ul><li><p>High-dimensional embeddings (like 768-d vectors from BERT) capture a lot of nuance, but they&#8217;re expensive to store, index, and search.</p></li><li><p>Low-dimensional embeddings (say 64-d) are fast and lightweight, but they lose critical meaning.</p></li></ul><p>In large-scale systems like recommendation engines, semantic search, and retrieval-augmented generation (RAG) this trade-off becomes painful. You either <strong>pay for accuracy</strong> or <strong>settle for efficiency</strong>.</p><p>But what if you didn&#8217;t have to choose?</p><p>That&#8217;s the promise of <strong>Matryoshka embeddings</strong>.</p><div><hr></div><h2>The Core Idea</h2><p>The concept comes from the 2022 paper <em>Matryoshka Representation Learning</em> (Kusupati et al.), and Hugging Face recently popularized it with blogs and open-source models.</p><p>The key insight: <strong>train embeddings so that any prefix (first N dimensions) of the vector remains useful.</strong></p><p>That means:</p><ul><li><p>A 64-d slice can already capture meaningful structure.</p></li><li><p>Expanding to 128-d improves accuracy further.</p></li><li><p>The full 768-d captures the richest semantics.</p></li></ul><p>Each smaller embedding is <em>nested</em> inside the larger one - just like Russian dolls &#129670;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Set-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Set-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Set-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Set-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Set-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Set-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png" width="1456" height="1269" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1269,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172957,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/171359257?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Set-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Set-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Set-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Set-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecf256e-d721-4f8d-aaec-0b3cc0d16a4c_1739x1516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h2>Why It Matters</h2><p>Matryoshka embeddings unlock some powerful practical benefits:</p><ol><li><p><strong>Scalable Search</strong></p><ul><li><p>Billions of embeddings can be stored and searched faster using only 64-d vectors for the first-pass retrieval.</p></li></ul></li><li><p><strong>Flexible Trade-offs</strong></p><ul><li><p>Edge devices can work with 64-d or 128-d slices (smaller memory footprint).</p></li><li><p>Cloud servers can afford the full 768-d reranking.</p></li></ul></li><li><p><strong>Unified Pipeline</strong></p><ul><li><p>You don&#8217;t need to train multiple embedding models for different dimensional needs.</p></li><li><p>One model serves all scenarios.</p></li></ul></li></ol><div><hr></div><h2>System Design Perspective</h2><p>Let&#8217;s imagine we&#8217;re building a <strong>semantic search engine</strong>.</p><ul><li><p><strong>Step 1:</strong> Generate a query embedding. Use the <strong>64-d slice</strong> to quickly retrieve top-100 candidates from a huge database using approximate nearest neighbor (ANN) search.</p></li><li><p><strong>Step 2:</strong> For this shortlist, expand the embeddings to <strong>768-d</strong>.</p></li><li><p><strong>Step 3:</strong> Rerank candidates with maximum semantic accuracy.</p></li></ul><p>This gives the <strong>best of both worlds</strong>: speed at scale + accuracy where it matters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Omis!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Omis!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 424w, https://substackcdn.com/image/fetch/$s_!Omis!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 848w, https://substackcdn.com/image/fetch/$s_!Omis!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 1272w, https://substackcdn.com/image/fetch/$s_!Omis!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Omis!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png" width="1456" height="784" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:784,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77065,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/171359257?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Omis!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 424w, https://substackcdn.com/image/fetch/$s_!Omis!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 848w, https://substackcdn.com/image/fetch/$s_!Omis!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 1272w, https://substackcdn.com/image/fetch/$s_!Omis!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2a0f2d4-64db-45d4-a5b2-46217c5acd5e_2385x1285.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>How Is This Different From PCA?</h2><p>You might wonder: <em>&#8220;Couldn&#8217;t we just do PCA on a 768-d embedding and truncate?&#8221;</em></p><p>Here&#8217;s the difference:</p><ul><li><p>PCA reduces dimensions <strong>after training</strong>, often losing semantic power.</p></li><li><p>Matryoshka embeddings are trained <strong>end-to-end</strong> so that <em>every slice is semantically meaningful</em>.</p></li></ul><p>That makes a huge difference in downstream tasks.</p><div><hr></div><h2>Russian Dolls in AI&#8230; and in LeetCode</h2><p>The name &#8220;Matryoshka&#8221; comes from Russian dolls - smaller dolls neatly fitting inside larger ones.</p><p>This analogy isn&#8217;t just cute; it&#8217;s actually accurate. Each smaller embedding &#8220;fits&#8221; inside the larger one, without losing identity.</p><p>Fun fact: there&#8217;s even a <strong>LeetCode problem (#354, Russian Doll Envelopes)</strong> where envelopes must nest inside each other. In a way, Matryoshka embeddings are the <em>vector-space cousin</em> of that puzzle.</p><div><hr></div><h2>Hugging Face&#8217;s Role</h2><p>While the paper came out in 2022, Hugging Face helped bring Matryoshka embeddings into the mainstream by:</p><ul><li><p>Publishing a detailed blog post</p></li><li><p>Releasing open-source implementations</p></li><li><p>Hosting pretrained models on the Hub</p></li></ul><p>This combination of <strong>research + tooling + accessibility</strong> is what often pushes ideas into practical adoption.</p><div><hr></div><h2>Closing Thoughts</h2><p>Matryoshka embeddings are a simple yet powerful idea:</p><ul><li><p>Train vectors so that smaller prefixes still hold semantic meaning.</p></li><li><p>Use them to balance speed and accuracy flexibly.</p></li><li><p>Apply them in search, recommendations, and retrieval-augmented generation.</p></li></ul><p>It&#8217;s one of those elegant ideas where a metaphor (Russian dolls &#129670;) really matches the math.</p><p>I expect we&#8217;ll see these embeddings widely used in <strong>large-scale AI systems</strong>, especially where <strong>cost-efficiency matters</strong>.</p><div><hr></div><h3>Further Reading</h3><ul><li><p><em><a href="https://arxiv.org/abs/2205.13147?utm_source=chatgpt.com">Matryoshka Representation Learning</a></em><a href="https://arxiv.org/abs/2205.13147?utm_source=chatgpt.com"> (Kusupati et al., 2022)</a></p></li><li><p><a href="https://huggingface.co/blog/matryoshka">Hugging Face blog: </a><em><a href="https://huggingface.co/blog/matryoshka">Matryoshka Representation Learning for Efficient Embeddings</a></em></p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Beyond the Layers: Your Guide to Generative AI Skills & Job Roles]]></title><description><![CDATA[Remember how in my last post we peeled back the layers of the Generative AI tech stack? We saw how everything from powerful computers to cool apps makes GenAI work. Well, understanding what makes it tick is great, but it naturally leads to the next big question: "What does this mean for]]></description><link>https://datajourney24.substack.com/p/beyond-the-layers-your-guide-to-generative</link><guid isPermaLink="false">https://datajourney24.substack.com/p/beyond-the-layers-your-guide-to-generative</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Sat, 07 Jun 2025 12:29:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Remember how in my last post we peeled back the layers of the <strong>Generative AI tech stack</strong>? We saw how everything from powerful computers to cool apps makes GenAI work. Well, understanding <em>what</em> makes it tick is great, but it naturally leads to the next big question: "What does this mean for <em>my</em> career?" or "What skills do I actually need to get involved?"</p><p>That's exactly what we're diving into today. This post will go layer by layer, breaking down the <strong>key skills and knowledge</strong> you'll typically need, and showing you how these line up with <strong>common job roles</strong> in the Generative AI world. Whether you're just starting out, a seasoned pro, or a leader looking to adapt, this guide should help light up your path.</p><div><hr></div><h3>Diving Deep: Skills &amp; Roles for Each Layer of the GenAI Stack</h3><p>Let's break down the essential stuff you'll need to know for each part of the Generative AI pyramid:</p><h4>Layer 1: Infrastructure Layer</h4><ul><li><p><strong>What it does:</strong> This is the base &#8211; building and keeping the powerful computers and cloud systems running.</p></li><li><p><strong>Skills you'll need to learn:</strong></p><ul><li><p><strong>Cloud Platforms (really know them):</strong> Think AWS, GCP, Azure, and how they handle big AI tasks.</p></li><li><p><strong>Containers &amp; Orchestration:</strong> Getting good with Docker and Kubernetes, especially for managing those powerful GPU containers.</p></li><li><p><strong>Operating Systems:</strong> Knowing your way around Linux and basic command-line stuff.</p></li><li><p><strong>Networking:</strong> Understanding how everything connects, like setting up virtual networks and making sure data flows super fast.</p></li><li><p><strong>Hardware Know-how:</strong> A grasp of how GPUs (like NVIDIA's), CPUs, and different types of memory and storage work.</p></li><li><p><strong>Infrastructure as Code (IaC):</strong> Using tools like Terraform to automate setting up computer systems.</p></li><li><p><strong>Monitoring &amp; Logging:</strong> Tools like Prometheus and Grafana to keep an eye on how everything's running.</p></li></ul></li><li><p><strong>Jobs that fit here:</strong></p><ul><li><p>Cloud Engineer / Cloud Architect</p></li><li><p>DevOps Engineer (especially for AI systems)</p></li><li><p>Site Reliability Engineer (SRE)</p></li><li><p>ML Infrastructure Engineer</p></li><li><p>Data Center Engineer</p></li></ul></li></ul><h4>Layer 2: Data Layer</h4><ul><li><p><strong>What it does:</strong> This is the fuel! It's all about finding, cleaning, storing, and managing the huge amounts of data GenAI needs.</p></li><li><p><strong>Skills you'll need to learn:</strong></p><ul><li><p><strong>Big Data Tech:</strong> Tools like Apache Spark for handling massive datasets.</p></li><li><p><strong>Database Management:</strong> Knowing SQL and different kinds of NoSQL databases.</p></li><li><p><strong>Vector Databases (super important for GenAI):</strong> Getting familiar with Pinecone, Weaviate, Milvus &#8211; how they store and search for AI information.</p></li><li><p><strong>Data Warehousing/Lakes:</strong> Working with systems like Snowflake or Databricks for storing and analyzing data.</p></li><li><p><strong>ETL/ELT Tools:</strong> Using things like Airflow to build pipelines that move and transform data.</p></li><li><p><strong>Data Governance &amp; Security:</strong> Understanding privacy rules (like GDPR) and how to keep data safe.</p></li><li><p><strong>Data Modeling:</strong> Designing how data is structured.</p></li><li><p><strong>Python (for Data Engineering):</strong> Key libraries like Pandas and PySpark.</p></li><li><p><strong>Data Quality:</strong> Making sure the data is accurate and consistent.</p></li></ul></li><li><p><strong>Jobs that fit here:</strong></p><ul><li><p>Data Engineer</p></li><li><p>ML Data Engineer</p></li><li><p>Data Architect</p></li><li><p>Database Administrator (DBA) (especially for vector databases)</p></li><li><p>Data Governance Specialist</p></li></ul></li></ul><h4>Layer 3: Model Layer</h4><ul><li><p><strong>What it does:</strong> This is the "brain" of GenAI &#8211; building, training, and fine-tuning the actual AI models.</p></li><li><p><strong>Skills you'll need to learn:</strong></p><ul><li><p><strong>Deep Learning Frameworks (master them):</strong> PyTorch, TensorFlow, JAX.</p></li><li><p><strong>Generative Model Architectures:</strong> Really understanding <strong>Transformers</strong> (what makes LLMs work), <strong>Diffusion Models</strong> (for images), and others like GANs.</p></li><li><p><strong>Math for ML:</strong> Linear Algebra, Calculus, Probability, Statistics (the fundamentals!).</p></li><li><p><strong>Python (for ML):</strong> Core libraries like NumPy and SciPy.</p></li><li><p><strong>Specialized Libraries:</strong> Hugging Face Transformers and Diffusers.</p></li><li><p><strong>Model Training Techniques:</strong> How to train models efficiently, including fine-tuning (like <strong>LoRA</strong>).</p></li><li><p><strong>Model Evaluation:</strong> How to measure if a generated text or image is good, and how to spot biases.</p></li></ul></li><li><p><strong>Jobs that fit here:</strong></p><ul><li><p>ML Scientist / Research Scientist</p></li><li><p>Generative AI Engineer (focused on building models)</p></li><li><p>Deep Learning Engineer</p></li><li><p>Applied Scientist (ML)</p></li><li><p>NLP Engineer</p></li><li><p>Computer Vision Engineer</p></li></ul></li></ul><h4>Layer 4: LLMOps &amp; Orchestration Layer</h4><ul><li><p><strong>What it does:</strong> This is the "nervous system" &#8211; getting those big AI models ready for prime time, making them work together, and keeping them running smoothly.</p></li><li><p><strong>Skills you'll need to learn:</strong></p><ul><li><p><strong>MLOps Best Practices:</strong> How to manage the whole lifecycle of AI models, from development to deployment and monitoring.</p></li><li><p><strong>LLM Serving Frameworks:</strong> Knowing tools like <strong>vLLM</strong> and Hugging Face TGI to run LLMs efficiently.</p></li><li><p><strong>Prompt Engineering:</strong> Advanced ways to talk to AI models to get the best results, and how to manage those prompts.</p></li><li><p><strong>RAG Architectures:</strong> Building systems that help AI models use outside knowledge to give better answers.</p></li><li><p><strong>AI Agent Frameworks:</strong> Working with <strong>LangChain</strong>, <strong>LlamaIndex</strong>, and AutoGen to build AI that can plan and use tools.</p></li><li><p><strong>API Design &amp; Integration:</strong> How to connect different software parts.</p></li><li><p><strong>Cloud ML Services:</strong> Using services like Vertex AI or SageMaker to manage AI pipelines.</p></li><li><p><strong>Distributed Systems:</strong> Understanding how to build and scale complex connected systems.</p></li><li><p><strong>Cost Optimization:</strong> Keeping an eye on token usage and other costs.</p></li></ul></li><li><p><strong>Jobs that fit here:</strong></p><ul><li><p>MLOps Engineer (specialized in LLMs/GenAI)</p></li><li><p>Generative AI Engineer (focused on deployment &amp; orchestration)</p></li><li><p>AI Platform Engineer</p></li><li><p>Prompt Engineer</p></li><li><p>Solutions Architect (AI/ML)</p></li></ul></li></ul><h4>Layer 5: Application Layer</h4><ul><li><p><strong>What it does:</strong> This is what users actually see and touch &#8211; the apps and services powered by GenAI.</p></li><li><p><strong>Skills you'll need to learn:</strong></p><ul><li><p><strong>Frontend Development:</strong> Building the user interface (web apps with React, mobile apps with Swift/Kotlin).</p></li><li><p><strong>Backend Development:</strong> Building the "behind-the-scenes" logic for apps (with Python, Node.js, Java).</p></li><li><p><strong>Database Integration:</strong> Connecting apps to databases.</p></li><li><p><strong>API Integration:</strong> Using APIs to link your app to the AI models.</p></li><li><p><strong>UX/UI Principles:</strong> Designing apps that are easy and enjoyable to use, especially with AI interactions.</p></li><li><p><strong>Security:</strong> Keeping user data and your app safe.</p></li><li><p><strong>Understanding GenAI Limits:</strong> Knowing what AI can and can't do to build realistic features.</p></li><li><p><strong>Product Thinking:</strong> Turning user needs into actual app features.</p></li></ul></li><li><p><strong>Jobs that fit here:</strong></p><ul><li><p>Full-stack Developer (with GenAI interest)</p></li><li><p>Frontend Developer</p></li><li><p>Backend Developer</p></li><li><p>Software Engineer (generalist, but building GenAI apps)</p></li><li><p>Product Manager (AI/ML)</p></li><li><p>UX Designer (focused on GenAI interaction)</p></li></ul></li></ul><div><hr></div><h3>Finding Your Place in the GenAI Ecosystem</h3><p>Understanding this detailed breakdown of skills and job roles across the 5-layer Generative AI Tech Stack is your roadmap to professional growth in this exciting field. It helps you:</p><ul><li><p><strong>Figure out your current strengths</strong> and how they fit into GenAI roles.</p></li><li><p><strong>Spot any skill gaps</strong> for the career path you want.</p></li><li><p><strong>Understand how to work with</strong> other specialized teams.</p></li><li><p><strong>Plan your learning journey</strong> more effectively.</p></li></ul><p>The Generative AI world is huge and still growing, but with this clearer picture, you can confidently navigate its complexities and find your perfect spot.</p><div><hr></div><h3>What's Next?</h3><p>The Generative AI journey is just beginning, and with a clear understanding of its underlying architecture, you're now better equipped to shape its future.</p><p>I'll be sharing more insights into the practical side of AI and ML in upcoming posts.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Beyond the Hype: Unpacking the 5-Layer Generative AI Tech Stack ]]></title><description><![CDATA[Discover the foundational technologies powering the Generative AI revolution, and the critical skills needed at each level to build, manage, and leverage AI.]]></description><link>https://datajourney24.substack.com/p/beyond-the-hype-unpacking-the-5-layer</link><guid isPermaLink="false">https://datajourney24.substack.com/p/beyond-the-hype-unpacking-the-5-layer</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Wed, 04 Jun 2025 17:22:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!O2yK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>Welcome to the World of Generative AI!</strong></h3><p>Generative AI is no longer a futuristic concept; it's here, transforming industries from creative arts and content creation to software development and scientific research. Tools like ChatGPT, Midjourney, and Sora are captivating the world, hinting at a vast, underlying technological infrastructure that makes this magic possible.</p><p>But for many, the 'how' behind this revolution remains a black box. What are the fundamental components that enable AI to create, write, and innovate? And more importantly, what skills do you need to truly engage with this groundbreaking technology and shape its future?</p><p>This post aims to demystify the Generative AI ecosystem by breaking it down into a clear, 5-layer tech stack. We'll explore each layer, highlighting its purpose, key components, and the essential skills you'll need to master it. Understanding this stack is the first crucial step towards building expertise in GenAI.</p><div><hr></div><h3><strong>The Generative AI Pyramid: A 5-Layer Tech Stack</strong></h3><p>Think of Generative AI as a powerful edifice, built layer by layer, from the raw computing power at its base to the user-friendly applications at its peak. Each layer is dependent on the one below it, and each requires a distinct set of technologies and skills.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O2yK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O2yK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 424w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 848w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 1272w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O2yK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c39883c6-8548-4449-9d82-e17933d46cb5_1737x961.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://datajourney24.substack.com/i/165206089?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc39883c6-8548-4449-9d82-e17933d46cb5_1737x961.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O2yK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 424w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 848w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 1272w, https://substackcdn.com/image/fetch/$s_!O2yK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a1e87aa-299e-44de-9fc5-f42a7694ec8d_1737x961.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GenAI stack</figcaption></figure></div><p>The Generative AI pyramid illustrates the hierarchical dependency of the tech stack, with foundational components at the base supporting increasingly abstract and user-facing capabilities towards the apex.</p><p>Let's explore each layer from the top down:</p><div><hr></div><h3><strong>Layer 5: Application Layer</strong></h3><ul><li><p><strong>Purpose</strong>: This is the most user-facing layer, comprising the actual products and services that deliver Generative AI capabilities to end-users. It focuses on user experience, specific business logic, and presenting AI-generated content in a meaningful way.</p></li><li><p><strong>Key Components &amp; Responsibilities:</strong></p><ul><li><p><strong>User Interface (UI) / User Experience (UX)</strong>: Web applications (React, Angular, Vue.js), mobile apps (React Native, Flutter, Swift/Kotlin), desktop applications. This is what the user directly sees and interacts with.</p></li><li><p><strong>Application-Specific Business Logic:</strong> Code that defines the unique features and workflows of the particular GenAI product. This includes user authentication, payment processing, integration with existing enterprise systems (CRM, ERP), and managing the overall application state.</p></li><li><p><strong>User-Facing Prompt Logic</strong>: While core prompt engineering is lower down, the application might include logic for how user input is captured, how it's formatted for a prompt, and how the LLM's response is parsed and displayed to the user.</p></li><li><p><strong>Agent Execution &amp; Presentation</strong>: If AI agents are part of the application, this layer manages how the user interacts with the agent, triggers its actions, and how the agent's progress and final results are communicated back to the user.</p></li><li><p>User-Centric RAG Display: How the application presents retrieved context to the user (e.g., citing sources, showing retrieved documents) to enhance transparency and trust.</p></li></ul></li><li><p><strong>Examples:</strong> ChatGPT, Midjourney, GitHub Copilot, Jasper, enterprise chatbots, AI-powered content creation tools, intelligent virtual assistants.</p></li></ul><div><hr></div><h3><strong>Layer 4: LLMOps &amp; Orchestration Layer</strong></h3><ul><li><p><strong>Purpose:</strong> This layer acts as the "nervous system" connecting the Application Layer to the core AI models and data. It handles the specific operational challenges of LLMs (LLMOps) and orchestrates complex AI workflows, including prompt management, RAG pipelines, and multi-agent systems.</p></li><li><p><strong>Key Components &amp; Responsibilities:</strong></p><ul><li><p><strong>Prompt Engineering &amp; Management:</strong></p><ul><li><p>Developing, testing, and optimizing prompt templates for various tasks.</p></li><li><p>Implementing prompting strategies (e.g., few-shot learning, chain-of-thought, self-consistency).</p></li><li><p>Versioning and managing prompts across different model versions and application features.</p></li></ul></li><li><p><strong>Retrieval-Augmented Generation (RAG) Pipelines:</strong></p><ul><li><p>Managing the entire workflow: processing user queries, retrieving relevant information from external knowledge bases (via the Data Layer), augmenting the prompt with retrieved context, and sending the combined input to the LLM.</p></li><li><p>Tools like LangChain and LlamaIndex are prominent here.</p></li></ul></li><li><p><strong>AI Agent Frameworks:</strong></p><ul><li><p>Implementing the core logic for AI agents: planning, tool use (e.g., via Model Context Protocol - MCP), memory management, and inter-agent communication (Agent-to-Agent - A2A).</p></li><li><p>Frameworks like AutoGen, CrewAI, and advanced capabilities of LangChain fall into this category.</p></li></ul></li><li><p><strong>LLM Serving &amp; Inference Optimization:</strong></p><ul><li><p>Deploying and scaling LLMs for real-time inference.</p></li><li><p>Using specialized inference engines (e.g., vLLM, NVIDIA TensorRT-LLM, Hugging Face TGI) for high throughput, low latency, and efficient GPU utilization.</p></li><li><p>Handling request batching, quantization, and distributed serving.</p></li></ul></li><li><p><strong>LLMOps (Operational Aspects):</strong></p><ul><li><p>Experiment Tracking: Logging and managing LLM training, fine-tuning, and inference experiments (e.g., MLflow, Weights &amp; Biases).</p></li><li><p>Model Deployment &amp; Management: Versioning, rolling out, and rolling back LLM models and fine-tuned adaptations.</p></li><li><p>Monitoring &amp; Observability: Tracking LLM performance (latency, throughput, token usage, quality metrics), detecting model drift, hallucination rates, and cost analytics.</p></li><li><p>Fine-tuning &amp; LoRA Management: Orchestrating the fine-tuning process of base models with custom data, and managing different LoRA adapters.</p></li><li><p>A/B Testing: For different prompts, models, or RAG configurations.</p></li></ul></li></ul></li><li><p><strong>Examples:</strong> LangChain, vLLM, AutoGen, MLflow, Weights &amp; Biases, OpenAI/Anthropic/Google APIs (when used as part of an orchestrated flow), custom API gateways.</p></li></ul><div><hr></div><h3><strong>Layer 3: Model Layer</strong></h3><ul><li><p><strong>Purpose</strong>: This layer contains the core generative AI models themselves &#8211; the "brains" that perform the actual content generation, understanding, and embedding.</p></li><li><p><strong>Key Components &amp; Responsibilities:</strong></p><ul><li><p><strong>Foundation Models (FMs) / Large Language Models (LLMs):</strong></p><ul><li><p>Pre-trained, general-purpose models on massive datasets that form the base for most GenAI applications.</p></li><li><p>Examples: GPT series (OpenAI), Gemini (Google), Claude (Anthropic), Llama (Meta), Mistral, Stable Diffusion (for images).</p></li></ul></li><li><p><strong>Fine-tuned Models</strong>: Specialized versions of foundation models that have been further trained on smaller, task-specific datasets to improve performance for particular use cases or domains.</p></li><li><p><strong>Embedding Models:</strong> Models specifically designed to convert text, images, or other data into numerical vector representations (embeddings). These are crucial for RAG, semantic search, and other AI tasks.</p></li><li><p><strong>Deep Learning Frameworks</strong>: Fundamental software libraries for building, training, and deploying neural networks.</p><ul><li><p>PyTorch (flexible, research-oriented).</p></li><li><p>TensorFlow (robust, production-oriented).</p></li><li><p>JAX (for high-performance numerical computation).</p></li></ul></li><li><p><strong>Model Hubs &amp; Repositories:</strong> Platforms for discovering, sharing, and versioning pre-trained models (e.g., Hugging Face Hub).</p></li></ul></li><li><p>Examples: GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Stable Diffusion XL, OpenAI text-embedding-ada-002, PyTorch, TensorFlow.</p></li></ul><div><hr></div><h3><strong>Layer 2: Data Layer</strong></h3><ul><li><p><strong>Purpose:</strong> This layer provides and manages the massive datasets that are the lifeblood of Generative AI. It encompasses data collection, processing, storage, and organization for both model training and real-time inference contexts (like RAG).</p></li><li><p><strong>Key Components &amp; Responsibilities:</strong></p><ul><li><p><strong>Data Collection &amp; Acquisition</strong>: Sourcing raw data from diverse origins (web scraping, public datasets like Common Crawl, enterprise data lakes, user-generated content).</p></li><li><p><strong>Data Preprocessing Tools:</strong></p><ul><li><p>Big Data frameworks (Apache Spark, Apache Hadoop) for cleaning, transforming, normalizing, augmenting, and chunking raw data into suitable formats for model consumption.</p></li><li><p>ETL (Extract, Transform, Load) pipelines.</p></li></ul></li><li><p><strong>Data Storage:</strong></p><ul><li><p>Object Storage: Scalable, cost-effective storage for large volumes of unstructured data (AWS S3, Google Cloud Storage, Azure Blob Storage).</p></li><li><p>Data Warehouses/Lakes: For structured and semi-structured data, enabling analytics and complex queries (Snowflake, Databricks Lakehouse, Google BigQuery).</p></li></ul></li><li><p><strong>Vector Databases</strong>: Highly specialized databases designed to efficiently store and query high-dimensional vector embeddings. Critical for fast similarity searches in RAG and semantic search applications (Pinecone, Weaviate, Milvus, Qdrant).</p></li><li><p><strong>Knowledge Bases &amp; Document Stores</strong>: The structured and unstructured data repositories that RAG systems retrieve information from (e.g., internal company wikis, documentation, CRM data).</p></li><li><p><strong>Data Labeling Platforms</strong>: Services and tools for human annotation and labeling of data, crucial for supervised fine-tuning.</p></li><li><p><strong>Data Governance &amp; Security:</strong> Implementing policies, tools, and processes for data quality, privacy (e.g., GDPR, HIPAA compliance), access control, and lineage.</p></li></ul></li><li><p>Examples: AWS S3, Google Cloud Storage, Pinecone, Apache Spark, Snowflake, custom document stores, vast web datasets.</p></li></ul><div><hr></div><h3><strong>Layer 1: Infrastructure Layer</strong></h3><ul><li><p><strong>Purpose:</strong> This is the foundational layer providing the raw compute, storage, and networking resources required to power all layers above it. It's the physical and virtual backbone of the entire GenAI tech stack.</p></li><li><p><strong>Key Components &amp; Responsibilities:</strong></p><ul><li><p><strong>Compute Hardware:</strong></p><ul><li><p>GPUs (Graphics Processing Units): Essential for the parallel processing capabilities needed for deep learning model training and high-performance inference (e.g., NVIDIA A100s, H100s, L40S).</p></li><li><p>TPUs (Tensor Processing Units): Google's custom ASICs optimized specifically for machine learning workloads.</p></li><li><p>CPUs: For general-purpose computation, data preprocessing, and orchestrating workloads.</p></li></ul></li><li><p><strong>Cloud Platforms:</strong> Provide scalable, on-demand access to compute, storage, and managed services.</p><ul><li><p>Amazon Web Services (AWS)</p></li><li><p>Google Cloud Platform (GCP)</p></li><li><p>Microsoft Azure</p></li><li><p>(Potentially on-premise data centers for specific enterprise needs).</p></li></ul></li><li><p><strong>Networking:</strong> High-bandwidth, low-latency network infrastructure for efficient data transfer between compute instances and storage.</p></li><li><p><strong>Operating Systems:</strong> Typically Linux distributions (Ubuntu, CentOS, etc.) running on servers.</p></li><li><p><strong>Virtualization / Container Orchestration</strong>:</p><ul><li><p>Docker: For packaging applications and their dependencies into portable containers.</p></li><li><p>Kubernetes: For orchestrating, automating deployment, scaling, and managing containerized applications across clusters of machines. This is vital for managing distributed training and inference workloads.</p></li></ul></li></ul></li><li><p>Examples: NVIDIA GPUs, AWS EC2 instances, Google Compute Engine, Azure Kubernetes Service (AKS), Docker.</p></li></ul><div><hr></div><div><hr></div><h3><strong>Why Understanding This Stack is Your Superpower</strong></h3><p>Knowing this <strong>5-layer Generative AI Tech Stack</strong> isn't just for textbooks; it's your personal blueprint for success in the AI era. Here's why getting a handle on it is so important:</p><ul><li><p><strong>For Tech Professionals (like ML Engineers, Data Scientists, and Developers):</strong> This structure helps you pinpoint exactly what skills you need to learn. You can specialize in hot areas like <strong>LLMOps</strong> or <strong>Vector Databases</strong>, and clearly see how your work fits into the bigger GenAI picture. It truly empowers you to build, fine-tune, and launch cutting-edge AI systems.</p></li><li><p><strong>For Product &amp; Business Leaders:</strong> This clear view gives you the insights to make smart decisions&#8212;like whether to build AI features in-house or buy them. You'll better understand what's technically possible, how to budget effectively, and how to spot truly game-changing AI product ideas that hit market needs.</p></li><li><p><strong>For Anyone in Tech:</strong> It turns Generative AI from a mysterious "black box" into a clear, understandable landscape. This knowledge lets you engage with GenAI strategically, whether you're hands-on building, managing projects, or simply figuring out how to use its incredible power.</p></li></ul><div><hr></div><h3>What's Next?</h3><p>The Generative AI journey is just beginning, and with a clear understanding of its underlying architecture, you're now better equipped to shape its future.</p><p>I'll be sharing more insights into the practical side of GenAI in upcoming posts.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Unlocking Transformers: 4 Resources to Demystify LLMs]]></title><description><![CDATA[Transformers are the foundation of powerful LLMs like GPT, yet understanding how they work can feel overwhelming.]]></description><link>https://datajourney24.substack.com/p/unlocking-transformers-4-resources</link><guid isPermaLink="false">https://datajourney24.substack.com/p/unlocking-transformers-4-resources</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Thu, 13 Mar 2025 05:26:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Transformers are the foundation of powerful LLMs like GPT, yet understanding how they work can feel overwhelming. These resources break down the complexity and provide insights that make transformers more accessible.</p><div><hr></div><h3>1. <strong>Jay Alammar's Illustrated Transformer</strong></h3><p>If you're a visual learner, this is the perfect starting point. Jay Alammar&#8217;s guide beautifully simplifies the transformer architecture using clear diagrams and intuitive explanations.</p><p>In this guide, Jay explains:</p><ul><li><p><strong>Self-attention:</strong> How each word in a sequence relates to every other word, improving context understanding.</p></li><li><p><strong>Encoder-decoder architecture:</strong> The core structure behind many transformer models.</p></li><li><p><strong>Detailed visual walkthroughs:</strong> Step-by-step illustrations that simplify even the most complex concepts.</p></li></ul><p><strong>Why it&#8217;s great:</strong> The visuals help you build strong intuition, making complex ideas easier to grasp. Jay&#8217;s clear narrative makes it engaging for both beginners and experienced practitioners.<br>&#128279; <a href="https://jalammar.github.io/illustrated-transformer/">Illustrated Transformer</a> | <a href="https://www.linkedin.com/in/jayalammar/">Jay Alammar</a></p><div><hr></div><h3>2. <strong>How Transformer LLMs Work</strong></h3><p><strong>Created by Jay Alammar and Maarten Grootendorst in collaboration with DeepLearning.AI</strong>, this course offers a comprehensive breakdown of the transformer architecture that powers LLMs.</p><p>Key concepts covered in this course include:</p><ul><li><p><strong>Tokenization and embeddings:</strong> How text is converted into numerical representations for model input.</p></li><li><p><strong>The attention mechanism:</strong> Understanding how models decide which words deserve more focus.</p></li><li><p><strong>The transformer block:</strong> Detailed insights into each component like multi-head attention, feedforward layers, and layer normalization.</p></li><li><p><strong>Practical coding examples:</strong> Build your intuition and skills by implementing key transformer components in code.</p></li></ul><p><strong>Why it&#8217;s great:</strong> This course not only builds theoretical understanding but also equips you with hands-on skills essential for applying transformers in real-world projects.<br>&#128279; <a href="https://www.deeplearning.ai/short-courses/how-transformer-llms-work/">How Transformer LLMs Work</a> | <a href="https://www.linkedin.com/school/deeplearning-ai/">DeepLearning.AI</a></p><div><hr></div><h3>3. <strong>Attention in Transformers: Concepts and Code in PyTorch</strong></h3><p>This course, created in collaboration with <strong>StatQuest</strong> and taught by its Founder and CEO, <strong>Josh Starmer</strong>, explains attention mechanisms with clarity and precision.</p><p>The course covers:</p><ul><li><p><strong>Attention mechanism fundamentals:</strong> Step-by-step breakdown of how attention scores are calculated.</p></li><li><p><strong>Coding attention in PyTorch:</strong> Practical guidance on implementing key transformer elements from scratch.</p></li><li><p><strong>Intuitive examples:</strong> Josh&#8217;s clear explanations simplify complex ideas, making them accessible to all learners.</p></li></ul><p><strong>Why it&#8217;s great:</strong> Combining theory with practical implementation helps you move from understanding concepts to applying them in real-world models.<br>&#128279; <a href="https://www.deeplearning.ai/short-courses/attention-in-transformers-concepts-and-code-in-pytorch/">Attention in Transformers</a> | <a href="https://www.linkedin.com/in/joshstarmer/">Josh Starmer</a></p><div><hr></div><h3>4. <strong>Luis Serrano&#8217;s Explanation of Key, Query &amp; Value Matrices</strong></h3><p>Luis Serrano offers a unique analogy for understanding the attention mechanism. He describes:</p><ul><li><p><strong>Word embeddings as planets and stars:</strong> Visualizing words floating in a &#8220;language universe.&#8221;</p></li><li><p><strong>The role of Keys, Queries, and Values:</strong> Acting like gravitational forces that determine which words attract the model&#8217;s attention.</p></li><li><p><strong>Step-by-step insights:</strong> Breaking down the mathematics behind attention in a simple yet powerful way.</p></li></ul><p><strong>Why it&#8217;s great:</strong> This creative analogy turns complex math into an engaging story, making it easier to understand how attention works. Luis's intuitive style is perfect for learners who prefer storytelling over technical jargon.<br>&#128279; <a href="https://www.youtube.com/watch?v=RFdb2rKAqFw">Luis Serrano's Video</a> | <a href="https://www.linkedin.com/in/luisgserrano/">Luis Serrano</a></p><div><hr></div><h3>Why These Resources?</h3><p>Each resource offers a unique perspective:</p><ul><li><p><strong>Visual learning</strong> (Jay Alammar)</p></li><li><p><strong>Conceptual insights with hands-on practice</strong> (How Transformer LLMs Work)</p></li><li><p><strong>Step-by-step coding guidance</strong> (StatQuest)</p></li><li><p><strong>Intuitive analogies for deeper understanding</strong> (Luis Serrano)</p></li></ul><p>Combining these resources gives you a well-rounded understanding of transformers &#8212; from theory to practice.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Understanding Agentic Design Patterns ]]></title><description><![CDATA[Artificial Intelligence (AI) is evolving rapidly, moving from simple tasks to more complex, autonomous operations.]]></description><link>https://datajourney24.substack.com/p/understanding-agentic-design-patterns</link><guid isPermaLink="false">https://datajourney24.substack.com/p/understanding-agentic-design-patterns</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Thu, 02 Jan 2025 13:54:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uy5R!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe99bfe70-ad63-4822-a55f-3dd10d018800_826x826.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Artificial Intelligence (AI) is evolving rapidly, moving from simple tasks to more complex, autonomous operations. A key factor in this advancement is the use of <strong>agentic design patterns</strong>. These patterns enable AI systems to make decisions, assess their performance, and improve over time, much like humans do.</p><p><strong>What Are Agentic Design Patterns?</strong></p><p>Agentic design patterns are structured methods that guide AI systems in becoming more independent and effective. They allow AI to perform tasks, make decisions, and interact with other systems on their own, similar to how humans solve problems and think.</p><p><strong>Common Agentic Design Patterns</strong></p><p>Here are some of the most common agentic design patterns:</p><ol><li><p><strong>Reflection</strong>: This pattern lets AI systems look at and assess their own outputs, helping them improve and fix mistakes.</p></li><li><p><strong>Tool Use</strong>: AI systems can use external tools or resources to boost their abilities.</p></li><li><p><strong>Planning</strong>: This pattern involves AI figuring out the steps needed to reach a bigger goal.</p></li><li><p><strong>Multiagent Collaboration</strong>: In this approach, multiple AI agents work together to solve complex problems.</p></li></ol><p><strong>Reflection Pattern</strong></p><p>The <strong>Reflection</strong> pattern allows AI systems to examine and evaluate their own outputs, leading to self-improvement and error correction.</p><ul><li><p><strong>Implementation</strong>: To integrate reflection, AI systems can create feedback loops where they assess their outputs against predefined criteria or benchmarks. This process enables the system to recognize discrepancies and refine its approach.</p></li><li><p><strong>Benefits</strong>: Reflection enhances the reliability and accuracy of AI systems, allowing them to learn from past experiences and adapt to new challenges.</p></li></ul><p><strong>Tool Use Pattern</strong></p><p>The <strong>Tool Use</strong> pattern enables AI systems to extend their capabilities by utilizing external tools or resources.</p><ul><li><p><strong>Implementation</strong>: AI can integrate with various tools, such as web search engines, databases, or specialized software, to augment its knowledge base and functionality.</p></li><li><p><strong>Benefits</strong>: By leveraging external tools, AI systems can access a broader range of information and perform complex tasks more effectively.</p></li></ul><p><strong>Planning Pattern</strong></p><p>The <strong>Planning</strong> pattern involves AI autonomously determining the sequence of steps required to achieve a larger objective.</p><ul><li><p><strong>Implementation</strong>: AI can deconstruct complex tasks into manageable subtasks, such as conducting research, synthesizing findings, and compiling reports. This structured approach enables AI to tackle multifaceted problems systematically.</p></li><li><p><strong>Benefits</strong>: Planning enhances task efficiency and effectiveness, allowing AI to handle complex challenges more properly.</p></li></ul><p><strong>Multiagent Collaboration Pattern</strong></p><p>The <strong>Multiagent Collaboration</strong> pattern involves multiple AI agents working together to tackle complex challenges.</p><ul><li><p><strong>Implementation</strong>: AI agents can collaborate by dividing tasks, sharing information, and coordinating actions to achieve common goals. This collaborative approach leverages the strengths of each agent, leading to more robust and efficient problem-solving.</p></li><li><p><strong>Benefits</strong>: Collaboration among agents leads to more robust and efficient problem-solving, leveraging the strengths of each agent.</p></li></ul><p><strong>Conclusion</strong></p><p>Agentic design patterns are essential in advancing AI capabilities, enabling systems to operate with greater autonomy and intelligence. By incorporating Reflection, Tool Use, Planning, and Multiagent Collaboration, AI can tackle complex tasks more effectively, paving the way for more sophisticated and adaptable intelligent systems.</p><p>Stay tuned to learn more about agentic design patterns.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Exploring the Agentic Framework in AI]]></title><description><![CDATA[Artificial Intelligence (AI) has come a long way from being a tool for specific tasks like language translation or image recognition.]]></description><link>https://datajourney24.substack.com/p/exploring-the-agentic-framework-in</link><guid isPermaLink="false">https://datajourney24.substack.com/p/exploring-the-agentic-framework-in</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Mon, 09 Dec 2024 16:14:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bd5m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Artificial Intelligence (AI) has come a long way from being a tool for specific tasks like language translation or image recognition. Today, AI systems are evolving to become more autonomous, capable of learning, adapting, and making decisions without constant human intervention. A concept at the forefront of this evolution is the Agentic Framework in AI. But what does this framework entail, and why should you care? Let&#8217;s unpack it in simple terms.</p><h4><strong> What is the Agentic Framework?</strong></h4><p>The Agentic Framework is a design philosophy that frames AI systems as agents. These agents operate with a higher degree of autonomy and intelligence than traditional AI systems. Here are the key characteristics:</p><ul><li><p>Goal-driven: The agent works toward achieving specific objectives.</p></li><li><p>Environment-aware: It perceives and interacts with its surroundings</p></li><li><p>Autonomous: It makes decisions independently, without relying on constant human input</p></li><li><p>Learning-oriented: It improves over time by learning from its interactions and experiences.</p></li></ul><p>In short, an agentic AI isn&#8217;t just a passive tool; it&#8217;s an active, decision-making entity that collaborates with humans or other systems to achieve goals.</p><h4><strong> Why is the Agentic Framework Important?</strong></h4><p>Here&#8217;s why this framework is shaping the future of AI:</p><ul><li><p>Dynamic Decision-Making: Unlike traditional AI systems that follow static rules, agentic systems adapt and respond to real-time changes.</p></li><li><p> Scalability: Agentic AI can handle complex environments like robotics, autonomous vehicles, or large-scale simulations, where adaptability is crucial.</p></li><li><p> Human-like Interaction: These agents can emulate reasoning and decision-making patterns akin to humans, making them ideal for applications like customer service or personal assistants.</p></li><li><p> Reduced Supervision: Agentic systems free up human resources by requiring minimal oversight, allowing humans to focus on strategic tasks.</p><p></p></li></ul><h4><strong> Breaking Down an Agentic AI System</strong></h4><p>An agentic AI system typically consists of the following core components</p><p><strong>1. Agent Core (LLM):</strong></p><p>At the heart of the system, the Agent Core acts as the decision-making engine.It employs large language models (LLMs) like GPT-4 to handle high-level reasoning, dynamic task management, and goal updates </p><p>The core includes follwing components.</p><ul><li><p>Decision-Making Engine for analyzing inputs and generating responses.</p></li><li><p>Goal Management System to adapt objectives based on task progress.</p></li><li><p>An Integration Bus for seamless data flow between modules.</p></li></ul><p><strong>2. Memory Modules:</strong></p><p>Memory ensures context-awareness and task relevance</p><p>There are two types of memory.Short-term Memory (STM):Temporary storage for immediate tasks, optimized for quick access.Long-term Memory (LTM): Persistent storage using vector databases (e.g., Pinecone, Weaviate) to recall historical interactions, with retrieval based on semantic similarity.</p><p><strong>3. Tools:</strong></p><p>These are specialized capabilities for executing tasks, such as APIs or executable workflows.Frameworks like LangChain provide dynamic interaction and middleware support for secure and accurate data exchange.</p><p><strong>4. Planning Module:</strong></p><p>Planning modules handles complex problem through task decomposition and prioritization.Task Management System generates and adjusts task priorities in real-time, ensuring smooth progress toward goals.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bd5m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bd5m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 424w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 848w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 1272w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bd5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png" width="881" height="551" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:551,&quot;width&quot;:881,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bd5m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 424w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 848w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 1272w, https://substackcdn.com/image/fetch/$s_!bd5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21ea524-0a11-47d6-8fe2-4dc141ed567c_881x551.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The image above <a href="https://developer.nvidia.com/blog/introduction-to-llm-agents/">(source)</a> illustrates the architecture of a typical end-to-end agent pipeline.</p><h4><strong>Real-World Applications of the Agentic Framework</strong></h4><p>1. Healthcare: AI systems that autonomously create personalized treatment plans based on patient data and outcomes.</p><p>2. Autonomous Vehicles: Cars that navigate traffic, avoid obstacles, and adapt to unforeseen events like roadblocks.</p><p>3. Virtual Assistants: AI tutors that customize learning experiences based on the pace and preferences of individual students.</p><h4><strong>Challenges in Implementing Agentic AI</strong></h4><p>1. Ethical Concerns: Ensuring that these systems act in alignment with human values to avoid unintended consequences.</p><p>2. Complexity: Building and integrating multi-component systems is no small feat.</p><p>3. Trust: Users need assurance that AI&#8217;s decisions are explainable, reliable, and safe.</p><p>4. Regulatory Oversight:Sensitive applications, like healthcare or law enforcement, require strict compliance with regulations.</p><h4><strong>The Future of Agentic AI</strong></h4><p>The Agentic Framework is reshaping AI systems to be more like collaborators than tools. It offers a glimpse into a future where AI enhances daily life and tackles complex global challenges. However, this progress brings responsibilities&#8212;ensuring ethical design, building trust, and maintaining proper oversight are crucial for success.</p><p>What excites you the most about the Agentic Framework? Do you look forward to a future with smarter, more autonomous AI? Let&#8217;s discuss in comments. </p><p>Stay tuned for more beginner-friendly insights into AI and emerging technologies. Don&#8217;t forget to subscribe for updates!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[The Art and Science of Prompt Engineering]]></title><description><![CDATA[What is prompt engineering?]]></description><link>https://datajourney24.substack.com/p/the-art-and-science-of-prompt-engineering</link><guid isPermaLink="false">https://datajourney24.substack.com/p/the-art-and-science-of-prompt-engineering</guid><dc:creator><![CDATA[Pooja Palod]]></dc:creator><pubDate>Fri, 16 Feb 2024 08:01:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EGdH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>What is prompt engineering?</h4><p>Prompt engineering is the process of designing and refining prompts or instructions used to guide Generative AI systems in producing desired outputs. It involves crafting prompts that effectively elicit the desired responses while minimizing undesired or irrelevant outcomes. Prompt engineering is crucial for optimizing the performance, accuracy, and relevance of AI-generated content across different applications and domains.</p><p>Researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Developers use prompt engineering to design robust and effective prompting techniques that interface with LLMs and other tools.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Lets discuss few prompting techniques:</p><p><strong>Zero shot prompting:</strong></p><p>Zero-shot prompting refers to a technique in Generative AI where a model generates responses to prompts without any specific training examples or fine-tuning on the given prompt.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nooW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nooW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 424w, https://substackcdn.com/image/fetch/$s_!nooW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 848w, https://substackcdn.com/image/fetch/$s_!nooW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 1272w, https://substackcdn.com/image/fetch/$s_!nooW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nooW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png" width="700" height="198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/185af537-167f-4bf7-ae1a-9d564f997838_700x198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:198,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nooW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 424w, https://substackcdn.com/image/fetch/$s_!nooW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 848w, https://substackcdn.com/image/fetch/$s_!nooW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 1272w, https://substackcdn.com/image/fetch/$s_!nooW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185af537-167f-4bf7-ae1a-9d564f997838_700x198.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Zero shot prompting</figcaption></figure></div><p><strong>Few shot prompting:</strong></p><p>Few-shot prompting is a technique in Generative AI where a model is provided with a small number of examples (shots) of input-output pairs, typically ranging from one to a few examples, to perform a specific task. Unlike zero-shot prompting, which does not involve any task-specific training examples, few-shot prompting enables the model to leverage the provided examples to fine-tune its parameters and adapt its behavior to the given task.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZJrn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZJrn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 424w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 848w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZJrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png" width="700" height="235" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:235,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZJrn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 424w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 848w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed3c27bf-408e-49c5-b993-c8f12718a012_700x235.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Few shot prompting</figcaption></figure></div><p><strong>Chain of Thought Prompting:</strong></p><p>Chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EGdH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EGdH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 424w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 848w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 1272w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EGdH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png" width="700" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EGdH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 424w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 848w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 1272w, https://substackcdn.com/image/fetch/$s_!EGdH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53e3883-b999-4eae-a3b4-b9eb2b6a4d0d_700x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">COT prompting</figcaption></figure></div><p><strong>Tree of Thought reasoning:</strong></p><p>&#8220;Tree of Thoughts&#8221; (ToT) generalizes over the popular &#8220;Chain of Thought&#8221; approach to prompting language models, and enables exploration over coherent units of text (&#8220;thoughts&#8221;) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.</p><p>ToT frames any problem as a search over a tree, where each node is a state representing a partial solution with the input and the sequence of thoughts so far.</p><p>ToT architecture comprises four fundamental components:</p><p>1. Thought Decomposition: This segment involves breaking down the problem-solving process into smaller, manageable thought steps. Each thought should be substantial enough for Large Language Models (LLMs) to assess its relevance and potential, yet small enough to foster the generation of diverse samples.</p><p>2. Thought Generator: The thought generator is responsible for proposing potential next thoughts for each state within the problem-solving tree. Two strategies are employed:</p><p>a. Sampling from CoT Prompts: Suitable for expansive thought spaces such as paragraphs, this strategy involves independently sampling thoughts from a Chain of Thought (CoT) prompt.</p><p>b. Sequential Thought Proposals: More suitable for constrained thought spaces like single words or lines, this approach involves proposing thoughts sequentially using a &#8220;propose prompt&#8221; method.</p><p>3. State Evaluator: This component assesses the progress made by each state within the problem-solving tree. It serves as a heuristic for the search algorithm to determine which states warrant further exploration. Two evaluation strategies are employed:</p><p>a. Independent State Valuation: Each state is evaluated independently through reasoning, leading to the generation of a scalar value or classification.</p><p>b. State Comparison and Voting: Different states are compared, and the most promising one is selected through a voting mechanism.</p><p>4. Search Algorithm: The architecture utilizes a tree search algorithm to explore the problem space effectively. Two primary algorithms are considered:</p><p>a. Breadth-First Search (BFS): Suitable for scenarios where the tree depth is limited, BFS maintains a set of the most promising states per step. It allows for the evaluation and pruning of initial thought steps to a small set.</p><p>b.Depth-First Search (DFS): DFS prioritizes the exploration of the most promising state first until reaching the final output or determining that the current state makes it impossible to solve the problem. In cases where the latter occurs, the subtree is pruned to prioritize exploitation over further exploration. DFS backtracks to the parent state to resume exploration</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!riIt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!riIt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 424w, https://substackcdn.com/image/fetch/$s_!riIt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 848w, https://substackcdn.com/image/fetch/$s_!riIt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 1272w, https://substackcdn.com/image/fetch/$s_!riIt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!riIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png" width="700" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c4ed38b-e037-4972-8374-d37c09041653_700x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!riIt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 424w, https://substackcdn.com/image/fetch/$s_!riIt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 848w, https://substackcdn.com/image/fetch/$s_!riIt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 1272w, https://substackcdn.com/image/fetch/$s_!riIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4ed38b-e037-4972-8374-d37c09041653_700x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Various prompting techniques</figcaption></figure></div><p>Example:</p><p>Game of 24 -It is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (multiplication, addition, division, and subtraction) to obtain an answer of 24. For example, given input &#8220;4 9 10 13&#8221;, a solution output could be &#8220;(10&#8211;4) * (13&#8211;9) = 24&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nLP7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nLP7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 424w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 848w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 1272w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nLP7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png" width="700" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nLP7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 424w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 848w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 1272w, https://substackcdn.com/image/fetch/$s_!nLP7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b079974-1922-468e-ba0f-b6b9a8579b7a_700x238.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">TOT prompting</figcaption></figure></div><p>The &#8220;propose prompt&#8221; function suggests possible next steps from four given numbers, creating new nodes. Each node&#8217;s contribution to reaching the solution is evaluated, and the best one is chosen based on the problem&#8217;s criteria. This process repeats until a solution equaling 24 or meeting the desired goal is found. Once found, a summary of the chosen path leading to the solution is provided as the final answer.</p><p>References:</p><ol><li><p><a href="https://arxiv.org/pdf/2305.10601.pdf">Tree of Thoughts: Deliberate Problem Solving with Large Language Models</a></p></li><li><p><a href="https://www.promptingguide.ai/">https://www.promptingguide.ai/</a></p></li><li><p><a href="https://arxiv.org/pdf/2109.01652.pdf">Finetuned language models are zero shot learners</a></p></li><li><p><a href="https://arxiv.org/abs/2201.11903">Chain of thought prompting elicits reasoning in large language models</a></p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajourney24.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading DataJourney! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>