Prompt Engineering Is a Subset. Context Engineering Is the Job.

There's a quiet skill gap opening up between data scientists who are building LLM applications that work reliably in production and those who aren't. It doesn't show up in job descriptions yet. Most people aren't teaching it in courses. But if you talk to the engineers actually shipping AI systems that hold up under real workloads, they all converge on the same thing: the quality of your prompts matters far less than the quality of the context surrounding them.
The term that crystallised around mid-2025 is context engineering. Shopify's CEO described it as "the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy put it differently: filling the context window with just the right information for the next step. Anthropic formalised it as the set of strategies for curating and maintaining the optimal set of tokens during LLM inference.
All three definitions are pointing at the same thing: in production AI systems, what the model knows matters more than how you've phrased your instructions.
Why prompt engineering alone breaks in production
Prompt engineering works well for contained, self-sufficient tasks — summarisation, translation, extraction, single-turn Q&A. If the model already knows enough from training to answer and you just need to shape the output, a well-crafted prompt does the job.
The wheels come off once your system gets complex. Conversation history accumulates. Retrieved documents vary in quality. Tool outputs land in the context window. Agent state needs to persist across steps. At that point, you're not tuning a prompt anymore — you're managing an information architecture. And if that architecture is sloppy, no amount of prompt finesse will save you.
Research from 2025 made this concrete in an uncomfortable way. A Chroma study testing 18 LLMs found that model performance degrades measurably as context length grows — not because the model runs out of capacity, but because relevant information gets buried. A Databricks study found accuracy dropping around 32,000 tokens, well before advertised million-token limits. Stanford's "lost in the middle" research showed that LLMs systematically underweight information placed in the middle of long contexts, performing best when critical content sits at the start or end.
These aren't edge cases. They're structural properties of how transformers process information. And they mean that where you put things in the context window matters as much as what you put there.
What context engineering actually involves
It's less about writing and more about architecture. The core questions are: What gets retrieved and when? How is conversation history compressed before it becomes noise? What gets stored in long-term memory versus kept in the active window? How are tool outputs formatted so the model can reason over them rather than just read them?
In practice, this means treating your RAG pipeline as an active curation system rather than a lookup table. It means chunking strategies that preserve semantic coherence, not just character limits. It means metadata schemas that let you filter before you retrieve, so irrelevant context never reaches the model in the first place. And it means monitoring what lands in your context window the same way you'd monitor what lands in a production database — because garbage in still produces garbage out, just more fluently.
LangChain's 2025 State of Agent Engineering report found that 32% of organisations cited quality as the top barrier to scaling AI agents — and most traced those failures not to the model, but to poor context management.
The model is not the bottleneck. The information it receives is.