Home

Published

- 28 min read

One Algorithm to Rule Them All?

img of One Algorithm to Rule Them All?

In the quest for Artificial Intelligence, researchers have often dreamed of discovering a single Master Algorithm – one method that could solve any problem thrown at it, much like a Swiss Army knife ready for any task. This vision of a universally dominant model, however, runs up against two guiding principles in AI that at first glance seem to clash. On one hand, the No Free Lunch theorem mathematically assures us that there’s no such thing as a one-size-fits-all algorithm. On the other hand, Rich Sutton’s Bitter Lesson from decades of AI research argues that general-purpose approaches which make minimal assumptions tend to win out in the long run. Are these two principles contradictory, or do they actually complement each other in guiding us toward true artificial general intelligence (AGI)? To find out, let’s explore the paradox of the master algorithm through these dual lenses.

I. The “No Free Lunch” Theorem: No Universal Freebies

In the world of machine learning theory, the No Free Lunch theorem stands as a formal proof of a humbling truth: there is no universally best algorithm. First formulated by David Wolpert and William Macready in 1997, the NFL theorem shows that if you consider all possible problems, every learning algorithm has equal average performance. In Wolpert and Macready’s words, “if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems”. In plain terms, any gain an algorithm has on some tasks is offset by losses on others. There is no free ride and no all-conquering champion when we take the average over every conceivable problem.

This theorem is both mathematically deep and intuitively understandable. It’s essentially telling us that specialization is the key to getting ahead on any particular task. Imagine an endless casino of games: no single strategy will win at every game consistently. If an AI algorithm is tuned to excel at one category of tasks, it must trade off performance in tasks that don’t fit those assumptions. The NFL theorem reminds us that any performance advantage comes from aligning with structure in the task – and if there’s no assumed structure, you can’t do better than random guessing on average. As a result, no algorithm can dominate on all possible data distributions. There is no free lunch in learning; you only eat what you (or your algorithm) can hunt.

II. Picking the Right Tool for the Job

An easy way to grasp the No Free Lunch principle is through analogy. Think of machine learning algorithms as tools in a toolbox. There’s no single “magic” tool in a craftsman’s kit that effortlessly handles every job. A hammer drives nails with ease but is lousy at cutting wood; for that you’d reach for a saw. A wrench tightens bolts but won’t help you paint a wall. Likewise, each AI algorithm is a specialized tool: a convolutional neural network shines in image recognition tasks, whereas a recurrent neural network or transformer might excel at sequence modeling. If one tries to use a hammer when the task calls for a screwdriver, he is going to have a bad time. The No Free Lunch theorem formalizes this intuition by proving that for every problem an algorithm solves well, there’s some other problem where it stumbles.

This is why in practice AI experts emphasize the importance of understanding the problem domain and choosing an algorithm whose built-in assumptions match that domain. For example, a CNN assumes that nearby pixels are related and that patterns in an image repeat across the image (translational invariance). These are excellent assumptions for vision – they exploit the local continuity of images – but the very same CNN architecture might perform poorly on a language task, where word order and long-range dependencies are key. A vanilla transformer-based language model, conversely, assumes nothing about locality and can attend to information across an entire sequence, which is great for text or code, but that flexibility comes at the cost of needing far more data (and computation) to learn patterns that a more tailored model might handle with less. In essence, each algorithm carves up the space of problems, thriving on those that fit its assumptions and floundering on those that defy them.

So the NFL theorem isn’t just an abstract math result – it’s a guiding principle for practitioners: always pick the right tool for the job. Just as a skilled craftsman evaluates the task and selects a suitable tool, an AI practitioner considers the nature of their data and objectives when choosing a model. There’s no free lunch, no silver bullet algorithm that works best for everything. Trying to force one might be as futile as eating soup with a fork; better to use a spoon or, in the case of AI, to use (or design) an algorithm that “likes” the kind of patterns your problem exhibits.

III. Modern Generalists: LLMs and the Illusion of Universality

Given the No Free Lunch theorem, one might wonder: how do today’s giant models like GPT5/Gemini2.5/Claude3.7/Seed1.6/GLM4.5/Qwen3 (hard to list exhaustively) seemingly excel at so many tasks? At first glance, Large Language Models (LLMs) – and their visual cousins, vision-language models (VLMs) – appear to be astonishingly all-rounded. They can code, write poetry, answer factual questions, translate languages, analyze images, and more. Do LLMs somehow break the No Free Lunch rule? Have we found an algorithmic Swiss Army knife that truly does it all?

The reality is that NFL still applies, even to these modern generalists. LLMs have such broad capabilities not because one model magically sidestepped NFL, but because they were trained on an immense variety of data, essentially sampling a very wide distribution of tasks. In effect, an LLM like GPT5 is specialized – not to one narrow task, but to the broad distribution of tasks found in its training data (a huge swath of the internet). It’s as if we forged a tool that’s particularly suited to the “meta-problem” of predicting the kind of text (or image) that humans produce. This gives it a kind of versatility across many language and knowledge tasks. But even this versatility has limits:

  • Specialists vs. Generalists: A general-purpose model often pays a price for its breadth. On highly specialized problems, a model or system fine-tuned specifically for that task will likely outperform a generic LLM. For example, a dedicated time-series forecasting model can outpredict a large language model on financial data trends, and a protein-folding model like AlphaFold will beat a general vision model at determining molecular structures. The LLM’s broad training is not free – it trades off some depth in any one area for breadth across many areas. Its “free lunch” on one class of tasks is paid for by sacrifices on tasks that fall outside its training distribution or expertise.
  • Architectural Biases: Even highly flexible architectures have built-in biases. The transformer architecture underlying most LLMs assumes that data can be processed as a sequence and that “attention” is the key operation. This bias is incredibly powerful for language and sequential data, but it might not be ideal for everything. Some problems – say, certain physics simulations, symbolic reasoning tasks, or graph-structured data – may be solved more naturally by other architectures. Modern transformers struggle with tasks requiring precise iterative logic or handling extremely long sequences (despite extensions like retrieval and sparse attention). In other words, the choice of architecture is itself an inductive bias, and no architecture is optimal for all possible tasks. LLMs just hit a sweet spot for an impressively large set of tasks, thanks to the richness of their training corpora and the power of attention, but they too have failure modes.
  • The Training Data Universe: Crucially, LLMs excel only in the universe of data similar to what they’ve seen. The moment you take them far out-of-distribution – ask about completely new physics beyond known science, or give gibberish inputs with no relation to natural language – their performance degrades. They do better at extrapolating than earlier narrow models, but they are not truly universally intelligent (not to say that human beings are, though.) Their seeming all-roundedness is a testament to the broad coverage of their training data (and the underlying structures of human language and knowledge that that data reflects), rather than a repeal of the No Free Lunch theorem.

In summary, modern AI giants demonstrate that you can build very general systems – by training on very broad data and using scalable architectures – but the No Free Lunch insight lurks in the background: these models are only as good as the alignment between their training experience and the new task at hand. When that alignment breaks, so does their competence.

IV. Inductive Bias: The Hidden Assumptions Behind Success

At the heart of the No Free Lunch theorem is the concept of inductive bias – the set of assumptions that a learning algorithm makes about the data. Every model has some inductive bias; without it, learning would be impossible because the model wouldn’t know how to generalize from limited data. In fact, NFL can be interpreted as a blunt statement that an algorithm’s power comes from its biases being well-matched to the problem. If you have no bias (i.e. you assume nothing about the data), you can’t generalize at all. If you have the right bias, you can generalize amazingly well within that domain.

Consider how a CNN bakes in assumptions of locality and translation invariance for images. Those assumptions are a big reason why CNNs, even with relatively fewer parameters, outperformed earlier fully-connected networks on vision tasks – the CNN was essentially stealing a peek at the underlying structure of images (nearby pixels correlate, features can move in an image). Likewise, a recurrent neural network or transformer assumes that outputs depend on sequences of inputs, which aligns with temporal or linguistic data. These inductive biases act like shortcut bridges, allowing the model to learn the true signal from fewer examples by ruling out unreasonably complex or irrelevant patterns. As NFL dictates, this advantage on one class of tasks comes at a cost: if you feed an image CNN some non-image data where locality isn’t meaningful (say, a shuffled pixel array or a financial time series), its built-in assumptions become liabilities.

The practical lesson is that embracing the right assumptions is crucial. NFL doesn’t say “learning is futile”; it says you only gain performance through assumptions. This sounds limiting, but it’s actually empowering: it means we can design better algorithms by baking in the appropriate inductive biases for the problems we care about. Much of the art in machine learning is about finding a good bias–variance tradeoff: making enough assumptions to efficiently learn the pattern, but not so many that you’re inflexible or misaligned with reality. In other words, the shape of your tool must fit the shape of the problem’s key – too generic a tool (no bias) and it won’t fit any lock; too specialized and it only opens one specific lock.

From this perspective, the No Free Lunch theorem encourages us to incorporate knowledge of the domain at the algorithmic design level. If you know something about the problem – e.g. “images have spatial structure” or “language has grammar and long-range dependencies” – building an algorithm that assumes those properties can give you a massive head start. You get a “free lunch” relative to an uninformed algorithm within that domain, paid for by worse performance on some other domain where those assumptions don’t hold. In essence, the NFL trade-off is the price of making intelligent assumptions.

V. Rich Sutton’s Bitter Lesson: The Power of Generality

In stark contrast to the advice of choosing specific assumptions, Richard Sutton’s “The Bitter Lesson” (2019) delivered a provocative message: in the long run, the winners in AI have been those methods that rely less on built-in human knowledge and more on general-purpose learning that scales with computation. Sutton reviewed decades of AI progress and noticed a pattern: whenever researchers painstakingly added human insight or domain-specific tricks to their AI systems, those systems saw short-term gains – only to be eventually outperformed by approaches that made fewer assumptions and instead leveraged massive data or computation. This was the “bitter lesson” that many AI researchers, himself included, had to learn over and over.

Some classic examples make Sutton’s point crystal clear:

  • Chess and Go: Early chess programs in the 80s and 90s were loaded with handcrafted evaluation functions (expert knowledge of chess strategy). They improved gradually, but ultimately it was IBM’s Deep Blue – which still used some chess heuristics but leaned heavily on brute-force search – that defeated world champion Garry Kasparov in 1997. Traditional researchers grumbled that brute force “wasn’t how humans played,” but it worked. Fast forward two decades: AlphaGo and later AlphaZero took it further, using deep learning and self-play reinforcement learning with almost no human chess or Go knowledge beyond the basic rules. AlphaZero discarded all the specialized Go heuristics and still crushed the best human and machine players by relying on general learning and search, scaled with enormous computing power. The human-centric approaches, once deemed elegant, were rendered obsolete by a method that learned from scratch with enough computation.
  • Speech and Vision: In speech recognition, systems in the 1970s and 80s tried to build in knowledge of phonetics, the vocal tract, grammar rules, etc. They were quickly eclipsed by data-driven statistical methods (e.g. hidden Markov models) in the 1990s, which themselves were later outperformed by even more general deep learning models in the 2010s. Similarly, for computer vision, years of manually designed features (edge detectors, SIFT features, etc.) were swept away by convolutional neural networks that learned their own features from raw pixels, given enough data. Modern vision models use surprisingly minimal built-in knowledge – perhaps just the notion of local connectivity in CNNs – and gain everything else from learning at scale. The result? Dramatically better performance than any human-engineered features could achieve.

Sutton distilled these observations into a guiding principle: whenever we have tried to “help” the AI by pre-loading human knowledge or clever tricks, those efforts yielded only temporary benefits. The approaches that ultimately prevailed did so by being more generic and by harnessing the growing power of computation (per Moore’s Law) to learn. In his essay he writes that we must learn the bitter lesson that building in how we think we think does not work in the long run. It’s “bitter” because it’s somewhat humbling – it suggests that many of our clever, proud insights as researchers are actually dead ends. Instead, simple architectures that scale with data and compute (like deep neural networks, or massive search algorithms) end up finding better solutions than those we crafted by hand.

Sutton specifically highlights search and learning as the two broad methods that scale indefinitely with more computation. If you give an algorithm more compute, either it can search deeper (consider brute-forcing a problem space or running a genetic algorithm for more generations), or it can learn from more data (training a bigger network on a bigger dataset). These two capabilities – search and learning – are essentially the engines behind modern AI breakthroughs, and notably they are quite general-purpose. They don’t care what the problem is, they’ll just churn away given the resources. Contrast this with a human-knowledge-infused system, which might do well on the scenario it was tailor-made for, but cannot easily scale beyond or outside its narrow domain.

The Bitter Lesson therefore advises us to focus on methods that minimize the baked-in assumptions and maximize the ability to leverage computation. It’s an argument for generality and for trusting data over preconceived models of the world. Instead of building an AI that we think reasons like a human, just build a big learning engine and let it discover what it needs from raw experience. This philosophy is one of the driving forces behind the success of deep learning and large models – we don’t hard-code rules for translation or image recognition or theorem proving; we use one general learner and feed it tons of data. And time after time, this strategy has yielded “unreasonably effective” results that surprise even the experts.

VI. Reconciling the Two Lessons: Map vs. Territory

At first glance, the No Free Lunch theorem and The Bitter Lesson appear to send conflicting messages. NFL says: you can’t have it all, you must specialize (i.e. build in assumptions aligned to your problem) to get superior performance. The Bitter Lesson says: don’t over-specialize your designs, the most general methods will ultimately win. Is this an outright contradiction?

On the surface, it is a paradox – almost a yin and yang of AI strategy. But the key to reconciling them is to realize they’re talking about different contexts and timescales. It’s like two maps of the world drawn at different scales: if you zoom out far enough, all algorithms look equal (NFL’s view); but in the actual territories we care about – the real problems and data of our universe – certain strategies are vastly superior (Bitter Lesson’s view). Let’s break down the apparent conflict:

  • No Free Lunch is a mathematical truth, but in an abstract world: NFL is proven on the assumption of averaging over all possible problems (often assuming a uniform distribution over problem space). This includes utterly bizarre, structureless problems along with the sensible ones. Under those conditions, it’s true that no algorithm can outperform another on average. It’s a statement about the limits of omnipotence: no one method can be best if you literally consider every conceivable task (many of which are random or adversarial). In that extremely broad sense, all algorithms are on equal footing. Think of NFL as a grand democratic law in algorithm-space – it says no algorithmic aristocracy shall exist when considering all worlds and all tasks.
  • The Bitter Lesson is an empirical truth about the problems we actually face in this universe: Sutton’s observation is not a formal theorem but rather a general trend noticed in our experience of AI research. Crucially, the problems we care about (playing chess, recognizing speech, driving cars, answering questions) are not a uniform random sample of all imaginable problems. They are highly structured and correlated; they follow the physics and regularities of our world and our human languages. In this realm, some methods are objectively better than others. The Bitter Lesson boldly claims that the methods which assume less and learn more have consistently overtaken those that relied on human insight or rigid structure. It’s as if within the space of real-world tasks, there is a kind of algorithmic aristocracy: a few general approaches stand above the rest because they can exploit the vast commonalities in the tasks we care about. In practice, not all lunches are created equal – our universe serves a menu of problems with rich patterns, and general learners have feasted on it.

The “contradiction” then dissolves when we realize that No Free Lunch includes all possible lunches, while the Bitter Lesson is only concerned with the lunches actually served on our table. NFL’s harsh guarantee applies in a fantasy land where every weird problem is as likely as a well-behaved one. But our reality is kinder: it offers many problems that share underlying principles (like continuity, locality, compositionality, causality), and a well-designed general algorithm can leverage these principles across tasks. In technical terms, the success of deep learning suggests that the distribution of real-world problems is highly biased and not adversarial or uniform. There are “free lunches” in practice – but only because we’re not averaging over evil, structureless tasks. We’re dining on the subset of meals that have nutritious structure, and a smart chef (general learning algorithm) can cook a lot of them well.

Another way to see it: No Free Lunch sets an unavoidable trade-off – you can’t get performance without assumptions. The Bitter Lesson agrees but says: choose as broad and general an assumption as possible, so that it applies to many tasks, and let computation fill in the rest. In effect, the Bitter Lesson is telling us what kind of inductive bias to favor: not a narrow, human-crafted one, but a simple, general bias that can mold itself to the data. For example, the neural network’s bias is “I can approximate any function given enough data and layers,” which is very general. The transformer’s bias is “attention is all you need (for sequences),” which turned out to be a remarkably flexible assumption, encompassing language, vision, and more with minimal tweaks. These biases are still biases – they don’t magically avoid NFL – but they are broad enough that they capture a huge swath of useful tasks, and thus there hasn’t yet been a need to swap them out for something else.

It helps to remember that NFL is not saying domain knowledge is useless, nor is Sutton saying inductive bias is evil. Instead, Sutton warns against overly specific, human-driven biases that don’t scale, whereas NFL warns against mindlessly trusting one approach everywhere. When Sutton advocates general methods, he is still implicitly choosing an inductive bias – just one that is very general (like “use a big deep network and gradient descent”). This choice is guided by the insight that such a bias, albeit generic, aligns well with a wide range of real-world problems. The paradox resolves because the Bitter Lesson’s favored methods do pay a cost on some imaginable tasks (they wouldn’t fare well on purely random data, for instance), but we willingly pay that cost because those tasks are rarely of interest. In short, the two principles operate at different levels: NFL is about the theoretical limits across all possible worlds, while the Bitter Lesson is about practical strategy in this world.

To use a metaphor: NFL is like a law of physics (say, conservation of energy) – it’s always true, but it doesn’t tell you which engine will win the race. The Bitter Lesson is like engineering advice – it tells you which engines (algorithms) have historically proven to go fastest given the fuel (compute and data) we have, even if no engine can break the ultimate speed limit. They aren’t truly at war; they’re just speaking to different aspects of the AI journey. An enlightened AI practitioner carries both lessons: know that no single model can magically solve everything by default, yet also know that betting on general, scalable methods will yield the most progress on the problems we care about.

VII. Toward the Ultimate Learner: Fewer Assumptions, More Intelligence?

The interplay of NFL and the Bitter Lesson invites a tantalizing question about the future: is it possible that we will discover an approach with virtually no hand-designed assumptions at all – a learner so general that it figures out how to solve any problem (at least any in our universe) purely from data? In other words, could there be something close to a universal algorithm for intelligence? And if so, would that contradict the No Free Lunch theorem, or simply be an ingenious exploitation of the structure of our world?

Current trends in AI suggest that we are moving toward fewer human-imposed assumptions at the high level. We already let neural networks learn low-level features instead of designing filters by hand. We use the same architectures (like transformers) across vision, language, and audio, instead of crafting different models for each modality – a convergence toward generality. But some assumptions still remain: for instance, we fixed the network architecture itself and the learning rule (backpropagation). The Bitter Lesson’s extrapolation would be: maybe even those choices could be made by the AI itself or replaced with more general processes. Here are a few frontiers that researchers are exploring, which push the boundary of making the “how to learn” part more flexible:

  • Evolving Architectures: Instead of a human designing a neural network’s structure, we can use evolutionary algorithms or neural architecture search to evolve network topologies. Given enough compute, an evolutionary process can discover architectures optimized for the task, a bit like how nature evolved brains. This approach reduces our bias about what the network should look like. In principle, one could imagine an AI that evolves a specialized sub-network for each new kind of problem it faces, on the fly. We already see glimpses of this in automated machine learning (AutoML) and architecture search systems.
  • Meta-Learning and Learning to Learn: Why fix the learning algorithm (such as gradient descent) in stone? Meta-learning approaches have AI systems learn their own update rules or training strategies. For example, one model can train another, or a model can adjust its own learning rate and mechanism based on experience. The ultimate meta-learning scenario is an AI that figures out how to improve itself without our guidance – it would be essentially writing its own learning code as it goes. This reduces the assumption that gradient descent (or any specific optimizer) is the best way to learn; maybe the AI can invent a better one for its purposes.
  • General Problem-Solving Programs: Another avenue is the quest for algorithms like AIXI (a theoretical formulation of a super-general AI by Marcus Hutter) or other universal learners that, given enough computation, can attain optimal behavior in any computable environment. AIXI, for instance, is an optimal agent in a very general sense – it treats the problem of learning as essentially a brute-force search through all possible strategies, weighted by simplicity. The catch? It’s uncomputable in practice and astronomically expensive even in approximations. But it serves as a proof-of-concept that, if we drop all efficiency constraints, a sufficiently general approach can outperform any specialized one, given enough time. No Free Lunch isn’t violated here; AIXI doesn’t excel on literally every possible world (it has implicit assumptions like computability and simplicity biases), but it’s as close to a universal problem-solver as one can define mathematically.
  • Integration of Modalities and Objectives: We’re also trending toward systems that don’t assume a single task or modality. The same model can handle vision, language, robotics, etc., given multi-modal training (e.g. a model that sees images and reads text and controls robots). Such a model has to discover for itself how concepts in one domain relate to another (like connecting the word “cat” to images of cats). Every time we successfully combine domains in one learning system, we peel away another assumption (like “text and images must be handled by separate mechanisms”). A truly assumption-light AI might take as input everything – all sensor modalities, all feedback signals – and simply learn whatever can be learned.

If one day we do arrive at a system that feels like a general problem solver, it will likely be because we’ve found a core set of assumptions (or a meta-method) so general that it applies to essentially all tasks we care about. Perhaps that core method will be something like “predict the future in a self-improving loop” or some form of general reasoning engine – we don’t know yet. But importantly, even such a master algorithm would not defy the No Free Lunch theorem; rather, it would cleverly capitalize on the structure of our universe. It would be the ultimate free lunch only in the sense that our universe’s lunch menu has a lot of recurring ingredients. If you imagine a universe with completely random rules every moment, no single AI could handle that any better than another (NFL still reigns). But in our world, physics is consistent, chemistry is stable, biology has patterns, human culture has redundancies – in short, our world has regularity. A sufficiently powerful general learner would be one whose inductive bias is essentially “reality has patterns; find them”. That might be the closest thing to a universal assumption that isn’t trivial.

In a way, the Bitter Lesson suggests that we should let the AI discover those deep patterns itself, rather than us trying to pre-program them. And if we succeed, we will have paid for that success by making our own role almost minimal – which is a bit bitter for our egos, but sweet in its result. The No Free Lunch theorem would politely nod in the background, noting that our master learner is phenomenally good in our realm, even if it’s not magic in absolute mathematical terms. We will have essentially chosen the right bias – the bias of very general learnability – and the universe rewarded us.

VIII. Embracing the Paradox

The journey through No Free Lunch and the Bitter Lesson leads us to an enlightened perspective: these two principles are not enemies, but rather complementary guides on the path to AI. The No Free Lunch theorem keeps us grounded – it’s a reminder not to chase unicorns, not to assume we can solve everything with a single trick. It teaches humility: if your model is doing great on some problems, ask yourself what price you paid in assumptions and what situations might foil it. It also encourages diversity in our toolkit; just as biodiversity strengthens an ecosystem, a diversity of algorithms and approaches strengthens the field of AI, because we have different tools for different challenges.

The Bitter Lesson, in turn, keeps us aspirational – it urges us not to get too attached to our clever biases and instead to build machines that can outgrow our knowledge. It’s a call to focus on methods that learn and methods that scale, because those are the engines that have driven every major leap of the past decades. It teaches patience and boldness: patience to allow a generic learner to figure things out from data, and boldness to allocate massive computation to a simple method that theoretically could solve the problem given enough resources. This lesson has a certain faith to it – faith that, ultimately, truth lies in the data, and that a sufficiently large and flexible model will uncover that truth without needing our hand-holding.

Are the two theorems contradictory in principle? In the purest abstract sense, one could say yes – if you took NFL to mean “no method can be better generally” and the Bitter Lesson to mean “general methods are better,” they clash. But by now we understand the nuance: the Bitter Lesson operates within the space of structured, earthly problems, where general methods are better (for those problems), whereas the NFL theorem whispers about the broader infinite landscape that includes unstructured madness as well. It’s a map vs. territory issue – the map (mathematical possibility) is vast and flat, but the territory (our reality) has mountains and valleys where certain routes (algorithms) are simply superior.

In practice, the savviest approach to AI is to follow both lessons. We should be strategic about our inductive biases (NFL’s advice) – choose them wisely to suit the problem – but we should prefer biases that are as broad and general as possible (Bitter Lesson’s advice), so that our solutions remain powerful and adaptable. For example, choosing the Transformer architecture for an NLP task is guided by domain knowledge (sequential data needs an appropriate inductive bias), echoing No Free Lunch; but the Transformer is a very general sequence learner, not a collection of hard-coded linguistic rules, echoing the Bitter Lesson. We still made a high-level assumption (that attention-based sequence modeling is a good approach), but we didn’t try to micromanage the solution. Instead, we let the model learn the details from data. This combination – a sensible high-level bias plus massive learning capacity – is behind many of the best AI systems today.

Looking ahead, as we flirt with the idea of a true Artificial General Intelligence, we will constantly balance on this tightrope. If we inject too much human wisdom into the design, we might box the AGI into our limited understanding (and history shows that’s likely suboptimal). If we make the system too unstructured, we may drown in infinite possibilities and need unreal amounts of data to learn anything. The sweet spot will be an architecture that embodies just the right level of inductive bias – perhaps only the very fundamental assumptions of our universe – and everything else it figures out by itself. Achieving that will feel like magic, but it will really be the culmination of these principles: paying for performance with the cleverest, most universal bias we can find.

In conclusion, the No Free Lunch theorem and the Bitter Lesson are not so much contradicting as illuminating different facets of the AI endeavor. NFL reminds us there is no effortless mastery – every gain has a trade-off – while the Bitter Lesson suggests a pragmatic way to maximize those gains across many domains by minimizing our preconceptions. Together, they counsel us to be both wise and bold: wise in understanding the limits, bold in pushing them. If there is a “free lunch” to be had in our universe, it will come only to those who deeply understand what makes a lunch not free in the first place. Perhaps the ultimate free lunch in AI will be won by fully embracing the bitter recipe of generality – a paradox we may happily live with as our machines grow ever more powerful and intelligent.

In the end, we find that the road to AI greatness lies in embracing the paradox: no free lunch, except maybe the one you earn by swallowing the bitter lesson.