7 min read

AI Needs Tools, Not Just Intelligence

There’s a blog post by Daniel Stenberg titled “The I in LLM stands for Intelligence.” The joke is that there is no I in LLM.

He’s right. LLMs are prediction machines. You feed them a sequence of tokens and they predict what comes next. Everything that looks like “thinking” is pattern matching.

Ask one to explain quantum mechanics and it’ll impress you. Ask it how many R’s are in “strawberry” and it’ll confidently say 2. The model can’t see letters. It sees tokens. Asking it to count characters is like asking someone to count bricks through frosted glass.

What you see:

Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me

What an LLM sees:

Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me

Token-based prediction and character-level analysis are different operations. More training data won’t fix this. Bigger models won’t fix this.

But a one line function just might.

Give it a calculator

function countLetter(word, letter) {
return word.toLowerCase().split(letter.toLowerCase()).length - 1;
}

Now the LLM doesn’t need to count. It needs to recognize “this is a counting question” and call the function. That’s a language task, the one thing it’s good at.

calls countLetter(“strawberry”, “r”) → 3

Correct. Every time.

Same story with math. Ask an LLM “what’s 76447 × 1254?” and it might get close. Or it might not. It’s predicting digits, not computing them. Hand it a function:

function calculate(a, b, operator) {
const ops = {
add: (a, b) => a + b,
subtract: (a, b) => a - b,
multiply: (a, b) => a * b,
divide: (a, b) => a / b,
};
return ops[operator](a, b);
}

calls calculate(76447, 1254, “multiply”) → 9,58,64,538

The LLM didn’t get smarter. It got access to a tool.

A real example: The Office

Counting letters and multiplying numbers are toy problems. Here’s a better one.

I have IMDB data for all 188 episodes of The Office: ratings, seasons, vote counts. When I ask a bare LLM “which episode of the US is the best?”, it will probably say some random episode that was discussed on a reddit thread.

It might give the correct answer, but let’s say you want to ask some fact about your company’s internal data, it has no access to it and will probably confidently hallucinate the answer. You can’t trust the answer, you can’t verify it, you can’t reproduce it.

Give it a SQLite database and let it write queries instead:

Q: What are the top rated episodes?

SELECT title, season, episode_num, imdb_rating
FROM episodes
ORDER BY imdb_rating DESC LIMIT 3;
titleseasonepisode_numimdb_rating
Goodbye, Michael7219.8
Finale9239.8
Stress Relief5139.7

Q: What is the average rating for each season?

SELECT season, ROUND(AVG(imdb_rating), 2) AS avg_rating
FROM episodes
GROUP BY season
ORDER BY avg_rating DESC;
seasonavg_rating
38.59
48.56
58.49
28.41
78.29
68.18
17.92
97.89
87.56

Season 3 wins. Not because the AI hallucinated the answer, but because it knew how to ask the database.

The LLM didn’t learn anything new about The Office. It learned how to ask a database.

The pattern

Without tools, the entire pipeline lives inside the LLM. Every step, including the ones that need precision, runs through probabilistic prediction.

Which season has the highest rating? Parse question Recall training data Guess the answer Generate response Season 4, I think? User LLM

The weak link is step 3. The LLM has to guess because it has nothing else to work with. The result is plausible but unverifiable.

Addding a tool splits the flow. The LLM handles language, the tool handles computation, and they meet back in the middle.

Which season has the highest rating? Parse question Write SQL query SELECT season, AVG(imdb_rating)... Season 3 — 8.59 Explain result Season 3 — avg 8.59 User LLM SQLite

Four out of five steps are language. The tool handles the one step that needs to be exact. The LLM becomes an orchestrator: routing intent to the right system and translating the result back into plain language.

Small model, right tools

Here’s something that sounds wrong but isn’t: a 7B parameter model with a calculator, a database, and a web browser will beat a 70B model with nothing.

The 70B model knows more. It has better reasoning. But ask it “what’s the current EUR/USD exchange rate?” and it guesses from training data that’s months old. The 7B model calls an API and gives you the live number.

Parameters store knowledge but tools enable the ability to access data.

If you’re building a product, better tooling for a smaller model will get you further than throwing money at a bigger model with no tools.

OpenClaw and the proof at scale

This thesis played out in public. Peter Steinberger built OpenClaw , a WhatsApp relay connecting a standard LLM to email, calendars, web browsing, and file systems. Nothing special about the model. The tools made it useful. It hit 140,000 GitHub stars, OpenAI hired the creator, and the same model that hallucinates TV trivia is now booking flights. Count letters? Give it a function. Query a database? Give it SQL. Manage someone’s digital life? Give it access to their apps. Each layer of tooling multiplies what the LLM can do. The security concerns are real, but the trajectory is clear: the value of these models scales with what you connect them to.

The goal of AI: Fake it till you make it

The I in LLM still doesn’t stand for intelligence. These models predict tokens. They don’t compute. They don’t verify. They don’t look things up.

But prediction plus the right tool at the right moment looks a lot like intelligence from the outside. The model recognizes it needs help, picks the right function, formats the right input, and explains the output. That workflow is where the value lives.

An LLM alone is a brilliant person locked in a room with no calculator and no internet. Give it tools and the gap between “impressive demo” and “reliable system” closes fast.

We’re the ones building those tools. That’s where the real work is.

Share: