AI Engineering Report

Beautiful Landing Pages with Nano Banana Pro

Bill Prin — Thu, 18 Dec 2025 18:35:54 GMT

Coding agents like Claude Opus 4.5 and Gemini Pro 3 can now generate decent-looking and functional UIs, but the designs they produce are often bland and uninspiring, and often resort to the same overdone patterns like purple gradients. This can be improved with smart prompting strategies, but there’s still a creative limit.

If you stick purely with coding agents for UI design, you are missing out on the massive improvements you can get by combining language models with image models to get the best of both worlds. This is 10x more true than ever since Google released Nano Banana Pro, and incredible image model.

Here’s a comparison of two landing pages for an AI coding course, the first generated by Gemini Pro and the second by Nano Banana Pro.

The page generated by Nano Banana is much more creative and visually appealing. Additionally, it’s faster to iterate because you can generate images faster than a coding agent can code a new page. Here’s another landing page for the same concept we quickly generated:

Nano Banana Pro is particularly exciting because it’s much better than previous image models at UI design, which often struggled to render text or comply with the prompt on small details.

Let’s talk about a workflow to combine language and image models for landing page design.

Anthropic’s Frontend Design Skill

Before we dive into the Gemini + Nano Banana workflow, it’s worth reviewing Anthropic’s “frontend design” Claude Skill, which is open-source on GitHub, and can be installed as a plugin with the following commands:

/plugin marketplace add anthropic/claude-code
/plugin install frontend-design@claude-code-plugins

This skill was created precisely to solve the "purple gradient AI UI slop” , and it does a decent job at pushing the agent to do something a bit more unique and creative.

What I find especially fascinating is that the skill doesn’t provide that much specific guidance but instead provides a framework for the model to apply “design thinking” and to commit to a creative aesthetic.

If you’re doing UI coding in Claude Code, I’d consider installing this plugin and skill mandatory. But while it significantly improves the UIs that Claude generates, the creative output is still significantly short of what Nano Banana can create.

Building an AI Design Workflow

The important takeaway here from Anthropic’s frontend skill is something most devs who have done a lot of AI coding already know - LLMs are much better at complex tasks if they first build a plan for that task, then execute that plan.

Design is no different. So our workflow will be something like:

Use the “design thinking” prompts to build a plan for the landing page we want to make. Besides applying design thinking principles, this is also where we ask it to decide on a hierarchy, typography, color scheme, and general creative aesthetic. Optionally, include reference images for aesthetic inspiration.
Ask the language model to generate ASCII mocks so we can quickly iterate on specific copy and features of the landing page we want.
Now provide this design plan to the image model , Nano Banana Pro. This is an important step, as image generators will create much more visually interesting designs than a coding agent, which tend to avoid anything visually complex.
Once we have an image we like, we often want to extract smaller assets out of the full design. We can ask the image model to remove various components until we’ve isolated the assets we need.
Finally, we provide a coding agent with our reference image and isolated assets and have it code up the final page

This approach lets us iterate on the design more efficiently, as having a coding agent code up a landing page is the slowest way to try a new design. With this system, we first iterate on the components on the landing page with fast text discussion and ASCII mocks. Then, we iterate on the visual design with the image generator, which is still much faster than coding up a full page. Finally, once we’re happy with the look and feel of our page, we have the coding agent actually build it.

1. Planning the Design with the Language Model (Gemini, Claude, or ChatGPT)

For iterating with Gemini and Nano Banana, I mostly stuck with Google AI Studio where it’s easy to switch back and forth between the two.

Here is the system prompt I used for Gemini Pro 3 to plan the model (link). Similar to the Anthropic frontend plugin, it encourages design thinking and creating something unique.

Then, I prompted Gemini 3 Pro to give me a plan for a landing page. Here’s my second user prompt. (link)

I also provided additional reference images, which are optional but could help. This is an opportunity to find various visual inspiration online on sites like Dribbble, Mobbin, or Reeoo. You can also use any stock photography site or inspiration site like Pinterest if you just want to provide an aesthetic inspiration. And if you can also provide a reference image of an existing version of the page if you want to stick to that structure.

But what’s important is that we explicitly ask for a landing page plan with this part of the prompt:

Plan the design, specifically regarding: Ratio of the section Layout, spacing, and white space Texture and backgrounds Animations Be extremely creative

This lets the language model use its knowledge of design, layout, and hierarchy to create a structure for our landing page:

2. Iterate on the basic landing page structure in ASCII

As mentioned, we can start to discuss the basic look and feel of the landing page structure with ASCII mocks. This is going to be the fastest way to iterate on the basic structure.

3. Switch to an image model like Nano Banana and use your design plan as the input

Now that we have a plan and landing page mocks we’re happy with, we can switch our model from Gemini to Nano Banana Pro and provide this instruction:

help me design a UI mock for the hero section, output image

From there, we can iterate on a design that looks good:

4. Extract Visual Assets The Coding Agent Will Need

As you can see in the example above, Nano Banana can generate great-looking visual assets that a coding agent would never make, but it generates a single image. For our coding agent to build this page, it might need specific assets.

Fortunately, we can talk to Nano Banana and ask it to remove various pieces until we have just the individual assets that we can use.

Another optional way to spice up this step is to use image-to-video models like Google Veo to generate short video animations that could make your landing page really pop.

5. Have the coding agent code it up!

Now we have a landing page design plan and design system generated in step 1, we have a final reference image that we’re happy with in step 3, and we have individual creative assets the coding agent can use generated in step 4.

Now, we can go back to Claude Code, Cursor, or our favorite coding agent or IDE and implement the landing page. The coding agent will have a clear plan and a clear visual reference. Just as importantly, you have a clear plan and a clear visual reference to make sure the page looks the way you wanted it to.

An interesting area to explore here is using the Playwright MCP to try to get your agent to build the app until the screenshots of the app match the reference screenshots.

Wrapping It Up

I copied most of this workflow from ‘AI Jason’s YouTube video “Nano Banana + Gemini 3 = S-Tier UI Designer” video. My mind was blown watching his video and trying it out on various projects of mine to great success, so I felt compelled to share.

I’m a typical programmer who has lots of fun ideas, but like many programmers, I’ve struggled with the visual aesthetic aspect of app design. And anyone who’s published any sort of software online knows that your apps looking good often matters a lot.

What I love about this workflow is that it leverages both language models and image models to get the best of both worlds. With the language model, you get a cohesive plan and design system that you can re-use throughout your project. With the image model, you get the maximum amount of visual creativity. Then finally, you head back to your coding agent to get it built for real.

Building professional-looking software is more accessible than ever, but there’s still plenty of challenges in optimizing design workflows like these. I’m hoping to dive deeper on this topic, especially around building more complex UIs, and streamlining the process, and experimenting with tools like Playwright MCP to match the agent’s work to the reference image.

I hope this topic fascinates you as much as it does me. Happy holidays and I look forward to sharing more AI coding experiments in 2026!

Leave a comment

Was MCP a mistake? The internet weighs in

Bill Prin — Wed, 19 Nov 2025 19:06:21 GMT

MCP is pissing a lot of people off, which may seem surprising for a software protocol. Earlier this month, Anthropic published a blog post on how to make MCP 98.7% more token efficient with code execution. You’d think this would be a good thing, but for large parts of the community, they don’t see “MCP with code execution” as an improvement of MCP as much as proof that MCP never should have existed at all. And honestly? They’re making some pretty strong points.

First, let’s cover the problems Anthropic is highlighting in their blog post.

Two Problems with MCP that Anthropic Highlights

The first problem is that MCP tools eat up a lot of context, which both limits how much additional context you can fit into your window for actually doing things, but also raises your LLM usage and cost. As I tweeted earlier, using just two MCP tools, I already eat up 20% of my context window, before I’ve even done anything!

The second problem Anthropic highlights is that MCP tools are not composable. Anthropic gives the example of pulling a transcript from Google Drive and uploading that transcript to Salesforce. The problem is that in a real coding environment, you could store that long transcript in a variable, but since MCP has no concept of variables, the entire transcript is passed between the two different tools as text. If the transcript is long, this could eat up a ton of tokens.

Anthropic’s Solution - And Why The Community Is Skeptical

Anthropic’s proposed solution is to make MCP more code-oriented. To solve the first problem that MCP tools take up too much context, the proposed idea is for MCP tools to expose a much smaller surface area meant only to discover options, and have the MCP client query that to determine what’s available as necessary.

To solve the second problem of composability, the proposed solution is to have MCP clients use the discovered MCP surface areas and write code to interact with it. In our example above including Google Drive and Salesforce, you can imagine the MCP client writing a short script with a variable to capture the transcript, and now the context has a short variable name instead of a long transcript.

If you don’t think too hard about it, this all sounds appealing. MCP has problems, Anthropic has proposed ways to improve MCP to fix those problems. But there’s a bigger picture issue - if LLMs can write code, why do we need MCP at all?

Steve Krouse took to Twitter to make the following points:

LLMs were bad at writing JSON
So OpenAI asked us to write good JSON schemas & OpenAPI specs
But LLMs sucked at tool calling, so it didn’t matter. OpenAPI specs were too long, so everyone wrote custom subsets
Then LLMs got good at tool calling (yay!) but everyone had to integrate differently with every LLM
Then MCP comes along and promises a write-once-integrate everywhere story.
….
Now this next part sounds like a joke, but it’s not. They generate a TypeScript SDK based on the MCP server, and then ask the LLM to write code using that SDK

Are you kidding me? After all this, we want the LLM to use the SAME EXACT INTERFACE that human programmers use?

Let me paraphrase Steve’s point here: a big reason that MCP came to being in the first place was because LLMs were not very reliable at writing code to an API spec. Now they are more reliable at writing code to an API spec, so much that Anthropic is suggesting that MCP should generate a spec and MCP clients should write code to it. But that raises the question - why not just use the original API specs that humans are using, such as those generated by tools like OpenAPI?

In other words: we started with OpenAPI specs, abandoned them for MCP because LLMs couldn’t handle them, and now that LLMs can handle specs again, we’re... generating specs from MCP. We’ve gone full circle.

Theo (t3.gg) Piles on the Hatred

Theo Browne, one of the biggest and most influential YouTubers in the developer space with over 480k subscribers, jumped into the criticism with his own video Anthropic admits that MCP sucks that has over 100k views. This video contained a hilarious amount of anger towards a software protocol, including the following quotes:

“How the fuck can you pretend that MCP is the right standard when doing a shitty code gen solution instead saves you 99% of the wasted bullshit? That is so funny to me. The creators of MCP are sitting here and telling us that writing shit TypeScript code is 99% more effective than using their spec as they wrote it. This is so amusing to me.”

and

This is what happens when we let these LLM people make the things that we have to use as devs. Devs should be defining what devs use. And if you don’t let them do that, then you’ll end up realizing they were right all along.

Are Skills Just MCP But Better?

I wanted to make this post somewhat two-sided on this debate, but honestly, I found the anti-MCP points much more compelling. Still, MCP solves more than one problem. The first is how the LLM uses a third-party tool, which we’ve already seen argued, does not require MCP. The second is how MCP discovers what tools is even possible for it to call. We might not want to clutter up our context window with every possible GitHub API call we could make, but we do want a way for the LLM to reliably know how to use the GitHub API properly when we need it, ideally with some examples.

Coincidentally, shortly before Anthropic wrote their own blog post addressing MCP’s problems, Mario Zechner wrote his own MCP-alternative solution in a blog post titled What if you don’t need MCP at all? . He was even prescient enough to highlight the exact same problems that Anthropic did, two days before their own post!

Unfortunately, many of the most popular MCP servers are inefficient for a specific task. They need to cover all bases, which means they provide large numbers of tools with lengthy descriptions, consuming significant context.
…
MCP servers also aren’t composable. Results returned by an MCP server have to go through the agent’s context to be persisted to disk or combined with other results.

Mario’s alternative approach to tool calling is to create a README that references certain scripts (the tools he needs to use), as well as examples on how to use them. He specifically picks the “MCP enabling the agent to see the browser window during web dev” as an example, since it feels like it should be one of the more complex use-cases of MCP, but he easily replaces it with a few scripts.

Of course, if we’re replacing MCP with “small markdown files that explain how to do a task”, that sounds an awful lot like the recently announced Claude Skills. And Mario specifically calls out that Claude Skills are aligned with his approach, although he still prefers doing it his way for even greater control.

But it certainly feels like Claude Skills could fix many of MCP’s problems without the additional complexity that Anthropic is proposing in their post. Skills achieve the same progressive disclosure Anthropic describes, but without needing MCP as the underlying protocol. To circle back to my Firebase and GitHub MCP examples, the idea is that I could replace their MCPs with markdown files that explain how to correctly call their APIs or use their CLIs, and then I only need a single sentence to load those markdown files when using Firebase or GitHub. This means far fewer tokens and far less complexity.

Do We Need A Standard At All?

The only argument I truly see left for MCP is that if there’s many different AI agents, and many different external tools that AI agents will need to work with, it feels like there must be some reason we need to standardize that communication. But, protocols were created to help machines that understood neither English nor unstructured data by creating a highly structured machine format for them to parse. AI agents are excellent at understanding both English and unstructured data, raising the question of whether protocols even make sense in this new world.

Even if we imagine a world where Claude has a marketplace for third-party tools, if LLMs can now reliably call API specs, then MCP feels like it’s unnecessarily re-inventing OpenAPI’s wheel.

Ironically, Claude Skills seems like the more “AI-native” approach to LLMs using third party skills, but Skills is not an open standard. However, the underlying concept of a sentence that “points” to a larger set of prompts seems simple enough that it’s likely we see broader adoption of the concept if not Claude’s exact version. If anything, Mario’s blog post above indicates that people were hand-rolling their own version of Skills even before Anthropic formalized the concept.

But let’s finally note Anthropic created Claude and Claude Code, so they have some pretty smart and talented people. When they’ve invested massive amounts of time and resources into formalizing a standard like MCP, it’s possible they have insight and foresight that the broader developer community lacks. If not, MCP risks going down as a major blunder in a company that until now has seemed to be untouchable.

Leave a comment

Claude Code Custom Commands: 3 Practical Examples + When to (Not) Use Them

Bill Prin — Thu, 06 Nov 2025 19:00:10 GMT

Like any great developer tool, Claude Code is highly customizable, most notably through custom slash commands and custom output styles. But it might not be obvious when you should reach for these customization features.

This post will focus on custom slash commands by walking through three practical examples, sharing resources for finding more examples, and some guidelines on when to create custom commands and when not to use them.

What Is A Custom Command

Claude Code has many builtin slash commands such as /clear to empty the conversation, /context to view the context usage, /init to set up a CLAUDE.md, and /agents for managing subagents. Custom commands let us add our own special commands that similarly accomplish a common task with one quick word.

Custom commands live in either your home directory (~/.claude) and apply to all your sessions, or they can live in the ~/.claude directory in your project, thus enabling you to add the commands to a Git repo so they’re available to any other devs working on the project.

Custom commands are a bit like Bash scripts, but written in English instead of Bash. They can take arguments which can be accessed via the $ARGUMENTS flag or individually with $1, $2, $3, similar to Bash.

The most important insight is that custom slash commands are just prompts you save somewhere. This also creates a simple guideline:

Use a custom slash command for a long prompt that you frequently use

If we don’t frequently use a prompt, there’s not much value in saving it. If it’s something we do rarely, all the custom slash command does is obscure what the prompt is and make it harder to tweak it.

The reason we probably shouldn’t bother with custom commands for short prompts is we can simply type them in again. For example, if our prompt was very short like “do a git commit” or “review uncommitted changes”, then we could simply type those phrases.

However, both code commits and code reviews are examples of good places to build a custom slash command since those short phrases rarely capture the entirety of what you need in your development workflows, as we’ll explore next.

Example 1: Sanity Check On Git Commits

Paul Graham, founder of YCombinator, once wrote:

I was taught in college that one ought to figure out a program completely on paper before even going near a computer. I found that I did not program this way… I tended to just spew out code that was hopelessly broken, and gradually beat it into shape.

Like Paul, I often find myself writing code with a lot of exploratory work, commented out code, and TODO markers. However, I generally don’t want to check that code in.

Even more risky, sometimes when developing my apps I enable test flags or rig mock data so that I can test certain user flows. But it would be a critical mistake to commit that test code into production.

Historically, these problems were solved with tools like linters triggered by Git commit hooks and code review. Linters can only catch a small subset of issues, and code reviewers tend to find it annoying when they have to repeatedly view “sloppy” commits by the code author , which is perhaps not even available if you’re a solo dev. Plus, code review takes up valuable human time. LLMs to the rescue.

We can build a Claude Code /commit command that will first check for any TODOs, commented out code, test flags enabled, or anything that simply looks like it’s a temporary test change and commit only if those things don’t exist in the changeset.

I created the file ~/.claude/commands/commit.md and entered the following text:

Review all local modifications relative to HEAD, including both staged and unstaged changes.
Before committing, check for any of the following patterns:
TODO, FIXME, HACK, or similar developer notes.
Commented-out blocks of code.
Debugging or test flags left enabled (e.g., DEBUG = true, test_mode = 1, or console.log/print calls used for debugging).
Clearly temporary or experimental code (e.g., “temp”, “try”, “placeholder”, “test data”).
Hardcoded, rigged, or mocked logic used for testing — like forcing outcomes, overriding randomness, mocking API responses, or using fixed test data.
If any of these are found:
Abort the commit.
Output a clear, structured summary describing what was found and where, with short cleanup suggestions.
If none are found:
Stage all changes (git add -A).
Perform the commit using the following message: “ $ARGUMENTS “ , or if the previous arguments between the quotes are empty, read through the changes being committed and write an appropriate commit message that reflects the changes being made.
Output a short confirmation message indicating success (e.g., “✅ Commit completed — working tree clean.”).

As you can see, I use the $ARGUMENTS flag so that I can assign a commit message when using the command like /commit update the landing page, but I can also just type /commit and have Claude write a commit message for me based on the changes - pretty cool!

This custom command solves a huge pain point for me on solo projects as I always felt that double checking I didn’t accidentally commit some temporary testing code was a timesink. Claude is far smarter than a traditional linter and can understand changes that would otherwise never be flagged, as you can see by it being suspicious of this “hardcoded logic”, as seen in the following screenshot of my app development:

Besides sanity checking for TODOs or commented out code, there’s many other things you might want to review before you commit. For example, some people feel strongly that a commit should only do one thing, so that’s another condition you could add to the custom prompt.

Custom prompts are custom, so while you could use other people’s prompts verbatim, it’s worth considering massaging them to your specific workflows.

Example 2: Catchup Context After Context Clear

This next command is one that I’m borrowing from the blog post How I use Every Claude Code Feature by Shrivu Shankar which did extremely well on Hacker News (500+ upvotes).

Shrivu recommends having a /catchup custom command that reads in all the uncommited changes into your context. The idea is that you can call /clear as the context window gets full, then reload the necessary work-in-progress into your new conversation. It might look something like this:

“read all uncommitted git changes into this conversation“

Honestly, I’m on the fence about whether this is a great example of a custom Claude Code command because it falls a bit short of the “long” prompt qualifier I mentioned above. It’s not difficult to simply type . On the other hand, one word is shorter than eight if you’re doing it frequently.

I included this example for a few reasons anyway: first, I wanted to link to Shrivu’s excellent post; second, it’s an interesting example of managing the mechanics of Claude Code itself; and thirdly, it ties into very nicely into my third example.

Example 3: Catchup Context With Github Issues via MCP

I expect that for many people, the most useful custom commands will interact with MCP tools. Why? Because if a task exists that a general Claude Code user needs to do frequently, it’ll probably just get added to Claude Code builtin commands. Adding external tools is one of the ways Claude Code gets unique to you - thus needing customization - and external tools often require various forms of finagling that saved prompts can capture.

One of the most obviously useful MCP is the Github MCP. While you can already talk to Claude directly on Github as I wrote about in Scaling Claude Code with Github Actions, there’s still plenty of times you will want to be developing locally in Claude Code and want to access something on Github. Setting this up is fairly quick, you can create a GitHub personal access token and add the MCP to Claude Code as documented here.

Let’s tie this Github MCP concept with Example 2 of catching up on context after a /clear command. In example 2, we only added the code context. But often times, we don’t just want to know what the current code changes are, but why we’re making those changes, and those changes often exist in an issue tracker like Github Issues.

So now we can extend our /catchup command with something like this:

Find issue $ARGUMENTS on repo and load its contents into the context here

Assuming you’ve installed the Github MCP, now you can pull all the information about the Github Issue into your conversation along with the code change after you run the /catchup command, resetting all necessary context after you cleared it. We could also consider pulling in information from relevant Pull Requests.

Where To Find More Examples

If you want to find more examples, you can check out the awesome-claude-code repo which has multiple examples of custom commands people have contributed. There’s a website directory at https://claudecodecommands.directory/ with more resources.

Browsing through there, we can look at several more templates of what people are using Custom Commands for:

Managing Git commits as demonstrated in Example 1
Context management tools as demonstrated in Example 2 and Example 3
Create documentation for new changes being added
Create a changelog.md or release notes based on recent commits or PRs
Managing releases
Creating formal product planning docs based on input docs

However, I would recommend avoiding using anyone else’s custom commands and instead use them for inspiration. After all, these are custom commands - it makes most sense to tailor them to your specific workflows. It’s also worth considering that if you add too many custom commands, you’re adding a level of indirection that might confuse new developers on a project if they have to learn dozens of project-specific command meanings.

Custom commands work best when they encode your specific workflows, and capture some sort of prompt you frequently find yourself reaching for but find tedious to repeatedly re-enter.

Thanks for reading and as always reply to this email or leave a comment on Substack.

Leave a comment

Claude Code vs Codex: I Built A Sentiment Dashboard From 500+ Reddit Comments

Bill Prin — Thu, 16 Oct 2025 16:39:53 GMT

Most benchmarks tell us how AI coding models perform in carefully constructed scenarios. But they don’t tell us what developers actually think when they use these tools every day. That gap is why I built a Reddit sentiment analysis dashboard to see how real engineers compare Claude Code vs Codex in the wild. You can find the dashboard at

https://claude-vs-codex-dashboard.vercel.app/

and the source code at: https://github.com/waprin/claude-vs-codex-dashboard

There are some options to view sentiment weighted or unweighted by upvotes, and compare on specific categories like speed, problem solving, and workflows.

In this newsletter edition, I’ll discuss:

What trends the sentiment analysis dashboard uncovers on Claude Code vs Codex discussions on Reddit
The methodology I used to build the dashboard and plans for future improvements

While notable AI benchmarks like SWEbench, PR Arena, TerminalBench, and LMArena help us navigate the landscape of the quality of AI models, I don’t think any benchmark can truly capture how most software engineers are using agentic coding models day-to-day. We don’t typically “set-it-and-forget” the agent on a constructed task but rather there’s an interactive back-and-forth conversational session. Furthermore, engineers in the wild are facing a far greater diversity of tasks than any given benchmark could hope to capture.

For those reasons, I believe a survey of the “wisdom of the crowd” is valuable to gain a broader understanding of which agentic coding models are performing better. To do so, I scraped a wide variety of comments on Reddit from AI-coding focused subreddits such as /r/ChatGPTCoding, /r/ClaudeCode, and /r/Codex. I then used the Claude Haiku model to classify whether the comment directly compared Claude Code and Codex, and classified the sentiment accordingly.

(note: this analysis was done before the new Haiku model that Anthropic announced yesterday)

Since this post is fairly long, I’ll summarize here:

Overall, Codex has much more positive sentiment than Claude Code in comments that compare the two directly
However, Claude Code has much more discussion overall, at about 4x the volume of Codex, raising the question of whether its popularity leads to its detractors
On specific topics like performance, model quality, and problem-solving, Codex leads in all categories except two - speed and workflows. Claude Code is considered faster to respond and has a better terminal UX and ecosystem of tools. Codex frequently gets complimented for outperforming Claude Code on more challenging problems.

Claude Code performing better on speed but Codex on problem-solving also aligns with some tweets I’ve seen in the wild.

Let’s dive into some specific takeaways, and then I’ll circle back to my notes on the methodology.

Codex Wins the Sentiment War By A Large Margin

The first takeaway is that Codex is compared more positively against Claude Code by a fairly large margin:

As you can see, 65.3% of Reddit comments comparing Claude Code vs Codex prefer Codex.

The above metric only tallies raw number of comments. If we weight those comments by upvotes, so a comment with 10 upvotes counts ten times as much as a comment with 1 upvote, the sentiment difference is even more stark.

We can see that 79.9% of Reddit upvotes prefer Codex to Claude Code.

The dashboard also lets you see all the Reddit comments at the bottom, sorted by upvotes and optionally filtered by themes such as speed or price, if you want to see the original comments.

But People Are Talking About Claude Code Much More….And Reddit Is A Negative Place

While Codex has far more positive sentiment than Claude Code, it’s worth noting that people are simply talking about Claude Code significantly more than Codex.

As you can see, “Codex>CC” has 98 comments vs. 18 comments that have “CC>Codex”. But if we look at comments that don’t directly compare the tools, both Claude Code and Codex have more negative comments than positive comments. People tend to complain more than praise on the internet. But of the 500 comments, Claude Code had 40 comments and Codex only had 10. This means people are talking about Claude Code about four times as much as they’re talking about Codex.

You can also see the volume of Claude Code discussion evidenced by the subreddits themselves, with /r/ClaudeCode having 4.2k weekly contributions and /r/codex having 1.2k weekly contributions.

This raises the question of how much negativity towards Claude Code is because the most popular tool tends to get the most criticism.

GLM As A New Dark Horse

I was surprised to see extensive discussion comparing Claude Code and Codex to another competitor I never even heard of, GLM, a Chinese agentic coding agent. In fact, one of the top threads in /r/ClaudeCode recently is: Why I Finally Quit Claude (and Claude Code) for GLM, with the top comment reading:

GLM surprised me when I tried it recently. It’s not as good (yet) in terms of agentic capabilities compared to Codex or Claude, but it’s good enough it produces quality results for pennies on the dollar in terms of cost. Easily the best value LLM around right now.

Top Comment in Favor of Claude Code

i‘ve been testing sonnet 4.5, gpt5-codex and glm 4.6 plans over the past few days with a nextjs project.
i think sonnet 4.5 is easily the best of the bunch. glm 4.6 is the worst, it needs a really good plan by a sota model or well broken down tasks, otherwise it codes itself in a corner.
gpt5 codex is great, sometimes best but i hit the limits much quicker than sonnet 4.5, even after the anthropic rate limit changes a few days ago. that’s on the 20$ plan. i also like claude code much more and the surrounding ecosystem of tools.
— User serialoverflow (source)

Top Comment in Favor Of Codex

Former Claude Code user for a few months on Max 20x, fairly heavy user too. Loved it at the time, but feels like at least during part of last month the quality of the model responses degraded. I found myself having to regularly steer Claude into not making changes I didn’t actually agree on (yes I use the plan mode, it’s highly valuable). Claude also often told me that code was production ready when it wasn’t, it either failed to compile or had some kind of flaw that needed addressing.
The biggest challenge I’ve given it so far was to refactor a long overdue and messy .cs file that contained about 3k LOC. I’ve tried this with various other AI LLMs, including Claude Code (which couldn’t read the entire file as it was over 25k tokens), but they just ultimately make bugs and mess things up when trying to do so. I didn’t think GPT-5 would be any different, but my god, it surprised me again. I planned with it, did it in small bits and pieces at a time, and a day or so later I’m now down to around 1k LOC for that file. It seems to be working fine too.
— User Hauven (source)

Claude Beats Codex On Two Categories - Speed and Workflows. Codex won the rest.

The dashboard allows you to filter by specific topics, and Codex leads Claude Code on 8 of 10 categories. However, Claude Code leads on two - speed and workflows. This aligns with much of the discussion I’ve seen online, where people generally think Codex is a stronger model but notice that Claude Code just returns a response faster and that the terminal UX and tool ecosystem is stronger.

Codex won the rest of the categories: pricing, performance, reliability, usage limits, code generation, problem solving, and code quality.

Notes On Methodology - Data Collection Via Scraping Reddit

To decide which Reddit comments to even scrape, I first used Google Search with the query `site:reddit.com “claude code” codex`. The vast majority of the results were from Claude-related subreddits - namely /r/ClaudeCode , /r/ClaudeAI, and /r/Anthropic. The only other subreddit with a large number of results is /r/ChatGPTCoding, which despite its name, is geared towards any sort of AI-coding discussion and is not ChatGPT specific.

Some other subreddits such as /r/Cursor, /r/OpenAI, /r/LLMDevs, /r/vibecoding, /r/AI_Agents, etc had a small number of results but were not significant.

Given that the Google Search API is severely limited in the number of results you can return, and the Bing API is being deprecated, the simplest way to scrape these comments is to use the Reddit API.

I focused on /r/ClaudeCode, /r/ChatGPTCoding, and /r/Codex. However, while I have some /r/Codex comments, none of them made it into this first analysis pass. I decided not to include /r/ClaudeAI despite a significant amount of discussion there because the dataset was already heavily biased towards Claude-centric discussions.

Notes On Methodology - Sentiment Analysis Time And Cost With Claude Haiku

I decided to use Claude Haiku to analyze the sentiment on each comment. I did make sure that each comment had its entire parent chain within its context when doing the sentiment analysis. This is important context if a comment says something like “I agree”.

I was curious if I should save time and cost by batching comments, but Claude itself recommended against this approach and suggested it could too easily distort results. There are actually two ways to consider batching here. We could batch by asking it to rate multiple comments at a time, which I decided against and there’s the Batch processing Anthropic API, which I didn’t get around to using yet but might add in the future to save time.

I did use Haiku since it’s one of the cheaper models. Overall, the cost was not an issue, but the time took surprisingly long. For some data points:

50 comments took 3 minutes, 28k input tokens, 10k output tokens, and $0.08 to analyze.

500 comments took 26.7 minutes, 273k input tokens, and $0.77 to analyze

As you can see, even fairly big batches cost under $1 to analyze, but waiting 25 minutes for the results was slightly annoying.

Haiku may have been slighty overkill, as Gemini Flash may have been cheaper, but Haiku was cheap enough as-is.

Takeaways And Conclusions

Again, check out the link to the dashboard at the top of the post or the GitHub repo if you’re curious to dig in yourself.

On a personal note, I’ve played with Codex but have been finding myself using Claude Code more because speed makes programming more fun for me, and having fun means I stick to the work longer. But I am exploring how to upskill my agentic coding with background agents and spec-driven development, and it’s very clear that the broader community sentiment suggests Codex is a stronger tool here than Claude Code.

Let me know if you’re interested in this topic by replying to this email or leaving a comment on Substack. I plan to add more comments to the analysis, and I’m also interested in comparing the sentiment analysis result of Haiku vs stronger models like GPT-5 or Sonnet and see if any differences emerge. Thanks for reading!

Leave a comment

OpenAI Dev Day: Sora 2 Stole the Show

Bill Prin — Tue, 07 Oct 2025 19:16:49 GMT

Yesterday, I had the opportunity to attend OpenAI Dev Day in San Francisco. Since this is such a big event, almost everyone writing about AI covered it, so I don’t want to re-hash the big announcements too much, but instead offer my personal impressions.

Let’s quickly recap the biggest announcements that I’ll discuss in this newsletter:

Sora 2 gets an API, meaning you can now build apps around generative video for surprisingly affordable pricing
OpenAI launched the Apps SDK, meaning you’ll be able to use MCP to build apps that are accessible directly within ChatGPT, hinting at a new ‘App Store’ moment
OpenAI launched AgentKit, a toolkit for building complex agent orchestrations, with a special focus on a visual drag-and-drop UI agent builder and an SDK for adding ChatGPT-like experiences in your app

Unrelated to DevDay, there’s also a big promo deal being offered for 40M free tokens for the #1 agent on TerminalBench at the end!

Takeaway #1: Sora 2 Stole The Show

While I was looking forward to primarily developer talks, it was a creatives talk that blew my mind the most - Sora, ImageGen, and Codex: The Next Wave of Creative Production. They showed off a new “Storyboarding” tool that lets you organize consistent characters, settings, and related prompts, in a way designed to tell consistent stories with consistent characters.

As someone who’s attended many of Machine Cinema’s “GenJam” meetups, where we do things like make music videos with AI, AI-generated videos badly struggled with lack of consistency. Even if individual clips looked great, the inconsistency between the clips is what made the longer videos easily identifiable as “AI slop”. This storyboard tool looks to solve that problem.

I loved that they linked traditional human creativity to AI-related creativity by having a human draw a real novel character on an iPad, then loading that character into the tool and generating a Pixar-style short animated film with that character as the lead.

But there was great Sora 2 news for developers after all, as they also announced pricing for the Sora 2 API, at $0.10 per second of video for the regular model, $0.30 for the pro model, and $0.50 for the larger resolution.

While not exactly cheap in the general sense, this is fairly cheap by AI video generation standards, which is significantly more expensive than text or code generation.

One use case they showed off was a toy company using Sora to make proof of concepts for new toy and game designs. While interesting, given these models are expensive, I think the use cases we’ll see emerge the most will relate to advertisement, as video advertisements are both hard to make and one of the most monetizable use cases of AI video generation.

Takeaway #2: OpenAI Competes Against Yet Simultaneously Promotes Their Startup Ecosystem

One of the biggest announcements was Agent Builder, a visual toolkit for building agents. At first, I thought this was strictly a no-code tool, but an OpenAI employee I spoke with later told me that it’s actually just a visualization of an underlying SDK and you can still mix and match coding with the drag & drop UI.

Many noted this might put huge competitive pressure on startups in the visual agent space, notably n8n, Zapier, and YC-Backed Gumloop. Any startup in the AI space has to consider that if their market is even remotely big, they will be competing directly with the AI behemoths they are building on top of. It’s honestly unclear to me whether startups can differentiate in the space, or if “nobody got fired for buying from OpenAI” will demolish them.

But the silver lining for those startups is that OpenAI does not seem interested in sucking the air out of the room for competing startups, in fact, very much the opposite. OpenAI gave speaking slots to many direct competitors to Codex, including a highlighted main stage talk given to Cursor, and a lightning round talk given to Warp. It was somewhat fascinating to me that Sam Altman would make Codex moving from Research Preview to GA a central aspect of his keynote address, and then let a direct competitor on stage immediately after.

To me, the takeaway is that OpenAI just wants to grow the overall AI ecosystem, and they’re more than happy to compete with startups and let the best products win. It was inevitable that they would want to build their own coding agents, but it’s also wise that they’re willing to force their internal teams to compete against external ones and let the best agent win.

It seems Sam Altman is avoiding many pitfalls that Microsoft ran into in the past trying to build a walled garden on Windows, and there was even a callback to the heyday of Microsoft when they played a remixed version of the classic Steve Ballmer’s “Developers, Developers, Developers” meme , replaced with Sam Altman, and generated by Sora 2, on the main stage.

Takeaway #3: OpenAI Feels A Step Behind Anthropic on AI Coding But A Leap Ahead of Anthropic on Everything Else AI

Despite Codex moving to GA being a big announcement, everything about coding agents felt underwhelming. They demonstrated several use cases of MCP, that felt like something that might impress me 6 months ago, but in the AI coding world, that might as well be a decade ago. Nowadays, using MCP to connect to an external device is a big ‘meh’, and lest we forget, it’s OpenAI adopting a standard that Anthropic created.

Moreover, while both Twitter and Reddit seem genuinely divided on whether Codex or Claude Code is better, I’m personally still a bigger fan of Claude Code, particularly since Sonnet 4.5 came out. I wrote about this in the last edition of the newsletter, and it’s almost entirely because it’s faster , which makes it more fun for me. Nobody disputes that Sonnet is faster - even the OpenAI employees I spoke with. The only argument is whether Codex is better on larger, more complex tasks, but personally, I’d usually rather ‘pair’ on larger tasks interactively than let the agent do the entire thing myself.

But meanwhile, OpenAI feels significantly ahead of Anthropic on everything except coding. The biggest announcement here was AppKit, which lets people build apps within ChatGPT - although whether this will be realistic to monetize is still up in the air, it will certainly be an exciting moment for the AI consumer market.

Their AgentKit is clearly meant to move into the market of the ‘semi-technical’ agent builders, expanding OpenAI’s ambitions beyond pure devs.

And of course, Sora 2 is both one of the hottest consumer apps on the planet and a leading option for the creative industry to adopt AI into their workflows.

Given the conference was titled Developer Day, I thought it’d be more focused on coding. But of course, if you’re a developer, we don’t just want to code faster but also build awesome stuff for end-users, and almost all of these announcements enable that.

AppKit might trigger a new “App Store” gold rush (I’m very skeptical it will be easy to monetize, but it’s possible and great to be early)
AgentKit might bridge the gap between technical and non-technical teams for building advanced agent use cases
Sora 2 will open up a whole new class of application categories enabled by AI video

ICYMI: Free Tokens for #1 Agent On Terminal Bench

Something unrelated to DevDay, but worth noting, in case you like free stuff - Factory shared a tweet offering 40M free tokens to use their agent Droid with Sonnet 4.5. Factory’s Droid agent is actually the #1 ranked agent according to TerminalBench, so this is a lot of free value if you want to compare their product to Claude Code or Codex.

Thanks for reading AI Engineering Report. Reply to this email or leave a comment on Substack with any thoughts.

Leave a comment

Claude Code 2.0 Is Promising But Flawed

Bill Prin — Wed, 01 Oct 2025 22:22:49 GMT

Two weeks ago, Anthropic looked like it was stumbling. Developers were bailing on Claude Code, OpenAI had just shipped GPT-5 Codex, and Sonnet itself was rumored to be degrading. But in the roller-coaster world of AI coding agents, scripts flip fast and Anthropic is once again flexing on the competition with the release of Claude Sonnet 4.5, beating Codex in head-to-head tests and winning over both veteran engineers and hype bros across the internet.

But while the underlying model feels stronger than ever, Anthropic’s simultaneous release of Claude Code v2 is more mixed. While it’s great to see progress on critical quality of life features like checkpoints and usage monitoring, both of these new features seem half-baked, especially next to third party options like Git and Claude Monitor.

In this edition of AI Engineering Report, I will cover:

Online sentiment towards Sonnet including one YouTuber’s fascinating breakdown comparing Sonnet 4.5 directly to GPT-5-Codex on a web dev task.
Why I’m disappointed with the new /rewind feature in Claude Code v2
Why I’m disappointed with the new /usage feature in Claude Code v2
Using OpenAI’s new Sora 2 app to pretend my dog can do a kickflip

Claude 2.0.0 Release - What‘s New

If you’re curious about exactly what Claude Code v2 contains, you can find it on their CHANGELOG.md on their GitHub repo. Claude Code is not open source, but their GitHub repo is used for Issue Tracking, Changelogs, and some small examples. Here’s the v2 release notes:

There’s a few neat things in here, including tab-to-think, which is a way to force the model to think harder. Sonnet is now outperforming Opus on most tasks - raising the question of whether there’s even a point to Opus - but like GPT-5 it can either respond quickly or think harder for a better thought-out response. The straightforward way to get it to think harder is to simply use the words ‘think harder’ or ‘ultrathink’ in your prompt, but as an alternative you can now press tab to turn on a mode that will always do it.

Ctrl-R to search history is a nice quality of life feature mirroring bash terminals, although honestly, I rarely find myself wanting to do this.

The two features that stood out to me are /rewind and /usage , both of which I found disappointing, which I’ll get into.

But first, let’s talk about the Sonnet 4.5 hype.

Sonnet Puts Codex Back In Its Place

Despite weeks of speculation that Sonnet was deteriorating as a model, the new release of Sonnet 4.5 completely upended the sentiment. Simon Willison, one of the top 5 bloggers ever on Hacker News, wrote:

My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago.

Simon wasn’t alone. YouTuber Cole Medlin has 167k subscribers, and in his video titled Claude Sonnet 4.5 - The New Coding King? (Sonnet 4.5 vs. GPT 5 Codex), he set out to compare Claude 4.5 directly to Codex on a web dev task he orchestrated. The task was to add a Stripe integration to an existing web app.

On Sonnet 4.5, he had the following to say:

There we go , Claude Code with Sonnet 4.5 has finished the implementation and did it in 15 minutes, the entire Stripe integration, it’s very impressive. I actually did this exact same build with Opus 4.1 in the past and it took 35 minutes to build the whole thing, so more than 2.5 faster with 4.5….it made a couple mistakes, so it wasn’t quite a one-shot but it was pretty close.

Meanwhile, he ran the same test with GPT-5-Codex

It took 1 hour and 20 minutes in total while Sonnet 4.5 took 15 minutes, so the speed...was pretty disappointing. It seemed to do a lot of weird things like after editing a file, it would re-read the files and see what changes it made.

While he had minor issues with both implementations, he found Claude Code’s to be slightly closer to his desired output.

Overall, Cole put Claude Code ahead of Codex on both speed and quality.

As an aside, I found Cole’s benchmarking strategy using a Claude Code custom command (/execute-prp) that took a product requirements doc to be a really cool way of benchmarking these two agents.

Claude Code /rewind Command: Useful But Inferior To Git

On the Claude 2 announcement post, Anthropic describes the /rewind command in the following way:

Complex development often involves exploration and iteration. Our new checkpoint system automatically saves your code state before each change, and you can instantly rewind to previous versions by tapping Esc twice or using the /rewind command. Checkpoints let you pursue more ambitious and wide-scale tasks knowing you can always return to a prior code state.
When you rewind to a checkpoint, you can choose to restore the code, the conversation, or both to the prior state. Checkpoints apply to Claude’s edits and not user edits or bash commands, and we recommend using them in combination with version control.

I tested this command out by doing some work, then typing `/rewind`.

Upon typing the command, you get a list of all your previous messages, which are checkpointed. You can return to those checkpoints, and either revert code, revert the conversation, or both - undoing all code and/or conversation changes since the checkpoint.

If it does what it advertises it does, why am I disappointed? Because Git is still way better at solving the stated problem. Anthropic advertises checkpoints as a way to manage exploration and iteration of long-running tasks, but the checkpoint feature is far too rudimental to accomplish this.

When you revert the code, it destroys all the code changes made. Even though that’s what you asked for, destroying written code might still surprise you in a way you didn’t expect. With Git, you could just commit all the changes you want to destroy to a branch, then move off the branch. That way, the code is gone, but in an emergency where you deleted something that you actually needed , it’s still right there in a branch.

And if you need Git for anything, why not just use it for everything? In particular, I find that GitHub Desktop is an intuitive and visual tool for managing Git branches on your local machine, and lightyears ahead of this new checkpoint feature of Claude.

In fact, a Reddit user even built a tool that auto-commits after every Claude edit, essentially giving you checkpoints, but powered by Git’s full ecosystem that’s evolved over two decades of production use.

To me, there’s one great use case of the /rewind command, which is to clean up the conversation to avoid context rot. You manage the code in Git but you can not manage the conversation in Git. Still, I’m increasingly suspecting it’s better to make more frequent check-ins into files like CLAUDE.md / AGENTS.md as a way of organizing work, and starting clean fresh sessions based on those markdown notes, than it is to try to repair a session that went off the rail.

Overall, it’s nice to see the Claude Code team make some progress towards managing complex work all within one session and one tool, but it’s simply not there yet.

Claude Code /usage Command: Objectively Worse Than Claude-Monitor

Last week, I wrote a post highlighting that Claude Code saves all your session’s token usage, and cost, in a local ‘~/.claude/projects’ JSONL file and simply doesn’t show it to you. Despite being easily accessible on your local box, you need a third party tool called Claude Monitor to view it. So when I saw that Claude Code v2 released a new /usage command, I assumed they incorporated the essentials of claude-monitor into Claude Code.

To my surprise, the /usage command is even more crippled than /rewind. It only shows you the percent of usage left in the current session and week.

It doesn’t show the amount of tokens you used - only the percentage
It doesn’t show what the cost would be if you were using the API instead of a monthly plan.
It doesn’t differentiate between token usage and message usage, even though the quotas limit both
It doesn’t let you switch between a session view, daily view, and monthly view.

Meanwhile, claude-monitor does all of these things.

Furthermore, if you are on a monthly plan, the /cost command still simply tells you that there’s “no need to monitor cost”, hiding the information that you might save money paying for usage-based pricing via API instead of a monthly plan. And once again, I have to speculate that Anthropic doesn’t want you to know that you could save money, and are banking on people preferring the predictability of fixed-pricing over taking the time to realize they’re overpaying.

In short, claude-monitor is still a mandatory tool if you care about optimizing cost and token usage at all.

Using Sora 2 To Watch My Dog Do A Kickflip

This newsletter is primarily targeted towards engineers and will primarily cover engineering content, but I’m also closely following the AI media and arts trends, including being a participant in the excellent Machine Cinema AI art meetup group. Through their WhatsApp group, I was able to get an invite into the just-announced Sora 2 by OpenAI, and I have to say, it’s a lot of fun. The UI is clearly heavily inspired by TikTok, except instead of posting and editing videos with a traditional editor, you only prompt videos and edit them with followup prompts.

For now, I’ve mostly focused on pretending my dog Lucky could skateboard.

That one actually looked somewhat believable as he does not succeed at the kickflip and some dogs can actually skateboard a little bit. But I did push my luck and have him get some hang-time off of the half-pipe:

Thanks for reading AI Engineering Report! As always, please leave a comment on Substack or reply to this email with any thoughts.

Leave a comment

The Hidden Costs of Claude Code: Cost Optimization and Token Usage Monitoring

Bill Prin — Wed, 24 Sep 2025 15:52:14 GMT

In the last edition of this newsletter, I discussed how many developers were switching from Claude Code to Codex, in part because of tighter usage limits on Claude Plans. Even before the new weekly caps, the 5-hour usage limits are a common point of frustration, since you can hit a quota in the middle of your working session.

But like many others, I had no idea that Anthropic hides the token and cost information you need to optimize your way around these problems

Once you find this critical info, you can develop strategies for avoiding hitting your quota in your 5-hour window. You can also more accurately determine whether a monthly subscription is more cost-effective than using the API-key directly, in which case you pay by usage, as well as evaluate how many tokens using various MCP tools will burn. But first, you must find this info.

In this edition of AI Engineering Report, I dig into topics around understanding token usage and cost optimization on Claude Code, covering:

An invaluable OSS tool for measuring token usage in Claude Code, and pro tips on how to get the most value from it
The critical cost and token usage information that Anthropic hides in plain sight which I learned by reverse engineering how the OSS tool reverse engineers Claude Code
How to compare costs of Claude Pro and Max plans vs API key usage
Thoughts on why Anthropic makes understanding cost intentionally difficult

To work around the 5-hour reset limit, some developers on Reddit are even reporting waking up early to send Claude a quick message so that they get a fresh quota reset fairly early in their workday, meaning Anthropic has solved one of the hardest problems in software engineering - getting programmers out of bed before 8AM.

Jokes aside, I’d remind those developers that Claude Code has a non-interactive -p flag, meaning they could set up a crontab job to run a “hi Claude” early in the morning and enjoy a few extra hours of sleep.

Claude Code Usage Monitor - A Mandatory OSS Tool

It’s very interesting that Anthropic effectively hides your cost usage. If you’re on a subscription plan, the /cost command in Claude Code gives you a message like this:

> /cost 
  ⎿  With your Claude Max subscription, no need to monitor cost — your subscription includes Claude Code usage

Given the limited quotas, Anthropic is wrong - you do need to monitor costs. Fortunately, a third party developer gave us an excellent way to do so.

Almost all of the research I did for this article comes from discovering the Claude Code Usage Monitor, developed by Maciek-roboblog on GitHub, downloading it, then having Claude Code answer my questions about how it works. This tool is amazingly useful, and I’d consider it mandatory for any Claude Code users.

After you install it, you can run claude-monitor in a terminal window and it will show you the cost, token usage, and message limit of your current session, and the percentage you’ve moved toward your maximum quota. Keep in mind, Claude has quota on both token usage and the number of messages you send to Claude, so you should keep an eye on both. The cost is only your true cost if you’re using an API key. If you’re on a monthly subscription plan, then the cost is just what it would cost if you used the API key - which is quite helpful if you’re evaluating whether a monthly plan is cheaper than using the API key.

Just as interesting as what information this tool tells you is how it gets this information. But first, I want to provide a few more tips on using it.

Claude Code Usage Monitor - Two Pro Tips

The Claude Code Usage Monitor tool has a README that clearly explains how the product works, but it’s exhaustive enough it’s easy to miss some critical options to get the most value out of it.

My first piece of advice is that if you’re on a monthly plan, you need to tell it what your plan is with the --plan flag, like this: claude-usage --plan max20 (or pro, max5). This is important because the usage tool tells you the percentage of quota you’ve used up, but it has no way of knowing what your quota actually is without you telling it.

If you don’t provide the plan, the tool estimates your plan by looking at all your recent sessions, assuming your largest session was one that ran into a quota, and making that your max usage. This is a heuristic that might often be right, but is easily wrong if you have yet to hit your current quota. It’s best to just explicitly tell the tool with the --plan flag.

The next pro tip for this tool is that it has several different views configured with the --view flag. The default is session, but you can also look at daily and monthly. The two important ones are session and monthly .

session is important because, behind the scenes, it calculates where you’re at in your current 5-hour quota window. If you want to avoid getting stuck without Claude Code for several hours, understanding where you’re at within the current window is critical. session is the default view if you don’t provide the --view flag. (N.B. technically the default appears to be realtime which I believe is simply an alias of session).

The --view monthly flag is not the default, so unless you look for it, you might not find it. As the name suggests, it shows the token usage, message usage, and API cost (or what it would be) of the current month.

The monthly view is critical if you’re evaluating between the API usage-based pricing and monthly subscription, as it’s the best way to see what your API usage-based pricing would be even if you’re on a monthly plan.

As you can see from the above screenshot, when I ran the --view monthly, I saw that I would have only spent $70.46 if I used an API key, when I’m actually paying $100/mo for Claude Max. Granted, the month is not over so I still may end up hypothetically saving money with the subscription, but the point remains that this view is the only reliable way to even make this comparison.

Unfortunately there’s no way to switch between views once you start the program, so you have to open up several instances of it or kill and restart it to switch between realtime, daily, and monthly views.

The final reason I consider this Claude Code Usage Monitor so important is that since Anthropic hides your usage statistics if you’re on the monthly plan, it’s the best way to understand token usage of various things you might do on Claude Code. For example, I’ve frequently seen people warn about tools like Playwright MCP burning many tokens, but there’s no good way of measuring how many tokens it burns without a tool like this.

Claude Code Usage Monitor - How It Works

When I first saw this usage monitor tool, I immediately wondered how it worked. Given that Anthropic doesn’t provide this info, I assumed the tool must do advanced reverse engineering of something complex like network packet captures. Out of curiosity, I downloaded the project and asked Claude Code to answer questions about it.

To my astonishment, the tool is surfacing information that Anthropic is saving locally to JSON files, and simply not revealing!

Every time you start a new Claude Code session, a JSONL file is created in ~/.claude/projects . While the ~/.claude/settings file is officially documented, this projects file remains undocumented. However, if you dig in, you’ll see that every time you create a Claude Code session, a new file is created in the projects directory.

In that file, every message that Claude Code sends to the backend API has a field showing the input_tokens and output_tokens used, which can be trivially multiplied by the current model pricing to calculate cost. This means the exact cost of each Claude Code message is sitting right on your computer in JSON format.

So all Claude Code Usage Monitor has to do is read those files to get your token usage, and multiply by $/token to get the cost. For example, to get the monthly costs, it simply reads all the JSON messages you sent this month and sums their cost. It still has a bit of nuance, for example, for the realtime view it has to determine where your 5-hour window reset occurs, but overall, the information is more hidden in plain sight than reverse engineered in a complex way.

Why Is Anthropic Hiding This Information?

It’s fascinating to me that Anthropic is putting this cost information neatly formatted on your local machine in JSONL files, then simply not showing it to you. If you are deciding between using the API key and picking a monthly plan, then how much you’d be spending with the API key is critical information. You might - like me - find that you’re paying $100/mo for a monthly plan but spending less than $100/mo on tokens, and therefore would save money with the API key.

The uncharitable interpretation would be that Anthropic doesn’t want you to save money. By telling monthly plan users, “don’t worry about usage, it’s included!”, they let a subset of users overpay for their subscription and perhaps subsidize power users. Even if you started with an API key and saw how much your usage was, you’d still need to know how many tokens you used to evaluate whether you’d be under the quota set by the monthly plans, which, like the cost, is sitting on a JSONL file on your computer but simply not surfaced to you by Claude Code.

The most charitable interpretation I could give Anthropic is that they want users to simply focus on getting value out of Claude Code and not stress over saving $20/mo by min-maxing their usage. But given how expensive AI coding can be, it’s simply inevitable that most users will care about cost optimization, and it’s hard not to suspect that Anthropic was discouraging that so that lower-usage customers could subsidize power users.

Fortunately, Claude Code Usage Monitor is here to the rescue, and if you’re a Claude Code user, it’s simply a mandatory tool. I hope this article helped you understand why it’s important and how to get the most value out of it.

Thanks for reading AI Engineering Report. Let me know your thoughts by replying to this email or leaving a comment on Substack.

Leave a comment

Devs Cancel Claude Code En Masse - But Why?

Bill Prin — Tue, 09 Sep 2025 18:26:10 GMT

Claude Code has gone from developer darling to facing a mass cancellation campaign in the blink of an eye. The top post on Anthropic’s subreddit last week was Claude Is Dead with over 841 upvotes, over double the amount that Anthropic’s official response got.

Meanwhile, metrics from the Vibe Kanban - a tool which orchestrates AI agents - has shown Claude Code usage drop from 83% to 70%, with OpenAI’s Codex agent taking up slack (and taking most of Google’s Gemini agent’s usage along with it).

(source)

In this edition of AI Engineering Report, I’ll cover:

The reasons devs and Redditors have started a mass cancellation campaign for Claude Code
What AI agent benchmarks are saying about how AI coding agents compare
What this means for the AI coding agent landscape

Why Devs Are Canceling Claude Code En Masse

There are two reasons that this cancellation campaign started:

The first reason is usage limit changes. Anthropic announced that starting August 28, 2025, they added weekly usage limits across all Claude Pro and Max plans. These new limits now act alongside the existing 5-hour reset window, introducing an additional layer of restrictions.

Before (pre-Aug 28, 2025):

Only 5-hour rolling reset windows.
Pro: ~45 msgs / 5h.
Max $100: ~225 msgs / 5h.
Max $200: ~900 msgs / 5h.
No weekly caps.

After (post-Aug 28, 2025):

Still 5-hour reset windows, plus new weekly caps.
Pro: ~40–80 hrs Sonnet 4 / week (no Opus).
Max $100: ~140–280 hrs Sonnet 4 + ~15–35 hrs Opus / week.
Max $200: ~240–480 hrs Sonnet 4 + ~24–40 hrs Opus / week.

As a result, many users have hit rate limits that have made it difficult to complete their tasks despite paying $200 a month.

Perceived Quality Issues and Benchmarks

The second reason is perceived quality issues. Many Redditors have complained that they feel Claude Code is producing worse outputs. Some go on to theorize that Anthropic has degraded the model in an effort to reduce costs, such as by quantizing the model, or reducing the numerical precision which slightly degrades performance but massively saves on cost.

Many are switching to OpenAI Codex as their preferred alternatives. Here’s some top comments mentioned:

Claude → better for rapid prototyping, creative fill-in, mimicking style.
Codex → better for structured, step-controlled builds.
I use both. BUT when considering to go to the Pro plan on Claude vs OpenAI, I went with OpenAI. I dont trust Claude not to just crash or further downgrade/limit access. (source)

I find codex is better with writing the minimum code required to get the job done (source)

Most damning of all is the deep analysis of YouTuber GosuCoder who did a deep dive on AI agent performance using his custom benchmark suite you can learn more about at GosuEvals.

In his video, he summarizes his benchmark system as follows:

Instruction Following
Unit Tests
LLM as judge

While GosuCoder clearly spends an incredible amount of time evaluating AI agents, it’s important to note that his eval framework is not open source so that it can’t be gamed by AI agent creators. While his motivation makes sense, it does make it harder to evaluate the quality of his benchmarks.

Claude Code At The Bottom

Regardless of the quality of GosuCoder’s benchmarking system, it was shocking to find Claude Code toward the bottom of the pack.

GosuCoder added the following commentary:

This is the thing that blows my mind….claude code used to be one of the better ones, and it has actually come down to be 24,314 now, it’s behind Kiro and Windsurf and Crush, and part of me wonders if comes with the nature of them trying to preserve tokens…or if the other agents have just caught up. Regardless Claude Code is still incredibly good value, but it’s surprising to me how much it’s fallen.

Anthropic Responds

Anthropic took to their own subreddit to make an official statement on perceived quality degradation performance. They admitted known bugs had hurt performance for some users but denied intentionally degrading performance to handle cost.

We've received reports, including from this community, that Claude and Claude Code users have been experiencing inconsistent responses. We shared your feedback with our teams, and last week we opened investigations into a number of bugs causing degraded output quality on several of our models for some users. Two bugs have been resolved, and we are continuing to monitor for any ongoing quality issues, including investigating reports of degradation for Claude Opus 4.1.

Implications and My Take

Let’s state the obvious - Redditors like to overdramatize things and vocal minorities often seem bigger than they are, with some stating that Claude Code is still easily the best. Many commenters still preferred Claude Code to Codex, and even the ones who do prefer Codex described the CLI as poor and suggested a separate repo to fix the UX issues.

Furthermore, GosuCoder’s benchmarks were shown as “proof” that quality degraded despite not being open to public scrutiny, still being subjective (as all benchmarks are), and perhaps most importantly, Claude Code was only 10% behind the leader in his scoring system. That type of difference could easily be changed by a tweak to the scoring system.

In my own experience, benchmarking AI agents is incredibly difficult. Most of them are quite good at accomplishing well-defined tasks and struggle at novel tasks or tasks they lack good training examples on. This often makes them more similar than different. Furthermore, they are often more influenced by prompting strategies and context provided than by the difference in the agent quality themselves. Finally, “good code” has always been somewhat subjective, so evaluations of AI coding “quality” is likewise subjective.

Even by the “stats” proving the demise of Claude Code, we can see that it dropped from 83% to 70% on the Vibe Kanban tool (and it’s unclear how reflective that is of broader industry trends), which is still the market leader. Still, Anthropic issuing an official response does indicate they were at least somewhat concerned by the accusations and wanted to quell them, with many end-users appreciating Anthropic’s transparency on the issue.

Thanks for reading AI Engineering Report, please reply by email or leave a Substack comment if you have any opinions on this topic!

Scaling Claude Code with GitHub Actions and Pull Requests

Bill Prin — Wed, 20 Aug 2025 18:35:16 GMT

The GitHub integration stands out as one of Claude Code's most powerful features. There are many exciting aspects of it, but what I'm most excited about is the ability to have dozens of agents running in isolated environments and adding features to your codebase, all in the extremely well-documented and battle-tested toolset of GitHub Actions.

GitHub Actions provide isolated container environments for each agent, which means Claude Code can work on multiple features simultaneously without agents stepping on each other.

Code Review, Linting, and Testing

Before diving into the scaling potential, there are some immediate wins from this integration. For example, very often when you make a new code change, it fails “linting” (static code analysis tool that catches bugs and style issues). Now, you can forget worrying about fixing those issues and just ask Claude to fix it for you after you’re done, eliminating the 10-15 minute cycle of fixing linting errors, running tests, and checking results.

Something I found even more frustrating during code review were minor style issues that modern linting tools weren’t quite sophisticated enough to catch, but that a code reviewer felt strongly about. In this case, you have to manually read the code review comments, fix the issues, and test everything again, which can be a huge time drain. Now Claude can handle all of that for you in the background so you can focus on more important work.

Scaling Feature Development

A former colleague has been chatting to me about the topic of scaling coding AI agents quite a bit using sketch.dev. He has very positive things to say about it other than the high cost of running so many agents, but when I took a look at the product, I found the UI/UX a bit intimidating and ambiguous.

Meanwhile, GitHub Actions is extremely well-documented, battle tested, and well understood. It’s very likely that you already use GitHub Actions to run your testing and possibly your continuous deployments.

Now, by running a single command in Claude Code (/install-github-app) you can mention @Claude on your Github Issue and it will spin up a pull request in an isolated environment. This implies that if you have 20 Github Issues, you can mention @Claude on each of them and have 20 Claude agents get to work adding 20 features to your app, each in an isolated environment.

My Experience Setting Up Claude Code and Github Actions on My Project and the Value of Vercel Preview Environments

The top of this post will include a Youtube video where I walk through my experience setting up Claude Code on my Github repo, but ultimately it was just running /install-github-app and following a few OAuth steps.

However, this experience revealed a key insight: AI coding agents make robust testing more critical, not less. While Claude codes incredibly fast, it still hallucinates and makes mistakes. It became crystal clear that having Claude code create PRs, while having now way to verify those PRs didn’t break anything, wasn’t actually useful. The more you can rely on automated testing to catch these issues, the faster you can move with confidence.

Of course, Claude Code was helpful in getting a Jest integration test suite setup very quickly. But its initial pass ended up writing a lot of tests that didn’t even catch bugs I introduced, so I had to coax Claude a bit more to accomplish what I actually cared about, which is a test suite that catches app breakages. In Claude’s defense, I’ve seen plenty of human engineering teams write huge amounts of useless tests.

Once I had those tests, and I setup the /install-github-app command, it truly was as simple as creating a GitHub Issue and mentioning @Claude, and it created the feature. This is demonstrated in the video.

Additionally, besides automated testing, since I deploy to Vercel they automatically create a preview deployment of the changes Claude made. This is useful to manually check everything looks good before merging the changes, and an easy path to set up end-to-end tests on a real environment.

Conclusion, Cost / Pricing, and Looking Forward

I'm curious about the cost and scalability of running many agents simultaneously, and will learn more as I continue to use this integration. But for now I wanted to share this initial setup experience

Thanks for reading! As always happy to read comments on Substack or by replying to this email.

How YC Startups Use AI: Agents, OCR, and Prompt Engineering with Mercoa (YC W23)

Bill Prin — Wed, 06 Aug 2025 15:55:37 GMT

Y Combinator is the highest-profile startup incubator and one of the loudest advocates of LLM adoption. They report that roughly a quarter of their portfolio lets AI write 95 percent of its code, and that nearly every new company touches AI in some way.

Mercoa fits that mold. The Winter 2023 batch company turns accounts-payable rails into an AI-powered bill-pay agent, and its CTO, Sandeep Dinesh (formerly my colleague at Google), has been shipping LLM features since the GPT-3.5 era.

In this interview we cover:

The state-machine AI agent architecture behind Mercoa’s payment product
Prompt-engineering tactics their team refined in production
How they pick models and tools (Gemini, GPT-4, BAML, Stagehand)
Advice for founders and engineers who want to move faster with AI

Interviewer: Bill Prin, AI Engineering Report

Bill: Quick intro – what is Mercoa?

Sandeep: Mercoa is an embedded accounts‑payable and accounts‑receivable platform. We sell to vertical SaaS, banks, and payment issuers such as Mercury or Brex. Their end customers are businesses that handle thousands of bills each month.

Recently we pivoted to an AI agent that pays invoices with virtual credit cards. The agent processes invoices, figures out whether the vendor will accept card, navigates the payment portal or checkout page, and executes the payment.

What are some of the ways you’ve used AI and LLMs to build features that help customers or otherwise give your business an advantage?

Sandeep: Back in the GPT‑3.5 days we dumped invoice text into ChatGPT with a structured prompt. It outperformed specialized OCR APIs that cost ten cents per document and struggled with layout variation. It was a bit surprising that these much more general language-focused models would outperform specialized computer vision techniques, but ultimately their ability to handle the slight imperfections in the input is where they came out ahead. Today we run Gemini 2.5 Pro for vision OCR. No fine‑tuning.

After one-shotting the invoice page through an LLM, we pipe the raw text into BAML so it can extract the fields it needs into clean JSON.

What are some of your lessons learned about AI Engineering based on that experience?

Sandeep: A key lesson around prompt engineering has been that less context reduces hallucination. So we split multi‑page docs, process pages in parallel, then use business logic to reconcile totals. We one‑shot with a tight system prompt and minimal examples.

Interesting, that sounds like a term I've recently seen called ‘context rot’ that Simon Willison has written about and I’ve seen some YouTube videos pop up around.

Sandeep: Yeah exactly, with the early versions of LLMs, context windows were so small, GPT-3 had 2048 tokens, people were begging for bigger windows. So LLMs like Gemini come out today with million-token context windows, and at first it sounds great, until you realize that too much context can actually hurt your results. You want to figure out the smallest amount of information you can give the model to complete the task as that will usually have the best results.

What are some other applications you’ve built? Have you found the need to do any fine-tuning or RAG applications?

Sandeep: We’ve completely ignored fine-tuning. There’s an emerging consensus that it’s too time-consuming and expensive to do relative to other approaches, especially with the foundational models themselves evolving so fast. You’d have to fine tune again for every model you want to try every time a new one releases. Meanwhile, providing better prompts and using better examples is typically cheaper and more effective.

For RAG, we have a use case where we predict metadata for invoices, and we’ve found that a “lazy RAG” approach has been most effective. In this case, we just do a database lookup for invoices that have some similarity based on the limited metadata we do have, and then we use those as examples so that the model can predict the rest of the metadata. For our use cases, we haven’t found the need for more complex embeddings or vector database approaches.

Let’s talk more about this AI agent you’ve been focusing on recently and have made your primary product offering. What does the agent architecture look like?

Sandeep: Think of it as a state machine that only goes in one direction. State 1: We have an invoice PDF. State 2: Determine if the vendor supports card. State 3: Get to the card form. State 4: Fill details and submit. Transition logic is fuzzy, so we let the model decide how to move forward inside each state. The secret sauce is the chain of prompts and loops to get from one state to the next in a reliable way.

One very useful tip we learned is that LLMs have a tendency to hallucinate and confidently answer even if they don’t know if you ask them for an answer. But if you provide an ‘escape hatch’ and explicitly tell them they can say they don’t know, they will use that escape hatch which will reduce bad answers. So typically we use BAML and have our model provide an answer of ‘yes’, ‘no’, or ‘unknown’. BAML lets the model write its chain-of-thought into reasoning while our code reads only acceptCard, which keeps the agent deterministic. The ‘unknown’ answer is that escape hatch that prevents some hallucinations.

There is an explosion of browser‑automation libraries. What are you using?

Sandeep: We shipped v1 on Stagehand because it worked fastest for us in TypeScript. We are evaluating Browserless, Browserflow, and Playwright‑based agents. Vendor lock‑in is a concern, so we prefer thin wrappers around browser primitives.

How do you choose between GPT‑4, Claude, Gemini, or open models?

Sandeep: Criteria are quality first, then latency, then price. For heavy OCR we use Gemini 2.5 Pro – accuracy matters. For the agent’s incremental steps Gemini Flash is fastest and cheap.

How do you evaluate the quality of your LLM responses?

Sandeep: We keep a unit‑test style suite per task. New model comes out, swap it in BAML, run the suite. If it passes, we migrate.

What prompt engineering tips have stuck with you over this time?

Keep the context window as small as possible – chunk large docs and post‑process.
Use enums instead of booleans (yes, no, unknown) to avoid bias.
Chain‑of‑thought is helpful but only include it if it actually helps.
Prompt engineering is probably the best way to get better performance. A good prompt can outperform a better model.

What’s your advice for people building startups, if you started over, what would you do differently?

Sandeep: I would build agentic workflows sooner. If something today seems like it’s delivering some value, but the underlying tech feels not quite ready, well, the underlying tech is evolving so fast that it’s likely it will be ready soon and you’ll be the first one to capitalize on it because you have a head start.

On hype and competition: Massive VC funding can look scary. It simply reflects a huge surface area of problems to solve. Focus on shipping something customers pay for.

What’s your advice for junior engineers looking for a job, or perhaps senior engineers looking to upskill on AI skills? Do you recommend any libraries or frameworks that seem hot like LangChain or LlamaIndex?

I’ve always been skeptical about learning frameworks or libraries because they’re popular or for their own sake. Instead, try to build something interesting, but also talk to a lot of people in the space, be active on AI social media, go to hackathons and meetups. When you do this, typically when you have conversations with people, the need for new tools will emerge more organically. You’ll say, “oh this problem is giving me a hard time” and someone will say, “oh this library can help”.

I was a bit late on the Cursor adoption, but I was on a call with a vendor when I saw them solve something very quickly with Cursor and that led me to try it. With this approach, any tool you adopt has some clear purpose and you understand why you’re using it, plus you have tangible output for all your work.

Resources and links

Mercoa – https://mercoa.ai
BAML – structured prompt and JSON extraction toolkit https://github.com/BoundaryML/baml
Stagehand – TypeScript browser automation, https://docs.stagehand.dev/get_started/quickstart

Claude Code vs Cursor: First Look On A Real Dev Problem

Bill Prin — Tue, 29 Jul 2025 21:36:43 GMT

A few months ago during the Vibe Coding Game Jam, almost everyone I saw on Twitter discussing their work was using Cursor, one of the most popular AI-assisted IDEs. Claude Sonnet 4 had just been released, and Cursor had put their coding “Agent” feature front-and-center and moved its “Ask” feature to the side. The combination of the power of Sonnet with the tooling of Cursor made vibe coding more powerful and fun than ever.

More recently, I’ve seen countless people talk about embracing Claude Code - which Anthropic, the creator of Claude, released in ‘research preview earlier this year’ - in favor of tools like Cursor, an AI-based IDE I’ve written about on my newsletter and Youtube a few times.

Unlike Cursor, Claude Code is a primarily terminal-based tool, though there is some IDE support via plugins, including one for VSCode and Cursor itself. The terminal-based nature of Claude Code is a huge appeal to many engineers, some of whom prefer not to use IDEs, and others who want to use agentic based coding in contexts outside of the IDE such as terminal scripts and GitHub.

When Cursor hype was at its peak (raising money at a $9B valuation), many highlighted the irony that a “ChatGPT-wrapper” - a tool that just builds around foundational models- seemed to be a stronger business than the foundational models themselves. This inverted previous assumptions that “wrappers” were low effort with little to no competitive business moats.

Since the release of Claude Code in Research Preview, we’ve seen a foundational model company strike back. Claude Sonnet is usually cited as the current strongest coding model, and Anthropic - being the company that makes Claude - has at least some advantages in building the best coding tools on top of it.

Claude Code has emerged as one of the most popular tools for AI-assisted coding, and I’ve seen countless tweets and Reddit posts asserting that they’ve abandoned Cursor in favor of simply using Claude Code. So I’ve decided to give both a spin and come to some early opinions.

The Task They Battled Over - Improving ESLint Config in a Large Turborepo Monorepo

The main task I set out to accomplish is one that’s been giving me headaches for months. In Typescript, you can write async functions which are almost always intended to be called with await . Forgetting this has been a huge source of bugs in my codebase.

As an example of a bug, I mean to send a customer an email, but the server function finishes and terminates because I did not await for the email function. Because the server thread no longer exists, the async email function never around to being run, and a critical email is not sent.

This is not technically a violation of the Typescript compiler, and is instead something that should be checked by a linter like ESLint. In a small project, this is usually trivial to add.

The real challenge arose when I moved to a Turborepo monorepo, which I mostly did so that my web version of my project (written in React) and the native iOS/Android versions (built with Expo / React Native) could share code. NodeJS dependency management is notoriously error-prone and fragile, and in the world of Expo and React Native, doubly so, and in a world of trying o combine Expo with NextJS, triply so.

A few months ago, I spent over 10 hours with both Claude and ChatGPT trying to get my linter working, and still failed. It sounds a bit ridiculous, but does speak to the complexity and fragility of the ecosystem.

Claude Code One-Shots It

I downloaded Claude Code, and ran just claude in my cli, being presented with a prompt, where I described my issue.

Before I did that, I ran the initialization to create a Claude.MD, and I was pleasantly surprised that it deduced that my project was really two separate but closely connected apps (web and native) sharing code, without any instruction from me explaining this.

I then asked it to fix the ESLint issues, including that I was looking for a sample file that could reproduce the bug and show “red squiggly lines” in my IDE, which would call visual attention to the bugs as I develop.

Claude Code had no problem navigating my codebase (including the two separate versions of the product) and setting up ESLint:

It also created the test file like I asked and pointed it to me to verify I could see the red squiggly lines:

But Cursor One Shots It Too

I was blown away by Claude Code finally getting the configuration right and fixing this long standing issue, as I thought previous AI toolings had failed. However, I asked Cursor to accomplish the same thing, and it also one-shot it, fixing many quirky dependency issues:

It’s All The Same Underlying Model - So Whats The Difference

Its not surprising that both Claude Code and Cursor were able to fix the issue since they both use the same model, Claude Sonnet, behind the scenes.

While there’s clearly some minor differences in how their agent operates, both Claude Code and Cursor have top-notch teams so any big difference in capabilities is likely to be ironed out over time.

So if both Claude Code and Cursor can accomplish this task - what is the difference between the two tools? I’ve come to the following (early) conclusions:

Cursor still offers better IDE usability. While I tried the Claude Code VSCode Plugin, it’s bare-bones in comparisons to Cursor, mostly just showing diffs to approve. Cursor, meanwhile, has tab to complete and multi-line edit, which are much more sophisticated ways for a human to use AI to code faster.
Cursor is a faster workflow if you can use your codebase knowledge to hint the AI along its task. When I asked Claude Code to fix one of the linter errors, it took several Unix search and grep commands to even identify the area of the code I was talking about. Cursor, on the other hand, by default will add the file you’re looking at into the context, and specifically add context about where your cursor is highlighting code. This makes Cursor’s agent slightly faster when you are actively working in your codebase.
Claude Code is scriptable . Claude Code has a -p flag that lets you give it a prompt as you call it, which means you can write a script that will trigger Claude Code to do its job. This has almost infinite use cases but to name one, you could imagine a simple script to pull a GitHub PR, and run claude -p fix the linter errors . This way, Claude Code can be dispatched to every use case outside of the IDE, while Cursor’s agent is trapped inside of it.
Claude Code has tight GitHub integration - This ties into the above scriptabiilty topic, but it’s worth calling out that Anthropic has already added first-class support to GitHub for Claude Code. There’s tons of cool out-of-the-box features like being able to go to a GitHub issue, mention @claude, and have it create a new PR using a Github action in the background.
Claude Code is slightly more price efficient - Both Claude Pro and Cursor Pro are $20/month, but Claude Pro resets its usage slightly more frequently. If you need to step up from there, Claude Code offers a halfway price point at $100/mo, while both offer $200/mo plans.

Why Not Both?

It’s worth noting that Cursor, being a fork of VCCode, has a terminal window as one its panels. That means its trivial to run Claude Code inside of Cursor, and in fact Cursor is explicitly supported.

If you take this approach, you’ll still have to pick whether you want to use Cursor’s agent or Claude Code , with Cursor support, on any given task. The difference is subtle, but as noted above, if you reach for Claude Code within Cursor, you’ll get basic code diffs but won’t have access to Cursor Rules or Projects. In practice, I think if I’m within the IDE, but there may be specific workflows where Claude Code is better set up to accomplish a task.

Conclusion

Both Claude Code and Cursor are fast moving products (along with the rest of the ecosystem) so it’s tough to compare any of these tools as anything written can be changed within a few weeks.

But overall, the comparison between them becomes pretty clear. Cursor is an IDE-first tool with a focus on a great IDE experience and Claude is a terminal-first tool with a focus on flexibility and scriptability. Since they both use the same models under the hood, most of the difference in these two tools becomes centered around these focus points.

Finally, you can run Claude Code in the terminal panel of Cursor and have access to the best of both worlds.

There’s still tons of interesting features being released by both products such as Cursor background agents, Claude sub-agents, and Claude Github integration I hope to explore in future posts. I’m also intrigued by Amazon’s new Cursor competitor - Kiro.

From Zero to Monetized iOS App in 10 hours with Bolt.new, Expo, and RevenueCat

Bill Prin — Wed, 25 Jun 2025 21:07:01 GMT

As part of the Bolt.new hackathon, I was able to go to from an idea to a monetized iOS app in about 11 hours of focused work. The app is called Systematic, and its a flashcard app to review key concepts on databases, networks, and API designs to prepare for tech system design interviews. You can find it on the App Store here. Here’s a screenshot of it:

I measure my time spent on projects in my time-tracking app, Interstitch, and it took me about 6 hours of focused work to go from “idea” to “in the App Store” and about another 6 hours of focused work to add the subscription monetization features with RevenueCat. Pretty good!

A couple weeks ago, I wrote about speedrunning an AI art app using Bolt.new, Supabase, and Replicate. The goal there was to build an app with a last-minute deadline as a favor for the organizer of an AI Art Meetup, but I was also motivated to use Bolt.new because of their hackathon with over a million in prizes.

Bolt.new also has several $25k prizes for apps that use Expo and RevenueCat to build monetized native iOS / Android apps. And as someone who already has an app using Expo and RevenueCat in production, I was eager to try out Bolt.new’s integration so I can circle back to that app with new tools.

So, I set out to again try to ship and monetize an app as fast as I could.

This post walks through what it took to go from zero to shipped, what went well, and what still had rough edges.

First Pass In Bolt.New

I started with the following prompt:

Build a mobile app that is a system design flashcard app. It should be minimal and offer a few quizzes on :
databases
networking
api design
All the flashcards can be statically stored for the first pass. No user login is needed. Keep it simple and minimal, but good looking, with maybe a few visual fx like confetti.

I stressed to keep it minimal and simple since sometimes AI agents can get a bit too creative and start adding up lots of UI for features like user authentication that don’t exist and aren’t necessary for a v1.

The request for the confetti was to make sure there was some visual “juice”, though if I were to be more thorough here, I should have first brainstormed different “juice” ideas with the AI, then asked it to add that. But showering confetti on a right answer seemed like a good start, and the agent did an excellent job with the animation.

My Favorite Prompt Engineering Trick

The first pass of the app looked great, but there was a slight problem. Whenever I answered a question, and clicked “got it right”, and expected to move to the next question, it would “flicker” the old question for about a second. It was minor, but very distracting and made the app look unprofessional.

I told the agent to fix the flicker several times, and it went into the frustrating loop that AI agents sometimes get into, where they repeatedly say:

They understand the problem
They have a working fix
The fix is implemented
….but then I check and the issue is still broken

fortunately, I was able to re-use a trick that I mentioned in my post on the Vibe Coding Game Jam.

I prompted the model in the following way:

First, determine the logic for when the next question should be rendered and replace the current question

Next, determine the logic for the timing of when the “question answered” buttons are pressed
Verify those occur at the exact same time. If not, fix it.

This prompting strategy fixes it. Giving the LLMs a way to make a multi-step plan is a well-known way to get agents to accomplish tasks, but helping the agent by providing your own plan can sometimes get it out of a rut.

Minor Bumps On The Road

After I had a pass of the app working in the Bolt.new preview, I synced it to a Github repo and finished most of the building and App Store publishing in my local repo using Cursor.

There were a few minor issues with building it for the App Store, which was fixed when I ran npx expo doctor, which is an Expo tool to make sure all dependencies are on the same page.

I was surprised to get errors related to permissions to use the iPhone Camera when I deployed to the App Store. Turns out, the Bolt agent added a bunch of dependencies to set up the Camera to my flashcard app, which makes no sense, and once again shows that the AI-generated code still benefits from human review.

Publishing the App and Trying Out RevenueCat Paywall

To create the iOS screenshots, I used App Mockup. I initially generated fancier ones but they got rejected during review. iOS review is notoriously inconsistent - they encourage straightforward screenshots, yet nearly every top app disregards that guidance. So in my experience you have to try a few times until you get a more permissive reviewer. For now, I used simpler screenshot annotations and will try a fancier one in a future update.

The hackathon requirement mentioned using RevenueCat’s Paywall system, which lets you design a paywall in a web design kit, that you can then experiment on without changing your code. Here’s what my Paywall design looked like:

As you can see, the design tool is similar to Figma.

I’m a little unclear what the value proposition of this tool is. I think the idea is that a developer can build an app, and then the marketer can just iterate on the paywall without a developer’s assistance, and without re-publishing the app. This makes some sense since the paywall will be one of the most important pieces of monetization, and a screen you’re going to iterate on more than average.

But this doesn’t seem to add much value to me, since using AI like Bolt.new to build screens is faster than what I’ll slowly design myself, and you can already publish Expo apps OTA (over-the-air) without going through the full app publication lifecyle.

So I found their desire to highlight an “old-school no-code” feature as part of a “new-school AI no-code” hackathon a bit odd. However, I am curious to try this Paywall feature along with RevenuCat’s Experiment features. It’s possible that the Paywall still makes pricing A/B tests easier, so I’m open to exploring more there.

Wrapping Up

About 2 years ago, I wrote a post on my old blog about going from an app idea to a paid subscriber in one month. At the time, I thought that was a pretty ambitious schedule and I was happy to have succeeded. (as an aside, that app and blog post have since been removed - long story).

Throughout my 13+ years at big Silicon Valley companies like Google and Pinterest, I mostly did backend work. When I set out to build my own apps, both as side projects while working and when focusing full time on indie apps, writing frontend code was a huge time sink. I know React and React Native fairly well, but doing good UX design in Figma, translating that to CSS, and just churning through the high volume of code required was a massive drag on my velocity.

AI frontend-generators like Bolt.new massively accelerate that velocity and unlock the ability to experiment with many more small apps to see what gets traction.

But, I will note that while Bolt was fantastic for creating UI screens, I found myself syncing to Github and working in Cursor for all further work, such as setting up RevenueCat. Both Bolt and Cursor are using similar models like Claude under the hood, so the difference really came down to the rest of the development environment. Cursor is still built for coders and makes working with the code easier, and even though I’m letting the AI do most of the coding, Cursor lets me “keep an eye” on things more as I fill in the rest of the backend.

One rough edge I hit was that Bolt’s one-project-per-repo structure doesn’t play well with my Turborepo setup, which I use to manage a growing portfolio of apps in one repo with standardized dependencies and tooling. I’m toying with the idea of building an AI agent to sync Bolt-generated screens into larger codebases.

Signing Off

Thanks for reading! Let me know if you checked out the app, have any questions about Bolt or Expo, or any other thoughts on AI no-code UI generators. Feel free to get in touch by replying to this email, leaving a comment on Substack, or emailing me using the email listed on my personal site.

Speedrunning an AI Art App With bolt.new, Supabase, and Replicate

Bill Prin — Thu, 12 Jun 2025 16:40:43 GMT

This week, an organizer of an AI art meetup I’ve attended messaged me on WhatsApp around 7pm: “Can you build an AI art app by tomorrow afternoon?”

A few years ago, I would’ve said that timeline was impossible. But with tools like Bolt.new, Supabase, and Replicate, I had a working version live the next day.

The goal was to build an “experience” app so that 300 people at a meetup in Washington DC could all share their fears and hopes about AI, have images generated, and then create a gallery where all those fears and hopes stood side-by-side.

bolt.new enabled me to build this app in a few hours.

bolt.new has been on my radar for a while, along with its competitors like v0 and lovable. This month, they are hosting a $1 million hackathon, which seemed like a good excuse to give it a try.

Overall, I’ve been very impressed by bolt, particularly for quickly shipping web apps. They now have first-class integration with Supabase, a popular open-source database, which enables not just data-persistence but server side API endpoints and file storage - both critical features for my AI art app.

In this post, I’ll cover why I’m interested in no-code AI platforms despite being a competent coder, why bolt.new was especially well-suited to this task, and other lessons and challenges.

(btw, you can check out the app here, but please don’t stress-test it too much as the event is Friday June 13)

The AI Art App We Set Out To Build using Replicate

The AI art meetup is called Machine Cinema, who’s mission is to bring AI + art into real life, with a community of people sharing the latest tips and techniques. The organizer - Minh Do - wanted to build a collaborative experience for meetup attendees.

The app asks the following four questions:

What fear, tension, or challenge about AI or society keeps you up at night?
At some point, we stopped believing ________, and started imagining ________.
Imagine a breakthrough moment that shifted the story of AI for the better - what was it?
In one sentence: What headline would make you feel like we've truly made it?

Each user should submit an answer to that question, and the answers will then be combined in a gallery so you can see each “act”: Act 1 where we review everyone’s fears, Act 2 where we review what we start imagining, Act 3 where we review everyone’s breakthrough moment, etc.

For image generation via an API, there are a few different options. Both OpenAI and Google’s Gemini have an API for generating images, however their flagship models are fairly slow and expensive.

We ended up going with Replicate, which is a company that runs open source image generation models for you, while handling various complexities like GPU provisioning. Our main motivation for using Replicate was the wide variety of models they offer so we can optimize the tradeoff between time, cost, and quality.

On Replicate, we started using their default suggested model, “flux-pro”. This model creates high quality images but like OpenAI and Gemini is fairly slow and expensive. Fortunately, it was easy to switch to “flux-schnell”, a much faster and cheaper image model. While the images weren’t as consistently relevant to the prompt, the tradeoff of the app being snappier to use and enabling us to afford it makes it a good choice.

Why Bolt Was the Right Choice

AI no-code platforms are being heavily marketed towards non-engineers , with there sometimes being an implicit assumption that coders won’t use no-code. But just because you can code, doesn’t mean you should.

Even though I’ve extensively coded in React and React Native, I’m more than happy to let AI do the lion’s share of the work - especially since Bolt enables both Github sync and a project download, easily enabling you to take your code off the platform if you ever run into a limitation.

When I first got asked to do the project, Minh - the organizer of Machine Cinema was already working with a teammate using v0.dev and Replit - other competitors in the space. I immediately suggested switching to bolt.new, and not only because of the hackathon, but because of critical features we ended up using.

Overall, Minh was happy with how quickly I turned around the app, and I was happy that I was able to do effectively the entire app in plain English, and most features in a single prompt.

Why Supabase Made It Work

Bolt.new has a built-in integration with Supabase, which adds an actual database to the project, a key and critical feature to make anything other than a transient demo. Without a database, whatever your app creates will disappear on the next page refresh.

There were a few other ways in which Supabase provided - edge functions and file storage.

Supabase adds another key feature to Bolt - Edge Functions - server side API endpoints. This turns out to be a key feature for our use case, because to generate the AI image, we use the Replicate API. But given images can be expensive to generate, we don’t want to expose the API key in the client, and instead need to use a server side endpoint. This is exactly where Supabase edge functions worked great, since it gave us a server side endpoint immediately.

Another problem we ran into is that our links kept being broken. This was because I was using Replicate to generate URLs for our images, but Replicate is not meant to serve images in a production app. You’re supposed to download the image and store it somewhere yourself. We learned this the hard way as our image links kept getting broken.

Fortunately, this was again easily fixed. In Bolt.New, we simply asked the agent to download the images from Replicate, upload the files in Supabase Storage, and save the links to our Supabase database. It worked like a charm.

Strengths and Weaknesses of Bolt.New

While I’m excited about various AI coding products, I do try to remain realistic and grounded rather than buy into any hype.

Overall I found Bolt extremely functional, but there were some setbacks. By far, the biggest pain point came when trying to setup Replicate. It was not familiar with the API. While Replicate can work with just HTTP fetch, it’s recommended to instead use the Replicate NPM package, which the agent did not use until I suggested it install it.

By contrast, the Supabase integrations worked incredibly well. I asked it to add the database and it created a schema that worked on the first try. Later, when we needed the Supabase edge functions and file storage, it again worked on the first try in both cases.

It’s not surprising that the Supabase prompts worked well and the Replicate prompts didn’t, given that bolt.new advertises Supabase as a first-class integration on the platform. Presumably, they have a variety of prompts already provided that guides it to using Supabase correctly, that were missing with Replicate.

Another major strength of bolt.new is its ability to sync your code to Github or download the project altogether. Because I was having problems with Replicate, I did download the code and fix it locally in Cursor, where I have more typical IDE tooling that’s partially missing in the bolt cloud environment. To me, this is a critical feature of any platform since being able to ‘backdoor’ into the code provides a workaround if the agent is ever simply unable to fix the problem.

Is Bolt Ready for Primetime?

One question I got asked on Twitter is whether bolt is ready for real, production, monetized apps. My answer to that is - it depends on what the app is. If it’s something simple, bolt can probably do it well. Of course, simple apps rarely have competitive advantages outright, although simple apps in a niche market with the right sales and marketing can still greatly succed. And now a no-coder could easily spin up a bolt.new app, integrate Stripe, and be off to the races.

On the other hand, if you wanted to start installing custom system libraries, you might run into the platform limitations. Even in this case, I could imagine bolt.new being a useful platform to build a frontend on that uses a more productionized backend API for the heavy lifting.

My Plans for Bolt Going Forward - Youtube and Expo Native iOS Apps

While the app was built for Minh’s event, I handled the engineering side, and I’m sharing it here primarily to showcase the tech stack, so here’s the link. Keep in mind this is primarily for Minh’s event so don’t share it too widely.

I’m very eager to continue playing with Bolt. If AI agents can build frontends, or at least frontend MVPs, I’m not worried about ‘replaced’, I’m excited to work on more interesting problems than boilerplate frontend code. The fact that Bolt is offering a huge number of prizes is icing on the cake in case I win, though given there’s already 80,000 people competing for 10 prizes, it’s surely to be competitive and I’m not counting on anything.

The project I’m most excited to work on is tooling to improve at YouTube. In case you didn’t notice, most of these Substack posts have YouTube embeds at the top, as I view YouTube as a platform with massive reach and its algorithm one of the best ways to reach new people. At the same time, I’ve frequently struggled with the video production. I have a vision for an app that helps me create “b-roll”, or background images that match my written script, as a way to more quickly add engaging visuals to my YouTube scripts. Since that’s a fairly complex app, I plan to start with a simpler use case - generating YouTube thumbnails.

Bolt also has an integration with Expo, which is a framework that simplifies React Native development and is a tool I used for my monetized native app, Live Poker Theory. I’m curious to try out Bolt on a new Expo app to compare my experience doing it with and without a no-code platform.

Overall, I’m very excited about the potential to ship web and mobile MVPs much faster with AI, and it’s clear that AI tools specifically built to accomplish this are making fast gains, with incredible demos from bolt, v0 by Vercel, and Lovable. Bolt’s Supabase integration , for the moment, put its at the head of the pack for me, though I plan to keep an eye on all of them.

Thanks for reading, reply to this email, comment on Substack, or email me with any thoughts. I look forward to sharing more of my progress on my bolt.new apps.

Building a Custom MCP Server to Query Firebase from Cursor

Bill Prin — Wed, 28 May 2025 20:47:24 GMT

Lately, I've been obsessed with bridging the gap between AI agents and real-world data. For my poker training app, Live Poker Theory, I often just wanted to quickly answer product questions, like "Are users studying tournaments or cash games more?" (These are two different sets of flashcards that users might study).

Now, I could go the traditional route: invest hours in setting up a full analytics platform like Mixpanel or Posthog, defining custom events, and building dashboards. Those are incredible tools, but they take time to configure and learn to use, and for a quick, ad-hoc question, it feels like overkill. My data lived in Firebase, my AI agent in Cursor, and the constant friction was that there wasn't an easy, intelligent way for them to talk.

I've already shown how MCP can connect AI tools to Notion for project planning, but if we can apply the same principle to the product's actual backend data?

Managing a TODO list is cool, but what’s even cooler is the AI doing the TODO list. To do that, the AI needs access to the real data.

This post details exactly how I built a custom MCP server to enable Cursor and Claude Desktop to query my Firebase database directly. My goal was to put MCP to a tangible test, letting me analyze product usage with natural language, all within the comfort of my editor.

The Setup

Quick recap: MCP lets you expose tools that agents, like Claude or Cursor, can call to obtain external knowledge like the weather or your TODO list.

LLMs are trained once and remain static and unchanged, so any “new” knowledge they have must be provided by the user or a tool, and MCP provides a standard way to create these tools.

My goal was to use an MCP server so that a tool like Cursor can talk to Firebase, and specifically, Firestore, which is a document database that serves as a subset of the Firebase suite of products.

When I started this work, no official Firebase MCP server existed, but an open source repo gave me a starting point. The repo is written by Gannon Hall, and it’s an excellent small reference repo to look at if you’re building an MCP server for any database. Since then, Firebase has released their own official MCP server.

But there was a big problem - Neither Firebase’s official server nor Gannon’s unofficial server supported the Firebase count method - the exact method I needed.

The existing MCP servers do support listing and retrieving documents, and in theory you could simply get all the documents and count them, but with tens of thousands of sessions in my database, this would be slow and expensive, since Firebase charges you based on the number of times you read a document.

Fortunately, the firebase-admin API supports a way to count all documents that match a query without retrieving them all, but that API call just wasn’t added to either of the Firebase MCP servers.

Of course, this problem was a great opportunity for me to explore extending an MCP server for a real use case and write about it in this newsletter!

To Vibe Code Or Not To Vibe Code ?

To do this, I forked Gannon’s repo and made the changes. I’ll try to get these merged upstream, but in the meantime, you can check out my working fork with firestore_count_documents support on my Github.

While adding the new tool, I did what most MCP tutorials suggest: I asked the AI (via Cursor) to write the code for me.

And to its credit, it worked. The assistant quickly scaffolded a functioning count_documents tool using Firestore’s admin SDK.

But this is also where “vibe coding” started to break down.

I asked it to add a filters parameter to the count query. It did - but it also added orderBy and pagination fields. That makes no sense for a count - you’re getting a single number. When I pointed this out, the AI agreed and removed them. But it was only my own experience reviewing the code that caught the problem.

To get it working, I had to dig into how MCP servers handle tool registration and request handling under the hood. I also leaned heavily on the MCP Inspector, a debugging tool that helped me simulate tool calls and verify responses without needing to go through Claude or Cursor directly. If you're building your own MCP integration, especially with a backend like Firebase, this next section covers the key pieces you'll want to get right.

If you’re curious about just the changes I made to Gannon’s repo, I’ve uploaded the two steps (listing what the tool does, and actually doing it) in this Github Gist.

Challenges I Encountered and Lessons I Learned

The process was mostly smooth, but I ran into a few challenges, and learned a few things:

Restart fatigue: I often forgot to restart Claude Desktop, reload MCP servers in Cursor, or rebuild the local server after making changes. Small things, but it frequently caused me to stumble when I thought my changes weren’t working when in fact they were not getting used
Environment handling in MCP Inspector: It's a great tool, but make sure you learn about providing command line environment variables with the `e` flag - otherwise it’s another thing thats easy to forget, and it’s tedious to copy and paste them in every time. In my case, I needed to add the credential to authenticate to Firebase as an environment variable
Firebase data modeling trickiness: My sessions don’t explicitly say “tournament” or “cash.” Instead, I inferred it from the chart field. Firebase doesn’t support any sort of substring match as its filter. However, it does support string compare, and fortunately, all my tournament charts begin with MTT, so I used string comparison ("MTT" <= x < "MTU") to filter them.
Tool registration differences: Gannon’s implementation uses server.setRequestHandler(ListToolsRequestSchema) and CallToolRequestSchema. This gives more control than the server.tool() shortcut highlighted in the weather tutorial, the first tutorial on the official MCP website . This is helpful if you want to customize routing or shape tools more directly.
Smithery integration: The repo includes a smithery.yaml file for publishing to Smithery AI, a community MCP registry. It also includes HTTP transport, which is forward-compatible even if you're using stdio today.
Port check issue: The server refused to start if port 3000 was in use, even when running in stdio mode where that port is irrelevant. Not a dealbreaker, but it was confusing until I understood the cause.

Product Insights I Obtained

About 17,000 sessions, or roughly 40 percent, were tournament-focused. That surprised me. I had assumed tournaments would dominate, especially since the demo flow starts in the tournament trainer and I didn’t exclude demo sessions from the count.

For those unfamiliar with poker strategy: tournaments tend to play short-stacked, which makes preflop decisions more structured and math-driven. Cash games, on the other hand, are often deep-stacked and reward postflop creativity. That usually makes preflop study less essential in cash compared to tournaments.

But the data is clear. If 60 percent of users are studying cash game spots, I’d be ignoring reality if I continued thinking of tournaments as the main use case.

Takeaways

MCP + Cursor lets me query my real product data in my editor, by translating natural language questions into database queries.
You can vibe code most of this, but you should still review the logic closely, as even state-of-the-art models still code major mistakes
MCP tooling is still in the early stages, and while remote registries like Smithery AI exist and server authors are adding HTTP transports, for now, we mostly live in a local MCP server world. For that, MCP Inspector is a critical and useful tool.

Conclusion

Thanks for reading. If you have any thoughts or opinions on Firebase + MCP, please get in touch by replying to this email, leaving a comment on Substack, or shooting me an email linked on my personal site.

For the next edition of AI Engineering Report, I’m getting equally excited by bolt.new $1 Million (!) dollar June hackathon, AI Agent architectures, and OpenAI’s recently released image generation API. Thanks for reading and stay tuned!

AI Engineer (AIE) vs. Machine Learning Engineer (MLE) : Why Hacker News Purists Are Wrong

Bill Prin — Thu, 15 May 2025 17:43:11 GMT

Everyone’s talking about AI, but not everyone agrees on what “AI Engineering” actually means, especially when compared to the more established role like Machine Learning Engineer.

Machine Learning Engineers are typically the ones that build the AI, and the job title is a standard one at leading tech companies like Meta, Uber, and Netflix. So at first, it might seem odd that ‘AI Engineering’ largely refers to something else other than what those engineers building AI are doing.

In this article, I’ll explain how I think the terms are evolving, and why I believe “AI Engineering” is going to become the dominant label for a growing category of engineering work. In essence, AI Engineering is becoming a shorthand for Applied AI Engineering, building applications, systems, and tooling on top of pretrained large foundational models, especially LLMs and image generators like Stable Diffusion.

AI - or Artificial Intelligence - has been around since the 1950s so it might seem strange that AI Engineering is such a recent term. But it is, and I’ll explain why below. And like all new terminology, there’s no clear consensus on what exactly it means.

The definition I’ll use in this article reflects a growing industry trend - and one I believe is likely to become the standard, for reasons I’ll explore, even if it contradicts how some purists feel. There's also uncertainty around whether the formal job title “AI Engineer” will catch on, or whether it will morph into something like “Software Engineer - Applied AI”.

Since I’ve named this newsletter AI Engineering Report, it’s only fair that I explain what the term means to me. I’ll also compare how AI Engineering compares to Machine Learning Engineering work.

AI Engineering vs. ML Engineering

The definition of "AI Engineering" varies depending on who you ask.

Some purists argue that only those building the AI models deserve the “AI Engineer” label, and that those merely using the models are just software engineers applying machine learning.

I think the purists are going to lose that fight.

Yes, their argument makes sense in a historical context. But “AI” has become a tidal-wave buzzword, and in practice, it now mostly refers to using models like GPT, Claude, and Stable Diffusion, not building them from scratch. Far more people will work with these models than will train them.

And importantly, the people building the models already have accurate titles: AI Researcher, Research Engineer, ML Engineer. There's room for a new label for those who specialize in building software with these models at the core.

A Brief History of AI Nomenclature

AI has always had many flavors, from brute-force search (e.g. chess), to symbolic logic (rule-based systems), to today’s statistical and neural methods.

Ultimately, machine learning, especially statistical and neural approaches, became the dominant paradigm for real-world AI products.

From Google Search and Translate, to the Facebook feed, to Netflix recommendations and Google Photos, machine learning powers nearly every major "AI product" of the last 15 years.

The people building those systems were called Machine Learning Engineers, because that’s what they were doing day-to-day: training models, optimizing pipelines, working closely with data and researchers.

Meanwhile, most academic researchers simply labeled their work AI research. Even if it included engineering, it wasn't usually framed that way.

Because the industry embraced "machine learning" as its operational term, and the academic community was smaller and less media-visible, the term AI remained loosely defined… until the rise of LLMs and generative models in the 2020s.

Today, AI in popular discourse largely refers to generative foundation models , especially LLMs. So it’s natural that AI Engineering now refers to engineering work focused on using those models.

ML Engineer vs. AI Engineer: What They Work On

Let’s be honest: job titles are messy, and responsibilities often overlap.

Roles like software engineer, data scientist, infra engineer, and ML engineer frequently blur. But we can still draw a rough outline of where things tend to cluster.

Machine Learning Engineers:

These are the people who build foundational models and bring them to production scale. Some specialize in algorithm design and research; others focus on training pipelines, distributed compute, and massive-scale data engineering.

Common areas of work:

Ranking systems (e.g. search engines, ad platforms)
Recommendation systems (e.g. Netflix, YouTube)
Classification (e.g. spam detection, fraud)
Forecasting and time series prediction
Computer vision and traditional NLP

Applied AI Engineers (AI Engineers):

These are software engineers who build apps, tools, workflows, and systems on top of pretrained models, especially generative ones like GPT-4 or Stable Diffusion.

Common areas of work:

Prompt engineering, prompt routing, and finetuning
Retrieval-augmented generation (RAG) and vector search
LLM pricing and latency optimization
Model benchmarking and evals
Agent architectures
AI-Assisted coding and AI-Driven coding (vibe coding)
Apps with LLM technology as a central feature

No Neat Boxes

In practice, there's no clean boundary between AI engineering and other disciplines.

A senior FAANG engineer recently told me: “At our company, data scientists run prompt evals, and software engineers handle prompt routing optimization.”

This reflects the reality that AI Engineering often spans teams and disciplines. And while I can’t say with certainty where industry consensus will land, I’m absolutely confident that this type of work is exploding in complexity, scope, and importance.

That means we need better language and clearer mental models, and this newsletter exists to help shape that conversation.

Connecting Claude to Notion with MCP

Bill Prin — Fri, 09 May 2025 18:38:14 GMT

Welcome to the second edition of the AI Engineering Report. This week, I’m exploring how I used Notion’s recently released official MCP server to connect my AI chatbot (Claude) to the documents, spreadsheets, and calendar I use to manage my life and projects. I’ll cover:

What Notion is and how the new official MCP server works
Why this general idea of connecting AIs to real tools like Notion, Google Docs, JIRA, Confluence, or Linear is so powerful
How this changes what AI agents can actually do

Notion MCP

If you’re not familiar with Notion, think of it like Google Docs but with a better UI - you can store and manage a document like a resume, a spreadsheet, or a calendar of events. I use it to keep track of almost everything in my life.

That’s why I was so excited to try the new official Notion MCP server.

If you’re not using a coding IDE like Cursor or Windsurf, your best option for an MCP-enabled chatbot is Claude Desktop. (OpenAI has announced MCP support for ChatGPT, but it’s not released yet. Similar story with Claude Web).

Once the Notion MCP server is installed, Claude can do things like:

“Look at my TODO list and pull up any tasks related to home improvement that are older than 6 months.”

Under the hood, Claude:

Locates the TODO document
Queries all items
Filters by date and natural language relevance (“home improvement”)
Uses the Notion API efficiently, but fills in the gaps where intelligence is needed

This hybrid model API-level speed, AI-level reasoning—is the future of agentic workflows.

Why This Pattern Matters for Engineers (and most other people too)

I wrote about MCP in the first issue of this newsletter, and I’ll likely return to it again. It’s a simple concept, but it unlocks a huge surface area for useful AI applications.

LLMs are only as good as the information they have access to. They hallucinate. They make up details. And they often can’t tell when they’re doing it.

MCP changes that. It gives them structured, real-time access to the data you already use. Some early use cases that stand out:

Letting an AI coding assistant reference live business requirements while writing code
Forecasting project timelines using real calendars and backlogs
Understanding team org charts and routing questions to the right person

These use cases don’t require “superintelligence.” They just require access. MCP provides that access.

Want to Try It?

If you want to try the Notion MCP setup, I wrote a short tutorial here:

How to Let Claude Access Your Notion data via MCP

I originally planned to cover building a custom MCP server (e.g., for a niche internal tool or a more streamlined Notion setup). But before doing that, it helps to see what’s possible with an official, full-featured one.

Thanks for reading. If this was useful, feel free to forward it or reply with thoughts!

Making a Video Game with AI By Just Typing English (Vibe Coding Game Jam)

Bill Prin — Thu, 01 May 2025 18:56:45 GMT

Welcome to the first edition of the AI Engineering Report.

In this newsletter, I’ll cover my experience participating in the Vibe Coding Game Jam, where the challenge was to create a game from scratch using only English prompts - no coding allowed. I used AI tools to build a playable deckbuilder game from scratch, discovered the power of the new Model Context Protocol (MCP), and came away with some new opinions on AI-assisted development.

My AI-Generated Game

Vibe coding - using AI to write 100% of your code - is having a major moment.

I recently made a video game from scratch by just typing English into a prompt window, as part of a game jam to celebrate “vibe coding”.

Here’s a quick video of my game in action:

And here is a link to Play My Game - Vibeslayer. The game should be playable on your browser on both web and mobile, but be warned: I only made the first level!

The game is in the deckbuilding genre, which you read about on Wikipedia. Specifically, it’s very similar to Slay The Spire, which is a great game if you like mobile strategy games, and a terrible game if you have anything more important to do with your time.

There’s only 1 level in my game and it’s very thin. The game jam had over 1000 entries and I didn’t expect to get much attention. My main goal was to learn pure AI coding better.

While coding, the mechanics of grabbing the card and dragging the card was where the AI was a massive unlock because the CSS and physics to get a smooth card playing motion aren’t exactly difficult, but are extremely tedious and complicated. I estimate it would have taken me at least a week to get the game feeling right, but it took the AI about 90 minutes.

I used Midjourney for the art, though this was before OpenAI released their new incredible new image gen. I also asked the AI agent for advice on getting the best Midjourney prompts, which is one of the best image-generation hacks (use the AI for advice on prompting the AI better)

Vibe Code Jam and Why Vibe Coding Is Huge Right Now

I made this game as part of the “vibe code jam”, organized by indie hacker legend Pieter Levels and he got over 1k entries.

Why has vibe coding exploded this year? It’s a confluence of a few factors:

Base models have gotten much better at coding - Every time OpenAI, Anthropic, and Google release new coding models, their ability improves. This year, Anthropic released Claude 3.7 and Google released Gemini 2.5 Pro - both leading models for AI coding
IDE assistants like Cursor, Cline, and Windsurf have improved at using the base models - The base models give you a starting point, but there’s still a lot of logistics to effectively leverage them in your coding workflows. The first pass at AI coding assistants basically just replaced the workflow from “copy my code into ChatGPT, ask the question, copy the response”. The new generation of these tools create agents that can form complex plans such as searching for necessary files themselves to implement a new feature, and do fancy multi-line automatic edits as you write code based on continually scanning the context around your cursor.
The tools are getting less siloed - This year, Anthropic released MCP, or Model Context Protocol, which is an open standard for AI agents to talk to each other. Now, AI agents can not just look at your code, but look at your browser to see any errors as they build your web app, or look at your database to understand the structure of the data you’re working with, or look at your company’s internal wiki for key documentation to understand requirements.

MCP was particularly relevant during my experience creating my game. Often, I’d ask the AI to implement a feature, and it would cause an error in the browser. But, the AI agent is unable to see the browser logs. With MCP, I was able to give the AI a way to check there are no errors and fix them if there are.

I wrote a tutorial on how to use MCP to debug browser errors on my website here:

How to Use MCP to Let Your AI IDE See and Fix Browser Console Errors

Vibe Coding Tips and Why I Believe Prompt Engineering Will Be Huge (But Called Something Else)

The biggest lesson by far in the vibe coding jam was I’m fully convinced prompt engineering is a massive, important field.

There were several points while making my game for the vibe coding contest that I wanted to just stop the AI from going in circles and to write some code myself to get it unstuck. I didn’t do that stay within the spirit of the competition and exercise.

But what I learned is that I was able to get myself “unstuck” from situations where the LLM was going in loops by prompting it creatively.

For example, here are some of the ways I got it unstuck:

Breaking it down into smaller steps - instead of “the user grabs the card and drags it up to play the card”, first just focus on making the card draggable and worrying about playing it later
Break down its reasoning with it - At one point , “attack cards” were playable, but “defend cards” weren’t. Telling the agent “attack cards work, just copy that” wasn’t working. But what did work was telling it “first, describe the exact logic by which an attack card will be played. next, describe the exact logic by which a defend card will be played. next, verify if those will always be exactly the same, if not, fix it”. That worked!
Changing models and just rewording commands - The equivalent of “smack your hand against the television until it works”, this strategy is more viable than ever in an era of non-deterministic machines. The two leading models are Gemini-2.5 and Claude 3.7 , so I usually switched between those, but others are worth trying.

Ironically, as I’m writing this post I’m seeing an email alert on LinkedIn that “prompt engineering'“ is a dead job title. And I think the reason is that prompt engineering is still engineering, and thus, I expect most people who are doing prompt engineering work will instead use the title AI Engineer.

Conclusion

Building my game with only AI was a surprisingly emotional experience.
In many ways, it felt like what I imagined programming would be as a kid: just telling a machine what you want it to do.

Today, "vibe coding" is often pitched as a way for non-technical people to build apps without engineers. I don't think we're quite there yet. AI still struggles with novel problems, and engineering involves much more than code: testing, deployment, monitoring all still require real judgment.

But it's close, and that's why vibe coding is so exciting for senior engineers.
The best results come when AI handles the busywork, and experienced engineers step in to fix architecture and unblock edge cases.

History shows a familiar pattern:

COBOL promised English-like programming to replace engineers.
Java promised programming without worrying about system details like memory management to replace engineers.
Each time, we didn’t eliminate engineers, we shifted the line between "technical" and "non-technical."

I believe AI will do the same. Most people still won't build full software products.
But a new wave of "almost-technical" builders will, and senior engineers who master these AI tools will move even faster.