How YC Startups Use AI: Agents, OCR, and Prompt Engineering with Mercoa (YC W23)
An interview with Sandeep Dinesh, Co‑Founder and CTO of Mercoa (YC W23), on building AI agents, doing OCR with LLMs, and lessons learned from AI in the trenches
Y Combinator is the highest-profile startup incubator and one of the loudest advocates of LLM adoption. They report that roughly a quarter of their portfolio lets AI write 95 percent of its code, and that nearly every new company touches AI in some way.
Mercoa fits that mold. The Winter 2023 batch company turns accounts-payable rails into an AI-powered bill-pay agent, and its CTO, Sandeep Dinesh (formerly my colleague at Google), has been shipping LLM features since the GPT-3.5 era.
In this interview we cover:
The state-machine AI agent architecture behind Mercoa’s payment product
Prompt-engineering tactics their team refined in production
How they pick models and tools (Gemini, GPT-4, BAML, Stagehand)
Advice for founders and engineers who want to move faster with AI
Interviewer: Bill Prin, AI Engineering Report
Bill: Quick intro – what is Mercoa?
Sandeep: Mercoa is an embedded accounts‑payable and accounts‑receivable platform. We sell to vertical SaaS, banks, and payment issuers such as Mercury or Brex. Their end customers are businesses that handle thousands of bills each month.
Recently we pivoted to an AI agent that pays invoices with virtual credit cards. The agent processes invoices, figures out whether the vendor will accept card, navigates the payment portal or checkout page, and executes the payment.
What are some of the ways you’ve used AI and LLMs to build features that help customers or otherwise give your business an advantage?
Sandeep: Back in the GPT‑3.5 days we dumped invoice text into ChatGPT with a structured prompt. It outperformed specialized OCR APIs that cost ten cents per document and struggled with layout variation. It was a bit surprising that these much more general language-focused models would outperform specialized computer vision techniques, but ultimately their ability to handle the slight imperfections in the input is where they came out ahead. Today we run Gemini 2.5 Pro for vision OCR. No fine‑tuning.
After one-shotting the invoice page through an LLM, we pipe the raw text into BAML so it can extract the fields it needs into clean JSON.
What are some of your lessons learned about AI Engineering based on that experience?
Sandeep: A key lesson around prompt engineering has been that less context reduces hallucination. So we split multi‑page docs, process pages in parallel, then use business logic to reconcile totals. We one‑shot with a tight system prompt and minimal examples.
Interesting, that sounds like a term I've recently seen called ‘context rot’ that Simon Willison has written about and I’ve seen some YouTube videos pop up around.
Sandeep: Yeah exactly, with the early versions of LLMs, context windows were so small, GPT-3 had 2048 tokens, people were begging for bigger windows. So LLMs like Gemini come out today with million-token context windows, and at first it sounds great, until you realize that too much context can actually hurt your results. You want to figure out the smallest amount of information you can give the model to complete the task as that will usually have the best results.
What are some other applications you’ve built? Have you found the need to do any fine-tuning or RAG applications?
Sandeep: We’ve completely ignored fine-tuning. There’s an emerging consensus that it’s too time-consuming and expensive to do relative to other approaches, especially with the foundational models themselves evolving so fast. You’d have to fine tune again for every model you want to try every time a new one releases. Meanwhile, providing better prompts and using better examples is typically cheaper and more effective.
For RAG, we have a use case where we predict metadata for invoices, and we’ve found that a “lazy RAG” approach has been most effective. In this case, we just do a database lookup for invoices that have some similarity based on the limited metadata we do have, and then we use those as examples so that the model can predict the rest of the metadata. For our use cases, we haven’t found the need for more complex embeddings or vector database approaches.
Let’s talk more about this AI agent you’ve been focusing on recently and have made your primary product offering. What does the agent architecture look like?
Sandeep: Think of it as a state machine that only goes in one direction. State 1: We have an invoice PDF. State 2: Determine if the vendor supports card. State 3: Get to the card form. State 4: Fill details and submit. Transition logic is fuzzy, so we let the model decide how to move forward inside each state. The secret sauce is the chain of prompts and loops to get from one state to the next in a reliable way.
One very useful tip we learned is that LLMs have a tendency to hallucinate and confidently answer even if they don’t know if you ask them for an answer. But if you provide an ‘escape hatch’ and explicitly tell them they can say they don’t know, they will use that escape hatch which will reduce bad answers. So typically we use BAML and have our model provide an answer of ‘yes’, ‘no’, or ‘unknown’. BAML lets the model write its chain-of-thought into reasoning while our code reads only acceptCard, which keeps the agent deterministic. The ‘unknown’ answer is that escape hatch that prevents some hallucinations.
There is an explosion of browser‑automation libraries. What are you using?
Sandeep: We shipped v1 on Stagehand because it worked fastest for us in TypeScript. We are evaluating Browserless, Browserflow, and Playwright‑based agents. Vendor lock‑in is a concern, so we prefer thin wrappers around browser primitives.
How do you choose between GPT‑4, Claude, Gemini, or open models?
Sandeep: Criteria are quality first, then latency, then price. For heavy OCR we use Gemini 2.5 Pro – accuracy matters. For the agent’s incremental steps Gemini Flash is fastest and cheap.
How do you evaluate the quality of your LLM responses?
Sandeep: We keep a unit‑test style suite per task. New model comes out, swap it in BAML, run the suite. If it passes, we migrate.
What prompt engineering tips have stuck with you over this time?
Keep the context window as small as possible – chunk large docs and post‑process.
Use enums instead of booleans (yes, no, unknown) to avoid bias.
Chain‑of‑thought is helpful but only include it if it actually helps.
Prompt engineering is probably the best way to get better performance. A good prompt can outperform a better model.
What’s your advice for people building startups, if you started over, what would you do differently?
Sandeep: I would build agentic workflows sooner. If something today seems like it’s delivering some value, but the underlying tech feels not quite ready, well, the underlying tech is evolving so fast that it’s likely it will be ready soon and you’ll be the first one to capitalize on it because you have a head start.
On hype and competition: Massive VC funding can look scary. It simply reflects a huge surface area of problems to solve. Focus on shipping something customers pay for.
What’s your advice for junior engineers looking for a job, or perhaps senior engineers looking to upskill on AI skills? Do you recommend any libraries or frameworks that seem hot like LangChain or LlamaIndex?
I’ve always been skeptical about learning frameworks or libraries because they’re popular or for their own sake. Instead, try to build something interesting, but also talk to a lot of people in the space, be active on AI social media, go to hackathons and meetups. When you do this, typically when you have conversations with people, the need for new tools will emerge more organically. You’ll say, “oh this problem is giving me a hard time” and someone will say, “oh this library can help”.
I was a bit late on the Cursor adoption, but I was on a call with a vendor when I saw them solve something very quickly with Cursor and that led me to try it. With this approach, any tool you adopt has some clear purpose and you understand why you’re using it, plus you have tangible output for all your work.
Resources and links
Mercoa – https://mercoa.ai
BAML – structured prompt and JSON extraction toolkit https://github.com/BoundaryML/baml
Stagehand – TypeScript browser automation, https://docs.stagehand.dev/get_started/quickstart
Super interesting (and helpfully opinionated)!
I'm impressed that Gemini 2.5 is their go-to OCR tool / outperforms the custom APIs. So many existing paradigms are being disrupted just by how powerful the models are becoming.