How to Improve AI Apps with Error Analysis | by Shaw Talebi

Step 0: Initial Prompt

The error analysis process above assumes you have an AI system to analyze. Since I was starting from scratch, I first iterated on an initial system prompt using OpenAI’s playground and the prompting tips discussed in my previous blog post.

The resulting prompt was long, but you can see it in the GitHub repo. I used GPT-4.1.

Step 1: Curate Inputs

Next, we want to create a set of realistic inputs for this use case. While I could have generated these synthetically [3], I chose to write them myself since I will be the only user of this system anyway.

I did this by 1) reviewing existing posts and writing short notes for them, and 2) grabbing old notes I had sitting in a Notion database. The result was a .csv file containing 49 post ideas.

Screenshot of first 5 user inputs. Image by author.

Step 2: Generate Outputs

With a system prompt and post ideas in hand, we can now generate LinkedIn posts for each idea. I used OpenAI’s Responses API for this.

Generating all the outputs took about 10 minutes. They were then saved to a new .csv file containing all 49 input-response pairs, available in the GitHub repo.

Here’s one example post:


Landing a job isn’t the only way to make money with AI.Here’s how you can turn your AI skills into income—no matter your background:
1. **Get a job:** Land a full-time role as a machine learning engineer, 
data scientist, or prompt engineer. The fastest way to steady income and 
learning.
2. **Implementation:** Help teams build and deploy real AI solutions as a 
freelancer or contractor. Solve specific technical problems for businesses.
3. **Strategy & Consulting:** Guide founders, managers, or non-technical teams 
on what’s possible (and what’s hype), helping them make smarter decisions.
4. **Education:** Teach others—run workshops, create courses, write tutorials, 
or build a YouTube channel. There’s a hungry audience eager to learn.
5. **Products:** Build your own SaaS, tools, or micro products powered by AI. 
Riskier, but the most scalable path if you can nail a real pain point.
Which path are you focused on right now?

Step 3: Create Data Annotator

Next, I used Claude (via Cursor) to vibe-code a custom data annotator. This was a simple Streamlit app with sections for the input, response, and a notes section.

While I had to make some aesthetic tweaks to the app, Claude got all the key components and logic right. Surprisingly, this was the fastest part of the 5-step process.

GIF of custom data annotation app. GIF by author.

While you could also do this in Excel (or the like), it will add a non-trivial amount of friction for long responses like the ones here. Manually reviewing data is hard enough, so making the process as frictionless as possible is critical [3].

Step 4: Manual Review

Now we get to the meat of error analysis: looking at the data. Here, I was looking for two key things.

Did any failures or mistakes occur?
What could the model do differently? (Optional)

These were recorded as unstructured notes in the “Notes” section of the data (as shown below). Going through all 49 examples took me about 90 minutes.

Example input-response-notes triplet. Image by author.

While I was initially skeptical (or just lazy) about doing this, I now realize how essential it is.

In past projects, I’d try to improve the system without a good understanding of what was wrong and why. However, for this one, seeing errors over and over again, it started to become obvious which aspects of my prompt were unclear or poorly formatted (or both).

Step 5: Categorizing Errors

Although I already had some good ideas of what improvements to make next, it was still helpful to take a step back and properly analyze the errors. I started by uploading a .csv file of the (input, response, note) triplets to ChatGPT and Claude to group them into common failure modes.

From there, I refined the categories based on notes I had taken during the annotation process. Then, I manually reviewed the notes (using my custom data annotator) to label each note with the corresponding category.

Here’s the final list of common failures with counts:

Error categories and counts after first round of error analysis. Image by author.

This categorization gives a clearer picture of the greatest leverage points for improving the system. Based on this, I went back to the OpenAI playground and implemented a quick fix for the double CTA, made the step-by-step instructions clearer, and got the most to generate hook ideas before writing one.

Here’s what these numbers looked like after that fix.

Updated failure counts after prompt improvements. Image by author.

Building software on top of unpredictable AI models can feel more like praying to the ML gods than engineering. Here, we saw how error analysis resolves the guesswork by giving us a systematic way to identify and improve the most significant failures from an LLM system.

Although manual review is an essential starting point for LLM error analysis, it is not scalable. In my next post, I will discuss how to scale this process using automated evals and unlock even greater system improvements.

Source link

How to Improve AI Apps with Error Analysis | by Shaw Talebi

Future ETH Price Dips Could Be Great Buy Opportunities

Transparency as Strategy: How Kazakhstan’s Venture Funds Are Stepping onto the Global Stage

Transparency as Strategy: How Kazakhstan’s Venture Funds Are Stepping onto the Global Stage

Leave a Reply Cancel reply

POPULAR POSTS

10 Ways To Get a Free DoorDash Gift Card

They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

Categories

Connect With Us

Recent Posts

Any tidal shift of investors away from the US will take time

Which Excel Alternative Is Best for You?