Step 0: Initial Prompt
The error analysis process above assumes you have an AI system to analyze. Since I was starting from scratch, I first iterated on an initial system prompt using OpenAI’s playground and the prompting tips discussed in my previous blog post.
The resulting prompt was long, but you can see it in the GitHub repo. I used GPT-4.1.
Step 1: Curate Inputs
Next, we want to create a set of realistic inputs for this use case. While I could have generated these synthetically [3], I chose to write them myself since I will be the only user of this system anyway.
I did this by 1) reviewing existing posts and writing short notes for them, and 2) grabbing old notes I had sitting in a Notion database. The result was a .csv file containing 49 post ideas.
Step 2: Generate Outputs
With a system prompt and post ideas in hand, we can now generate LinkedIn posts for each idea. I used OpenAI’s Responses API for this.
Generating all the outputs took about 10 minutes. They were then saved to a new .csv file containing all 49 input-response pairs, available in the GitHub repo.
Here’s one example post:
Landing a job isn’t the only way to make money with AI.Here’s how you can turn your AI skills into income—no matter your background:
1. **Get a job:** Land a full-time role as a machine learning engineer,
data scientist, or prompt engineer. The fastest way to steady income and
learning.
2. **Implementation:** Help teams build and deploy real AI solutions as a
freelancer or contractor. Solve specific technical problems for businesses.
3. **Strategy & Consulting:** Guide founders, managers, or non-technical teams
on what’s possible (and what’s hype), helping them make smarter decisions.
4. **Education:** Teach others—run workshops, create courses, write tutorials,
or build a YouTube channel. There’s a hungry audience eager to learn.
5. **Products:** Build your own SaaS, tools, or micro products powered by AI.
Riskier, but the most scalable path if you can nail a real pain point.
Which path are you focused on right now?
Step 3: Create Data Annotator
Next, I used Claude (via Cursor) to vibe-code a custom data annotator. This was a simple Streamlit app with sections for the input, response, and a notes section.
While I had to make some aesthetic tweaks to the app, Claude got all the key components and logic right. Surprisingly, this was the fastest part of the 5-step process.
While you could also do this in Excel (or the like), it will add a non-trivial amount of friction for long responses like the ones here. Manually reviewing data is hard enough, so making the process as frictionless as possible is critical [3].
Step 4: Manual Review
Now we get to the meat of error analysis: looking at the data. Here, I was looking for two key things.
- Did any failures or mistakes occur?
- What could the model do differently? (Optional)
These were recorded as unstructured notes in the “Notes” section of the data (as shown below). Going through all 49 examples took me about 90 minutes.
While I was initially skeptical (or just lazy) about doing this, I now realize how essential it is.
In past projects, I’d try to improve the system without a good understanding of what was wrong and why. However, for this one, seeing errors over and over again, it started to become obvious which aspects of my prompt were unclear or poorly formatted (or both).
Step 5: Categorizing Errors
Although I already had some good ideas of what improvements to make next, it was still helpful to take a step back and properly analyze the errors. I started by uploading a .csv file of the (input, response, note) triplets to ChatGPT and Claude to group them into common failure modes.
From there, I refined the categories based on notes I had taken during the annotation process. Then, I manually reviewed the notes (using my custom data annotator) to label each note with the corresponding category.
Here’s the final list of common failures with counts:
This categorization gives a clearer picture of the greatest leverage points for improving the system. Based on this, I went back to the OpenAI playground and implemented a quick fix for the double CTA, made the step-by-step instructions clearer, and got the most to generate hook ideas before writing one.
Here’s what these numbers looked like after that fix.
Building software on top of unpredictable AI models can feel more like praying to the ML gods than engineering. Here, we saw how error analysis resolves the guesswork by giving us a systematic way to identify and improve the most significant failures from an LLM system.
Although manual review is an essential starting point for LLM error analysis, it is not scalable. In my next post, I will discuss how to scale this process using automated evals and unlock even greater system improvements.