Lessons Learned about API Calls and Automated Testing

Last week, I made a post about needing to find someone to pay me for my work or I was going to have to sell plasma to cover the token usage I was going through.

@David Vogel

suggest I make a post about it in a bit more detail.

When I posted it, I had no idea what was actually going on. Here is the story and the lesson I learned the hard way.

I was working on Week 7's competition agent. I had tested it manually six ways to Sunday, made some fixes, tested again, made more fixes. Before I submitted it, I wanted to really put it through its paces with an automated testing script. The plan was simple. It would randomly assign one of six project documents as the default and then run the agent through a number of scenarios to catch any logic issues or edge cases, then switch to another random project and try again.

So far so good, right?

I decided that reusing code from another project was going to be the best option here. I used it three times already and it was working. All I would need to do is change the details for this project and viola. I copied that test file into my folder. I gave Claude a list of things I wanted to test for and how it should be done. I also told it to modify the existing script to work for this situation and let it rip.

A few minutes later, I had a ready to run test script. It was a big and complex sucker. It would randomly pick a company project, set that as the default, then run 25 tests with multiple steps. It came back and said it would take about 30 min to run and cost about $10 in API calls. I thought to myself, I put $50 in a couple of weeks ago and hardly used any, so that isn't an issue, and I can go grab some food and when I come back, it should be done.

While waiting in line, I get an email that I am out of credits for my API key. I just reauthorized for another $10 to get it going again.

I get home and before I can check my computer, I get another email that I am out of credits.

This is when I made the post in the forum, in jest, confusion and frustration.

I started to dig into what was going on. I asked Claude and it was telling me that there was nothing wrong and it was working like it should. Something clearly wasn't right.

I looked at the testing script and got about 100 or so lines in and saw the problem.

Both Claude and I hallucinated. I didn't give Claude clear enough instructions of what I wanted or any guardrails. Instead of just picking one project file, it would load all 6 of them into the context along with ALL the other files in the agent.

Since each test had multiple steps, it was sending HUGE context prompts and ignoring the fact that it should have sent the query via API that should just read the files in the directory and run it normally.

The first automated test script I wrote a couple of weeks ago hardly used any tokens, so what changed?

It was the reused code.

My second project was where I wrote the basis for the automated test feature that I was reusing for this project. I used it for the last project and then used it again for this one.

The flaw was there the whole time. Instead of making the call and reading the files, it was just sending everything each time it made an API call.

Fifty dollars in credits burned up because neither Claude nor I thought to make that clear not to do that or check that it wasn't doing that. I didn't notice it on the last 2 projects, because I had plenty of money in credits and never bothered to checked. Each time I ran it, it just burned through a ton of tokens and wasn't doing what I wanted it to do.

If I didn't have a bit of Python knowledge, I might not have found the problem as quickly. Without it, I would have spent a lot of time with Claude trying to ask the right question to figure out what the heck was really going on.

After a couple of rounds of editing, I got a test script that ran again and only cost me $3 in credit instead of $25.

There are a couple of lessons to be learned here:

Always check the work the AI does. Even if you have to open up a new context window and just quiz it.
Knowing Python and other languages is NOT a bad idea, even in the age of AI. It helps you see if the AI made any mistakes, because it will.
Be VERY clear on what you want the outcome to be and do. Give it guardrails. Be as detailed as possible. I really should have created a PRD for the test script and built it from that, not just a "Build me an automated test script."
Check your credit balances every once in a while. I could have caught this weeks ago, if I had bothered to look.
Automated tests are awesome, when you don't build them like I did...

2 comments