When AI Gets It Wrong (With 91% Confidence)

A guide to understanding when AI adds value and when it creates more work.

Nov 18, 2025

91% confidence. But completely wrong information. The AI was so certain, and I was stuck with unusable results after building my AI matching system.

A couple weeks ago, I tried to use AI to solve a very familiar Ops problem: matching a list of accounts to the ones in our database. On the surface, it looked like a perfect use case for AI search. I was already using this in my Analysis Dossier tool, so I assumed it would work just as well here.

I built the workflow, ran the matching process, and everything worked exactly as designed. But the results were not accurate, and many of the wrong matches were the ones the model felt most confident about:

TechCorp Solutions → TechFlow Industries (91% confidence)
Global Airlines → Global Financial Group (88% confidence)
Metro Manufacturing → Metro Insurance (85% confidence)
Summit Healthcare → Summit Energy Partners (89% confidence)

The output looked polished and intelligent, but it was unusable. After spending hours going through the list manually (something this engine was supposed to help me avoid doing), I eventually realized the problem wasn’t with the AI, it was with the assumption that AI was the right tool at all.

If I had taken just a few minutes to identify the problem correctly, I could have avoided the entire first attempt. This post breaks down what I missed, why it mattered, and the framework I now use to evaluate whether AI will help or make everything worse.

The “High-Confidence” Failure

Most Marketing Ops professionals know this pain. You get a spreadsheet from an event or a vendor and you have to match the account in their list to your database. The names are inconsistent, the same company could appear in different formats, and the process is messy.

My first approach was normal fuzzy matching in PowerQuery, but that created too many false positives. When “TechCorp” matched to “TechFlow,” I knew I needed something more reliable.

So I moved to what felt like the next logical option: AI-powered matching. With help from my AI assistant, I quickly set up an Azure AI Search pipeline:

Generate embeddings for account names by uploading an excel list in Azure AI Search.
Set up a Jupyter notebook to run Python scripts with the new list of names to match.
Calculate similarity scores between the new names and our list of accounts.
Export the matches for review.

It only took a couple of hours. And at first glance, the output looked accurate. But when I actually reviewed the results, the “High Confidence” section exposed the core issue: semantic intelligence does not equal entity matching.

Semantic Meaning vs. Entity Matching

To understand where things went wrong, it helps to understand what embeddings actually do. In simple terms, “generating embeddings” means converting text into vectors, or sets of numbers that represent meaning.

Two phrases positioned close together in this “meaning space” look conceptually similar to the AI model. For example:

“Supplier consolidation” and “vendor rationalization” → close together
“Metro Airlines” and “Metro Insurance” → also close together, despite being completely different companies

That can be incredibly powerful in the right context. It’s why my Analysis Dossier tool works so well. When a 10-K mentions “reducing operational costs”, I want the system to surface relevant content about “process automation” or “efficiency optimization” even though the terminology differs.

But for account matching, this is exactly the wrong type of intelligence. I didn’t need conceptual similarity, I needed literal, deterministic precision.

The Non-AI Approach That Worked

Once I stopped trying to force AI as the solution, I rebuilt the workflow using deterministic logic.

1. Normalization

I created a Python function (with the help of AI) to standardize company names by:

Lowercasing
Removing legal suffixes (Inc, Corp, LLC, Ltd, GmbH, SA, etc.)
Removing punctuation and whitespace
Extracting a consistent “core” name
First-letter filtering to reduce noise matches

This removed many superficial mismatches immediately.

2. Multi-Strategy Matching

Instead of relying on just one method, I layered several:

Exact matches of normalized names
Fuzzy matching for near-identical names
Word-overlap analysis, excluding generic business terms

3. Business Logic Checks

This additional layer made the workflow more reliable:

Rejecting matches with zero meaningful word overlap
Flagging parent/subsidiary scenarios for review
Assigning clear match categories (Exact, High Confidence, Medium, Review, Low Confidence)

The false positive rate dropped from roughly 40% to under 5%. And just as important, the logic was transparent. If something didn’t look right, I could trace exactly why it happened so I could debug it quickly.

A Framework for When to Use AI and When Not To

This experiment taught me an important lesson: the Ops Superpower isn’t just knowing how to use AI, it’s knowing when to use it.

Here’s a framework using my learnings to decide which method makes the most sense before building a solution:

1. Determine the Nature of the Task

Interpretation Tasks (AI Excels): These depend on context and meaning and is why tools like my Analysis Dossier (for research) and my ABM ad engine (for personalization) work so well.
- Synthesizing information
- Identifying strategic themes
- Understanding intent
- Mapping insights to personas
- Extracting insights from unstructured data
Precision Tasks (AI Struggles): These depend on consistency and literal accuracy, requiring deterministic rules.
- Entity matching
- Field validation
- Data standardization
- Calculations and date logic
- Compliance checks

2. Evaluate the Acceptable Risk Level

Ask: What happens if the model is wrong?
- Low Impact: Surfacing a related document, generating draft content: AI is fine.
- High Impact: Pushing matches into CRM, enriching data, routing records: Avoid. You need a deterministic, rules-based approach.

3. Consider the Structure of Your Inputs

Messy, narrative, or unstructured data: AI can help make sense of it.
Highly structured or numeric data : Traditional logic is safer and more consistent.

4. Look at How the Output Will Be Used

If a human reviews it: AI is acceptable as a “co-pilot.”
If a system consumes it directly: You need the 100% predictable behavior of deterministic logic.

5. Identify Your Debugging Needs

A predictable, rules-based system is easy to maintain and debug.
A probabilistic AI system is harder, as you can’t always guarantee the same output from the same input.

If I had a framework like this before building my solution, I would have saved hours of time. Here’s what I should’ve asked myself in the beginning:

Does this task require interpretation or precision?
Will a wrong answer create downstream issues?
Is the input unstructured or structured?
Does the output need to be 100% deterministic?
Can I define the rules clearly?

If the answer to most of these is “precision,” “structured,” or “deterministic,” you may not need AI at all.

This account matching experiment was a reminder that AI can produce impressive-looking results that are confidently incorrect. Yet, the deterministic Python approach ended up being far more reliable and maintainable.

There’s a lot of pressure in Ops to use AI everywhere, but the best contribution we can make is a solution that’s stable, transparent, and reliable.

AI expands what is possible, but your Ops judgment determines whether it actually works in practice. The most valuable skill isn’t how many AI tools you build, it’s knowing when AI adds value and when it makes the problem harder.

Applied AI for Marketing Ops | Lily Luo

Discussion about this post