Native vs. Aho-Corasick in PHP: How Simplicity Sometimes Wins

I recently ran into a real-world conundrum where I needed to scan product content – titles, descriptions, and other long-form fields – for a growing list of keywords.
The catch? The input text varied wildly in size, from a few lines to full-length articles pulled from manufacturers. And the keyword list wasn’t small – well into the hundreds, with plans to grow over time.

This posed the question: should I stick with PHP’s native string functions like mb_strpos(), or bring in a more complex multi-pattern matching algorithm like Aho-Corasick?

This article walks through that decision, the trade-offs involved, and benchmark results to help you decide which approach might work best for your own use case.

The Real Problem: One Text, Many Keywords

Let’s define the problem more precisely.

We’re scanning one aggregated large block of text for the presence of many keywords.
From a computer science perspective, this is known as multi-pattern string matching, and there’s an ideal runtime we’re aiming for: O(k + m + z), where:

k = total length of all keywords (not the number of keywords n)
m = length of the input text
z = number of matches found in the text

That’s the benchmark algorithms like Aho-Corasick strive to hit. It preprocesses the entire keyword set, then performs a single pass through the text to find matches.

In contrast, PHP’s built-in string functions are fast but don’t optimize for this case.
The most common approach – looping over the keyword list and using mb_strpos() to search each term individually – ends up scanning the entire input text once per keyword.

So if you’re checking 500 keywords against a 1MB product description, you’re effectively doing 500MB of work.

The simplicity of this approach is appealing, but the scaling behavior should quickly become a bottleneck, so what we want to test is: developer simplicity and native speed vs custom algorithmic efficiency at scale.

Two Approaches: Simple Loop vs Aho-Corasick

Before we dive into the benchmarks, here’s a quick overview of the two strategies being compared.

1. Simple Iterator (Native PHP)

The native PHP solution is straightforward:

foreach ($keywords as $word) {
    if (mb_strpos($text, $word) !== false) {
        $found[] = $word;
    }
}

This approach is easy to read, dependency-free, and built on top of PHP’s native string functions, implemented in C.

For quick checks like “does this text contain any of these keywords?” (read: we just want to know if a keyword appears – we do not care if it’s once or more), each call to mb_strpos() performs a linear scan of the text*, and results in a time complexity of O(n × m × l), where:

n = the number of keywords
m = length of the text being scanned
l = average length of each keyword

But once you need positions for all matches of a keyword, things get heavier. Now you’re calling mb_strpos() in a loop, advancing the offset after each match. That means you can end up doing O(n × m × l) work again for every extra match, so the real-world effort stacks up fast even though the Big-O shape hasn’t changed.

*PHP’s mb_strpos() uses a simple, naive string search approach internally. No Boyer-Moore, KMP, or other advanced pattern-matching logic happening under the hood.

2. Aho-Corasick Algorithm

Aho-Corasick solves this problem differently. It builds a trie-based finite state machine from the entire keyword set. Once built, you scan the input only once, and it emits all matches – including their positions.

There is an upfront cost to building the structure, but the pay-off is significant at scale: performance is tied to the length of the input text (m), not the number of keywords.

This approach is ideal for high-volume inputs, but comes with complexity: you’ll likely rely on a third-party library (like Wikimedia’s PHP implementation), and the data structures can be harder to debug or customize.

Benchmark Setup: Testing the Trade-Offs

To better understand the performance trade-offs, I benchmarked both the Aho-Corasick algorithm and the simple iterator approach under two common use cases:

Position Tracking: Return all keyword matches and their positions in the text (simulates tasks like highlighting or frequency analysis)
This is useful when you need to know exactly where matches occur – for example, if you’re highlighting terms in search results, or you need to find matches that are part of larger words (e.g., matching “car” inside “cargo” or “carpet” without strict word boundaries).
Unique Matching: Return a deduplicated list of matched keywords, ignoring frequency or location (useful for flagging content without deeper analysis)

These were tested against two types of input:

Large Text: ~1.2MB (the full text of Moby Dick)
Small Text: ~6KB (a real-world product description from Giant Bikes)

Benchmark Results

Each test was done with more keyword variations but 50, 500, and 1000 keywords seems to be enough to show the trend of how each approach behaves, so I’m sticking with those for this comparison:

Large Text (~1.2MB)

Method	50 keywords Time (ms)	500 keywords Time (ms)	1000 keywords Time (ms)
Aho-Corasick (Position)	129.87	128.02	130.64
Simple Iterator (Position)	2,336.86	10,638.38	20,158.47
Aho-Corasick (Unique)	126.15	128.20	131.17
Simple Iterator (Unique)	21.01	218.46	405.95

Small Text (~6KB)

Method	50 keywords Time (ms)	500 keywords Time (ms)	1000 keywords Time (ms)
Aho-Corasick (Position)	0.88	2.17	3.95
Simple Iterator (Position)	0.86	2.04	3.33
Aho-Corasick (Unique)	0.81	2.35	4.20
Simple Iterator (Unique)	0.12	1.21	2.30

What the Data Tells Us

TL;DR: For small texts, the simple iterator does surprisingly well – even when multiple, positional data is required.
For most other scenarios with larger amount of text, Aho-Corasick is the way to go (the only exception being a large amount of text with very few keywords; the overhead of creating the state trie for Aho-Corasick comes with an up-front cost that doesn’t scale until later).

Positional data

When positional data is needed (i.e., tracking where each match occurs), Aho-Corasick is significantly faster for large texts; in our 500 keywords experiment it’s ~70x faster, and ~155x faster for 1,000 keywords. This trend will only increase with more keywords and/or text, so it’s definitely worth the extra overhead of a dependency.

The smallest experiment (~6kb text with 50 keywords) should be where the simple iterator has its best chances – and it actually outperforms Aho-Corasick – but within margin or error type gains.
To be honest, I was surprised considering Aho-Corasick wasn’t the one “barely wining” considering it only has to run once; but if you know that you’ll never process large amounts of text but will have thousands of keywords, then there’s no reason to worry about it.
I’m personally curious how many keywords it would take for Aho-Corasick to actually start scaling beyond the simple iterator, but it was outside of the scope I was working on, so I didn’t investigate that direction. I do suspect that it will take several thousands before it breaks even and then it will only scale slightly better for even longer.

Unique data

When only unique keyword matches are needed (i.e., knowing that a match exists and the quantity of times doesn’t matter), the performance advantage of Aho-Corasick isn’t as clear anymore.

On large texts with 50 keywords, the simple native iterator was ~6x faster because of the upfront cost of Aho-Corasick’s state trie.
Although it’s not part of the table above, then I did some runs to find it breaks even at ~300 keywords. From this point, the result of Aho-Corasick was almost statically the same even beyond 1,000 keywords, whereas the simple iterator’s processing time kept increase and was ~3x slower at our largest data-sample.

On small text, you’d need thousands of keywords for Aho-Corasick to be worth it. I didn’t dig into when it breaks even but it’s several thousands.

Note: it’s always fun to see that the processing time of Aho-Corasick keeps growing the more keywords there are on small text because the total length of keywords becomes the largest variable due to the small size of the text.

Final Remarks

The benchmark confirmed something I found counter-intuitive: when you only need to know if keywords exist, PHP’s native mb_strpos() loop is relatively fast – even on large texts.
I had expected Aho-Corasick to pull ahead in more of the different combinations of text and keyword sizes since, theoretically, scanning the text once regardless of keyword just sounds fast. In reality, the overhead of building and running a more complex matcher doesn’t always pay off.

For the feature I was building, it was possible to change the project scope and rely on unique matches instead of using positional information. TL;DR: I was scanning product descriptions for hazardous material indicators – for example, checking if terms like “lithium” appeared – so the merchant can be prompted about missing special handling indicators. In that context, simply knowing that a keyword existed was enough; I didn’t need to show where or how often a keyword appeared or whether it was part of a larger word.

Note: to expand on the feature requirements, then we still wanted to know all the keywords that had a match since it’s not correct to assume just because “lithium” was a false-positive, then the next ones would also be false-positives.

Having some data about average product description sizes did also point us much, much closer to the small text sample versus the full Moby Dick book, so based on the benchmark, then the simple iterator should be the fastest solution – and if someone has really, really large production description (as it, they wrote a whole book about it and, for some reason, decided to include it in their product description) then the performance is still decent.

At the end of the day, it’s a reminder that real-world software is about pragmatic trade-offs, not theoretical academic. “Trust, but verify.”