If you’ve tried using an AI humanizer to clean up ChatGPT output, you’ve probably had a mixed experience. Sometimes the text passes detection. Sometimes it doesn’t. There’s rarely any explanation for why.
We got tired of guessing. So we built an experiment to find out what actually works, what doesn’t, and — most importantly — why.
The problem with AI detection
AI detectors like GPTZero have become the de facto standard for checking whether content was machine-generated. Universities use them. Publishers use them. Clients use them. And they’re getting better.
But here’s what most people don’t realise: these tools aren’t just looking at your writing style. They’re analysing patterns in how words are selected — statistical fingerprints that are invisible when you read the text but obvious to an algorithm. That’s why simply prompting ChatGPT to “sound more human” rarely works. The words might change, but the underlying pattern stays the same.
What we tested
Our team at Rephrasy ran a controlled experiment across 100 AI-generated texts, 50 topics, and three different text lengths. We tested six different humanization methods — including fine-tuned models, prompt-based approaches, and our own production tools — against three independent AI detectors.
The goal was simple: find out which approaches actually reduce AI detection scores, and which are just noise.
The results were clear
Five out of six methods failed against GPTZero. Pass rates sat between 1% and 7%. Fine-tuned models that cost significant time and resources to build performed no better than basic prompting.
Only one approach showed real promise, achieving a 48% bypass rate on first pass. We took that as our starting point.
Rather than throwing more compute at the problem, we went back to the data. We analysed hundreds of humanized outputs — comparing the ones that passed detection against the ones that didn’t — and identified consistent, repeatable patterns that separate the two groups.
We then built those findings into an improved humanization pipeline. Same model. Same infrastructure. Better results.
The improved version hit a 67% bypass rate across 100 fresh test samples. For context, that’s nearly double the previous best, and significantly ahead of anything else we’ve tested — including tools from other providers.
Short content used to be a weakness
One finding that surprised us: text length plays a major role in detection. Short content — a few paragraphs, a product description, a social media caption — was historically much harder to humanize. Our baseline only passed 17.5% of the time on short texts.
The improved approach brought that up to 67.5%. That’s a meaningful change for anyone working with shorter formats, which is most real-world use cases.
Writing style changes everything
We also tested whether the tone and style of the output affects detection rates. It does — significantly.
We found that certain writing styles are inherently harder for detectors to flag. Our best-performing style variant achieved an 84% bypass rate, with long-form content in that style passing detection 100% of the time in our sample.
This makes sense intuitively. AI detectors are trained on what AI text typically looks like. The further your output moves from that expected pattern, the harder it is to catch.
What we learned doesn’t work
Some approaches we tested failed badly, and they’re worth mentioning because they’re strategies many people still rely on.
Running humanized text through a second AI model made things worse, not better. The bypass rate dropped to zero. Each pass through a language model reinforces the same statistical patterns that detectors pick up on.
Running the same text through a humanizer twice gave marginal improvement at best — a few percentage points — and doubled the processing cost. Not worth it.
And fine-tuning models from scratch, even with sophisticated training methods, couldn’t break past a 7% ceiling against GPTZero. Without the right guidance, even a well-trained model produces detectable output.
What this means for content teams
AI-assisted content isn’t going away. Neither are AI detectors. The question for businesses, agencies, and freelancers is whether the tools they’re using actually deliver.
Most don’t. The majority of AI humanizers on the market have never published independent test results — and there’s usually a reason for that.
If you’re evaluating humanization tools, here’s what to look for: test results against multiple detectors, not just one. Performance data across different content lengths. And transparency about methodology.
The bar for AI content quality is rising. The tools you use should be rising with it.
Su Schwarz, Support Engineer at Rephrasy — an AI humanization platform used by content creators and businesses to produce natural-sounding text that passes AI detection. The Rephrasy research team conducts ongoing independent testing across multiple AI detection systems.

