Measure HTML Token Bloat

Detect when boilerplate overwhelms useful content for LLM crawlers.

The Problem

LLMs and RAG pipelines tokenize your HTML before extracting content. When navigation, footers, tracking scripts, and repeated UI components dominate the page, the signal-to-noise ratio collapses. Claude, GPT, and Perplexity either truncate or hallucinate when the useful content is buried under 10× its weight in boilerplate tokens.

The Hard Way

Calculate token bloat manually: fetch the page, strip boilerplate, count tokens in the useful content vs. the full HTML. You’d need a tokenizer (tiktoken or similar), a content extractor, and a baseline for what’s “good”. For pSEO at scale, you’d repeat this per template.

The SEODiff Way

One API call. Results in under 2 seconds.

POST https://seodiff.io/api/v1/agent/evaluate

{"urls": ["https://example.com/blog/post-1"], "assertions": [{"rule": "max_token_bloat", "value": 8.0}]}

Parameter	Type	Example
`value`	float	`8.0`

Code Examples

Copy-paste examples in your preferred language:

Related Assertions

`min_word_count`

Prevent thin content by requiring a minimum number of words per page.

`max_js_ghost_ratio`

Flag pages where content is rendered client-side and invisible to crawlers.

Use in CI/CD

Add this assertion to your deployment pipeline. Works with any CI platform:

Measure HTML Token Bloat

The Problem

The Hard Way

The SEODiff Way

Code Examples

cURL

Python

Node.js

Go

PHP

Related Assertions

`min_word_count`

`max_js_ghost_ratio`

Use in CI/CD

🐙 GitHub Actions

🦊 GitLab CI

▲ Vercel

Start testing in 30 seconds