Measure HTML Token Bloat
Detect when boilerplate overwhelms useful content for LLM crawlers.
The Problem
LLMs and RAG pipelines tokenize your HTML before extracting content. When navigation, footers, tracking scripts, and repeated UI components dominate the page, the signal-to-noise ratio collapses. Claude, GPT, and Perplexity either truncate or hallucinate when the useful content is buried under 10× its weight in boilerplate tokens.
The Hard Way
Calculate token bloat manually: fetch the page, strip boilerplate, count tokens in the useful content vs. the full HTML. You’d need a tokenizer (tiktoken or similar), a content extractor, and a baseline for what’s “good”. For pSEO at scale, you’d repeat this per template.
The SEODiff Way
One API call. Results in under 2 seconds.
POST https://seodiff.io/api/v1/agent/evaluate
{"urls": ["https://example.com/blog/post-1"], "assertions": [{"rule": "max_token_bloat", "value": 8.0}]}| Parameter | Type | Example |
|---|---|---|
value | float | 8.0 |
Code Examples
Copy-paste examples in your preferred language:
cURL
See the full evaluation example in cURL →
Python
See the full evaluation example in Python →
Node.js
See the full evaluation example in Node.js →
Go
See the full evaluation example in Go →
PHP
See the full evaluation example in PHP →
Related Assertions
min_word_count
Prevent thin content by requiring a minimum number of words per page.
max_js_ghost_ratio
Flag pages where content is rendered client-side and invisible to crawlers.
Use in CI/CD
Add this assertion to your deployment pipeline. Works with any CI platform:
🐙 GitHub Actions
Block bad deployments with automated SEO checks in your GitHub Actions CI/CD pipeline.
🦊 GitLab CI
Add automated SEO quality gates to your GitLab CI/CD pipelines.
▲ Vercel
Automatically validate SEO on every Vercel preview deployment before promoting to production.