SEO Crawler Test Checklist

This page outlines the technical SEO features and scenarios implemented across this test website.

Back to Home

On-Page Elements

Basic Tags:
- Presence/content of <title> (e.g., index.html)
- Handling of missing <title> (page4-no-title.html)
- Handling of very long <title> (content-1.html)
- Presence/content of <meta name="description"> (e.g., index.html)
- Presence/content of <h1> (e.g., index.html)
- Handling of multiple <h1> tags (content-2.html)
- Use of other heading tags (<h2>, etc.)
Canonicals:
- Presence/correct URL in <link rel="canonical"> (e.g., index.html)
- Self-referencing canonical (page3-duplicate.html)
- Canonical pointing to destination on redirects (redirect-page.html, page6-js-redirect.html, page12-delayed-refresh.html)
- Handling of multiple <link rel="canonical"> tags (page9-multi-canonical.html)
- Handling canonical to different domain (content-4.html)
Meta Robots:
- Parsing index, follow (index.html)
- Parsing noindex, follow (page2.html, redirects)
- Parsing specific bot directives (<meta name="google" content="nosnippet"> on page2.html)
Links:
- Internal/External link identification
- Broken internal link detection (/nonexistent-page.html)
- Parsing rel="nofollow" (internal/external on index.html)
- Parsing rel="sponsored" (index.html)
- Parsing rel="ugc" (index.html)
- Links within <iframe> (iframe-content.html)
- Links added via JavaScript (page5-js-content.html, hidden link)
- Handling links with empty anchor text (content-5.html)
- Handling links to non-HTML resources (PDF link on content-6.html)
- Handling pages with high outgoing link counts (content-7.html)
- Handling links with mixed-case URLs (content-8.html)
- Relative link resolution with <base> tag (page7-base-tag.html)
- Relative link resolution from deep paths (level1/level2/deep-page.html)
- Following internal links
- Handling fragment identifiers (#) (index.html, faq.html#question1)
Images:
- Presence of <img> tags
- Presence/content of alt text
- Discovery of images in CSS (style.css -> index.html)
- Handling references to large image files (content-3.html)
Internationalization:
- Parsing/validation of hreflang tags (index.html, index-gb.html)
Structured Data (Schema.org):
- Parsing JSON-LD
- Validation of WebSite schema (index.html)
- Validation of Article schema (blog-1.html)
- Validation of FAQPage schema (faq.html)
- Validation of BreadcrumbList schema (level1/level2/deep-page.html)
Pagination:
- Detection/understanding of rel="next"/"prev" (blog-*.html)
Mobile Friendliness:
- Detection of missing <meta name="viewport"> (page11-no-viewport.html)
Social Tags:
- Parsing Open Graph tags (index.html)
- Parsing Twitter Card tags (index.html)
Content:
- Extraction of main page content (paragraphs, lists).
- Handling content hidden via CSS (`display: none` on `index.html`)
- Detection of identical content across multiple pages (`duplicate-*.html`)
- Handling of canonical tags on duplicate pages (`duplicate-2.html`, `duplicate-3.html`)
Alternate Links:
- Parsing `rel="alternate"` for feeds (RSS link on `index.html`)

Crawling & Indexing Control

robots.txt:
- Fetching/parsing robots.txt
- Handling different `User-agent` blocks
- Respecting `Disallow:` (including wildcards)
- Respecting `Allow:`
- Parsing `Crawl-delay:`
- Parsing `Sitemap:` directive
Sitemaps:
- Fetching/parsing Sitemap Index files (sitemap-index.xml)
- Fetching/parsing XML Sitemaps (sitemap-pages.xml, sitemap-blog.xml)
- Extracting standard URL data
- Handling URLs blocked by `robots.txt`
- Parsing Sitemap extensions (Image sitemap in sitemap-blog.xml)
- Cross-referencing sitemap/crawled URLs

Technical Health

Redirects:
- Handling meta refresh redirects (immediate & delayed)
- Handling JavaScript redirects
- Handling HTTP 301 Permanent Redirects (Server-side config for /old-page.html)
- Handling HTTP 302 Temporary Redirects (Server-side config for /temporary-promo.html)
- Detection of simulated Soft 404s (soft-404-example.html with 200 status)
JavaScript:
- Rendering JS-dependent content/links
- Handling JS redirects
- Respecting `robots.txt` for JS resources
HTML Validity:
- Handling URLs with excessive length/keywords (a-very-long-page-name...html)
- Robustness with parsing errors (page8-html-errors.html)
Character Encoding:
- Handling encoding conflicts (page10-encoding-error.html)
Resource Loading:
- Fetching external CSS/JS (respecting `robots.txt`)
iFrames:
- Detection and handling of `<iframe>` content

Back to Home