<![CDATA[
Robots.txt is a simple but powerful file that controls which pages search engine crawlers can and cannot access. Proper robots.txt configuration preserves crawl budget for your most important pages, blocks indexation of non-public content, and prevents duplicate content issues from parameter URLs and internal systems.
Robots.txt Syntax
The file lives at the root of your domain: yourdomain.com/robots.txt
User-agent: * Allow: / Disallow: /admin/ Disallow: /internal-search/ Disallow: /*?sort= Disallow: /*?filter= Sitemap: https://yourdomain.com/sitemap.xml
Key Directives
- User-agent: Specifies which bot the rules apply to (* = all bots)
- Disallow: Blocks the specified path from crawling
- Allow: Explicitly permits crawling (overrides Disallow for specific paths)
- Sitemap: Points crawlers to your XML sitemap
What to Block
- Admin and login pages: /admin/, /wp-admin/, /login/
- Internal search results: /search?, /internal-search/
- Parameter URLs: ?sort=, ?filter=, ?page= (if canonical tags handle these)
- Development and staging paths: /staging/, /test/, /dev/
- Thank-you and confirmation pages: /thank-you/, /order-confirmation/
- User-generated parameter pages with thin content
What NOT to Block
- CSS and JavaScript files: Googlebot needs these to render pages properly
- Image directories: Unless you deliberately want to exclude images from Google Images
- Your sitemap: Always allow crawling of your sitemap
- Any page you want to rank: Disallow prevents crawling, which prevents indexation
Robots.txt vs. Noindex
Important distinction:
For more on this topic, see our guide on xml sitemap seo guide.
For more on this topic, see our guide on google algorithm update recovery.
For more on this topic, see our guide on crawl budget optimization.
For more on this topic, see our guide on log file analysis seo.
For more on this topic, see our guide on indexation management seo.
- Robots.txt Disallow: Prevents crawling but does NOT remove pages from the index. If a page has backlinks, Google may index the URL without crawling it.
- Noindex meta tag: Tells Google to remove the page from the index. Requires the page to be crawlable (not blocked by robots.txt).
To remove a page from search results: use noindex, NOT robots.txt. To save crawl budget on non-content pages: use robots.txt.
Testing Robots.txt
Use Google Search Console’s robots.txt tester to verify your rules work as intended. Test specific URLs to confirm they’re allowed or blocked. After changes, monitor Search Console’s crawl stats for unexpected changes in crawl behavior.
Review your robots.txt quarterly. As your site grows, new URL patterns emerge that may need blocking. A well-maintained robots.txt ensures Googlebot spends its crawl budget on pages that drive your SEO performance.
]]>