> For the complete documentation index, see [llms.txt](https://replai-1.gitbook.io/replai-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://replai-1.gitbook.io/replai-docs/connectors-sources/website-crawler.md).

# Website crawler

## Website crawler

Use the website crawler when your content already lives on a public website.

It works well for docs, help centers, and policy pages.

### What do you need first?

You need:

* A public base URL or sitemap URL
* A clear idea of which paths should be included
* A page limit for the first crawl

{% hint style="info" %}
The crawler respects `robots.txt`.
{% endhint %}

### How do you set it up?

{% stepper %}
{% step %}

### Start with the sitemap

Use the sitemap URL when one exists.

If no sitemap exists, start from the base URL.
{% endstep %}

{% step %}

### Set path filters

Add include and exclude paths.

Keep search pages, archive pages, and thin landing pages out of scope.
{% endstep %}

{% step %}

### Set the max pages

Start with a smaller crawl.

Expand later if the first preview looks clean.
{% endstep %}

{% step %}

### Preview pages

Review the discovered pages before the full sync.

Then launch the crawl.
{% endstep %}
{% endstepper %}

### What happens if there is no sitemap?

The crawler follows same-domain links from the starting URL.

This works, but a sitemap usually gives cleaner coverage.

### What should you watch for?

Watch for duplicate pages, noisy navigation pages, and sections outside your real support scope.

Tighter filters usually improve answer quality.

### Related links

* [Sources overview](/replai-docs/connectors-sources/sources-overview.md)
* [Sync & scheduling](/replai-docs/connectors-sources/sync-and-scheduling.md)
* [Troubleshoot connectors](/replai-docs/connectors-sources/troubleshooting-connectors.md)
* [Citations](/replai-docs/answers-and-grounding/citations.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://replai-1.gitbook.io/replai-docs/connectors-sources/website-crawler.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
