Web Crawler

Connection

The Web Crawler integration enables customers to extract data from any website by scraping its pages. This process utilizes sitemap files to guide the integration on which pages to scan and scrape. For more information about sitemaps, click here.

Users are prompted to input a website URL for scanning on the connection interface. The system attempts to auto-discover sitemap files through the robots.txt file.

Note:
If you cannot access the Web Crawler connection, it needs to be enabled for your account. Please reach out to your Unleash representative.

Filtering and Selection

Users can connect the WebCrawler integration on their own; however, given the variance in website layouts and sitemap structures, as well as the diversity of information presented, it is recommended that you contact your Unleash representative, who can fine-tune the crawling process behind the scenes. Our data specialists are equipped with a wide range of tools to tailor the configuration to your specific needs, ensuring a superior experience. These tools include:

Selection and filtering of DOM elements
Language filters
Title manipulation

Sync

After establishing a connection, a scraping cycle is initiated every 6 hours to capture updates.
To expedite subsequent synchronizations, we utilize the If-Modified-Since header (where supported) to determine page modifications.