Web Crawler
  • 01 Sep 2024
  • 1 Minute to read
  • Contributors
  • Dark
    Light

Web Crawler

  • Dark
    Light

Article summary

Connection

The Web Crawler integration enables customers to extract data from any website by scraping its pages. This process utilizes sitemap files to guide the integration on which pages to scan and scrape. For more information about sitemaps, click here.

Users are prompted to input a website URL for scanning on the connection interface. The system attempts to auto-discover sitemap files through the robots.txt file.

Note:

If you cannot access the Web Crawler connection, it needs to be enabled for your account. Please reach out to your Unleash representative.

Filtering and Selection

Users can connect the WebCrawler integration on their own; however, given the variance in website layouts and sitemap structures, as well as the diversity of information presented, it is recommended that you contact your Unleash representative, who can fine-tune the crawling process behind the scenes. Our data specialists are equipped with a wide range of tools to tailor the configuration to your specific needs, ensuring a superior experience. These tools include:

  • Selection and filtering of DOM elements

  • Language filters

  • Title manipulation

Sync

  • After establishing a connection, a scraping cycle is initiated every 6 hours to capture updates.

  • To expedite subsequent synchronizations, we utilize the If-Modified-Since header (where supported) to determine page modifications.


Was this article helpful?

What's Next