Download url crawler
Author: s | 2025-04-25
Download URL Crawler latest version for Windows free. URL Crawler latest update: Aug tags: Collect URL, download Collect URL, Collect URL free download, URL collector, URL extractor, web crawler, crawler, collect, extract, gather. Download. Add to Basket. VOVSOFT
URL Crawler for Windows - CNET Download
🕸 Crawl the web using PHP 🕷This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.Support usWe invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.InstallationThis package can be installed via Composer:composer require spatie/crawlerUsageThe crawler can be instantiated like thissetCrawlObserver() ->startCrawling($url);">use Spatie\Crawler\Crawler;Crawler::create() ->setCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:namespace Spatie\Crawler\CrawlObservers;use GuzzleHttp\Exception\RequestException;use Psr\Http\Message\ResponseInterface;use Psr\Http\Message\UriInterface;abstract class CrawlObserver{ /* * Called when the crawler will crawl the url. */ public function willCrawl(UriInterface $url, ?string $linkText): void { } /* * Called when the crawler has crawled the given url successfully. */ abstract public function crawled( UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null, ?string $linkText, ): void; /* * Called when the crawler had a problem crawling the given url. */ abstract public function crawlFailed( UriInterface $url, RequestException $requestException, ?UriInterface $foundOnUrl = null, ?string $linkText = null, ): void; /** * Called when the crawl has ended. */ public function finishedCrawling(): void { }}Using multiple observersYou can set multiple observers with setCrawlObservers:setCrawlObservers([ , , ... ]) ->startCrawling($url);">Crawler::create() ->setCrawlObservers([ class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, ... ]) ->startCrawling($url);Alternatively you can set multiple observers one by one with addCrawlObserver:addCrawlObserver() ->addCrawlObserver() ->addCrawlObserver() ->startCrawling($url);">Crawler::create() ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);Executing JavaScriptBy default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:executeJavaScript() ...">Crawler::create() ->executeJavaScript() ...In order to make it possible to get the body html after the javascript has been executed, this package depends onour Browsershot package.This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.Browsershot will make an educated guess as to where its dependencies are installed on your system.By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.setBrowsershot($browsershot) ->executeJavaScript() ...">Crawler::create() ->setBrowsershot($browsershot) ->executeJavaScript() ...Note that the crawler will still work even if you don't have the system dependencies required by Browsershot.These system dependencies are only required if you're calling executeJavaScript().Filtering certain urlsYou can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expectsan object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:/* * Determine if the given url should be crawled. */public function shouldCrawl(UriInterface $url): bool;This package comes with three CrawlProfiles out of the box:CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls: this profile will only crawl the internal Limit of pages to crawl.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 3: Combining the total and crawl limitBoth limits can be combined to control the crawler:;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 4: Crawling across requestsYou can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.Initial RequestTo start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);">// Create a queue using your queue-driver.$queue = ;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);Subsequent RequestsFor any following requests you will need to unserialize your original queue and pass it to the crawler:setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);">// Unserialize queue$queue = unserialize($serializedQueue);// Crawls the next set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.An example with more details can be found here.Setting the maximum crawl depthBy default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.setMaximumDepth(2)">Crawler::create() ->setMaximumDepth(2)Setting the maximum response sizeMost html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.You can change the maximum response size.setMaximumResponseSize(1024 * 1024 * 3)">// let's use a 3 MB maximum.Crawler::create() ->setMaximumResponseSize(1024 * 1024 * 3)Add a delay between requestsIn some cases you might get rate-limited when crawling too aggressively. To circumventURL Crawler on the App Store
Urls on the pages of a host.CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.Custom link extractionYou can customize how links are extracted from a page by passing a custom UrlParser to the crawler.setUrlParserClass(::class) ...">Crawler::create() ->setUrlParserClass(class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class) ...By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.setUrlParserClass(SitemapUrlParser::class) ...">Crawler::create() ->setUrlParserClass(SitemapUrlParser::class) ...Ignoring robots.txt and robots metaBy default, the crawler will respect robots data. It is possible to disable these checks like so:ignoreRobots() ...">Crawler::create() ->ignoreRobots() ...Robots data can come from either a robots.txt file, meta tags or response headers.More information on the spec can be found here: robots data is done by our package spatie/robots-txt.Accept links with rel="nofollow" attributeBy default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:acceptNofollowLinks() ...">Crawler::create() ->acceptNofollowLinks() ...Using a custom User AgentIn order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.setUserAgent('my-agent')">Crawler::create() ->setUserAgent('my-agent')You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.// Disallow crawling for my-agentUser-agent: my-agentDisallow: /Setting the number of concurrent requestsTo improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.setConcurrency(1) // now all urls will be crawled one by one">Crawler::create() ->setConcurrency(1) // now all urls will be crawled one by oneDefining Crawl and Time LimitsBy default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.The crawl behavior can be controlled with the following two options:Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.Total Execution Time Limit (setTotalExecutionTimeLimit): This limit defines the maximal execution time of the crawl.Current Execution Time Limit (setCurrentExecutionTimeLimit): This limits the execution time of the current crawl.Let's take a look at some examples to clarify the difference between setTotalCrawlLimit and setCurrentCrawlLimit.The difference between setTotalExecutionTimeLimit and setCurrentExecutionTimeLimit will be the same.Example 1: Using the total crawl limitThe setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);Example 2: Using the current crawl limitThe setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total. Download URL Crawler latest version for Windows free. URL Crawler latest update: AugURL Crawler on the App Store
Permanently keep your stuff, for life."Fossilo A commercial archiving solution that appears to be very similar to ArchiveBoxNeonLink Simple self-hosted bookmark management + Benotes note-taking app with limited archiving featuresArchivematica web GUI for institutional long-term archiving of web and other contentHeadless Chrome Crawler distributed web crawler built on puppeteer with screenshotsWWWofle old proxying recorder software similar to ArchiveBoxErised Super simple CLI utility to bookmark and archive webpagesZotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)TiddlyWiki Non-linear bookmark and note-taking tool with archiving supportJoplin Desktop + mobile app for knowledge-base-style info collection and notes (w/ optional plugin for archiving)Hunchly A paid web archiving / session recording tool design for OSINTMonolith CLI tool for saving complete web pages as a single HTML fileObelisk Go package and CLI tool for saving web page as single HTML fileMunin Archiver Social media archiver for Facebook, Instagram and VKontakte accounts.Wayback Archiving in style like ArchiveBox, but with a chat.Smaller UtilitiesRandom helpful utilities for web archiving, WARC creation and replay, and more... A utility to sync xBrowserSync bookmarks with ArchiveBox A browser extension that collects and collates all the URLs you visit into a hierarchical/graph structure with metadata A Chrome extension for saving the state of a page in multiple formats command-line tool that lets you download the entire Wayback Machine archive for a given URL Download an entire website from the Internet Archive Wayback Machine. Replace any broken URLs in some content with Wayback machine URL equivalents download an archived page or Oleh alat ini. Jenis yang tidak didukung mungkin ada dan valid di halaman, dan dapat muncul di hasil Penelusuran, tetapi tidak akan muncul di alat ini. Data respons tambahan Untuk melihat data respons tambahan seperti HTML mentah yang ditampilkan, header HTTP, output konsol JavaScript, dan semua resource halaman yang dimuat, klik Lihat halaman yang di-crawl. Informasi respons tambahan hanya tersedia untuk URL dengan status URL ada di Google atau URL ada di Google, tetapi mengalami masalah. Crawler yang digunakan untuk menghasilkan data bergantung pada posisi Anda saat membuka panel samping: Saat dibuka dari tingkat atas laporan, sub-laporan HTTPS, dan sub-laporan data terstruktur apa pun di bagian Peningkatan & Pengalaman, jenis crawler ditampilkan di bagian Ketersediaan halaman > Di-crawl > Di-crawl sebagai Saat dibuka dari sub-laporan AMP, jenis crawler-nya adalah smartphone Googlebot. Screenshot halaman yang dirender hanya tersedia di pengujian langsung. Pengujian URL aktifJalankan pengujian langsung untuk URL di properti Anda guna memeriksa masalah pengindeksan, data terstruktur, dan lainnya. Pengujian langsung berguna saat Anda memperbaiki halaman, untuk menguji apakah masalah telah diperbaiki.Untuk menjalankan pengujian langsung guna mengetahui potensi error pengindeksan: Periksa URL. Catatan: tidak masalah jika halaman belum diindeks, atau gagal diindeks, tetapi halaman harus dapat diakses dari internet tanpa persyaratan login. Klik Uji URL aktif. Baca memahami hasil pengujian langsung untuk memahami laporan tersebut. Anda dapat beralih antara hasil pengujian langsung dan hasil yang diindeks dengan mengklik Indeks Google atau Pengujian Langsung di halaman. Untuk menjalankan kembali pengujian langsung, klik tombol jalankan kembali pengujian di halaman pengujian. Untuk melihat detailURL Crawler para Windows - CNET Download
Spider is the fastest and most affordable crawler and scraper that returns LLM-ready data.[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World's Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy\`\`\`import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":50,"url":" = requests.post(' headers=headers, json=json_data)print(response.json())\`\`\`Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs]( Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs]( [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub]( metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]The params parameter is a dictionary that can be passed to the loader. See the Spider documentation to see all available parametersurl-crawler GitHub Topics GitHub
This, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms">Crawler::create() ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150msLimiting which content-types to parseBy default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.setParseableMimeTypes(['text/html', 'text/plain'])">Crawler::create() ->setParseableMimeTypes(['text/html', 'text/plain'])This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.Using a custom crawl queueWhen crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.setCrawlQueue()">Crawler::create() ->setCrawlQueue(implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)HereArrayCrawlQueueRedisCrawlQueue (third-party package)CacheCrawlQueue for Laravel (third-party package)Laravel Model as Queue (third-party example app)Change the default base url schemeBy default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.setDefaultScheme('https')">Crawler::create() ->setDefaultScheme('https')ChangelogPlease see CHANGELOG for more information what has changed recently.ContributingPlease see CONTRIBUTING for details.TestingFirst, install the Puppeteer dependency, or your tests will fail.To run the tests you'll have to start the included node based server first in a separate terminal window.cd tests/servernpm installnode server.jsWith the server running, you can start testing.SecurityIf you've found a bug regarding security please mail security@spatie.be instead of using the issue tracker.PostcardwareYou're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.We publish all received postcards on our company website.CreditsFreek Van der HertenAll ContributorsLicenseThe MIT License (MIT). Please see License File for more information.. Download URL Crawler latest version for Windows free. URL Crawler latest update: AugURL Crawler by Marco Tini - appadvice.com
GivenA page linking to a tel: URI: Norconex test Phone Number ">>html lang="en"> head> title>Norconex testtitle> head> body> a href="tel:123">Phone Numbera> body>html>And the following config: ">xml version="1.0" encoding="UTF-8"?>httpcollector id="test-collector"> crawlers> crawler id="test-crawler"> startURLs> url> startURLs> crawler> crawlers>httpcollector>ExpectedThe collector should not follow this link – or that of any other schema it can't actually process.ActualThe collectors tries to follow the tel: link.INFO [SitemapStore] test-crawler: Initializing sitemap store...INFO [SitemapStore] test-crawler: Done initializing sitemap store.INFO [HttpCrawler] 1 start URLs identified.INFO [CrawlerEventManager] CRAWLER_STARTEDINFO [AbstractCrawler] test-crawler: Crawling references...INFO [CrawlerEventManager] DOCUMENT_FETCHED: [CrawlerEventManager] CREATED_ROBOTS_META: [CrawlerEventManager] URLS_EXTRACTED: [CrawlerEventManager] DOCUMENT_IMPORTED: [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: [CrawlerEventManager] REJECTED_NOTFOUND: [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...INFO [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...INFO [AbstractCrawler] test-crawler: 2 reference(s) processed.INFO [CrawlerEventManager] CRAWLER_FINISHEDINFO [AbstractCrawler] test-crawler: Crawler completed.INFO [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.INFO [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/INFO [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)">INFO [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progressINFO [JobSuite] JEF work directory is: ./progressINFO [JobSuite] JEF log manager is : FileLogManagerINFO [JobSuite] JEF job status store is : FileJobStatusStoreINFO [AbstractCollector] Suite of 1 crawler jobs created.INFO [JobSuite] Initialization...INFO [JobSuite] No previous execution detected.INFO [JobSuite] Starting execution.INFO [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)INFO [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)INFO [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/INFO [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.INFO [HttpCrawler] test-crawler: RobotsTxt support: trueINFO [HttpCrawler] test-crawler: RobotsMeta support: trueINFO [HttpCrawler] test-crawler: Sitemap support: trueINFO [HttpCrawler] test-crawler: Canonical links support: trueINFO [HttpCrawler] test-crawler: User-Agent: INFO [SitemapStore] test-crawler: Initializing sitemap store...INFO [SitemapStore] test-crawler: Done initializing sitemap store.INFO [HttpCrawler] 1 start URLs identified.INFO [CrawlerEventManager] CRAWLER_STARTEDINFO [AbstractCrawler] test-crawler: Crawling references...INFO [CrawlerEventManager] DOCUMENT_FETCHED: [CrawlerEventManager] CREATED_ROBOTS_META: [CrawlerEventManager] URLS_EXTRACTED: [CrawlerEventManager] DOCUMENT_IMPORTED: [CrawlerEventManager] DOCUMENT_COMMITTED_ADD:Comments
🕸 Crawl the web using PHP 🕷This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.Support usWe invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.InstallationThis package can be installed via Composer:composer require spatie/crawlerUsageThe crawler can be instantiated like thissetCrawlObserver() ->startCrawling($url);">use Spatie\Crawler\Crawler;Crawler::create() ->setCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:namespace Spatie\Crawler\CrawlObservers;use GuzzleHttp\Exception\RequestException;use Psr\Http\Message\ResponseInterface;use Psr\Http\Message\UriInterface;abstract class CrawlObserver{ /* * Called when the crawler will crawl the url. */ public function willCrawl(UriInterface $url, ?string $linkText): void { } /* * Called when the crawler has crawled the given url successfully. */ abstract public function crawled( UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null, ?string $linkText, ): void; /* * Called when the crawler had a problem crawling the given url. */ abstract public function crawlFailed( UriInterface $url, RequestException $requestException, ?UriInterface $foundOnUrl = null, ?string $linkText = null, ): void; /** * Called when the crawl has ended. */ public function finishedCrawling(): void { }}Using multiple observersYou can set multiple observers with setCrawlObservers:setCrawlObservers([ , , ... ]) ->startCrawling($url);">Crawler::create() ->setCrawlObservers([ class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, ... ]) ->startCrawling($url);Alternatively you can set multiple observers one by one with addCrawlObserver:addCrawlObserver() ->addCrawlObserver() ->addCrawlObserver() ->startCrawling($url);">Crawler::create() ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);Executing JavaScriptBy default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:executeJavaScript() ...">Crawler::create() ->executeJavaScript() ...In order to make it possible to get the body html after the javascript has been executed, this package depends onour Browsershot package.This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.Browsershot will make an educated guess as to where its dependencies are installed on your system.By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.setBrowsershot($browsershot) ->executeJavaScript() ...">Crawler::create() ->setBrowsershot($browsershot) ->executeJavaScript() ...Note that the crawler will still work even if you don't have the system dependencies required by Browsershot.These system dependencies are only required if you're calling executeJavaScript().Filtering certain urlsYou can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expectsan object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:/* * Determine if the given url should be crawled. */public function shouldCrawl(UriInterface $url): bool;This package comes with three CrawlProfiles out of the box:CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls: this profile will only crawl the internal
2025-04-16Limit of pages to crawl.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 3: Combining the total and crawl limitBoth limits can be combined to control the crawler:;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 4: Crawling across requestsYou can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.Initial RequestTo start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);">// Create a queue using your queue-driver.$queue = ;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);Subsequent RequestsFor any following requests you will need to unserialize your original queue and pass it to the crawler:setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);">// Unserialize queue$queue = unserialize($serializedQueue);// Crawls the next set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.An example with more details can be found here.Setting the maximum crawl depthBy default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.setMaximumDepth(2)">Crawler::create() ->setMaximumDepth(2)Setting the maximum response sizeMost html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.You can change the maximum response size.setMaximumResponseSize(1024 * 1024 * 3)">// let's use a 3 MB maximum.Crawler::create() ->setMaximumResponseSize(1024 * 1024 * 3)Add a delay between requestsIn some cases you might get rate-limited when crawling too aggressively. To circumvent
2025-04-21Urls on the pages of a host.CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.Custom link extractionYou can customize how links are extracted from a page by passing a custom UrlParser to the crawler.setUrlParserClass(::class) ...">Crawler::create() ->setUrlParserClass(class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class) ...By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.setUrlParserClass(SitemapUrlParser::class) ...">Crawler::create() ->setUrlParserClass(SitemapUrlParser::class) ...Ignoring robots.txt and robots metaBy default, the crawler will respect robots data. It is possible to disable these checks like so:ignoreRobots() ...">Crawler::create() ->ignoreRobots() ...Robots data can come from either a robots.txt file, meta tags or response headers.More information on the spec can be found here: robots data is done by our package spatie/robots-txt.Accept links with rel="nofollow" attributeBy default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:acceptNofollowLinks() ...">Crawler::create() ->acceptNofollowLinks() ...Using a custom User AgentIn order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.setUserAgent('my-agent')">Crawler::create() ->setUserAgent('my-agent')You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.// Disallow crawling for my-agentUser-agent: my-agentDisallow: /Setting the number of concurrent requestsTo improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.setConcurrency(1) // now all urls will be crawled one by one">Crawler::create() ->setConcurrency(1) // now all urls will be crawled one by oneDefining Crawl and Time LimitsBy default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.The crawl behavior can be controlled with the following two options:Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.Total Execution Time Limit (setTotalExecutionTimeLimit): This limit defines the maximal execution time of the crawl.Current Execution Time Limit (setCurrentExecutionTimeLimit): This limits the execution time of the current crawl.Let's take a look at some examples to clarify the difference between setTotalCrawlLimit and setCurrentCrawlLimit.The difference between setTotalExecutionTimeLimit and setCurrentExecutionTimeLimit will be the same.Example 1: Using the total crawl limitThe setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);Example 2: Using the current crawl limitThe setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total
2025-04-11Permanently keep your stuff, for life."Fossilo A commercial archiving solution that appears to be very similar to ArchiveBoxNeonLink Simple self-hosted bookmark management + Benotes note-taking app with limited archiving featuresArchivematica web GUI for institutional long-term archiving of web and other contentHeadless Chrome Crawler distributed web crawler built on puppeteer with screenshotsWWWofle old proxying recorder software similar to ArchiveBoxErised Super simple CLI utility to bookmark and archive webpagesZotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)TiddlyWiki Non-linear bookmark and note-taking tool with archiving supportJoplin Desktop + mobile app for knowledge-base-style info collection and notes (w/ optional plugin for archiving)Hunchly A paid web archiving / session recording tool design for OSINTMonolith CLI tool for saving complete web pages as a single HTML fileObelisk Go package and CLI tool for saving web page as single HTML fileMunin Archiver Social media archiver for Facebook, Instagram and VKontakte accounts.Wayback Archiving in style like ArchiveBox, but with a chat.Smaller UtilitiesRandom helpful utilities for web archiving, WARC creation and replay, and more... A utility to sync xBrowserSync bookmarks with ArchiveBox A browser extension that collects and collates all the URLs you visit into a hierarchical/graph structure with metadata A Chrome extension for saving the state of a page in multiple formats command-line tool that lets you download the entire Wayback Machine archive for a given URL Download an entire website from the Internet Archive Wayback Machine. Replace any broken URLs in some content with Wayback machine URL equivalents download an archived page or
2025-04-07Oleh alat ini. Jenis yang tidak didukung mungkin ada dan valid di halaman, dan dapat muncul di hasil Penelusuran, tetapi tidak akan muncul di alat ini. Data respons tambahan Untuk melihat data respons tambahan seperti HTML mentah yang ditampilkan, header HTTP, output konsol JavaScript, dan semua resource halaman yang dimuat, klik Lihat halaman yang di-crawl. Informasi respons tambahan hanya tersedia untuk URL dengan status URL ada di Google atau URL ada di Google, tetapi mengalami masalah. Crawler yang digunakan untuk menghasilkan data bergantung pada posisi Anda saat membuka panel samping: Saat dibuka dari tingkat atas laporan, sub-laporan HTTPS, dan sub-laporan data terstruktur apa pun di bagian Peningkatan & Pengalaman, jenis crawler ditampilkan di bagian Ketersediaan halaman > Di-crawl > Di-crawl sebagai Saat dibuka dari sub-laporan AMP, jenis crawler-nya adalah smartphone Googlebot. Screenshot halaman yang dirender hanya tersedia di pengujian langsung. Pengujian URL aktifJalankan pengujian langsung untuk URL di properti Anda guna memeriksa masalah pengindeksan, data terstruktur, dan lainnya. Pengujian langsung berguna saat Anda memperbaiki halaman, untuk menguji apakah masalah telah diperbaiki.Untuk menjalankan pengujian langsung guna mengetahui potensi error pengindeksan: Periksa URL. Catatan: tidak masalah jika halaman belum diindeks, atau gagal diindeks, tetapi halaman harus dapat diakses dari internet tanpa persyaratan login. Klik Uji URL aktif. Baca memahami hasil pengujian langsung untuk memahami laporan tersebut. Anda dapat beralih antara hasil pengujian langsung dan hasil yang diindeks dengan mengklik Indeks Google atau Pengujian Langsung di halaman. Untuk menjalankan kembali pengujian langsung, klik tombol jalankan kembali pengujian di halaman pengujian. Untuk melihat detail
2025-03-26Spider is the fastest and most affordable crawler and scraper that returns LLM-ready data.[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World's Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy\`\`\`import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":50,"url":" = requests.post(' headers=headers, json=json_data)print(response.json())\`\`\`Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs]( Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs]( [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub]( metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]The params parameter is a dictionary that can be passed to the loader. See the Spider documentation to see all available parameters
2025-04-16