Quickly scan a website, extract entire content (text, html or markdown) or specific divs or spans, save as csv or json
Version: 4.15.2WebScraper uses the Integrity v6 Engine to quickly scan a website, and can output the data (currently) as csv or json.
License: Free To Try $25.00
Operating System: Mac OS X
The output can include various meta data, the entire content of each page (as text, html or markdown) and can extract parts of the pages (currently a named class, id or itemprop of divs, spans, dd's or p's).
Webscraper is new. Please use it for free and please get in touch with any requests, bug reports or observations.
Easy to scan a site - just enter the starting url and press Go Easy to export - checkboxes for the columns you want Plenty of options / configuration
Configuration of various limits on the crawl and the output file size
Version 4.15.2: Small enhancement concerning downloading of images where the 'single folder' option is chosen. Adds timeout control under Advanced Scan Settings. Adds option to 'render page / run js' before parsing it for links Fixes a problem preventing scanning of a list of urls now Universal Binary Intel/M1
Version 4.14.4: Fixes a problem preventing scanning of a list of urls
Version 4.14.3: Updates the selectable user-agent strings and adds more Changes default setting for treating http:// links on the same domain (when starting with an https:// url). Now treats them as internal Fixes a problem with the plain text content option Inherits some general updates in the crawling engine
Version 4.14.1: Adds option to recreate directory structure when downloading pdfs or images to a local folder Fixes crash where a regex is found on the page but the collecting part is empty string.
Version 4.12.1: Adds option to download and save pdf files to a folder as it scans Adds support for charset=GBK, charset=koi8-r, charset=euc-kr and some other Latin and non-Latin character encodings.
Version 4.11.0: Adds option in simple setup and complex setup for scraping email addresses. Adds field in Preferences for editing the regular expression that is used when scraping email addresses.
Version 4.10.2: Can use the ProxyCrawl service Adds option to strip html markup from results of class/id or regex extraction Adds td to the list of tags which are searched when you specify a class or id Many other fixes and enhancements
Version 4.4.1: dark-mode-ready Fixes bug that could result in column information (complex setup) becoming misaligned after dragging and dropping to reorder the columns. Other small fixes
Version 4.4.0: dark-mode-ready Adds 'crawl above starting directory' control Fixes some issues with markdown generation Improves the scan 'blacklist / whitelist rules' in the UI Other small fixes
Version 4.1.1: Adds capability of downloading images to a folder during the scan. Adds option to filter output file Allows editing of your table columns Also allows re-ordering of columns Unifies the helper windows