htmlcxx is a C++ software that efficiently parses HTML and CSS1 without validation. It provides a simple and straightforward solution for C++ developers to extract meaningful data from web pages.
The parsing approach of htmlcxx is created to mimic Mozilla Firefox's behavior. As a result, users should expect similar parse trees created by Firefox. However, unlike Firefox, htmlcxx does not insert non-existent items in HTML, promising the exact bytes originally contained in the HTML document when serializing the DOM tree.
Using htmlcxx is relatively easy; the following example demonstrates how it works:
#include < htmlcxx/html/ParserDom.h > ... // Parse some HTML code string html = "< html >< body >hey< /body >< /html >"; HTML::ParserDom parser; tree< HTML::Node > dom = parser.parseTree(html); // Print the entire DOM tree cout ::iterator end = dom.end(); for (; it != end; ++it) { if (it->tagName() == "A") { it->parseAttributes(); cout attributes("href"); } } // Dump all text of the document it = dom.begin(); end = dom.end(); for (; it != end; ++it) { if ((!it->isTag()) && (!it->isComment())) { cout text(); } }
Overall, htmlcxx project is a useful and unique option for C++ developers in need of an HTML and CSS parser.
Version 0.83: N/A