October 13, 2008

htmlcxx is a C++ software that efficiently parses HTML and CSS1 without validation. It provides a simple and straightforward solution for C++ developers to extract meaningful data from web pages.

Version 0.83

License LGPL

Platform Linux

Supported Languages English

Homepage htmlcxx.sourceforge.net

Developed by Davi de Castro Reis and Robson Braga Arajo

htmlcxx project is a C++ parser for HTML and CSS1 that does not validate. While there are several other parsers available, htmlcxx is unique in some aspects. It allows for STL-like navigation of the DOM tree, utilizing the tree.hh library from Kasper Peeters. It can reproduce the original document character by character from the parse tree. It also comes with a bundled CSS parser and offers optional parsing of attributes. Its C++ code resembles C++, and offsets of tags/elements are stored in the nodes of the DOM tree.

The parsing approach of htmlcxx is created to mimic Mozilla Firefox's behavior. As a result, users should expect similar parse trees created by Firefox. However, unlike Firefox, htmlcxx does not insert non-existent items in HTML, promising the exact bytes originally contained in the HTML document when serializing the DOM tree.

Using htmlcxx is relatively easy; the following example demonstrates how it works:

#include < htmlcxx/html/ParserDom.h > ... // Parse some HTML code string html = "< html >< body >hey< /body >< /html >"; HTML::ParserDom parser; tree< HTML::Node > dom = parser.parseTree(html); // Print the entire DOM tree cout ::iterator end = dom.end(); for (; it != end; ++it) { if (it->tagName() == "A") { it->parseAttributes(); cout attributes("href"); } } // Dump all text of the document it = dom.begin(); end = dom.end(); for (; it != end; ++it) { if ((!it->isTag()) && (!it->isComment())) { cout text(); } }

Overall, htmlcxx project is a useful and unique option for C++ developers in need of an HTML and CSS parser.

What's New

Version 0.83: N/A

Free Download 410K

Softpile

Free Downloads

htmlcxx

Most Popular

Related Downloads