$Header: /CVSROOT/tinohtmlparse/README,v 1.2 2006/06/11 06:57:30 tino Exp $

THIS CURRENTLY IS PUBLIC DOMAIN AND CAN BE DISTRIBUTED UNDER ANY
LICENSE.  However as it ekhtml is linked in statically, binaries must
be compliant to the EKTHML license.

Origin: http://www.scylla-charybdis.com/tool.php?tool=tinohtmlparse

This is currently based on ekhtml, a deadly inperfect HTML parser (for
example it does not parse Comments correctly, like in the following:
<!DTD -- comment 1 -- more DTD -- comment 2 -- again DTD />).  Perhaps
sometimes I come around and write a working version (which then shall
be able to sanitize HTML as well), but for now, we keep it as it is.

First fetch ekhtml from CVS at sourceforge.net:
	cvs -d :pserver:anonymous@cvs.sourceforge.net:/cvsroot/ekhtml login
		(empty password)
	cvs -z3 -d :pserver:anonymous@cvs.sourceforge.net:/cvsroot/ekhtml co ekhtml
Compile ekhtml:
	cd ekhtml; ./autogen.sh; make

To compile tinohtmlparse:

Create softlink to the source of ekhtml if you compiled it in another directory:
	ln -s ../somewhere/ekhtml ekhtml
Then type
	make

You can ignore the error that tinolib is missing as tinolib is not
required for this.  If you really want it, grab the distribution of
tinolib and let the softlink point there like this:
		ln -s ../somewhere/tinolib-*/old tinolib
Note that tinolib restricts distributions which bundle it to GPL!


Usage:
	tinohtmlparse [-r|--raw] [-l|--list]

--list shows a list of known entities

--raw does not convert these entities to their % representation in
attributes.

The HTML file is read from stdin and the output is written to stdout.
The parsed lines all look like following template:

TYPE TAG ATTR Q TEXT

The first 4 words are guaranteed to not contain SPC ever.  If they are
empty it is guaranteed that no more words or text follows.

- TYPE is a type string (see below)
- TAG usually is the HTML TAG (ekhtml converts this to uppercase)
- ATTR is the attribute name
- Q is a Quote type of the text which follows.
- TEXT is the text and is % escaped such that it is URL compatible

When TYPE is "text" or "comment" then TAG is a number counting the
lines starting with 0, ATTR is an LF flag and Q always is -.

Q can be B for boolean attribs (those without =), N (was not quoted),
' or " (the quote which was used).  There is a form where Quote is two
HEX digits HH, but this never shall show up (it's in case ekhtml send
some unusual quote character).

So you can do

./tinohtmlparse < htmlfile |
./tinohtmlabsurl.sh "BASEURL" |
while read -r type tag name q text
do
	...
done


Output documentation:

open TAG
close TAG
	Open and closing TAG tags encountered.  TEXT is empty.

attr TAG ATTR Q TEXT
	A named attribute, immediately follows "open".

	TAG is the TAG it belongs to, added for more easy parsing.
	ATTR is the attribute's name.

	The text is URL-escaped with %, that is %xx is the hex
	representation of any unusual character (including %).  For
	unicode characters there is the representation %uXXXX.

	If 

text COUNT LF - TEXT
comment COUNT LF - TEXT
	COUNT is the line count.

	LF is either 0 (TEXT does not contain a LF) or 1 (TEXT does
	contain an LF).  Multiple lines are repeated with the line
	count counted up, so there are no complex to parse
	continuation lines.

	In case of the comment form, this is the commented out text.

-Tino
webmaster@scylla-charybdis.com
$Log: README,v $
Revision 1.2  2006/06/11 06:57:30  tino
Mainly only documentation corrected

Revision 1.1  2005/02/05 23:07:28  tino
first commit, tinohtmlparse.c is missing "text" aggregation

