Team,
I observed that HTML parser (hubbub-0.1.2) is breaking when it finds a
SEMICOLON in the text field. I am giving below an example of the text
string.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>MIT - Massachusetts Institute of Technology</title>
<meta name="keywords" content="Massachusetts Institute of
Technology, MIT" />
<meta name="description" content="MIT is devoted to the
advancement
of knowledge and education of students in areas that contribute to or
prosper in an environment of science and technology." />
<meta name="robots" content="index,follow,noodp,noydir"
/>
<meta name="allow-search" content="yes" />
<meta name="language" content="en" />
<meta name="distribution" content="global" />
<meta http-equiv="content-type"
content="text/html*;*charset=UTF-8" />
When it finds the ';', it stops working. When I remove this ';' from the
string, it works fine. Can you please check, if this is an issue with the
parser or if I am missing anything?
I am pasting below the output of the parser (i.e. ./libxml) mit-edu.htm is
the HTML weg page I am giving as inputs.
anilj@ubuntu:~/apache/sandbox/hubbub-0.1.2/examples$ ./libxml mit-edu.htm
WARNING: Failed creating namespace xml
HTML DOCUMENT
standalone=true
DTD(html), PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN, SYSTEM
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
ELEMENT html
default namespace
href=http://www.w3.org/1999/xhtml
namespace math
href=http://www.w3.org/1998/Math/MathML
namespace svg
href=http://www.w3.org/2000/svg
namespace xlink
href=http://www.w3.org/1999/xlink
namespace xmlns
href=http://www.w3.org/2000/xmlns/
ATTRIBUTE xmlns
TEXT
content=http://www.w3.org/1999/xhtml
ELEMENT head
TEXT
content=
ELEMENT title
TEXT
content=MIT - Massachusetts Institute of Technol...
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=keywords
ATTRIBUTE content
TEXT
content=Massachusetts Institute of Technology, M...
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=description
ATTRIBUTE content
TEXT
content=MIT is devoted to the advancement of kno...
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=robots
ATTRIBUTE content
TEXT
content=index,follow,noodp,noydir
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=allow-search
ATTRIBUTE content
TEXT
content=yes
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=language
ATTRIBUTE content
TEXT
content=en
TEXT
content=
ELEMENT meta
ATTRIBUTE name
TEXT
content=distribution
ATTRIBUTE content
TEXT
content=global
TEXT
content=
ELEMENT meta
ATTRIBUTE http-equiv
TEXT
content=content-type
ATTRIBUTE content
TEXT
content=text/html; charset=UTF-8