Hello,
I'm one of the Google Summer of Code participants this year. More
specifically, I'll be working on a new HTML parser for NetSurf with John
Mark-Bell. What exactly an HTML parser does may not be immediately
apparent, so here's a rough explanation of what I'll be doing. :)
In one line: an HTML parser takes an HTML file and turns it into a
structure in memory that can be used by the display engine to render to
the screen.
In a little more detail: an HTML file usually looks something like:
<html>
<title>A page</title>
<h1>A heading</h1>
<p>A paragraph: some <b>bold text!</b>. And this is
normal</p>
<p>Another paragraph</p>
</html>
The parser takes the HTML file and turns it a tree-shaped structure in
RAM which the display engine can look around much more easily than just
trying to read the above text. For the above file, you might get
something looking like: (use your imaginations!)
+ <html>
\+ <title>
| A page
+ <h1>
| A heading
+ <p>
| A paragraph: some
|\+ <b>
| | bold text!
| + and this is normal.
+ <p>
Another paragraph
Each start tag (the bits in angle brackets, <>) gets its own branch of
the tree. This is useful for a whole variety of reasons: for example,
now it's really easy to see what should be highlighted if you wanted the
display engine to display in bright red all paragraphs (<p> tags). You
just run down the tree, and find all the branches that are marked as
paragraphs.
In other words, the parser turns the document from what the writer wrote
it as (a kind of text document) into a logical structure in memory, far
more suitable for manipulation.
Now, there's already a parser being used in NetSurf, or it wouldn't be
displaying anything at all-- so why am I interested in writing a new one?
Well, the current one isn't really very good at dealing with
badly-written documents. If you like, it's like a passable Microsoft
Word file converter: you get most of the meaning out of the document but
some bits of it aren't *quite* right. This is because most people who
write HTML just test in one browser, and since how to parse HTML has
never been formally defined, all the browsers do it slightly
differently. Actually, HTML parsing in the major non-IE browsers
(Safari, Fierfox, Opera) is mostly reverse-engineered from IE's
behaviour, but reverse-engineering is both a) not very easy b) very
error-prone, so they're all quite inconsistent.
NetSurf's current parser just hasn't had the time spent on it that the
big web browsers' parsers have, which is over a decade of work. It
would be silly of me to suggest, then, that in one summer, someone could
reverse-engineer all these browsers and write a brand new parser that
parsed every page like other browsers did.
It's lucky, then, that most of the reverse-engineering work is already
done. :) There's a new version of HTML on the horizon, HTML5, and its
editor has spent many years looking at how the different browsers go
about parsing badly-written documents. The draft specification includes
very carefully-written rules that combine the best aspects of each
browser's parsers.
My job, then, is to implement the parsing bit of HTML5. In doing this,
NetSurf gets that little bit closer to the major browsers, and the world
gets a new HTML parser written in the C programming language, which can
hopefully be reused by many other projects over the course of time.
I hope I've explained myself well enough, and I look forward to helping
get NetSurf's users a better-parsed web. :)
Cheers,
--
Andrew Sidwell
Show replies by date
Thank you for this. It's very interesting to see what goes on under the
bonnet - even though I'm only going to use the result of your work, rather
than the work itself.
--
David Wild using RISC OS on broadband
www.davidhwild.me.uk
In message <4816748F.3090508(a)entai.co.uk>
Andrew Sidwell <andy(a)entai.co.uk> wrote:
Hello,
I'm one of the Google Summer of Code participants this year. More
specifically, I'll be working on a new HTML parser for NetSurf with John
Mark-Bell.
Welcome to this noble cause. You'll be working with an excellent
team. I'd like to thank you in advance for all the good work that
you will no doubt do over the summer, and, maybe, beyond - who
knows!
Dave