On Sat, 2009-04-25 at 21:19 +0800, Bo Yang wrote:
There may be hundreds and thousands of strings, intern all these
will cause more collision in the hashtable and waste much memory.
Indeed.
2. Which type of string should we intern?
According 1, I propose to intern strings optionally. We should use
lwc_string to restore the strings, which appears multiple times and
which are necessary to compare frequently. Generally, I mean HTML tag
name, attribute name, enum-like attribute value (such as attribute
"display"'s value which include "inline", "block",
"inline-block"...).
And the id attribute value as well as the class attribute value are
also good candidates for intern.
I wouldn't say "optionally"
In order for libcss, hubbub and libdom to work sensibly together, the
attribute names, tag names, and many of the attribute values *MUST* be
interned for rapid selection, tree processing etc. Thus at minimum, I'd
say tag names, attribute names and attribute values should be interned.
The wapcaplet context can always be built to have a larger hash chain
count if it turns out to be an overhead we can't ignore.
However, CDATA and the like I don't think needs to be interned at all.
3. When to intern strings?
I think the best time to import lwc_string is when the webpage is
being parsed. The hubbub parser should create lwc_string when it come
across the above types of strings. I propose this way because, if the
hubbub did not create a lwc_string, we should create one in the libDOM
and this require two times of string scanning (one time when page get
parsed in hubbub and one time in libDOM for interning) and of course
not efficient.
For the XML parser, the only way left for us is to intern string in
libDOM. I mean, in the callbacks of libDOM binding.
Certainly hubbub will be interning strings which ought to be interned as
it goes. As for the libxml binding, it'd be the responsibility of the
binding to intern them before giving them to libdom.
All libdom will need to do is ensure it increases the refcount on an
lwc_string if it stores a pointer to it in a struct. (and decrease the
refcount when it frees the container, obviously).
4. What does the dom_string look like?
I propose, Change a little:
Rather than change a little, my counterproposal is to change a lot.
Refactoring code doesn't take a vast amount of time, and by ensuring we
hit *every* use of a string, we can be sure we've considered everything
in libdom appropriately.
Thusly I propose:
Anywhere dom_string is currently used for tag names, attribute names or
attribute values, they are changed to directly use lwc_string.
Anywhere dom_string is currently used for CDATA and the like, it is
changed to dom_cdata_string (whose structure is identical to the current
dom_string).
We remove dom_string entirely.
This means we will catch everything in one fell swoop. It'll be painful
for a couple of days, but in the long-run will be superior.
5. Some more consideration...
When I think about the strings, I also suspect how we store a
character? Now, we use uint8_t in libDOM, but I think we should use
UTF-16 encoding in DOM. And use uint16_t to replace uint8_t.
Everything in the new libraries is based around UTF-8. Also, UTF-8 makes
sense, where UTF-16 just feels like a kludge on top of Microsoft's
insane fuckup wrt. their resource strings.
I hope I have expressed my idea clearly, if anybody get confused by
me, please shot any question to me. Any criticism and advice will be
appreciated very much! Thanks!
You were clear, but I fear slightly misguided.
Remember, a big refactor now can reduce effort in the long run. Don't be
afraid to change the API. Until the first release of libdom, the API
should be considered entirely fluid and subject to change so that we can
get it to be as right as possible. Once we start integrating it into
NetSurf proper, it will be much much harder to change.
Regards,
Daniel.
--
Daniel Silverstone
http://www.digital-scurf.org/
PGP mail accepted and encouraged. Key Id: 2BC8 4016 2068 7895