Hi jmb and Kinnson,
After some work on the libDOM string type conversion, I come
across some new thoughts and I think this may impact how we design the
dom_string.
In the center of the dom_string problems is whether we should one
type of string or two distinct types of strings.
After some programming, I found that I usually get trapped into
that whether this field should be a lwc_string or dom_cdata_string.
Take the Node interface for example, I change the nodeName to
lwc_string, when I came across the nodeValue, I change it to the
lwc_string, too. But later, when I read the W3C DOM Core level 3
document, I find that the nodeValue should be a dom_cdata_string,
because that the nodeValue attribute act like a "virtual attribute". I
mean when the Node is an Attr node, the nodeValue is equal to the
attirubte's value, which is a lwc_string. But when the Node is a Text
node, the nodeValue is equal to the Text.data, which is a
dom_cdata_string. So, the only solution is to make this field a
dom_cdata_string. The whole DOM interface only take all string as one
type, but we split it into two distinct types, I think this will bring
us some problems like above in later.
And for the DOM test suite, there are problems too. The test case
may declare:
<var type="DOMString" value="div"/>
How should we deal with this declaration? Creating a lwc_string or
creating a dom_cdata_string? Somebody may argued that we should
creating a lwc_string because the string's value is "div", which is a
tag name. But creating different types of string accroding to their
content is not a good thing, I think. So, maybe the best way is to
create a cdata_string first and when this string will be used as
tagname or attribute name or element id, we intern it then.
Not only DOM Test Suite, think about when we support JavaScript, if
the user write:
var name = "div";
How do we deal with this string, too? I think it should be exactly the
same with the above one.
And that is my point: We use two distinct string in DOM internally,
but only one type of string in the DOM public interfaces, and convert
from cdata_string to an interned string whenever necessary. And the
conversion work is done in the corresponding API.
So, we still provide the dom_string and hide the lwc_string
entirely from the DOM public API. The dom_string will look like:
struct dom_string {
void *ptr; /**< Pointer to string data or the
lwc_string * */
int len; /**< Byte length of string */
lwc_context *ctx; /** < The lwc_context of this string is
the string is lwc_string */
dom_alloc alloc; /**< Memory (de)allocation function */
void *pw; /**< Client-specific data */
uint32_t refcnt; /**< Reference count */
};
I think the above struct is self-explained.
Doing this way, our clients never need to know something like
lwc_context. He can just call:
dom_string *str;
dom_string_create("div", 3, &str);
dom_document_get_element_by_id(str, &ele);
dom_characterdata_append_data(cdata, str);
And in the dom_document_get_element_by_id, the function will detect
whether the str contain a lwc_string, if not the function will intern
the string data and create one. But in function
dom_characterdata_append_data, it just extract the char * from the
dom_string and then append it to the cdata...
And we should provide dom_string_cmp and many other helper function to
leverage the lwc_string in the dom_string to accelerate the
comparison.
And, internally we can use two distinct types of strings. I mean,
we declare the various fields of DOM interfaces as whether lwc_string
or dom_string. For the ones, we are sure we should intern them, we
declare them as lwc_string, and for them like nodeValue or something
we did not want to intern, we declare them as dom_string.
This way, we keep an consistent interface with the DOM spec
interface and maintain our performance as well as keep flexible
internally. We can change some interned string to cdata_string or
inversely without any disturbing of our clients. And because we hide
lwc_*, we can determine how many lwc_context should we use. I mean,
maybe one const lwc_context for all tag names & attribute names & enum
attribute values, another lwc_context for ids & class names &
something other... Of course, just some quick simple thought about
that small lwc_context may accelerate our matching. :)
Oh, having write so long an English article is really not an easy
job. I just want to make our API simple and make our code base simple,
please shot me with your questions as many as you can. Please point
out any mistake I made freely, thanks for your advice!
Regards!
Bo