Saturday, August 18, 2007

Ambiguity, Disambiguation and KISS

My recent work on the Wisdi Project has me thinking quite a bit about ambiguity. Evolution has obviously provided us humans with an amazing ability to function quite well in the face of ambiguity. In fact, we often fail to perceive ambiguity until it is specifically brought to our attention. Ambiguity can arise in many different contexts and it is instructive to review some of these contexts, although you probably will not find them to be unfamiliar.

Human ability to deal with ambiguity has had some undesirable consequences. Our skill at disambiguation has left a legacy of ambiguous content spewed across the web. While almost all the content of the web was targeted for human consumption, its present vastness and continued exponential growth has made it paramount that machines come to our aid in dealing with it. Unfortunately, ambiguity is the bane of the information architects, knowledge engineers, ontologists and software developers who seek to distill knowledge from the morass of HTML.

Of all the forms of ambiguity mentioned in the above referenced Wikipedia article, word sense ambiguity is probably the most relevant to further development of search engines and other tools. You may find it instructive to read a survey of the state of the art in Word Sense Disambiguation (circa 1998). There is also a more recent book on the topic here.

An important goal, although certainly not the only goal, of the Semantic Web initiative is to eliminate ambiguity from online content via various ontology technologies such as Topic Maps, RDF, OWL, DAML+OIL. These are fairly heavy-handed technologies and perhaps it is instructive to consider how far we can proceed with a more light weight facility.

Keep It Simple Silly

Consider the history of the development of HTML. There are clearly many reasons why HTML was successful however simplicity was clearly a major one. This quote from Raggett on HTML 4 says it all.
What was needed was something very simple, at least in the beginning. Tim demonstrated a basic, but attractive way of publishing text by developing some software himself, and also his own simple protocol - HTTP - for retrieving other documents' text via hypertext links. Tim's own protocol, HTTP, stands for Hypertext Transfer Protocol. The text format for HTTP was named HTML, for Hypertext Mark-up Language; Tim's hypertext implementation was demonstrated on a NeXT workstation, which provided many of the tools he needed to develop his first prototype. By keeping things very simple, Tim encouraged others to build upon his ideas and to design further software for displaying HTML, and for setting up their own HTML documents ready for access.

Although I have great respect for Tim Berners-Lee, it is somewhat ironic that his proposals for the semantic web seemingly ignores the tried and true principles of KISS that made the web the success it is today. Some may argue that the over simplicity of the original design of HTML was what got us into this mess, but few who truly understand the history of computing would buy that argument. For better or worse, worse is better (caution, this link is a bit off topic, but interesting none the less)!

So, circling back to the start of this post, I have been doing a lot of thinking about ambiguity and disambiguation. The Wisdi Sets subproject hinges on the notion that an element of a set must be unambiguous (referentially transparent). This has me thinking about the role knowledge bases can play in improving the plight of those whose mission it is to build a better web. Perhaps, a very simple technology is all that is needed at the start.

Consider the exceedingly useful HTML span tag. The purpose of this tag is to group inline elements so that they can be stylized. Typically, this is done in conjunction with CSS technology. Why not also allow span (or a similar tag) to be used to provide the contextual information needed to reduce ambiguity? There are numerous ways this could be accomplished, but to make this suggestion concrete I'll simply propose a new span attribute called context.

I had a great time at the <span context="">rock</span> concert. My favorite moment was when Goth <span context="">Rock</span>
climbed on top of the large <span context="">rock</span>
and did a guitar solo.

It should not be to difficult to guess the intent of the span tags. They act as disambiguation aids for software, like a search engine's web crawler, that might process this page. The idea being that an authoritative site is used to provide standardized URL's for word disambiguation. Now one can argue that authors of content would not take the time to add this markup (and this is essentially the major argument against the Semantic Web) but clearly the simplicity of this proposal leads to ease of automation. A web authoring tool or service could easily flag words with ambiguous meaning and the author would simply point and click to direct the tool to insert needed tags.

One can debate the merits of overloading the span tag in this way but the principle is more important than the implementation. The relevant points are:

  1. Familiar low-tech HTML facilities are used.
  2. URL's provide the semantic context via an external service that both search engines and authoring tools can use.
  3. We need not consider here what content exists at those URL's, they simply need to be accepted as definitive disambiguation resources by all parties.
  4. This facility can not do everything that more sophisticated ontology languages can do, but who cares. Worse is better, after all.

No comments: