Wednesday, August 8, 2007
Wisdi has a Wiki!
Monday, August 6, 2007
Ontoworld
Saturday, August 4, 2007
Erlang Example
The program's purpose is to convert a file from the Moby Words project that contains English Parts of Speech information to an Erlang representation. The language has a very clean and intuitive syntax (IMO) and you may be able to guess its basic operation before reading my explanation below.
1 -module(moby_pos).
2 -export([convert/1]).
3 %-compile(export_all).
4
5 convert(File) ->
6 {ok, Device} = file:open(File,read),
7 process_words(Device,[]).
8
9
10 process_words(Device, Result) ->
11 case io:get_line(Device,'') of
12 eof -> Result;
13 Rec ->
14 {Word, POSList} = parse_record(Rec),
15 process_words(Device, [ [ {Word, P, Q} || {P, Q} <- POSList] | Result])
16 end.
17
18 parse_record(Record) ->
19 [Word,PosChars] = string:tokens(Record,[215]),
20 {Word, parse_pos(PosChars,[])}.
21
22 parse_pos([],Result) -> Result;
23 parse_pos([$\n],Result) -> Result;
24 parse_pos([P|R], Result) ->
25 parse_pos(R, [classify(P) | Result]).
26
27
28 classify($N) -> {noun,simple};
29 classify($p) -> {noun,plural};
30 classify($h) -> {noun,phrase};
31 classify($V) -> {verb,participle};
32 classify($t) -> {verb,transitive};
33 classify($i) -> {verb,intransitive};
34 classify($A) -> {adjective,none};
35 classify($v) -> {adverb,none};
36 classify($C) -> {conjunction,none};
37 classify($P) -> {preposition,none};
38 classify($!) -> {interjection,none};
39 classify($r) -> {pronoun,none};
40 classify($D) -> {article,definite};
41 classify($I) -> {article,indefinite};
42 classify($o) -> {nominative,none};
43 classify(X) -> {X,error}.
Lines 1 - 3 are module attributes that define the module name and what is exported. The % character is used for comments. Line 3 is commented out because it is used only for debugging to export everything.Line 5 is the definition of a function. Variable names always begin with uppercase. So the function takes one arg which is a file name. The -> characters are called arrow and imply a function is a transformation.
Line 6 illustrates a few concepts. First { and } are used to define tuples which are fixed length lists of Erlang terms. On this line we define a tuple consisting of the atom ok and the variable Device. Atoms are constants that are represented very efficiently by Erlang. Here we see the first use of = which is not an assignment in the traditional procedural language sense but a pattern matching operator. It succeeds if the left and right hand sides can be matched. Here, we are counting on the fact that file:open(File,read) returns a tuple either {ok, IoDevice} or {error, Reason}. If it returns the former than the match succeeds and Device variable becomes bound otherwise the match fails and the program will abort. There are of course more sophisticated ways to handle errors but we won't touch on those here.
Lines 10-16 illustrate a recursive function that uses a case expression. Each case is a pattern match. Here we are counting on the fact that io:get_line(Device,'') returns the atom eof or the next line as a string that will get bound to the variable Rec.
Line 15 is a bit dense so lets consider it piece by piece.
15 process_words(Device, [ [ {Word, P, Q} || {P, Q} <- POSList] | Result]).
First thing you need to know is that single | is used to construct lists form a head and another list. It is equivalent in this usage to cons(A,B) in Lisp. So we are building a list whose new first element is [ {Word, P, Q} || {P, Q} <- POSList]. This expression using double || and <- is called a list comprehension. It is a concise way of building a new list from an expression and a existing list. Here we are taking POSList and map each element (which are tuples of size 2) to the variables P and Q and building a resulting triplet {Word, P, and Q} where Word is an English word, P is a part of speech and Q is some qualifier to the part of speech.
Lines 22-43 show how Erlang allows the definition of multi-part functions by exploiting pattern matching. For example, the function classify is a multi-part function defined in terms of single character matches. The $
One important detail of Erlang is that it supports tail recursive optimization so tail recursive functions are very space efficient. You can see that all the recursive functions defined in this program are tail recursive.
On my system it took Erlang ~3.6 seconds to process ~234000 words in the Moby file or about 15 uSec per entry.
Another Road to Web 3.0
To achieve a semantic web one needs technologies like XML, RDF, OWL and others applied at the source of the data. The Semantic Web is a distributed knowledge model. A Wisdi in contrast is a centralized knowledge model.
The past 15 or so years of computing practice has instilled a dogma that "distributed = good" and "centralized = bad". However, in the case of the goals of Web 3.0, a centralized approach may be a more viable model.
One of the major objections to the semantic web is that people are "too lazy and too stupid" to reliably markup their web pages with semantic information. Here is where a Wisdi can come to the rescue. A fully realized Wisdi will have a rich store of knowledge about the world and the relationships between things in the world. Together with a natural language parser, a Wisdi can provide a "Semantic Markup Service" that will automate Web 3.0. Initially, this capability might still require some cooperation from human creators of web pages. For example, it will be quite some time before a Wisdi can deal with disambiguation problems with high degrees of reliability. However, requiring a bit of meta form content producers is a more viable model then asking them for the whole thing.
What do you think?
Friday, July 27, 2007
Wisdi v0.1
So Wisdi v0.1 will provide the following services:
- A Part of Speech Service - given a word it will classify it as an adjective, adverb, interjection, noun, verb, auxiliary verb, determiner, pronoun, etc.
- A Verb Conjugation Service - given a verb it will provide the Past Simple, Past Participle, 3rd person singular, Present Participle and plural forms.
I plan to use JSON-RPC for my service interface because I think SOAP is way to heavy and because there are Erlang implementations available.
I believe this goal is simple enough to get something working quickly (although sadly this weekend I have personal obligations) but rich enough to play around with a bunch of ideas before moving to more interesting knowledge bases.
Wednesday, July 25, 2007
Adjectives and Verbs
In some recent posts I have argued that natural languages (NL) are programming languages in the sense that they execute inside of a cognitive computer (mind or software system) to achieve understanding. Another way of expressing this is to think of a NL as a high-level simulation language and understanding as information derived through simulation.
In this model I proposed that verbs and adjectives act like functions. In English, we view verbs and adjectives as quite distinct grammatical categories. Therefore, for an English speaker, it may seem counterintuitive to suggest that underneath the covers verbs and adjectives are the same. However, I find it quite suggestive that there are languages that do not have adjectives and instead use verbs.
Not all languages have adjectives, but most, including English, do. (English adjectives include big, old, and tired, among many others.) Those that don't typically use words of another part of speech, often verbs, to serve the same semantic function; for example, such a language might have a verb that means "to be big", and would use a construction analogous to "big-being house" to express what English expresses as "big house". Even in languages that do have adjectives, one language's adjective might not be another's; for example, where English has "to be hungry" (hungry being an adjective), French has "avoir faim" (literally "to have hunger"), and where Hebrew has the adjective "צריך" (roughly "in need of"), English uses the verb "to need".
Friday, July 20, 2007
Er, which lang to use?
A compromise seemed to be Java 5 (or 6). Generics removed some of my disdain for the language. However, generics are not C++ templates and sometimes having half of something can be more painful than having none.
I also (very briefly) considered Ruby, Python and even Mathematica but none of these would do for reason that are both logical and admittedly emotional.
This past Wednesday a received a copy, fresh of the press, of Programming Erlang: Software for a Concurrent World by Joe Armstrong. That clinched it for me.
The great thing about Erlang is that I practically knew it already because many of the constructs related to list processing and pattern matching are similar or identical to two languages I am comfortable (Prolog and Mathematica). The second trait was that it is an very clean functional language and I always wanted to do a large development project in a functional language. Third, and unique for functional languages, it lets you go way down close to the metal and manipulate arbitrarily complex binary structures with out escaping to C. Fourth, you can escape to C. Fifth, and by far the most important, Erlang will scale due to its elegant concurrency model and it will do so with out all the typical headaches associated with writing concurrent code. And finally, I imagine the capability of hot swapping will be welcome when exposing ones creations to the world and getting that first bug report.
Now, Erlang is not perfect; no language is. It sacrifices type safety and does not have a good security model for distributed computing. However, when almost all your D is to drive R then these issues are less important.
So it begins this weekend. Me and Erlang are about to become quite intimate. Time to brew a big pot of coffee. See you on Monday.