Saturday, August 4, 2007

Erlang Example

Here is a bit of an Erlang program that I wrote recently. It does not show any of the really nice features that make Erlang ideal for writing highly concurrent fault tolerant systems but it does illustrate many of the basic features of the language so I thought it would be interesting to those without much Erlang experience.

The program's purpose is to convert a file from the Moby Words project that contains English Parts of Speech information to an Erlang representation. The language has a very clean and intuitive syntax (IMO) and you may be able to guess its basic operation before reading my explanation below.
1 -module(moby_pos).
2 -export([convert/1]).
3 %-compile(export_all).
5 convert(File) ->
6 {ok, Device} = file:open(File,read),
7 process_words(Device,[]).
10 process_words(Device, Result) ->
11 case io:get_line(Device,'') of
12 eof -> Result;
13 Rec ->
14 {Word, POSList} = parse_record(Rec),
15 process_words(Device, [ [ {Word, P, Q} || {P, Q} <- POSList] | Result])
16 end.
18 parse_record(Record) ->
19 [Word,PosChars] = string:tokens(Record,[215]),
20 {Word, parse_pos(PosChars,[])}.
22 parse_pos([],Result) -> Result;
23 parse_pos([$\n],Result) -> Result;
24 parse_pos([P|R], Result) ->
25 parse_pos(R, [classify(P) | Result]).
28 classify($N) -> {noun,simple};
29 classify($p) -> {noun,plural};
30 classify($h) -> {noun,phrase};
31 classify($V) -> {verb,participle};
32 classify($t) -> {verb,transitive};
33 classify($i) -> {verb,intransitive};
34 classify($A) -> {adjective,none};
35 classify($v) -> {adverb,none};
36 classify($C) -> {conjunction,none};
37 classify($P) -> {preposition,none};
38 classify($!) -> {interjection,none};
39 classify($r) -> {pronoun,none};
40 classify($D) -> {article,definite};
41 classify($I) -> {article,indefinite};
42 classify($o) -> {nominative,none};
43 classify(X) -> {X,error}.
Lines 1 - 3 are module attributes that define the module name and what is exported. The % character is used for comments. Line 3 is commented out because it is used only for debugging to export everything.

Line 5 is the definition of a function. Variable names always begin with uppercase. So the function takes one arg which is a file name. The -> characters are called arrow and imply a function is a transformation.

Line 6 illustrates a few concepts. First { and } are used to define tuples which are fixed length lists of Erlang terms. On this line we define a tuple consisting of the atom ok and the variable Device. Atoms are constants that are represented very efficiently by Erlang. Here we see the first use of = which is not an assignment in the traditional procedural language sense but a pattern matching operator. It succeeds if the left and right hand sides can be matched. Here, we are counting on the fact that file:open(File,read) returns a tuple either {ok, IoDevice} or {error, Reason}. If it returns the former than the match succeeds and Device variable becomes bound otherwise the match fails and the program will abort. There are of course more sophisticated ways to handle errors but we won't touch on those here.

Lines 10-16 illustrate a recursive function that uses a case expression. Each case is a pattern match. Here we are counting on the fact that io:get_line(Device,'') returns the atom eof or the next line as a string that will get bound to the variable Rec.

Line 15 is a bit dense so lets consider it piece by piece.

15 process_words(Device, [ [ {Word, P, Q} || {P, Q} <- POSList] | Result]).

First thing you need to know is that single | is used to construct lists form a head and another list. It is equivalent in this usage to cons(A,B) in Lisp. So we are building a list whose new first element is [ {Word, P, Q} || {P, Q} <- POSList]. This expression using double || and <- is called a list comprehension. It is a concise way of building a new list from an expression and a existing list. Here we are taking POSList and map each element (which are tuples of size 2) to the variables P and Q and building a resulting triplet {Word, P, and Q} where Word is an English word, P is a part of speech and Q is some qualifier to the part of speech.

Lines 22-43 show how Erlang allows the definition of multi-part functions by exploiting pattern matching. For example, the function classify is a multi-part function defined in terms of single character matches. The $ notation means the ASCII value of the . So classify is a simple map from a single character encoding used by Moby Words to a tuple consisting of two atoms designating a primary part of speech (e.g., verb) and a qualifier (e.g., transitive).

One important detail of Erlang is that it supports tail recursive optimization so tail recursive functions are very space efficient. You can see that all the recursive functions defined in this program are tail recursive.

On my system it took Erlang ~3.6 seconds to process ~234000 words in the Moby file or about 15 uSec per entry.

1 comment:

fusiongyro said...

Thanks for the insight! However, you linked to Moby Words II, but I think you meant to link to Moby Part-of-Speech II.