Dear reader,

This blog is moving to some other hosting service (and other blog platform).

This blog URL will change and the RSS feed will change too; sorry about that.

If you want to follow this blog, please consider browsing http://blog.khigia.net and changing your RSS subscription to http://feeds2.feedburner.com/khigia/blog (or http://feeds2.feedburner.com/khigia/blog/erlang to filter only erlang related posts).

Thanks WordPress.com for both the hosting and the platform you’re providing.

Quite a few times one or the other of my automation shell scripts got hang while running some external command (e.g. a remote shell command may hang if network is in weird state). For those cases where a stopped automation script could really hurt the system, I ended up writing a script wrapper that add a timeout to any command.

This is kind of heavy weight implementation: instead of running the command directly, this start a first process sleeping for the duration of the timeout, and execute the command in a background process, and finally return the result of the first finishing process! 3 processes: 1 working, 1 sleeping, 1 waiting … but it helped me few times!

http://gist.github.com/101192

Can’t embed the gist in this blog 😦

Tweet-tweet here, tweet-tweet there, here tweet, there tweet, … everybody tweet! And you got to choose who to follow or you’ll be quickly overwhelmed!

MrTweet can help you there by proposing you some tweets to follow. How can he achieve that? Through this blog post, let’s focus on the subject of recommending twitter users to follow to existing user.

Disclamer: I have no accointance what’s so ever with MrTweet, and absolutely no knowledge about it, nada; all the following are random thoughts; oh, and I have no accointance with Twitter neither!

Let say you have access to all the tweets of the world [1], and you can know who’s following who on twitter. How would you look for new user to follow?
Obviously there are quite a few possibilities … and some of them are probably yet to be invented. Traditionally, mining this kind of info can be done at 2 levels: (1) looking at the content and recommend to user U1 to follow user U2 if there is any correlation between tweets content of users U1 and U2; (2) second level is to look at the graph on connected communities and recommend user U1 to follow user U2 if they share some links in the graph.
On top of this there are many refinements in what can be done to mine such information: focusing and differenciating static data and dynamic aspects of user behaviours, mixing approaches of content and meta data, looking for time-series correlation … this is a vast domain.

Following are some ideas about different way to implement MrTweet.

The UserRank dictature

Using a Pagerank-like algorithm to explore the graph of followers, you can obtain a global UserRank for all the twitter users.

Why Pagerank? Well, this is probably not the best approach of the problem, as it extract a global knowledge of your data, where in this case we should focus on communities. But heck! I like PageRank, it’s simple, and it’s a good start to think about the problem. PageRank was the first scientific article that I study thoroughly, and it was  a ha-ha moment: simple to read, easy to understand, easy to implement … computer science was possible! No need of linear algebra, matrix and eigenvector to understand it (it helps for the proof though :)). [2]

So how to apply Pagerank to “tweeterers”? Well, following is one way to do it.

For each user consider its followers as mark of interest. The algo define PageRank number Pr of user U as the weighted sum of the Pr of all its followers, where the weight is in inverse proportion of the number of people followed by the follower.

That is, with some notations to make it hopefully clearer:

  • Let Pr(U) be the PageRank of user U;
  • Let Fo(U) be the set of all user that U is following, and [Fo(U)] its cardinality;
  • Let Fi(U) be the set of all user that are following U;
  • PageRank is: Pr(Ux) = Sum(Pr(Un)/[Fo(Un)]), for all  Un in Fi(Ux).

Thus Pr is high if the user is followed by users which have a high Pr and are not following everybody! Easy enough but wait … how to get the Pr of the followers? this is kind of a recursive definition!

The beauty of the PageRank algo is that it propose an simple iterative solution to solve this problem: asume first that everybody as a Pr=Pr_0 of 1/N (N is the number of users, this give everybody a fair starting value), compute the new Pr_1 of each user using Pr_0 assumption, then compute the Pr_2 of each user given your Pr_1 assumption … and iterate a few more times until the Pr stabilize, usually after few computations only (this is not magic but math … which sometimes looks alike). If you like matrix and linear algebra, you can write the whole process as a loop which multiply a the vector of all PageRank by a square matrix of all ‘follow-relations’.

Now, all we have is a Pr for each user? then what? Who shall I follow? Well, use your network to discover it!

Take all the people you are following, grab all the people they follow themselves, and remove your own sibling from this big set of people; sort them by Pr and you’ll get to follow the most “popular” people linked to your network.

We have a UserRank algo to recommend people to follow to any twitter user! Well, all this rely on assumption that being followed give the user some importance. And I don’t know if this works, I haven’t tried it! Maybe it can be done with google API and the twitter page of the user 😉

Let explore another algo with another kind of idea.

Following the Slope

In fact, what we’re looking for with MrTweet, is the same kind of recommendation as Amazon made popular: “users who like this also like that”. Using the same meta-data as previously (the graph of followers), we can rephrase it as “user who follow Ux and Uy are also following Uz”.

Algo to solve those problems are usually classified as recommender systems (see the NetFlix for importance of those kind of algo). A big problem of those solution for web application is often scalability and dynamic nature. But let’s pick one such recommender system and see what can be done for MrTweet problem.

No surprise, let’s look at SlopeOne algorithm. The wikipedia page explain it very simply and I surely can’t do better. All we need to do is map the data to our problem. Let say that if user U1 follow user U2, this means user U1 rate user U2 with weight 1. Doing so, we’ll have only rating of weight 1. This is ok for SlopeOne to give some result though.

The main difference with previous method of UserRank is that this is not a global analysis of all the users but rely more on the network of the user to which you want to do the recommendation.

Wanna real rating value to feed your SlopeOne algo? well, why not use the UserRank you defined previously as the rating value, and compute the SlopeOne prediction based on that? thus you would get as recommendation not the most popular user linked to your network but the most popular user liked by your network! Not so bad! Maybe yes, maybe not … need to be tested!

More than a graph?

We’ve been looking at the graph of followers, which give recommendatons based on “neighborhood”. Let’s open the problem a bit more …

  • instead of using followers, we can look at the graph of replies
  • why not look at the tweet content? we could check correlation in keyword, or in the URL (even following the URL)?
  • why not cross data from twitter with data from other network? … and I’m going to read about this just here: http://33bits.org/2009/03/19/de-anonymizing-social-networks 🙂

I don’t know you, but this kind of problem delight me. But after all this thinking and no action, I’m now going to write some real code to do more than think about algo ;).

[1] With N millions users, 7 tweets a day, 140 bytes a tweet in raw format … storage is not that huge, even for few years. How many twitter users? where does seven come from? who care! engineers only need order of magnitude and units!

[2] Pagerank is closely related to the hub and authorities: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.3875 [3]

[3] I know, all those ref are “old” 🙂

Surprising! as of now, in my experiences of software engineer, I met more people that read/know about TAOCP than people who read Code Complete.

It may be a probability accident (I kind of remember that probabilities and small numbers are not good friends). It may also be influenced by my fields of experiences. But assuming this is real (i.e. programmers know more about TOACP), it’s doesn’t seems right.

I consider TAOCP as a reference to write efficient code; Code Complete is a reference to write good software (eventually being a product maintained by a team). Both are important, but the first one being used way less often!

This blog is not completely abandoned … I’m just buzy lazy those days.

Just to relate my brief come back to roots: doing physics simulation in 3D ;).

I’ve been playing with Irrlicht as 3D framework. It’s good, not to complex. But the API is not always easy (you’re forced to use the irrlicht not-automatic counting ref). I haven’t found doc about coordinate systems (orientation and angle’s units) but it’s using common practices in game engine. It’s pretty fast, and the simple included demos help a lot to get started (Ogre3D don’t come with simple examples).

For the physics engine, I’ve used Bullet, and it’s really pleasant to use! and result are great! I only quickly went through some part of the code, and it’s great code, I think I will learn quite a few tricks of this code (e.g. how to avoid conditional branch (‘if’) to select between 2 unsigned integers … and it even can be usefull for efficiency on some processor like Cell (maybe related to parallelism and pipelining?). No really, this is beautiful code, and if you just want to play a little bit with physics, using Bullet directly is more fun than PAL (Physic Abstraction Layer).

I’ll post the code when/if my Domino Rally simu is done 🙂 … but without sound it’s not good.

I took some time to get the terminology: radix tree/trie, patricia tree/trie … it’s all the same, really:

  • A trie is a multi-way tree;
  • I guess radix term replaced the Patricia appellation because it sounds more generic (not specific to information retrieval);
  • And in the case of radix trie, if you use the binary representation of the key, you have an alphabet of 2 and the trie become a 2-ways tree, thus radix tree 😛

Lately I’ve been playing a bit … well, that’s a programmer kind of game: I wrote yet other implementations (with yet other bugs I assume) of some classic data-structures, in Ocaml.

Those are all about string set or string map, and are purely functional (thus persistent).

The code is in the project ocaml-stringset (http://github.com/khigia/ocaml-stringset/tree/master). There is a TST (ternary search tree), a Collapsed TST, a Trie (bootstrap version of Chris Okasaki) and maybe other will be added. Most of the code is in file/module StringSet so it is easy to embed in a project. There is one example file for each datastructure (as well as some unit test … and the unit tests of TST iss pretty bad as it test the internal structure, not the API).

Haven’t been as far as doing stress tests though it would be nice to compare the different behaviours.

Update: those algo are implemented for string, but we could certainly provide a generic implementation for any datatype that is iterable collection and whose items are comparable. I may try to make some generic version later on.

Update: most of those tree algo implementation are NOT tail recursive; this would probably need continuations … I’ll add this if/when needed (in production environment, it should probably be done anyway except if the use case is restricted and ensure a limit in recursion depth).

(that time I learnt the lesson: never promise what the next post will be about coz I often not deliver)

Optimization algo need to explore the space of solutions, and this exploration is a major part of the algo! Let’s focus on one simple case: exploring in random order two finite dimensions. For example, let two integer variables, V1 taking value in interval [1, 100] and V2 taking value in interval [1, 20]: how to explore all the possible pairs (v1,v2) in random order?

A simple (almost-)solution is given by the following python code:

[(e1,e2) for e1 in shuffle(range([1,100)), for e2 in shuffle(range(1,20))]

which is in fact the cartesian product of the shuffles of V1 and V2 intervals … but this is ordered by values of the first shuffle and thus not random.

OK, now is the bad news: I have no solution completely space efficient to propose. My best effort is a solution which compute at least two shuffles, one on each dimension (kind of O(n + m) for space) and is even not random … just random enough for most of the cases 😛

So how? the solution is describe in this post http://weblog.raganwald.com/2007/02/haskell-ruby-and-infinity.html and especialy the section about the tabular view of the cartesian product. Did you read it? so the proposed solution is to navigate the cross-product table along its diagonal instead of row by row … simply smart isn’t it? It give “impression” of randomness 😛

Ocaml code for this algo is here: http://github.com/khigia/ocaml-anneal/tree/master/walks.ml (function pair_permutation_random_seq); It uses extensively an ad-hoc stream implementation (Seq module) to perform the walk lazyly. It was a good example to test the stream implementation!

I’ve been using this python implementation of simulated annealing to deduce Singapore bus service from a database of bus stops (see previous posts about Google Map API).

I wanted to play a little bit more with this algo and decided to port it to OCaml: you can find the code on GitHub.

As a first try, I wrote a direct code translation. Only few points differ:

  • The OCaml code use a stream implementation (lazy list) … (mostly equivalent to my previous Erlang implementation).
  • The OCaml code uses array structure where Python uses list.
  • The OCaml implementation of function reversed_section do only one array reverse in all cases.

Result is quite ok. Without any optimization on the algo, the OCaml native code performed quite faster (around 10 times faster), meaning that solving the TSP problem for SG bus services took 40 minutes for OCaml version where the Python code ran for hours (of course we could improve this version too).

The algo itself can be improve (I guestimate the time for SG map problem can be reduce by one more order of magnitude with the same simulated annealing approach).

But in coming post (code is there, need time to clean it), I’ll be looking at other interesting problems I found inside this one, especially on how to walk in solutions space in quasi-random order.

This post is a following to the previous post about drawing SG bus route with google map API.

Status of the system

This is for now only a toy demo, all in a single javascript document that draw the map when loaded (the javascript file is generated from a KML file by a python program that also computes the bus service routes).

Toward drawing all the bus services …

Drawing all the bus services on the same map not only is *very heavy* for the browser (at least for firefox, and its SVG renderer) but also give very little information (it is too much data at one time).

See by yourself:

Even zooming (on Dhoby Ghaut station):

What could be done …

  • To solve the problem of too many services displayed at same time, definitively a good UI could improve usability.
    What I have in mind is a kind of UI which enable to select which services to display … but maybe something better can be done like auto-selection of services to display depending of zoom factor or user itinary?
  • The previous point could also help to not send all the data to the client (in order to save bandwith).
  • Automatic route simplification: on one route, when adjacent stops are “too close” to each other, we could remove one of them on the drawing (which would make it lighter, and also remove some artifact like tiny wigzag caused by stop on the opposite side of the road).
  • Adding some constraints manually (editing few distance between some stops) to correct some wrong path.

More?

As in the previous post conclusions, I think a mashup of existing itinery service to display the itinery on a map could be quite interesting.

Playing with the DB of SG bus stop locations and google map API.

My goal is to try to draw the route of the bus services … but the DB has no info about the order of the stops per service (there is info of which services stop for each bus stop). Following are some experiments, with display about bus service 124.

First try was kind of funny (not unexpected!) as I draw the bus route “as stops are found in DB” … which apparently is from left to right:

So, to get a better idea of the bus route, I’m trying to draw the shortest path … better and even quite enough:

But of course this is inexact map as:

  1. this shortest route uses all bus stops without taking care of which ones are on the way to go or the way to go back (this data is not in my DB);
  2. the length of the path does not depend of the road … those are only straight lines!

But anyway, this give a better view of the bus service than nothing or a guide book of bus stop names …

New example in ocamerl lib of erlocaml project: a tiny map-reduce-like (1) word-count program (word-count is the simple example often used to illustrate map-reduce principle).

First part of this example is an ocaml distributed erlang node [l52]. It is composed of one permanent registered (‘wc’) mbox [l40] which upon request create other mboxes implementing the mappers [l19] (on request, a mapper read a text file [l8] and send it word by word to reducer [l29]).

Second part of this map-reduce example is an erlang module implementing two functionalities. One of them is the reduce part implementation [l34] done by a process which accept messages from mappers and update a dictionary accordingly.
Second responsibility of this module is orchestration of the map-reduce flow [l90], consisting of running all mappers [l66], waiting end of all those executions [l79], and collecting result of all reducers [l95].

Assuming EPMD is running, this script run an ocaml node and an erlang node, and run a simple map-reduce to count words in two test files.

As usual, not sure it’s a useful example, but it was sure fun to write 🙂

(1) this is not a map-reduce lib … there is no failure case taken care of, processes pipe is mostly done through messages over network (while map-reduce is mainly meant to use an efficient filesystem based communication), etc!

Little intro step-by-step to the ocaml-erlang message exchange mechanism offered by ocamerl (a lib of erlocaml).

Subject of the following code is a wonderfull application … named “echo server”!

  1. In the interpretor shell of ocaml, we create a hidden erlang node and an activity to echo all received message (both on console as well as in respond to the caller).
  2. In the erlang interpretor shell, we send few message and compare them to the received ones.

The following assumes code is running on machine “comp1”.

Ocaml shell:

(* getting Ocamerl available in toplevel *)
ocaml> #use "topfind";; (* ocamerl used findlib for install *)
ocaml> #thread;; (* ocamerl is multithreaded app *)
ocaml> #require "ocamerl";;

(* creating/declaring a node, assuming epmd is running *)
ocaml> let o1 = Ocamerl.Enode.run "o1" ~cookie:"cookie";;
val o1 : Ocamerl.Enode.t = <abstr>

(* creating/declaring a mbox, equivalent of erlang process *)
ocaml> let m = Ocamerl.Enode.create_mbox o1;;
val m : Ocamerl.Enode.Mbox.t = <abstr>
ocaml> Ocamerl.Enode.register_mbox o1 m "echoer";; (* give it a name *)
- : unit = ()

ocaml> Ocamerl.Enode.Mbox.create_activity m (fun msg -> match msg with
    | Ocamerl.Eterm.ET_tuple [|pid; any;|] ->
        Printf.eprintf "MSG:%s\n%!" (Ocamerl.Eterm.to_string msg);
        Ocamerl.Enode.send o1 pid any
    | _ ->
        () (* drop unexpected msg *)
);;
- : unit = ()

Erlang shell:

# starting erlang node with same cookie
erl -sname e1 -setcookie cookie

% check connection
erl> pong = net_adm:ping(o1@comp1).
pong

% utility to print whatever is in message queue
erl> F = fun() -> receive W -> io:format("got back: ~w~n", [W]) after 1 -> error end end.
#Fun<erl_eval.20.67289768>

% some tests ... send data, received it back
erl> {echo1, o1@comp1} ! {self(), {1,2,"test"}}.
{<0.37.0>,{1,2,"test"}}
erl> F().
got back: {1,2,[116,101,115,116]}
ok
% in the mean time, ocaml shell also display the data

That’s it! A wonderfull echo server 🙂

Amongst things on the to-do list:

  • Should not have to create a mbox and set its activity separately (need some wrapper)
  • Could have an onode ocaml toplevel which run one node by default and offer direct interface (e.g. “send”).

Yet another useless example for erlocaml, but this time it is not completely silly … it does something 🙂

In fact the project eocarve uses:

Idea is simple: an ocaml node run and provide a API for Erlang node to call the seamcarving library. See eocarve wiki for details/example.

Aim of this project was mainly to demonstrate use of ocamerl lib … however it may be usefull for an Erlang web app which would need seamcarving (heavy weight for CPU!). Had fun to integrate those lib in same app.

Minor updates for erlocaml (more exactly on ocamerl) … mostly code cleaning and minor refactoring!

Only one new feature: ability for Erlang processes to send message to unregistered ocaml processes. An example of that is “ex_node_mult” which generalize “ex_node_double” by dynamically creating ocaml process to perform a multiplication … yep I know, not very useful

% erlang code using the ocaml node
{byn, OcamlNode} ! {self(), 2},
By2 = receive M -> M after 1 -> error,
{byn, OcamlNode} ! {self(), 3},
By3 = receive M -> M after 1 -> error,
P1 = make_ref(),
P2 = make_ref(),
By2 ! {self(), P1, 21},
By3 ! {self(), P1, 21},
ok = receive {P1, 42} -> ok after 1 -> error end,
ok = receive {P2, 63} -> ok after 1 -> error end

With this feature, ocamerl begin to be usable … and that’s exactly what I will be doing next: experiment with more useful (at least less silly) examples! Some ideas are:

  • tiny mapreduce example with ocaml workers (erlang doing the network part);
  • using the ocaml image processing lib from erlang (not best example as lot of data need to be exchange … this is task to be solved for future erlocaml development);
  • others???