July 2007


I didn’t known this was a so popular problem when I coded a sieve of Eratosthene using stream design from SICP.

In fact, I saw this video about NewSqueak which present an example of a concurrent algorithm of the sieve, and in this related discussion on plan9 group I also found an Haskell implementation (more related to the stream design than to the concurrent design IMHO). I also had a look at libtask which offer a C version of the concurrent sieve.

That’s already a lot, so one more shouldn’t be useful … but I like this example, and thus here is my Erlang implementation using a spawn for each prime in the sieve.

Because CF need data to learn, small examples to illustrate adviserl are not easy to find. And thus the first “real” example is already a mis-use of adviserl: it uses the CF algorithm as an IR tool!

Anyway, here we go, I created a tag recommender with my delicious bookmarks (this is not new, delicious already display related tags, but this is just an application example toy).

This is done by considering each bookmark as a source (a user) and each tag as an item: each time a tag is associated with a bookmark, this is translated as “the bookmark rate the tag with a score of 1”. The complete code is this:
application:start(adviserl),
{ok, DeliciousPID} = deli_posts:start_link(),
gen_server:call(DeliciousPID, {login, User, Password}),
{ok, Posts, _Status} = gen_server:call(DeliciousPID, {get_posts, User, Options}, infinity),
io:format("Loading posts", []),
lists:foreach(
fun(#delipost{href=HRef,tags=Tags}) ->
io:format(".", []),
lists:foreach(
fun(Tag) -> adviserl:rate(HRef,Tag,{1,no_data}) end,
Tags
)
end,
Posts
),
io:format("~n", []).
Getting a recommendation for few keywords is then:
Keywords = ["erlang", "concurrency"],
KeywordIDs = lists:map(fun(K) -> adv_items:id_from_key(K) end, Keywords),
Ratings = lists:map(fun(ID) -> {ID, 1} end, IDs),
Rec0 = adviserl:recommend_all(Ratings),
lists:map(fun({ID,_}) -> {ok,K} = adv_items:key_from_id(ID), K end, Rec0).

(lot of this code is about format and conversion, hopefully it will be done in next API release).

This delicious example toy can be run by keywords.sh in delicious example folder.

Yeah, I know, we can do the same more easily with few statistics (and R) and no CF … but (1) I needed a small example and (2) this could be extended to use different user accounts.

Hey! I should try to use citeUlike instead of delicious for the next example!

I (began to) wrote a delicious API to implement a toy example for adviserl (thus the code is here). More about the toy example in next blog post.

Haven’t found anywhere a delicious API library for Erlang … but haven’t really look for it because I wanted to avoid dependency. So I wrote one, organised in 2 parts.

  • The first part (deli_api) is able to:
    • connect to the delicious server and retrieve data (posts) using inets http client;
    • convert those data (posts) in Erlang terms using xmerl (discard XML as different delicious API (rss, api, …) have different format of the same data);
  • The second part is a gen_server (deli_posts) which accept those requests:
    • {login, User, Password}, return ok
    • logout, return ok
    • {get_posts, User, [Option]} with Option=no_update, return {ok, [Post], archive|uptodate}

The server will maintain a local cache (a DETS file) and download from delicious server only if needed and requested. The login is only used to retrieve data from delicious server: data put in cache can always be accessed without password.

I may like to use this API in order to write a plugin for SharedCopy: any shared URL could be automatically posted in delicious (seems easy with Yaws but I need a web hosting for that).

The recommender system adviserl manage all data (item and user, or source) through identifiers.

Keeping those identifiers opaque for the whole system is possible, but as many prediction algorithms use integer identifiers (for matrix operations), adviserl provides an API in input of the system to generate integer ID if needed. Thus all identifiers in adviserl are integers, with an automatic conversion if needed.
erl> rate(5, 1, {3,no_data}). # user 5 rate item 1 with score 3 (no rating data)
erl> rate("bill", "microsoft", {5, no_data}). # user "bill" rate item "microsoft" ...
erl> adv_items:id_from_key("microsoft").
2 # item "microsoft" got ID 2
erl> adv_source:id_from_key("bill").
1 # source "bill" got ID 1

The API is also usefull to keep data about items or sources, which open the possibility to content-based recommendation algorithms. For this, it’s enough to call the item/source API before to set related ratings. Any item/source has an integer ID, a external key (any Erlang term), and a data term (a property list in the following example).
erl> adv_items:insert_new("linux", [{tags, ["oss", "os"]}]).
{true,3} # bool if inserted or existing, integer is ID
erl> adv_items:object_from_id(3).
{3,"linux",[{tags,["oss","os"]}]} # {ID, Key, Data}
erl> adviserl:rate("linus", "linux", {6,no_data}).

Until now, to access a remote node running adviserl, you could use the native Erlang RPC:
erlang> rpc:call(adviserl_node@localhost, adviserl, rate, [1, 2, {3,nodata}]).

I added a gen_server API to do exactly the same thing:
erlang> gen_server:call({adv_api, adviserl_node@localhost}, {rate, 1, 2, {3,nodata}}).

This is not simpler, but not more complicated either, and hopefully this will be more flexible when coming to distribution (the API may then become a global process).

I also added a bunch of shell script to start, stop, rate, or run prediction from command line. The start-stop script (adv.sh) is quite usefull, while the others are convenient when debugging. Those scripts have yet to be documented 😦 but are more or less straightforward and have a minimum help message when bad options or bad arguments are given.
shell> ./bin/adv.sh start
shell> ./bin/adv-rate.sh 1 2 2 # user 1 rate item 2 with score 2
shell> ./bin/adv-rate.sh 1 3 3 # ...
shell> ./bin/adv-rate.sh 2 3 3
shell> ./bin/adv-rate.sh 2 4 4
shell> ./bin/adv-rate.sh 3 3 3
shell> ./bin/adv-getratings.sh 1
3 3
2 2
shell> ./bin/adv-recommendall-source.sh 3 # prediction for user 3
4 4.00000
2 2.00000

I think my next post on adviserl will show a more useful example! 🙂

From a programmer perspective, how easy (difficult) is it to mix OO and functional programming?

Earlier this year, I’ve been reading SICP and Programming pearls at the same time; Last month, I’ve been writing both OO Python code and functional Erlang code. Those experiences are probably the best feeling of programming I had since a while: I’m not writing code, I am coding abstractions.

But I’m still wondering about mixing both styles/paradigms in the same project or same code file. My intuition is that it may be nice to work with different styles for different parts of a program, but I would really be afraid to introduce such kind of mix in a project maintained by multiple programmers. Any experience to share on this subject?

Perhaps I should give Nemerle a try. Any other?

I have no idea what is a good way to discover and learn new things. Surely “science of learning” has some models and advices that may have been useful, but I choose to follow the lazy way. To learn about collaborative filtering, I googled.

The first source of information reported by the search engine is (of course?) wikipedia: collaborative filtering .Then I passed 2 links in Google results (general information) to direct myself toward specialized articles through this great personal page. I navigate a bit in it and with the help of few more Google searches I got lost … on my way to be lost, at least I figure out some articles that seems fundamental: Reporting and evaluating choices in a virtual community of use, GroupLens: an open architecture for collaborative filtering of netnews, Social information filtering: algorithms for automating ‘word of mouth’. My machine learning background is appreciated at this stage.

But Wow! That’s a lot of information: do I really understood what I read? This was time to got dirty hands so I choose one system looking relatively simple and faced it at implementation level: Slope-One algorithm.

Then I found a recent article about Google news personalization recommender system: I still don’t feel comfortable to compare the different algorithms, but at least this great article put things in real context (taking into account the big scalability problem). And also this point me toward a wonderful survey: Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.

I have enough (too much) information for now, so I guess that (1) it is why I feel a bit confused and (2) it is the good time to write down all that stuffs to try to organized it: field description and algorithms classifications, evaluation of recommender systems (see for example here and here) and at last but not least the scalability problem. And I will try to use citeUlike and its recommendations 🙂

Hopefully, after that, I can go further “Incorporating contextual information in recommender systems using a multidimensional approach” (available here), but before I still have to find a survey of available recommender systems. Any recommendation? 😉

Did I miss something important?

Long time ago I wrote a very simple Slope-One implementation (collaborative filtering algorithm): this was easy and fulfilled all my needs … which was then to learn CF and Erlang ;).

Then I realized that it could be more than fun and could even become useful as a simple recommender system. So I wrote it as an OTP application and it is published under GPLv3 license. Lots (lots!) of things still have to be done but the basic are like that:

% start application: by default use Slope-One data/algorithm
application:start(sasl),
application:start(adviserl),
% add some rating in the system
adviserl:rate(1, 2,  {3, no_rating_data}), % user 1 rate item 2 with value 3 (no data)
adviserl:rate(1, 4,  {5, no_rating_data}), % ...
adviserl:rate(2, 2,  {1, no_rating_data}),
adviserl:rate(2, 5,  {8, "damn good!"}), % any data term can be associated to rating value
adviserl:rate(3, 4,  {3, no_rating_data}),
adviserl:rate(3, 5,  {2, no_rating_data}),
adviserl:rate(3, 12, {2, no_rating_data}),
% some debug output to "see" the data
adv_ratings:print_debug(), % display the ratings per user
adv_items:print_debug(), % display a covisitation matrix
% try some predictions
adviserl:recommend_all(1), % prediction for user 1
adviserl:recommend_all(2), % ... for user 2
adviserl:recommend_all(3),
adviserl:recommend_all(4),
adviserl:recommend_all([]), % for any user without rating!
adviserl:recommend_all([{2,5}]), % for any user having those ratings
adviserl:recommend_all([{4,5}]), % idem
adviserl:recommend_all([{2,5},{4,5}]), % idem with multiple ratings
adviserl:recommend_all([{3,5}]), % ... even if item is unknown
% update on the fly
IncreaseRating = fun({R, Data}) -> {R + 1, Data} end,
DefaultRating = {1, no_data},
adviserl:rate(1, 2, {7, now()}), % user 1 change rating of item 2 from 3 to 7, adding data
adviserl:rate(1, 2, IncreaseRating, DefaultRating), % update from 7 to 8 with function
adviserl:rate(1, 42, IncreaseRating, DefaultRating), % rate item 42 at 1 (default)
ok_lah.

Among the main points on the pseudo roadmap:

  • API to call adviserl functions through process messages
  • data persistence
  • data distribution
  • algorithm distribution