Writing music on the Web

January 19th, 2009

I’ve been looking forward to see a web-based music edition application for a while back. I was delighted to discover Noteflight, a flash-based music edition application. It is very functional and you can even playback your scores and share them on your website. Here’s the first part of the Monaghan Jig, one of my favorites of the Quebecois traditional repertoire. 

Now I am looking forward to see a feature that translate the scores into guitar tablatures, my music reading skills are just not that good.  

js Web Technologies

Y! Web Ranking Algorithm running on a 10,000+ Hadoop Cluster

February 26th, 2008

A few weeks ago, I posted about MapReduce and mentioned that Hadoop was an open-source Java implementation of MapReduce. Since this post, I’ve continued to investigate Hadoop and found out that it is supported by a few large and smaller companies, including Yahoo!.

Last week, Y! announced that WebMap, their link inversion and ranking algorithm, is now running on what is believed to be the largest Hadoop cluster, 10,000+ cores on Linux machines).

Some more info about WebMap data size:

  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed!
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes

Pretty impressive stuff! You now have this Java-based, open-source library that allows you to run your favorite machine learning algorithms on web-scale datasets.

js Web Technologies , ,

Blogger Play

February 25th, 2008

I just discovered Blogger Play. I think this is really cool:

Blogger Play is a real-time slideshow of photos Blogger users have recently uploaded to their blogs. It’s a great snapshot of what people are thinking and posting about, right now!

Most of the time, the web interface fails miserably at representing real-time online users activity. For example, consider a real-world outdoor market versus ebay, the former is full of lively activity while the later is pretty stale. This, even though ebay has probably 1000X more activity than any outdoor market in the world. It just fails to represent that real-time activity very well.

js Web Technologies ,

Easy Distributed Computing with MapReduce

January 31st, 2008

I’ve been hearing about MapReduce quite often lately, so I’ve decided to give it a closer look. MapReduce was originally developed and implemented by Google to perform task on massive amount of data (e.g., index the web). It is a programming model that aims to make distributed computing on large clusters (thousands) of machine easily accessible to the average programmer.

MapReduce programming model is inspired from functional programming (i.e., Lisp) and is quite simple to understand. To run an algorithm as a MapReduce distributed task, the programmer has to implement the algorithm in the form of two functions, “map” and “reduce”, the platform takes care of all the rest. Here’s an explanation from the original MapReduce paper by Dean and Ghemawat:

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce.

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the reduce function.

The reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

May sounds a bit strange at first, but let’s look at a very simple example. The example is not realistic but serves the purpose of explaining MapReduce.

Example: Counting Character Occurrence in a String

Imagine we want to count the number of times each character occurs in a string. The sequential algorithm to perform this task is pretty trivial:

count_characters(string text)
  map<string><int> character_count
  for each character c in text
    character_count[c] += 1
  end
end

The resulting character counts for:

count_characters("hello world")

would be something like:

'd' => 1
'e' => 1
'h' => 1
'l' => 3
'o' => 2
'r' => 1
'w' => 1
' ' => 1

Counting Character Occurrence with MapReduce

To implement this algorithm using MapReduce programming model, we need to express it as map and reduce functions:

map(string text)
  for each character c in text do
    emit_intermediate(c, "1")
  end
end
reduce(string char, list<string> values)
  int count
  for each value v in values do
    count += integer(v)
  end
  emit(char, count)
end

The map function of our algorithm goes through each character of an input string and store a count of “1″ in the intermediate key/value pair (through the emit_intermediate function).

The reduce function receives a character with an associated list of values (i.e., a list of “1″s) and sums up the values for the character. Once all the values are added, the reduce function outputs the total count for the current character (through the emit function).

Figure 1 illustrates counting the number of occurrences for each character in the string “hello world” with MapReduce. The map and reduce functions are executed in different processes and can be executed by multiple processes at the same time. In the Figure, M1 and M2 are worker processes executing the map function on the input data. R1 and R2 are processes executing the reduce function on the intermediate data generate by M1 and M2. In a real-world scenario, there could be a large number of worker processes, furthermore, the number of processes in the map phase need not be the same as for the reduce phase. Steps are labeled from 1 to 6, let’s explain each one of them.

1) The input data is partitioned into “splits”. In our example we split the “hello world” into two pieces (ignoring the white space for simplicity) and we assign one piece to each worker process (i.e., M1 and M2). Typically, at this step the data is broken into chunks of 16MB to 128MB.

2) Each worker passes their assigned sub-string to the map function implemented by the programmer.

3) The map function calls emit_intermediate which writes the data to a local
bucket, there is one bucket for each reduce process. In this example, keys are assigned to buckets by splitting the alphabet in two, a-m go to R1, n-z go to R2. However, in a read world scenario we would typically assign the buckets by taking a hash function of the key and apply the modulo of the number of reduce processes hash(key) mod R.

4) Each reduce process reads their bucket of data from each map process. This data is aggregated in a list for each key.

5) The reduce function implemented by the programmer is called on each “key/list of value” pair.

6) The reduce function emit the data which is written to the output.

This example aims to illustrate the “big picture” of MapReduce, but a lot of subtleties that can vary from an implementation to another were left out on purpose. If you are interested to know more about MapReduce, I encourage you to look at the resources listed below.

The MapReduce model seems pretty simplistic, but it turns out that in practice a lot of useful algorithms can be implemented in this framework, such as web indexing, machine learning algorithms, statistical analysis…

MapReduce References and Resources

The Concepts

MapReduce Implementations

  • Hadoop - an open-source Java implementation of MapReduce by the Apache Foundation.
  • Simple MapReduce in Ruby - a ruby implementation just to play around.
  • Skynet - Another ruby MapReduce implementation.

js Algorithms , , ,

A new HTML5 draft published

January 23rd, 2008

The W3C announced the publication of a new HTML5 specification draft. Interesting new features, quoted from the official press release:

Some of the most interesting new features for authors are APIs for drawing two-dimensional graphics, embedding and controlling audio and video content, maintaining persistent client-side data storage, and for enabling users to edit documents and parts of documents interactively. Other features make it easier to represent familiar page elements, including

;

also…

The HTML 5 specification helps to improve interoperability and reduce software costs by giving precise rules not only about how to handle all correct HTML documents but also how to recover from errors.

How long before we can actually use these new features? HTML 4 became a W3C Recommendation in 1997…

js News, Web Technologies , , ,

A First Drive on the Facebook Developers’ Platform…

January 13th, 2008

This weekend I spent a few hours trying out the Facebook developers’ platform and developed my first facebook app (I’m in). I was quite surprised (and pleased) to see the flexibility and ease-of-use of the platform. There are some limitation in what you can do on the profile page of users, mainly because the profile page is cached (for performance reasons) and facebook proxies everything that is on the profile page (for privacy reasons). On the other hand, in your app’s dedicated pages (i.e., canvas pages) you can do anything you want: html, iframes, javascript, java, flash, third-party maps, and you can also use facebook’s FBML that has some helpful markups. In canvas pages, the content is also served directly from your server to the user’s browser.

To try out the fb platform I have built a simple app in PHP that uses a mix of technologies: geolocation, geocoding, google maps, and the fb social graph (of course). The purpose of my app, “I’m in“, is to allow the user to share his/her current location (i.e., city) with his friends easily, as well as be able to know where his friends are. The app figures out the location of the user automatically using the user’s IP address or can use the location that is set manually in his facebook profile.

Profile

The content that an app puts on the profile page must be pushed, it cannot be dynamically generated when a user sees it. My app pushes the user’s current city info (with a nice image if available) to the profile when he accesses the app’s canvas page.

“I’m in” user profile view

Canvas Page

The canvas pages of an app are directly served from the app’s creator, thus it can be dynamic. In my app I show the user’s friends for which the location is known on a google map.

“I’m in” canvas page

Although Facebook is a proprietary social graph (as opposed to opensocial), the fact is that it has a HUGE users base, and I was really pleased to see the flexibility of the developers’ platform. As I continue to mature my app I might post further about the fb api. Stay tune and install I’m in!!!

js Programming , , ,

A new canadian copyright law is coming…

November 30th, 2007

According to a recent Globe and Mail article:

Ottawa copyright circles are buzzing with hints that the government is preparing its new revised copyright bill, and will be tabling it soon, perhaps as early as next week.

And the buzz is that the new law will basically be a copy of the controversial U.S. Digital Millennium Copyright Act (DMCA).

Howard Knopf, a copyright lawyer and litigator is explaining the predicted implications of the law and what is already known from the US experience.

If like me, you think that this all sounds quite worrying, here’s 30 things you can do to protest DRM on Michael Geist’s blog (another copyright lawyer).

js Uncategorized , ,

Rails respond_to method and Internet Explorer

November 22nd, 2007

I recently encountered a strange problem with my Rails (1.2.5)  application that took me sometime to figure out, so I thought I would share it.

While my application would work perfectly fine in Firefox, in IE (both 6 and 7) the wrong document type would be returned to the web browser (an Excel file was returned rather than HTML). Addressing the issue required a deeper understanding of the functioning of the respond_to method.

First, my controller is shown below, it returns a list of accounts in HTML or Excel (xls) format:

def index
  @accounts = Account.find(:all)
  respond_to do |format|
    format.html
    format.xls do
      headers['Content-Type'] = "application/vnd.ms-excel"
      headers['Content-Disposition'] = 'attachment; filename="accounts.xls"'
      render :template => 'accounts/excel', :layout => false
    end
  end
end

I also registered the xls mime type in the environment.rb file:


Mime::Type.register "application/vnd.ms-excel", :xls

The behavior I expected out of this controller was that a user would see the list of accounts as html at http://localhost:3000/accounts and would get the Excel version at http://localhost:3000/account.xls
It worked as expected in Firefox, however, in IE http://localhost:3000/accounts would return the Excel content rather the HTML, meaning the Excel format would be returned as default. Why was that?

How does the respond_to method works?

The respond_to method in the controller determines the content to return based on params[:format] and request.accepts. The parameter params[:format] is set by rails from the extension in the url. If you request
http://localhost:3000/accounts.xls, params[:format] will be set to ‘xls’ and an excel document will be returned.
If the format is not specified, rails will use request.accepts to determine a list of mime types ordered by priority. Request.accepts returns the list of accepted mime types sent in the HTTP header by the browser, (if “Accept” is not specified in the header, it returns the list of all rails known mime types). From the mime type list it will execute the code block of the type that is found first. For example, if we have a priority list with xml, html, and xls, html will returned since the first type to be found in the list with an existing handler is html (our controller does not have an xml handler). This logic makes sense if the browser sends the list of mime type it accepts in order of preference, but… in IE word document, excel document, image formats, and other mime types come before HTML. Below is the Accept HTTP header sent by my installation of IE6:

image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/x-shockwave-flash, application/vnd.ms-powerpoint,
application/vnd.ms-excel, application/msword, */*

As we can see, a whole bunch of mime types have priority over HTML, which is implicitly specified by the wildcard */*.
It means that with this header IE tells to rails that the user prefer to see excel document over HTML document (which is probably not the case, not for my app anyway).

The Solution

I am not sure if this is a Rails bug or a IE bug, I would have to find in the HTTP specification if the order of accepted mime types is meaningful, but in any case I fixed the problem pretty easily. To solve this problem, I decided to not rely on what is passed by the browser in the Accept HTTP header and always make sure params[:format] is set to something. I added the following filter to my ApplicationController:

before_filter: set_default_formatdef set_default_format

params[:format] = 'html' unless params[:format]

end

Problem solved!

js Programming ,

What does Quebec, Cuba, Iran, Syria, North Korea, Sudan, and Myanmar have in common?

November 18th, 2007

They have in common that their residents are not allowed to participate in at least two of the biggest programming challenges recently announced, namely, the Google Android Developer Challenge and the NetFlix Prize challenge. As a Quebecois it is a bit embarrassing to see my province (I am sometime so proud of) enumerated along countries where governments are somewhat controversial.

The Google Android Challenge is a programming contest that will distribute a total $10 millions to individuals and teams that will have developed the best applications on Google’s new mobile development platform. The challenge is meant to stimulate the development of innovative apps using platform. Unfortunately, as the FAQs stipulate, Quebec residents are not allowed to enter the contest:

The Android Developer Challenge is open to individuals, teams of individuals, and business entities. While we seek to make the Challenge open worldwide, we cannot open the Challenge to residents of Cuba, Iran, Syria, North Korea, Sudan, and Myanmar (Burma) because of U.S. laws. In addition, the Challenge is not open to residents of Italy or Quebec because of local restrictions.

Netflix on the other hand is trying to stimulate the development of a better algorithm that predicts how much someone is going to love a movie based on their movie preferences. Unfortunately, Quebecers are also not allowed to play the game:

Residents of the province of Quebec in Canada are ineligible to participate. Residents of Cuba, Iran, Syria, North Korea, Myanmar (formerly Burma) and Sudan are also ineligible to participate. The Contest is void in these countries and where prohibited or restricted by law. Netflix reserves the right to limit, or restrict upon notice, participation in the Contest to any person at any time for any reason.

It seems that this restriction for Quebecois would be due to a law that states that the organizer of an international contest in which Quebec residents can participate must pay to the government of Quebec, 0.5% of the total prizes distributed among all participants (not just resident of Quebec) . This is ridiculous, it implies Google would have to pay $50,000 to the government of Quebec to allow their residents to enter the contest and Netflix $5,000.

58. Une personne au bénéfice de laquelle est tenu un concours publicitaire
dont la valeur des prix offerts dépasse 100 $ doit payer à
la Régie, en même temps qu’est transmise la formule prévue à
l’article 59, les droits suivants :
a) 10 % de la valeur d’un prix offert à des participants du Québec
exclusivement;
b) 3 % de la valeur d’un prix offert à un ensemble de participants
du Canada exclusivement, lorsque cet ensemble comprend des
participants du Québec;
c) 0,5 % de la valeur d’un prix offert à tout autre ensemble de participant
comprenant des participants du Québec.

Well… we are the most taxed province/state in north america, we have to live up to our reputation… and we have this great “free” health system to finance. But hey wait… wouldn’t we be making more tax money from a company that could grow out of innovation motivated by one of those contest… ahhh it’s not worth the risk. ;o)

js Montreal Technologies

Montreal on Rails 4

November 9th, 2007

Last night was the fourth edition of MoR. As usual we had great presentations by three different speakers.

The most interesting was Mathieu Martin who presented JRuby, a 100% Java implementation of Ruby. JRuby basically allows to run Ruby programs (including Rails) inside the Java Virtual Machine and to use Java classes within ruby code. It’s pretty neat, it allows to leverage all the enterprise power of Java inside your (more fun) Ruby programs. Even more interesting are the performance benchmarks that show a gain with JRuby over MRI implementation of the ruby interpreter, the trick is that Ruby code is compiled to Java bytecode. However, JRuby is still in heavy development, but it looks very promising.

The two other presentations were by Alain Pilon and Sylvain Carle who presented the acts_as_state_machine plugin and the Comatose light-weight CMS, respectively.

js Montreal Technologies , , ,