Generating music with deep learning

Automatic, machine-generated music has been a small interest of mine for some time now. A few days ago, I tried out a deep learning approach for generating music… and failed miserably. Here’s the story about my efforts so far, and how computational complexity killed the post-rock.

The spark of an idea

When Photo Amaze was created in 2014, I thought it would be fun to have some kind of ambient music playing while navigating through the 3D maze. But I did not want to play pre-recorded music. I wanted it to be automatically generated on-the-fly, based on the contents of the pictures in the maze.

That was the spark. A picture is worth a thousand words, so why can’t it be worth a few seconds of music as well? For example, take a look at this picture:

Mountain with running water in the foregroundmotional impact, like the ambient sound of a running water stream or the whistle of the wind picking up speed over the mountain.

Since I can make a connection between photo and music, perhaps a machine could do this automatically as well. This is not a novel idea, but it is a nut that has yet to be cracked, and it was an intriguing idea to start exploring.

Hard-coded music mappings

Modern browsers have all the ingredients necessary for doing both image analysis and sound generation. There are numerous JavaScript libraries for analyzing and manipulating pictures, and the Web Audio API makes it possible to create synthesized sound in a fairly straightforward way. Thus, it made sense to start here.

The first experiment I did was to create a more-or-less fixed mapping between an image’s content and some kind of sound output. The high-level idea of the implementation was to simply map brighter colors to brighter sound notes. The steps to produce the output sound were something like this:

  • Find at most 200 “feature pixels” in the input image using trackingjs.
  • For each found “feature pixel”:
    • Calculate the average pixel value between the three RGB color values. This produces a single number per pixel between 0 and 255.
    • Normalize the pixel value from 0-255 to 20-500. This produces a base frequency for the output sound.
    • Create a sine wave oscillator using the Web Audio API for the pixel value.
    • Combine the oscillators to a single sound output.
    • While playing the sound, randomize the frequency of each oscillator slightly over time.

Using this approach, an image would be turned into a randomly changing output sound consisting of about 200 sine waves, each with a frequency between 20 and 500 Hz.

Here is an example output using the mountain from above as the input image (the red dots mark the found “features” of the image).

That might not sound terrible, until you realize that the sound is basically the same for any input image:

Mila might be a monster dog, but that output is just too dark :-)

There were a ton of problems with this implementation, to the point that it was actually outright silly. For example, the “feature pixel” selection mostly found edges/corners of the image, and using just 100 pixels as input corresponded to just 0.01% of all available pixels in the test images. Another problem was how the final pixel value was calculated from the average between the red, green and blue value of the pixel. Some colors arguably have more impact to the viewer than others, but this fact is not captured when taking the average.

Even with all its problems, the first experiment was a good first step, considering I did not know where to start before. It is possible that with a lot of tweaks, lots of new ideas and lots of time, this approach could start producing more interesting soundscapes. However, the downside of the approach was also that the music creation would always be guided by the experimenters: the humans. And I wanted to remove them from the equation.

Machine learning to the rescue

The second experiment ended before it even really started. It was clear that some kind of machine learning was needed to move forward, and it seemed that an artificial neural network might be the solution.

This was the idea:

  • Use every pixel of the input image as a single input node of the neural network.
  • Treat every output node as a single sound sample.

For the purposes of this blog post, everything that happens between input and output nodes of the network is largely hidden magic. With that in mind, here is how the network would look (P1 – Pm are the input pixels and S1 – Sn are the output samples):

Photo to sound neural network
Mapping a photo’s input pixels (P1-Pm) to a soundwave output (S1-Sn).

To get an idea of the size of the network, consider this: the mountain test image from above is 1024 by 683 pixels, so the network would have 699,392 input nodes when using images of that size. Digital sound is just a collection of amplitudes in very tiny pieces called samples. The most commonly used sampling rate for music is 44.1KHz, which means that every second of digital music consists of 44,100 individual samples. For a neural network of this design to produce a five-second sound, it would thus require 220,500 output nodes.

The intentions were good but the implementation never happened. After having the initial idea, I started Python and tried to simply read and write soundfiles, but it didn’t go so well, and the weekend was nearly over, and, “oh a squirrel!”… and the code was never touched again.

Machine learning is great, but the motivation was suddenly lacking, and the project was put on ice. This was about two years ago, and the project was not revived until quite recently.

AI art

Deep learning has been steadily on the rise in recent years, often outperforming other machine learning techniques in specific areas such as voice recognition, language translation and image analysis. But deep learning is not limited to “practical” use cases. It has also been used to create art.

A well-known example of “AI Art” is Google’s Deep Dream Generator. The software is based on an initial project called DeepDream which produced images based on how it perceived the world internally. Some of its images were even shown and sold at an art exhibition.

The A.I. Duet project shows another interesting use case for deep learning: the creator of the project, Yotam Mann, trained a model that can produce short sequences of piano notes based on the note input of a human. So if I played C-D-E, the software might respond with F-G-A although the result would most likely be slightly more interesting than that.1

A.I. Duet is impressive, but it still has a big limitation: it only works for specific notes for a specific instrument. So while the result is amazing, what I really want is more complex arrangements and raw audio output. Even so, the above examples show that deep learning is a powerful and versatile machine learning technique, and it is now finally becoming more feasible than ever to achieve the goal of creating music using AI.

Mila drawn with Deep Dream
Although it is not really relevant to this blog post, I could not help myself: here is the Mila image from above processed with default settings of Deep Dream. It is slightly disturbing to see that… chicken?… coming out of her left paw. Thank you for ruining my sleep, robot!

The bleeding edge, where the story ends

While doing some research on the latest state of the art for machine-generated sound, I stumbled upon yet another Google project called WaveNet. In an interesting blog post, the authors of WaveNet discuss how their research can be used to improve text-to-speech quality, but what is really exciting to me is that they also managed to produce short piano sequences that sound natural (there are some examples at the bottom of their blog post).

The big surprise here is that the piano samples are not just based on specific notes. They are raw audio samples generated from a model trained with actual piano music.2

Finally! A tried and tested machine learning technique that produced raw audio. Reading about WaveNet marked the beginning of my final experiment with music generation, and is the entire reason this blog post exists.

I found an open source implementation of WaveNet, and to test the implementation, I wanted to start simple by using just one sound clip. For this purpose, I extracted an eight-second guitar intro from the post-rock track Ledge by Seas of Years3:

My hope was that by training the model with a single sound clip, I would be able to reproduce the same or a very similar clip to the original to validate that the model produced at least some sound. If this was possible, I would be able to train the model with more sound clips and see what happens.

Unfortunately, even with various tweaks to the network parameters, I could not manage to produce anything other than noise. Sharing an example of the output here is not even appropriate, because it would hurt your ears. The experiment ended with an early failure.

So what was the problem? I soon realized that even with this fairly simple example, I had been overly optimistic about the speed at which I would be able to train the model. I thought that I could train the network in just a few minutes, but the reality was very different.

The first warning sign showed itself pretty quickly: every single step of the training process took more than 30 seconds to complete. In the beginning, I did not think much about this. Some machine learning models actually start producing decent results within the first few steps of training so I was hoping it would be the same here. However, after doing more research on WaveNet, it became clear that training a WaveNet model did not just require a few learning steps, it required several thousand. Clearly, training WaveNet on my machine was completely unfeasible, unless I was willing to wait more than a month for any kind of result.

Where do we go from here?

Machine learning has been rapidly evolving in recent years, propelled by software libraries like TensorFlow, and the technology is more accessible than ever for all kinds of developers. But there is also another side of the coin: in order to use the state of the art, we are often required to have massive amounts of computing power at our disposal. This is probably why a lot of high-profile AI research and projects are produced by companies like Google, Microsoft and IBM, because they have the capacity to run machine learning at a massive scale. For lone developers like me that just want to test the waters, it can be difficult to get very far because of the complexities of scale.

As a final example to illustrate this point, consider NSynth, an open source TensorFlow model for raw audio synthesis. It is based on WaveNet and on NSynth’s project page, it says:

The WaveNet model takes around 10 days on 32 K40 gpus (synchronous) to converge at ~200k iterations.

Training a model like that would cost more than $5,000 using Google Cloud resources4. Of course, it is possible that a simpler model could be trained faster and cheaper, but the example still shows that some technologies are most definitely not available for everyone. We live in a time where there is great access to many technological advances, but the availability is often limited in practice, because of the scale at which the technologies need to operate.

So where do we go from here? Well, computational complexity killed my AI post-rock for now, but I doubt that it will take long before significant progress is made in this field. For now, I will enjoy listening to human-generated music. In a way, it is re-assuring that machines cannot outperform us in everything yet.


  1. The video explaining how A.I. Duet works is quite good. 

  2. Describing how WaveNet works is beyond the scope of this blog post, but the original paper for WaveNet is not terribly difficult to read (unlike most other AI research). 

  3. Seas of Years’s album The Ever Shifting Fields was one of my favorite post-rock albums of 2016. I recommend a listen. 

  4. I used Google’s pricing calculator with 4 machines, each with 8 GPU cores. 

Privacy Leak “By Design”

I had an interesting chat with Intercom support about what I perceived to be a security and privacy hole in their support messenger app, but it turned out that what I thought should be a great concern for them was happening “by design”.

Intercom is a popular customer relations tool, and one of their cool features is the chat messenger app. It adds a little chat icon to the bottom-right of a website and allows real-time chat with customers for help and support. We use it at Receiptful which allows us to chat directly with our users when they are signed-in to our app. It looks like this:

Intercom in Receiptful

Chats are not private

A few days ago, I was using the Intercom chat app on a website that hosts some of our data. I needed to update some basic settings for our account and asked for help using the Intercom chat while I was signed-in to the service. A common use case for the Intercom chat is to allow support for both anonymous and signed-in users. What I found out is that there is no distinction between these by default.

When I signed out from the website, I noticed that my private chat session was still visible in the “anonymous” chat window. Even after restarting the browser and without signing in to the service, my private chat session was visible.

In other words: If I was on a shared computer, the next person using the browser would be able to see my private chat sessions, even though I signed out from the service where I had the chat in the first place.

Next, I tried to do the same thing on the Intercom website and it was the same deal: All previous announcements and private chats were visible from their frontpage without me signing in:

Intercom chat

“This is, in fact, by design”

When I noticed that my private support chats were leaking into the anonymous part of their website, I reported it to Intercom as a possible security hole because I did not think that it was intentional that private chats were visible while being signed out. This is the response from Intercom support:

This is, in fact, by design. We track users using an anonymous cookie, and when they logout that cookie still exists, so we can use that to keep the conversations in the messenger. I think your concern though is interesting, and I’ll forward this as feedback to our Messenger team.

If you’d like to ensure that others won’t see the conversations, I recommend clearing your cookies with us after logging out.

Apologies for the confusion there, it’s clear that sometimes what we think is a good idea isn’t always agreed upon by others.

No kidding…

So the privacy leak is “by design” and I have to remember to clear all my cookies to avoid it. What a joke. Imagine having a private chat on Facebook that was still visible after signing out. That would be quite horrible. Intercom clearly does not see their support chat system as a private conversation, although it most certainly is. In the chats, both my real name and email are used and what is even worse: I can create a new conversation using the same chat window, thereby impersonating whoever was the last one to use the system.

Now to be fair, there is a documented API called Intercom('shutdown') which clears the user cookie and resets the state of Intercom. However, Intercom does not even use this API themselves and I cannot imagine many websites that do this. So leaking chats are probably quite common.

The bigger picture

I think what really bothered me is that I already knew what Intercom would say when I reported the issue. Before I got the above answer from Intercom, I wrote this message to my colleagues:

It's been reported to Intercom. Now awaiting the response "that's by design". I ​_know_​ that will be their response because that's the world we live in these days. Where you can close your browser and then someone else can view your data afterwards. Noone cares about these things. Sigh...

The problem with lack of privacy is systemic. In this case with Intercom, usability won over privacy. They thought it was a “good idea” to keep chat windows open even after the user had signed out of their service and in most cases, this decision does not present a problem for the user if they are not on a shared computer. But by asking the questions “should private chats be visible after the user signs out”, “what if the user is on a shared computer” and “how does this relate to the privacy of our users”, I think they would have arrived at a different conclusion.

As developers in an a world of increasing surveillance, we need to ask ourselves questions about privacy when developing our solutions. And if there is an obvious case of private information leaking to a non-secured area, we should most definitely not consider it to be “by design”.


For full transparency, here is a copy of my support chat with Intercom.

Historical design decisions and consequences

Douglas Crockford discovered JSON and he also wrote a reference implementation for JSON in Java. One of the great stories I heard him share was on Hanselminutes episode 396: He was contacted by some developers that were getting syntax errors while parsing JSON using his reference implementation. It turned out that they were transferring JSON documents more than 2GB in size but the Java implementation was using a 32 bit integer to keep track if the number of characters in the JSON document. The maximum value of a 32 bit int in Java is:

2^31-1 = 2.147.483.647 ~ 2 billion = 2G

So because 2GB of data has a higher number of characters than can be stored in a 32 bit integer, that was obviously a problem. As Douglas Crockford says in the podcast, he had no idea that anyone would ever use JSON to store so much data yet it happened anyway and all of a sudden it was actually a bug.

Historical design decisions and their consequences for modern day computing are very interesting and I recommend listening to the entire Hanselminutes show through the link above. I just wanted to share a recent example of the same kind of problem. Here is a screenshot from the landing page for Rocksmith (a guitar hero kind of game but with a real guitar):

Rocksmith splash screen

The interesting part of this screenshot is the highest arcade score. It is exactly the maximum value of an int (2.147.483.647). I do not think it is a coincidence. I think the Rocksmith developers simply did not think that anyone would ever get more than 2 billion points in the arcade game and it was probably a fair guess. I could be wrong of course, but it is still a good reminder that even small decisions like choosing a datatype to keep track of a score has consequences for the software we produce.

Ubuntu — not ready for primetime

I wanted to install Ubuntu on my Dell XPS 15 to try out Steam for Linux. This was not the enjoyable experience I had hoped for since a lot of things did not work perfectly out of the box. Below are some steps I had to take to get the system going.

Fixing the graphics

My laptop has NVIDIA optimus technology which automatically switches between Intel’s HD 4000 graphics card and the faster NVIDIA Geforce GFX 640. Apparently, optimus support on Linux is not good.

In Ubuntu, I had no 3D support and the graphics would spontanously turn off after a restart so I was presented with only the terminal. Fortunately, there are some nice people that are maintaining a project called Bumblebee which adds support for optimus in Linux. After installing this, my graphics system has been fairly stable. Just do this:

sudo add-apt-repository ppa:bumblebee/stable
sudo add-apt-repository ppa:ubuntu-x-swat/x-updates
sudo apt-get update
sudo apt-get install bumblebee bumblebee-nvidia linux-headers-generic

I also recommend the primusrun package:

sudo apt-get install primus

With the above installed, it is possible to run programs specifically with the NVIDIA card like so:

optirun glxspheres
primusrun glxspheres

Fixing the mouse

Yes, the mouse did not work. Well, the touchpad worked but my wireless Logitech M705 mouse did not. The problem, it turned out, was the Logitech Unifying Receiver. It is a small USB thing that is plugged in for a mouse and external keyboard and is used for many Logitech devices. After searching for many hours, somewhere on some forum, I found the following simple command-line trick:

#!/bin/bash
while :; do dmesg|grep logitech-djreceiver|tail -1|grep -q -c "failed with error -32" || exit; echo -n `date`" Driver Reload" ; rmmod hid_logitech_dj ; modprobe hid_logitech_dj ; dmesg|grep logitech-djreceiver|tail -1 ; sleep 1; done

You can also find it as a github gist here.

The script simply tries to reload the receiver with modprobe and it works. Sometimes after one loop, sometimes after ten. And it is a pain in the ass to run it at every startup.

Getting Steam to work

The real reason I wanted to try Ubuntu again was the recently released Steam for Linux client. After installing Bumblebee, Steam actually installed and ran quite well. However, it is worth taking a look at this guide for running programs with optirun/primusrun.

Conclusion

In the above, I left out the fact that before finding the solutions, I had to reinstall Ubuntu three times because of playing around with graphics drivers that broke the system until finally figuring out about the Bumblebee project. This is definitely something most users would not want to mess around with. Not only that but my mouse is still not working after a restart and sometimes I am still greated with the terminal login instead of a graphics login. It is quite random, actually.

I should also note that I have had similar experiences with Ubuntu in the past. I love Linux but it just does not work like Windows or Mac. As soon as you are faced with a weird hardware problem, good luck fixing that without the command-line!

Therefore, I have to recommend not installing Ubuntu at the current time — at least if you have dual graphics card with optimus technology or you are not willing to spend hours trying to fix things. It is a big shame because the Linux platform and Ubuntu in particular shows great promise. But it is not for everyone. It is not ready for primetime.

Why you should not use Microsoft Silverlight for your next web application

You are running a Silverlight 4 application. You may experience incompatibilities as Moonlight does not have full support for this runtime yet.
The above warning message sums up everything I dislike about proprietary web technologies. Silverlight is a fairly new Microsoft technology from 2007 and Moonlight is its Open Source and not up-to-date equivalent that I have to use because Microsoft does not provide Silverlight for Linux. Since most software only run on Windows anyway why is this so upsetting? Well, everything is different on the web.

First, the premise for Rich Internet Applications (RIA) like Silverlight, Flash and even Java is ok: A website usually does not provide much “action” in itself so having an extra plugin running on the website with some access to the underlying system resources as well as built-in extra functionality will provide a better user experience. Microsoft writes on the Silverlight website:

Silverlight is a powerful development platform for creating engaging, interactive user experiences for Web, desktop, and mobile applications when online or offline

That sounds good and I do not necessarily disagree with the premise for RIAs but the question is: Why did we need another RIA platform that doesn’t really work? I don’t exactly know. However, it is understandable from Microsoft’s viewpoint, given their history of locking-in people to their platforms and the fact that they have legions of .NET developers that are probably very comfortable staying in their own environment when writing web applications. For them, Silverlight is probably bliss.

But there are some problems, one of them being its availability. As of this writing, Silverlight is only supported in roughly 61% of all browsers, according to statowl, or roughly 69%, according to riastats. As the above warning message suggests, it does not work perfectly on (my version of) Linux, even with the newest Chrome browser and Ubuntu 11.04, the arguably best supported Linux version out there. Indeed, most of the Silverlight applications that I have looked at did not work very well, including an app that I was offered to work on[ref]I politely declined the offer with the explanation that I was busy at the moment but actually, there was also idealogical reasons for not working on the app which should be apparent from this post.[/ref] and popular services such as Netflix which currently does not work with Moonlight.

Again, the question arises: Why is this so bad? Isn’t this the same as always? No, it is a problem because the beauty of the web is its openness, and this openness is what Microsoft is challenging. The same applies to Flash. You may call Apple snobbish and manipulative if you like but Steve Jobs does have a point regarding Flash as a platform:

Most Flash websites will need to be rewritten to support touch-based devices. If developers need to rewrite their Flash websites, why not use modern technologies like HTML5, CSS and JavaScript? … New open standards created in the mobile era, such as HTML5, will win on mobile devices (and PCs too). Perhaps Adobe should focus more on creating great HTML5 tools for the future …

In my opinion, this would apply to Silverlight also. To be fair, Microsoft is embracing standards like html5 in Internet Explorer 9 but at the same time, they continue to push Silverlight forward. Some people think that this is not a problem since html5 and Silverlight are not direct competitors. But because of Microsoft’s power, Silverlight applications are shooting up everywhere which cuts of some users from using certain services of the Internet, something we have not seen on the same level with Flash. And this is a troubling development.

There is only one way of avoiding that Silverlight will dominate the next decade of web applications as Flash did in the past: Stop developing applications for it. My choice should now be clear. But I fear that I am almost alone.