Generating music with deep learning

Automatic, machine-generated music has been a small interest of mine for some time now. A few days ago, I tried out a deep learning approach for generating music… and failed miserably. Here’s the story about my efforts so far, and how computational complexity killed the post-rock.

The spark of an idea

When Photo Amaze was created in 2014, I thought it would be fun to have some kind of ambient music playing while navigating through the 3D maze. But I did not want to play pre-recorded music. I wanted it to be automatically generated on-the-fly, based on the contents of the pictures in the maze.

That was the spark. A picture is worth a thousand words, so why can’t it be worth a few seconds of music as well? For example, take a look at this picture:

Mountain with running water in the foregroundmotional impact, like the ambient sound of a running water stream or the whistle of the wind picking up speed over the mountain.

Since I can make a connection between photo and music, perhaps a machine could do this automatically as well. This is not a novel idea, but it is a nut that has yet to be cracked, and it was an intriguing idea to start exploring.

Hard-coded music mappings

Modern browsers have all the ingredients necessary for doing both image analysis and sound generation. There are numerous JavaScript libraries for analyzing and manipulating pictures, and the Web Audio API makes it possible to create synthesized sound in a fairly straightforward way. Thus, it made sense to start here.

The first experiment I did was to create a more-or-less fixed mapping between an image’s content and some kind of sound output. The high-level idea of the implementation was to simply map brighter colors to brighter sound notes. The steps to produce the output sound were something like this:

  • Find at most 200 “feature pixels” in the input image using trackingjs.
  • For each found “feature pixel”:
    • Calculate the average pixel value between the three RGB color values. This produces a single number per pixel between 0 and 255.
    • Normalize the pixel value from 0-255 to 20-500. This produces a base frequency for the output sound.
    • Create a sine wave oscillator using the Web Audio API for the pixel value.
    • Combine the oscillators to a single sound output.
    • While playing the sound, randomize the frequency of each oscillator slightly over time.

Using this approach, an image would be turned into a randomly changing output sound consisting of about 200 sine waves, each with a frequency between 20 and 500 Hz.

Here is an example output using the mountain from above as the input image (the red dots mark the found “features” of the image).

That might not sound terrible, until you realize that the sound is basically the same for any input image:

Mila might be a monster dog, but that output is just too dark :-)

There were a ton of problems with this implementation, to the point that it was actually outright silly. For example, the “feature pixel” selection mostly found edges/corners of the image, and using just 100 pixels as input corresponded to just 0.01% of all available pixels in the test images. Another problem was how the final pixel value was calculated from the average between the red, green and blue value of the pixel. Some colors arguably have more impact to the viewer than others, but this fact is not captured when taking the average.

Even with all its problems, the first experiment was a good first step, considering I did not know where to start before. It is possible that with a lot of tweaks, lots of new ideas and lots of time, this approach could start producing more interesting soundscapes. However, the downside of the approach was also that the music creation would always be guided by the experimenters: the humans. And I wanted to remove them from the equation.

Machine learning to the rescue

The second experiment ended before it even really started. It was clear that some kind of machine learning was needed to move forward, and it seemed that an artificial neural network might be the solution.

This was the idea:

  • Use every pixel of the input image as a single input node of the neural network.
  • Treat every output node as a single sound sample.

For the purposes of this blog post, everything that happens between input and output nodes of the network is largely hidden magic. With that in mind, here is how the network would look (P1 – Pm are the input pixels and S1 – Sn are the output samples):

Photo to sound neural network
Mapping a photo’s input pixels (P1-Pm) to a soundwave output (S1-Sn).

To get an idea of the size of the network, consider this: the mountain test image from above is 1024 by 683 pixels, so the network would have 699,392 input nodes when using images of that size. Digital sound is just a collection of amplitudes in very tiny pieces called samples. The most commonly used sampling rate for music is 44.1KHz, which means that every second of digital music consists of 44,100 individual samples. For a neural network of this design to produce a five-second sound, it would thus require 220,500 output nodes.

The intentions were good but the implementation never happened. After having the initial idea, I started Python and tried to simply read and write soundfiles, but it didn’t go so well, and the weekend was nearly over, and, “oh a squirrel!”… and the code was never touched again.

Machine learning is great, but the motivation was suddenly lacking, and the project was put on ice. This was about two years ago, and the project was not revived until quite recently.

AI art

Deep learning has been steadily on the rise in recent years, often outperforming other machine learning techniques in specific areas such as voice recognition, language translation and image analysis. But deep learning is not limited to “practical” use cases. It has also been used to create art.

A well-known example of “AI Art” is Google’s Deep Dream Generator. The software is based on an initial project called DeepDream which produced images based on how it perceived the world internally. Some of its images were even shown and sold at an art exhibition.

The A.I. Duet project shows another interesting use case for deep learning: the creator of the project, Yotam Mann, trained a model that can produce short sequences of piano notes based on the note input of a human. So if I played C-D-E, the software might respond with F-G-A although the result would most likely be slightly more interesting than that.1

A.I. Duet is impressive, but it still has a big limitation: it only works for specific notes for a specific instrument. So while the result is amazing, what I really want is more complex arrangements and raw audio output. Even so, the above examples show that deep learning is a powerful and versatile machine learning technique, and it is now finally becoming more feasible than ever to achieve the goal of creating music using AI.

Mila drawn with Deep Dream
Although it is not really relevant to this blog post, I could not help myself: here is the Mila image from above processed with default settings of Deep Dream. It is slightly disturbing to see that… chicken?… coming out of her left paw. Thank you for ruining my sleep, robot!

The bleeding edge, where the story ends

While doing some research on the latest state of the art for machine-generated sound, I stumbled upon yet another Google project called WaveNet. In an interesting blog post, the authors of WaveNet discuss how their research can be used to improve text-to-speech quality, but what is really exciting to me is that they also managed to produce short piano sequences that sound natural (there are some examples at the bottom of their blog post).

The big surprise here is that the piano samples are not just based on specific notes. They are raw audio samples generated from a model trained with actual piano music.2

Finally! A tried and tested machine learning technique that produced raw audio. Reading about WaveNet marked the beginning of my final experiment with music generation, and is the entire reason this blog post exists.

I found an open source implementation of WaveNet, and to test the implementation, I wanted to start simple by using just one sound clip. For this purpose, I extracted an eight-second guitar intro from the post-rock track Ledge by Seas of Years3:

My hope was that by training the model with a single sound clip, I would be able to reproduce the same or a very similar clip to the original to validate that the model produced at least some sound. If this was possible, I would be able to train the model with more sound clips and see what happens.

Unfortunately, even with various tweaks to the network parameters, I could not manage to produce anything other than noise. Sharing an example of the output here is not even appropriate, because it would hurt your ears. The experiment ended with an early failure.

So what was the problem? I soon realized that even with this fairly simple example, I had been overly optimistic about the speed at which I would be able to train the model. I thought that I could train the network in just a few minutes, but the reality was very different.

The first warning sign showed itself pretty quickly: every single step of the training process took more than 30 seconds to complete. In the beginning, I did not think much about this. Some machine learning models actually start producing decent results within the first few steps of training so I was hoping it would be the same here. However, after doing more research on WaveNet, it became clear that training a WaveNet model did not just require a few learning steps, it required several thousand. Clearly, training WaveNet on my machine was completely unfeasible, unless I was willing to wait more than a month for any kind of result.

Where do we go from here?

Machine learning has been rapidly evolving in recent years, propelled by software libraries like TensorFlow, and the technology is more accessible than ever for all kinds of developers. But there is also another side of the coin: in order to use the state of the art, we are often required to have massive amounts of computing power at our disposal. This is probably why a lot of high-profile AI research and projects are produced by companies like Google, Microsoft and IBM, because they have the capacity to run machine learning at a massive scale. For lone developers like me that just want to test the waters, it can be difficult to get very far because of the complexities of scale.

As a final example to illustrate this point, consider NSynth, an open source TensorFlow model for raw audio synthesis. It is based on WaveNet and on NSynth’s project page, it says:

The WaveNet model takes around 10 days on 32 K40 gpus (synchronous) to converge at ~200k iterations.

Training a model like that would cost more than $5,000 using Google Cloud resources4. Of course, it is possible that a simpler model could be trained faster and cheaper, but the example still shows that some technologies are most definitely not available for everyone. We live in a time where there is great access to many technological advances, but the availability is often limited in practice, because of the scale at which the technologies need to operate.

So where do we go from here? Well, computational complexity killed my AI post-rock for now, but I doubt that it will take long before significant progress is made in this field. For now, I will enjoy listening to human-generated music. In a way, it is re-assuring that machines cannot outperform us in everything yet.


  1. The video explaining how A.I. Duet works is quite good. 

  2. Describing how WaveNet works is beyond the scope of this blog post, but the original paper for WaveNet is not terribly difficult to read (unlike most other AI research). 

  3. Seas of Years’s album The Ever Shifting Fields was one of my favorite post-rock albums of 2016. I recommend a listen. 

  4. I used Google’s pricing calculator with 4 machines, each with 8 GPU cores. 

A Learning Plateau

Ancient Atlantis
Ancient Atlantis by European Space Agency (CC-BY-SA)

Anyone can play guitar and they won’t be a nothing anymore Radiohead

So you want to learn how to play guitar? Awesome!

Step one, get yourself a guitar. Great, you took the first step!

Step two, start playing any pop song ever. Great, now you know chords!

Step three, check out some cool and/or classic guitar riffs. Great, now you can read tabs!

Step four, learn a bit of music theory and realize that the pentatonic scale always sounds good. Great, now you can improvise!

Congratulations, you can now play guitar. This is your progress so far in ASCII art:

      |
      |    /
      |   /
skill |  /
      | /
      |/
      +----------------------------
                 time

Finally, we have step five: Practice and hone your skills. Oh shoot, the progress now feels like this:

      |     ___________________
      |    /
      |   /
skill |  /
      | /
      |/
      +----------------------------
                 time

I call this the learning plateau, and it seems that when trying to learn new skills, arriving at the learning plateau is an inevitable part of the process.

It goes like this: In the beginning, the learning curve is linearly increasing, but at some point, it starts feeling like the curve is flattening out and the improvements become smaller and smaller. While this is not inherently a bad thing, it gets frustrating when you know that you can improve, but there is no clear indication of progress.

When it comes to playing the guitar, knowing all the scales, chords and music theory there is to know will not help with the speed of the guitar play, or the coolness of the guitar riff improvisations. Improving this particular subset of skills takes exponentially longer than learning a new chord. At this point, it is easy to start making excuses for practicing less or even giving up entirely. I kind of did that…

Bend the curve

So how does one deal with being “stuck” on the learning plateau for a particular skill? To be honest, I don’t really know, but here are some thoughts:

Decide whether it is actually an important priority to improve the skill.

For example, is it really important for me that I improve my guitar play? Will I see any long-term benefits from putting in that effort or am I content with the current skill-set that I have obtained?

It seems silly to continue pushing forward if it does not feel “important” to do so. However, in my own experience, lost interest can sometimes be the consequence of facing big difficulties rather than an actual loss of interest and that is a bad excuse for quitting :-) This actually leads to:

Realize that it is always possible to improve.

Shake off that fixed mindset and start nurturing your growth mindset. For example, learn to play that guitar riff at 120 beats per minute rather than 110 — not a huge improvement, but it is an improvement. And playing the guitar just a bit faster sometimes opens up for playing different genres of music or making improvisation more interesting. Small improvements add up.

Form productive habits.

It is no secret that the people that get very good at their trade (whether it is art, business, entrepreneurship, etc.) put in a lot of deliberate practice, directly or indirectly. The overnight business success is a huge myth and even the most talented musician has to practice several hours per day. I recently stumbled upon an inspiring quote from Andrew Ng, a prolific and well-known figure in the Machine Learning field:

When I talk to researchers, when I talk to people wanting to engage in entrepreneurship, I tell them that if you read research papers consistently, if you seriously study half a dozen papers a week and you do that for two years, after those two years you will have learned a lot. This is a fantastic investment in your own long term development.

But that sort of investment, if you spend a whole Saturday studying rather than watching TV, there’s no one there to pat you on the back or tell you you did a good job. Chances are what you learned studying all Saturday won’t make you that much better at your job the following Monday. There are very few, almost no short-term rewards for these things. But it’s a fantastic long-term investment. This is really how you become a great researcher, you have to read a lot.

People that count on willpower to do these things, it almost never works because willpower peters out. Instead I think people that are into creating habits — you know, studying every week, working hard every week — those are the most important. Those are the people most likely to succeed. Andrew Ng

The question is: How the hell do successful people get motivated and how do they stay consistent? Andrew Ng seems to suggest that forming habits is important and this is a topic that I have only recently started researching in more detail. I read a blog post recently about forming identity-based habits and that was a good starting point for me at least.

Take the first step.

It sounds almost cliche, but taking the first, small steps towards a specific goal is important. It is also a good place to end this post. It is my first step into a hopefully more creative 2016. Happy New Year and thank you for reading :-)

Share your art

Frisbjär stone... art?

A few days ago, I talked to a guy that plays the piano. He uses YouTube tutorials to learn new songs and told me that he would like to record himself playing the piano. I asked him if he would put the recordings online somewhere, and his reply was one I have heard often before:

“Yes, if they are good.”

I believe in releasing as much art as possible, even if the artist does not think that their art is “good”. I have often heard people claim that there is “too much bad art” out there so it is difficult to find the “good” stuff. I strongly disagree with this point of view.

Consider this: A toddler draws their first doodle and proclaims: “Look, it’s mommy and daddy!”. This example illustrates my two main points for this post:

  1. “Good” is highly subjective. To most people, the toddler drawing will probably look like randomly connected lines. To the parents, the drawing marks an important moment in the child’s development.
  2. “Good” is often a result of unnecessary self-imposed criticism. The toddler does not have this. They really do think that their drawing looks like mommy and daddy.

While the first point is commonly stated as “people have different tastes”, we do not talk enough about the second point although it is highly endemic in our society.

As we age, self-criticism often increases. This is not inherently a bad thing. After all, it is important to improve our skills and a healthy amount of self-criticism and feedback might help with that. However, it becomes a problem when we start giving up completely. I think there are several reasons why this happens and it is often a social issue. For example, we institutionalize criticism at an early age in our schools, we glorify high-performing individuals to the point that if you cannot perform at the same level, then it is not worth doing it at all in the first place. And finally, there are always a horde of “critics” that will gladly tell you if they do not like something.

When it comes to art, the focus is often on the end result, not the journey and the story behind the art. I am not saying that we should release everything we create, but too much self-imposed criticism is not doing any good for anyone, especially not ourselves. For example, creating a full piano cover of a song is a major achievement, and although it might not get a million likes on YouTube, it is still worth sharing, even if just one other person listens to it.

This is a wonderful time to share art, especially digital art. Having a blog is great for writing, SoundCloud is great for sharing music, Flickr is an excellent photo-sharing app, etc. So get your stuff out there. If you enjoyed creating it, other people might enjoy it too.

Copying is not stealing, period

A copy is just a copy
A copy is just a copy by kioan

Taylor Swift recently received a lot of media attention for pulling her music off Spotify. I am not going to comment specifically on Swift’s decision for pulling the music but I would like to take a look at the following quote from Swift in an interview with Yahoo:

I felt like I was saying to my fans, ‘If you create music someday, if you create a painting someday, someone can just walk into a museum, take it off the wall, rip off a corner off it, and it’s theirs now and they don’t have to pay for it.’ Taylor Swift

Comparing physical art with digital art in this way is like comparing apples and skyscrapers. Does Swift really think that vandalizing and stealing part of a painting is the same as streaming a song? I hope not. The painting is a physical object and it is unique. The bits and bytes of a song are not unique. If I rip off the corner of your painting, it is not the same painting anymore. If I stream your song on Spotify, the song is still the same. If your painting is stolen, you do not have it anymore. If I copy your song, we both have the same song.

A digital copy is a perfect copy — identical to the original. Analogies like Swift’s convey the wrong message about streaming and it sounds very similar to the old music industry slogan that “copying is stealing”. But let’s be perfectly clear about something: Copying is not stealing.

Copying, pirating, streaming or whatever might or might not be a bad thing but we can and should not use physical analogies to describe the act of copying or streaming. It is very disappointing that Taylor Swift is perpetuating the traditional discourse of the industry when talking about digital art.

Spotify royalties

Spotify
Spotify by Blixt A.
Spotify is a cool service but I do not agree with how they pay out royalties. In this post, I will propose a different way.

The current royalty calculation is explained by Spotify like this: There is a big chunk of money (the revenue) and each artist is paid according to their global “market share”. The share is calculated by taking the number of artist streams and dividing it by the total number of streams on Spotify.1

On the surface this looks like a good thing because everyone is paying for everyone. But the problem is that the equation does not account for the usage of each user on Spotify. Sometimes I can go days without using Spotify and every second I am not using the servive, the market shares of the artists I listen to are going down, relative to users that use Spotify more than me. For example, if I stream two Radiohead tracks during one month and another user streams eight tracks from Justin Bieber, the market share for Justin Bieber will be four times higher than Radiohead, simply because Spotify is being used more by the other user.

I think this is an unfair way of distributing royalties and I am not the first one to say this.2 Instead of calculating a global market share for each artist, I propose to calculate the market share for each artist as the average market share value of that artist for each user.

So instead of:

for each artist:
  market_share = artist.streams / total_streams

I propose:

for each artist:
  market_share_sum = 0

  for each user:
    market_share_sum +=
      user.artist.streams / user.total_streams

  market_share = market_share_sum / number_of_users

The calculation is probably more complicated than what is explained by Spotify but I do not think the proposed change is unreasonable. Let us see how it fixes the market share calculation bias from the example before.

Old market share calculation:

Radiohead.streams = 2
JustinBieber.streams = 8
total_streams = 10

Radiohead.market_share =
  Radiohead.streams / total_streams = 
  2 / 10 = 20%
JustinBieber.market_share =
  JustinBieber.streams / total_streams =
  8 / 10 = 80%

New market share calculation:

David.Radiohead.streams = 2
someone.JustinBieber.streams = 8
total_streams = 10

Radiohead.market_share =
  (David.Radiohead.streams / David.total_streams +
   someone.Radiohead.streams / someone.total_streams)
  / number_of_users = 
  (2/2 + 0/8) / 2 = 50%
JustinBieber.market_share =
  (David.JustinBieber.streams / David.total_streams +
   someone.JustinBieber.streams / someone.total_streams)
  / number_of_users = 
  (0/2 + 8/8) / 2 = 50%

The two artist now have an equal market share. The reason that I think this is fair is that it values our listening preferences equally, not the time we spend listening.

I love Spotify and have been a happy (paying) customer for almost two years. The 99 SEK per month price means that I have spent more money on music in the last two years than I did in the ten years before that and I am sure I am not alone. Spotify says that about 70% of their revenue is paid to artists and rights holders so to me, it seems like a win for the industry. But I hope they redo their royalty calculation. Until then, my limited usage does not warrant a premium account. I really don’t want to support Justin Bieber while I’m sleeping.


  1. http://www.spotifyartists.com/spotify-explained/#how-we-pay-royalties-overview 

  2. The problem has also been discussed on Hacker News