Generating music with deep learning

Automatic, machine-generated music has been a small interest of mine for some time now. A few days ago, I tried out a deep learning approach for generating music… and failed miserably. Here’s the story about my efforts so far, and how computational complexity killed the post-rock.

The spark of an idea

When Photo Amaze was created in 2014, I thought it would be fun to have some kind of ambient music playing while navigating through the 3D maze. But I did not want to play pre-recorded music. I wanted it to be automatically generated on-the-fly, based on the contents of the pictures in the maze.

That was the spark. A picture is worth a thousand words, so why can’t it be worth a few seconds of music as well? For example, take a look at this picture:

Mountain with running water in the foregroundmotional impact, like the ambient sound of a running water stream or the whistle of the wind picking up speed over the mountain.

Since I can make a connection between photo and music, perhaps a machine could do this automatically as well. This is not a novel idea, but it is a nut that has yet to be cracked, and it was an intriguing idea to start exploring.

Hard-coded music mappings

Modern browsers have all the ingredients necessary for doing both image analysis and sound generation. There are numerous JavaScript libraries for analyzing and manipulating pictures, and the Web Audio API makes it possible to create synthesized sound in a fairly straightforward way. Thus, it made sense to start here.

The first experiment I did was to create a more-or-less fixed mapping between an image’s content and some kind of sound output. The high-level idea of the implementation was to simply map brighter colors to brighter sound notes. The steps to produce the output sound were something like this:

  • Find at most 200 “feature pixels” in the input image using trackingjs.
  • For each found “feature pixel”:
    • Calculate the average pixel value between the three RGB color values. This produces a single number per pixel between 0 and 255.
    • Normalize the pixel value from 0-255 to 20-500. This produces a base frequency for the output sound.
    • Create a sine wave oscillator using the Web Audio API for the pixel value.
    • Combine the oscillators to a single sound output.
    • While playing the sound, randomize the frequency of each oscillator slightly over time.

Using this approach, an image would be turned into a randomly changing output sound consisting of about 200 sine waves, each with a frequency between 20 and 500 Hz.

Here is an example output using the mountain from above as the input image (the red dots mark the found “features” of the image).

That might not sound terrible, until you realize that the sound is basically the same for any input image:

Mila might be a monster dog, but that output is just too dark :-)

There were a ton of problems with this implementation, to the point that it was actually outright silly. For example, the “feature pixel” selection mostly found edges/corners of the image, and using just 100 pixels as input corresponded to just 0.01% of all available pixels in the test images. Another problem was how the final pixel value was calculated from the average between the red, green and blue value of the pixel. Some colors arguably have more impact to the viewer than others, but this fact is not captured when taking the average.

Even with all its problems, the first experiment was a good first step, considering I did not know where to start before. It is possible that with a lot of tweaks, lots of new ideas and lots of time, this approach could start producing more interesting soundscapes. However, the downside of the approach was also that the music creation would always be guided by the experimenters: the humans. And I wanted to remove them from the equation.

Machine learning to the rescue

The second experiment ended before it even really started. It was clear that some kind of machine learning was needed to move forward, and it seemed that an artificial neural network might be the solution.

This was the idea:

  • Use every pixel of the input image as a single input node of the neural network.
  • Treat every output node as a single sound sample.

For the purposes of this blog post, everything that happens between input and output nodes of the network is largely hidden magic. With that in mind, here is how the network would look (P1 – Pm are the input pixels and S1 – Sn are the output samples):

Photo to sound neural network
Mapping a photo’s input pixels (P1-Pm) to a soundwave output (S1-Sn).

To get an idea of the size of the network, consider this: the mountain test image from above is 1024 by 683 pixels, so the network would have 699,392 input nodes when using images of that size. Digital sound is just a collection of amplitudes in very tiny pieces called samples. The most commonly used sampling rate for music is 44.1KHz, which means that every second of digital music consists of 44,100 individual samples. For a neural network of this design to produce a five-second sound, it would thus require 220,500 output nodes.

The intentions were good but the implementation never happened. After having the initial idea, I started Python and tried to simply read and write soundfiles, but it didn’t go so well, and the weekend was nearly over, and, “oh a squirrel!”… and the code was never touched again.

Machine learning is great, but the motivation was suddenly lacking, and the project was put on ice. This was about two years ago, and the project was not revived until quite recently.

AI art

Deep learning has been steadily on the rise in recent years, often outperforming other machine learning techniques in specific areas such as voice recognition, language translation and image analysis. But deep learning is not limited to “practical” use cases. It has also been used to create art.

A well-known example of “AI Art” is Google’s Deep Dream Generator. The software is based on an initial project called DeepDream which produced images based on how it perceived the world internally. Some of its images were even shown and sold at an art exhibition.

The A.I. Duet project shows another interesting use case for deep learning: the creator of the project, Yotam Mann, trained a model that can produce short sequences of piano notes based on the note input of a human. So if I played C-D-E, the software might respond with F-G-A although the result would most likely be slightly more interesting than that.1

A.I. Duet is impressive, but it still has a big limitation: it only works for specific notes for a specific instrument. So while the result is amazing, what I really want is more complex arrangements and raw audio output. Even so, the above examples show that deep learning is a powerful and versatile machine learning technique, and it is now finally becoming more feasible than ever to achieve the goal of creating music using AI.

Mila drawn with Deep Dream
Although it is not really relevant to this blog post, I could not help myself: here is the Mila image from above processed with default settings of Deep Dream. It is slightly disturbing to see that… chicken?… coming out of her left paw. Thank you for ruining my sleep, robot!

The bleeding edge, where the story ends

While doing some research on the latest state of the art for machine-generated sound, I stumbled upon yet another Google project called WaveNet. In an interesting blog post, the authors of WaveNet discuss how their research can be used to improve text-to-speech quality, but what is really exciting to me is that they also managed to produce short piano sequences that sound natural (there are some examples at the bottom of their blog post).

The big surprise here is that the piano samples are not just based on specific notes. They are raw audio samples generated from a model trained with actual piano music.2

Finally! A tried and tested machine learning technique that produced raw audio. Reading about WaveNet marked the beginning of my final experiment with music generation, and is the entire reason this blog post exists.

I found an open source implementation of WaveNet, and to test the implementation, I wanted to start simple by using just one sound clip. For this purpose, I extracted an eight-second guitar intro from the post-rock track Ledge by Seas of Years3:

My hope was that by training the model with a single sound clip, I would be able to reproduce the same or a very similar clip to the original to validate that the model produced at least some sound. If this was possible, I would be able to train the model with more sound clips and see what happens.

Unfortunately, even with various tweaks to the network parameters, I could not manage to produce anything other than noise. Sharing an example of the output here is not even appropriate, because it would hurt your ears. The experiment ended with an early failure.

So what was the problem? I soon realized that even with this fairly simple example, I had been overly optimistic about the speed at which I would be able to train the model. I thought that I could train the network in just a few minutes, but the reality was very different.

The first warning sign showed itself pretty quickly: every single step of the training process took more than 30 seconds to complete. In the beginning, I did not think much about this. Some machine learning models actually start producing decent results within the first few steps of training so I was hoping it would be the same here. However, after doing more research on WaveNet, it became clear that training a WaveNet model did not just require a few learning steps, it required several thousand. Clearly, training WaveNet on my machine was completely unfeasible, unless I was willing to wait more than a month for any kind of result.

Where do we go from here?

Machine learning has been rapidly evolving in recent years, propelled by software libraries like TensorFlow, and the technology is more accessible than ever for all kinds of developers. But there is also another side of the coin: in order to use the state of the art, we are often required to have massive amounts of computing power at our disposal. This is probably why a lot of high-profile AI research and projects are produced by companies like Google, Microsoft and IBM, because they have the capacity to run machine learning at a massive scale. For lone developers like me that just want to test the waters, it can be difficult to get very far because of the complexities of scale.

As a final example to illustrate this point, consider NSynth, an open source TensorFlow model for raw audio synthesis. It is based on WaveNet and on NSynth’s project page, it says:

The WaveNet model takes around 10 days on 32 K40 gpus (synchronous) to converge at ~200k iterations.

Training a model like that would cost more than $5,000 using Google Cloud resources4. Of course, it is possible that a simpler model could be trained faster and cheaper, but the example still shows that some technologies are most definitely not available for everyone. We live in a time where there is great access to many technological advances, but the availability is often limited in practice, because of the scale at which the technologies need to operate.

So where do we go from here? Well, computational complexity killed my AI post-rock for now, but I doubt that it will take long before significant progress is made in this field. For now, I will enjoy listening to human-generated music. In a way, it is re-assuring that machines cannot outperform us in everything yet.


  1. The video explaining how A.I. Duet works is quite good. 

  2. Describing how WaveNet works is beyond the scope of this blog post, but the original paper for WaveNet is not terribly difficult to read (unlike most other AI research). 

  3. Seas of Years’s album The Ever Shifting Fields was one of my favorite post-rock albums of 2016. I recommend a listen. 

  4. I used Google’s pricing calculator with 4 machines, each with 8 GPU cores. 

Golang and the search of the past

Gopher inspecting Go code
Gopher inspecting Go code by Anthony Starks (CC-BY-NC)

This post is mostly a status of what I have been up to here at the nearly-almost-half-year mark of 2016.

Introducing Product Search

By the end of 2015, we had already been using Elasticsearch for a while. It was the first part of a long-term strategy of moving data away from Google App Engine. Event data such as page views and clicks as well as order aggregations such as revenue-per-day for our users was being stored and calculated on Elasticsearch. Although Elasticsearch is popular for collecting log data, its main selling point is that it is a very fast full-text search engine.

During the Christmas holidays, I wanted to see how easy it would be to add a search widget, powered by Elasticsearch. After about 3-4 hours, I posted this proof-of-concept video to our Slack channel with the following message:

Product AJAX search proof-of-concept, powered by Receiptful

As it turned out, the product search feature quickly found its way onto the roadmap :-)

Unlisted Antecons

It was going to happen at some point, and in early April, we finally removed the listing of Antecons from the Shopify app store. The app continues to run and interestingly, we have some users that are still using it, even though we have contacted everyone and tried to get them to switch over to Receiptful. Loyal customers.

Popular metrics report

By the end of April, we released the report “8% of all product page traffic converts to sales”. For a short while, I think it made a little splash and was read by quite a few people. Although I did not write the article, all the data for the article was gathered by me a few months before. One of those little side tasks that spice up developer life — although doing data analysis is slightly more exciting than data gathering :-)

Go nuts with Golang

Currently, I am in Golang land. I did not think I would end up there, but when tasked with creating a new web app for some simple store metrics, I decided to create it with Go after consulting with the team. After some initial headaches (i.e. getting used to a statically typed, compiled language again), I must say that Go has some good things going for it. My colleagues mock me about using tabs, but that is the Go way.

In the same project, I also said hello to my old friend MapReduce. It is a feature of MongoDB and we use it to create pre-aggregated reports for the project. It might be a short affair though, as I am also considering other options such as Google BigQuery. We will see…

So those are the major headlines (I probably missed something). I have been meaning to write slightly more technical articles, but I do not feel like I am not in the right mindset to do so yet. Those pieces also tend to be much longer and much more difficult to write, so for now, you will have to do with these random rambles.

What happened next

Future Or Bust!
Future Or Bust! by Paul Hocksenar

In July last year (2014), I started working full-time on Antecons. Back then, I wrote a post about complete independence where I predicted that I could support myself for up to half a year without income. This turned out to be true, but now that we have entered a new year, it is time for an update. This is an update that I was nervous about until very recently where a story about monetary failure took an interesting turn for the better.

The challenge

A lot of people try to hide their failures, myself included. Being afraid of failure, taking risks and facing challenges, it seems like a very weird and poor life choice to give up a lucrative and comfortable position as a consultant to work without income and earning potential for an extended period of time. The explanation is simple though: Things were starting to get fun.

When Antecons launched on Shopify in 2013, it was mostly developed while doing freelance consulting work. Having a high-revenue commercial product was never really the goal. I simply wanted to see if it was possible for me to start something from scratch and make it through all the way to a finished product and not just an abandoned side-project.

When the first few customers started coming in, I realized that this product might have some potential, and it was quite a different feeling having customers that bought a service rather than buying my labor. It was… fun. So I started working on Antecons full-time in July with the vision of slowly building up a list of stable clients. I had a good relationship and regular contact with a webshop house, and they sounded very interested in offering Antecons to their webshop customers so I pushed towards a first “revenue milestone” with them as a re-seller. It was a great Summer and I built an API beta that I was satisfied with offering to potential buyers.

Fail

My plan failed. It turned out that nothing came out of my contact with the webshop re-seller and in the meantime, Shopify sales only increased very slowly. Antecons was even featured on the front page of the Shopify app store, but there were disappointingly few signups.

There is no need to go into great detail about the technicalities, because the main problem was that I neglected sales too much so I failed to sell. That makes sense now, but I did not realize it soon enough… so I failed. Period. The end.

Late Fall came, I was not making any money and the product was not going to generate significant revenue anytime soon. I had to take on a few hours of consulting work again to pay the bills which meant that I had less time to work on Antecons and in the beginning of December, I was hired full-time as a Python developer for Neodev.

I had mixed feelings about starting a new job, because it felt like a defeat to stop working on Antecons. Both the job and my colleagues were great though, and it was sometimes quite difficult to explain to friends and family how having a rewarding and well-compensated job could still feel a bit like a step back or a let-down. Nevertheless, it was difficult to leave Antecons behind. As it turned out though, the departure was short-lived.

An unexpected journey

It was exactly one day after signing my new contract that I was contacted by Adii, a successful entrepreneur with experience in the e-commerce field. Adii was looking for a recommendation engine to improve a product upsell feature in a young startup called Receiptful. Initially, I was skeptical and without hope, because I had basically given up on monetizing Antecons since it was not making any real money. But after a few weeks of communication, we decided to work together and I was “acquihired” (yes, that is a real word) and Antecons was revived before it had even drawn its final breath.

Fast forward to today and I have been with Receiptful for a month, working on integrating Antecons as part of the Receiptful system. It is really great to be able to work on Antecons and data analysis in a full-time position and it has also presented some new challenges, but that is a topic for a different post.

Working independently was a great experience. It was an unexpected journey with an unexpected ending. A new and different chapter has now begun: Life in a startup.

Photo Amaze

Screenshot of Photo Amaze app

About a month ago, I attended the wedding of a childhood friend. Since I have had some extra free time lately, I came up with an idea of combining my interest in 3D with an app that could be used for the wedding. The result was Photo Maze, a 3D maze where the guests at the wedding could upload a photo from their phones and it would appear immediately in the maze, giving the bride and groom a kind of interactive photo album from their party.

I felt the urge to develop this idea a bit further and now, it has be re-named to Photo Amaze — a pun on maze and amaze. It is available for everyone, and I hope you will try it out!

Antecons blog

I have just created a new Antecons blog. From now on, Antecons related news and information will be posted there. I will also continue the developer diary there.

For some time now I have felt that Antecons needed its own blog and website to separate my private writings from my product writings. Today, I finally made the move and started up a simple blog for Antecons. I will continue to post on Thought Flow, just as I have always done.