Thought Flow

Tag: tensorflow

  • A game of (AI) telephone

    A game of (AI) telephone

    Do you remember playing a game called “telephone” (or “Chinese Whispers”) as a kid?

    The game is simple: The first player comes up with a short sentence like “Alice and Bob walked to the bakery”. They then whisper this sentence to the second player, who then whispers to the third player etc.

    The fun part happens at the end, when the final player tells everyone what they heard, and everyone usually laughs because the sentence has changed a lot when passed from ear to ear, e.g. “Alice ran after Bob who stole her cake”1

    What if we made a pair of AI systems play a similar game of communication with each other — but with a small twist: Instead of communicating with voices, the systems communicate by text and images.

    The game would go like this:

    1. I choose a real image and write a one-sentence description of the image, just to make sure the first input is “real”.
    2. AI 1 — a text-to-image generator — would take the description and turn it into an image.
    3. AI 2 — an image-to-text generator — would take the image from step 2 and create a new image description.
    4. Go back to step 2 and feed this image description to AI 1.

    Without further ado, let us try it out, and we will get to the technology later in the post.

    For the first image, I chose this picture with the description “dog standing on a grass hill with a yellow field in the background”:

    With both the image and the description as the first inputs to the image generator, I got the following series of images and text descriptions:

    Image generated from text “dog standing on a grass hill with a yellow field in the background”

    The above image was interpreted by AI 2 as “a dog on green grass” (not bad actually) and that description was fed back to AI 1 to produce:

    Image generated from text “a dog on green grass”

    This image was interpreted by AI 2 as “a bunch of atm food” which doesn’t make sense, but who am I to judge, so back it went to AI 1, and we got the following image:

    Image generated from text “a bunch of atm food”

    Yeah that looks like bread, sausages and an… “AMT”? Not quite an ATM, but hey, it’s close. This image was interpreted as “a display case are sitting on a kitchen table” and based on this, we get our final image:

    Image generated from text “a display case are sitting on a kitchen table”

    I think it tried to draw cameras and a cellphone, but it is a bit abstract. This image is interpreted as “a few on a wall”.

    And there you have it. We went from “dog standing on a grass hill with a yellow field in the background” to “a few on a wall” and from a nice summer image of my dog to a display of electronic devices?

    It is worth noting that the above is just one of many possible outcomes from the same starting point. The models use randomness in their configurations, which means that the end result is almost never the same. It might be interesting to automate the process in the future.

    For now, I hope you just enjoyed this little experiment.

    The tech behind

    As mentioned in the introduction, the “game” consists of two deep learning (“AI”) systems. I actually already wrote a post about one of them, the image captioning model from the Tensorflow tutorial. This is what I called “AI 2”, and it can take an image and produce a caption for it.

    For this experiment, I let the model train a bit longer than in the previous post, but I did not really evaluate it, so the captions are still hit or miss. However, you can see from its first interpretation “a dog on green grass” that it is not terrible.

    The image generator (“AI 1”) is the new and shiny thing here. It is called VQGAN+CLIP, and it actually consists of two models that work together to produce an image from a piece of text. The specific version I am using here is based on work by Katherine Crowson found in this vqgan-clip repository on Github (specifically the notebook with the z+quantize method).

    For this experiment, I let the system run for a few minutes before stopping it and taking the produced image. I do not fully understand how the VQGAN+CLIP system works, and it is probably also beyond the scope of this post to discuss it, but I encourage you to search for examples online.

    Its creations are often abstract with a hint of reality, so they end up looking quite surreal and sometimes disturbing. This blog post about “AI movie posters” is what got me interested in VQGAN+CLIP, and I might explore it a bit more in the future as well.

    By the way, the image at the beginning of this post is also made by VQGAN+CLIP from the text “An artificial intelligence whispers to another artificial intelligence”, and it is scaled up with a super resolution neural network called esrgan. Good stuff!

  • Image captioning model

    Progress is slow for my various hobby efforts to generate images, create lyrics or even detect my dog in an image, but I do still experiment a little bit when I have the time for it.

    Today, I ran through the image captioning tutorial from Tensorflow, because why not. I trained it on a reduced set of 5000 images compared to the tutorial, but otherwise the code is identical.

    The purpose of the model is to look at an image and predict a caption for the image, e.g. “man sits next to computer” if the model was looking at me right now.

    The results are quite hilarious at this point, and it has given me some ideas for future development that could potentially delight or confuse.

    Anyway, for now, here is a caption that is as close to being accurate as I could get today. It’s a picture of Mila, my dog, with the predicted caption “a little dog sitting on dirt field”. Not too bad.

    Image of Mila, my dog, sitting on a lawn
    Best predicted caption: a little dog sitting on dirt field

    The model has a bit of randomness in its output, so the same image might produce multiple different results. For example, the image above also produced “a dog is surfing on a field with it’s toppings in front of grass” as well as the completely nonsensical “a fire up of a green grass next to a grass in a grassy field”.

    I also tried a few other images, but I am a too lazy to look up the proper attribution for including them in this post, so here is another image of Mila with the strange caption “a dog that is standing in tall grass in the grassy field with a large dog and white dog is standing in a grassy field next to a tree”

    Best predicted caption: a dog that is standing in tall grass in the grassy field with a large dog and white dog is standing in a grassy field next to a tree

    There is also the slightly more boring “a dog standing in front of a grass” and my personal favorite (although it’s a bit disturbing) “a dressed dog in a field eaten horse to enjoy zebras grass in the grass”.

    That is it for now. Not much content here, but I felt like writing a post. Stay tuned for possibly more silliness when the spark of energy hits.

  • Hello World is a Dream

    Generating song lyrics using neural networks… that sounds cool! Those were my thoughts on an evening in the recent past, so I started experimenting with the idea.

    I decided to name the neural network that came out of this project Gene Lyrica One, a very AI-sounding name (it is “the year of AI” after all). But before we get into the details of how Gene was born, let’s set the stage for this blog post with Gene’s unique take on the classic “hello world”:

    hello world is a dream
    i know when i been like your love
    and i can’t go home

    i can’t cry
    i don’t want to see
    i don’t know why i can’t run
    i got me
    i got the way you know
    i want to feel my love
    i want to go
    i need you
    and i want to feel so good
    i want to get it and i can’t see

    i’m gonna stop if i had it
    i can’t really want to see

    Gene Lyrica One

    The neural network is given some initial text, called the seed text, and then the network creates new lyrics based on this text. As mentioned above, the seed text for these lyrics were “hello world” which, given the subject matter, makes sense on multiple levels.

    If you want to create your own lyrics, you can try it out here,2 and the code that generated the network can be found on GitHub.

    In the following sections, I will describe the process that led to Gene Lyrica One, including more lyrics from Gene as well as other networks that were part of the experiment.

    I have no clue

    i have no clue
    i want to lose the night
    and i’m just a look in your mind
    but i know what you want to do
    all the way
    i gave you love
    and i can think you want to

    Gene Lyrica One

    Generating lyrics is not a walk in the park, and I have not quite cracked the nut yet. To be honest, I would say I generally have no clue what I am doing.

    I know where I started though: To get a feeling for how to “predict language” with a neural network, I created neural networks to generate text based on two different techniques:3

    1. Given a sequence of words, predict another full sequence of words.
    2. Given a sequence of words, predict just one word as the next word.

    The second kind of model (sequence-to-single-word) is the one that conceptually and practically was easiest for me to understand. The idea is this: For an input sentence like “all work and no play makes jack a dull boy”, we can split the sentence into small chunks that the neural network can learn from. For example “all work and no” as input and “play” (the next word) as output. Here is some code that does just that.

    With the basic proof-of-concept architecture in place, I started looking for a dataset of song lyrics. One of the first hits on Google was a Kaggle dataset with more than 55 thousand song lyrics. This felt like an adequate amount so I went with that.

    New Lines

    new lines
    gotta see what you can do

    oh you know

    Gene Lyrica One

    Lyrics consist of a lot of short sentences on separate lines, and while the texts on each line are often related in content, they do not necessarily follow the same flow as the prose in a book.

    This led to two specific design decisions for creating the training data. First, newline characters (\n) are treated as words on their own, which means that a “new line” can be predicted by the network. Second, the length of the input sequences should not be too long since the context of a song is often only important within a verse or chorus. The average length of a line for all songs happens to be exactly 7 words, so I decided to use 14 words for the input sequences to potentially capture multiline word relationships.

    A few other decisions worth mentioning:

    • Words are not pre-processed. This means that e.g. running, runnin, and runnin’ will be treated as three different words.
    • Words are not removed or cleaned. For example, the word “chorus” sometimes appear in the dataset to mark the beginning of the song’s chorus.

    Well Known With a Twist

    well known with a twist for the bed

    i got the

    oh oh

    what you want to do
    i’m goin’ down

    Gene Lyrica One

    The first attempt at training the network yielded some funny results. Because there were hundreds of thousands of parameters to tune in the network, training was extremely slow, so I initially tested it on just the first 100 songs in the dataset. Because of alphabetical ordering, these all happened to be Abba songs.

    The final accuracy of the network was somewhere around 80%. One way to interpret this is to say that the network knew 80% of the Abba songs “by heart”. Thus, the network was creating “Abba songs with a twist”. For example, it created the verse:

    so long see you baby
    so long see you honey
    you let me be

    Baba Twist

    The Abba song “So long” has the phrase “so long see you honey” so it improvised a little bit with the “so long see you baby” (“you baby” appears in a different Abba song “Crying Over You” which probably explains the variation). Or how about:

    like a feeling a little more
    oh no waiting for the time
    if you would the day with you
    ’cause i’m still my life is a friend
    happy new year
    happy new year
    happy new year
    ……
    [many more happy new years] :-)

    Baba Twist

    which is probably “inspired” by the Abba song “Happy New Year”. The network was overfitting the data for Abba, which turned out to be fun, so this was a promising start.

    Too much information

    too much information
    i can’t go

    Gene Lyrica One

    With decent results from Baba Twist (the Abba-network), it was time to try training the network using all 55 thousand songs as input data. I was excited and hopeful that this network would be able to create a hit, so I let the training process run overnight.

    Unfortunately, my computer apparently could not handle the amount of data, so I woke up to a frozen process that had only finished running through all the songs once (this is called one epoch, and training often requires 50 or more epochs for good results).

    Luckily, the training process automatically saves checkpoints of the model at certain time intervals, so I had some model, but it was really bad. Here is an example:

    i don’t know what i don’t know

    i don’t know what i don’t know

    i don’t know what i don’t know

    Tod Wonkin’

    Not exactly a masterpiece, but at least Tod was honest about its situation. Actually, “I don’t know what I don’t know” was the only text Tod produced, regardless of the seed text.

    In this case, I think there was too much information for the network. This feels a bit counter-intuitive. We usually seem to always want more data, not less, but for a small hobby project like this, it probably made sense to reduce the data size a bit to make the project more manageable and practical.

    Famous Rock

    famous rock are the dream

    chorus

    well i got a fool in my head
    i can be

    i want to be
    i want to be
    i want to be
    i want to be

    Gene Lyrica One

    After the failure of Tod Wonkin’, I decided to limit the data used for training the network. I theorized that it would be better to only include artists with more than 50 songs and have a smaller number of artists in general, because it would potentially create some consistency across songs. Once again, this is a case of “I have no clue what I’m doing”, but at least the theory sounded reasonable.

    A “top rock bands of all time” list became the inspiration for what artists to choose. In the end, there were 20 rock artists in the reduced dataset, including Beatles, Rolling Stones, Pink Floyd, Bob Dylan etc. Collectively, they had 2689 songs in the dataset and 16389 unique words.

    The lyrics from these artists are what created Gene Lyrica One.

    It took some hours to train the network on the data, and it stopped by itself when it was no longer improving, with a final “accuracy” of something like 22%. This might sound low, but high accuracy is not desirable, because the network would just replicate the existing lyrics (like Baba Twist). Instead, the network should be trained just enough that it makes sentences that are somewhat coherent with the English language.

    Gene Lyrica One felt like an instant disappointment at first, mirroring the failure of Tod Wonkin’ by producing “I want to be” over and over. At the beginning of this post, I mentioned Gene Lyrica One’s “Hello World” lyrics. Actually, the deterministic version of these are:

    hello world is a little man

    i can’t be a little little little girl
    i want to be
    i want to be
    ……
    [many more “i want to be”]

    Gene Lyrica One

    At least Gene knew that it wanted to be something (not a little little little girl, it seems), whereas Tod did not know anything :-)

    The pattern of repeating “I want to be” was (is) quite consistent for Gene Lyrica One. The network might produce some initial words that seems interesting (like “hello world is a little man”), but it very quickly gets into a loop of repeating itself with “i want to be”.

    Adding Random

    adding random the little

    you are

    and i don’t want to say

    i know i don’t know
    i know i want to get out

    Gene Lyrica One

    The output of a neural network is deterministic in most cases. Given the same input, it will produce the same output, always. The output from the lyric generators is a huge list of “probabilities that the next word will be X”. For Gene Lyrica One, for example, the output is a list of 16389 probabilities, one for each of the available unique words.

    The networks I trained were biased towards common words like “I”, “to”, “be”, etc. as well as the newline character. This explains why both Gene Lyrica One and Tod Wonkin’ got into word loops. In Gene’s case, the words in “I want to be” were the most likely to be predicted, almost no matter what the initial text seed was.

    Inspired by another Kaggle user, which in turn was inspired by an example from Keras, I added some “randomness” to the chosen words in the output.4 The randomness could be adjusted, but adding too much of it would produce lyrics that do not make sense at all.

    All the quotes generated by Gene Lyrica One for this post have been created using a bit of “randomness”. For most of the sections above, the lyrics were chosen from a small handful of outputs. I did not spend hours finding the perfect lyrics for each section, just something that sounded fun.

    The final trick

    the final trick or my heart is the one of a world

    you can get out of the road
    we know the sun
    i know i know
    i’ll see you with you

    Gene Lyrica One

    A few months ago, TensorFlow.js was introduced which brings machine learning into the browser. It is not the first time we see something like this, but I think TensorFlow.js is a potential game changer, because it is backed by an already-successful library and community.

    I have been looking for an excuse to try out TensorFlow.js since it was introduced, so for my final trick, I thought it would be perfect to see if the lyrics generators could be exported to browser versions, so they could be included more easily on a web page.

    There were a few roadblocks and headaches involved with this, since TensorFlow.js is a young library, but if you already tried out my lyrics generator in the browser, then that is at least proof that I managed to kind-of do it. And it is in fact Gene Lyrica One producing lyrics in the browser!

    This is the end

    this is the end of the night

    i was not every world
    i feel it

    Gene Lyrica One

    With this surprisingly insightful observation from Gene (“I was not every world”), it is time to wrap up for now. Overall, I am pleased with the outcome of the project. Even with my limited knowledge of recurrent neural networks, it was possible to train a network that can produce lyrics-like texts.

    It is ok to be skeptical towards the entire premise of this setup though. One could argue that the neural network is just an unnecessarily complex probability model, and that simpler models using different techniques could produce equally good results. For example, a hand-coded language model might produce text with better grammar.

    However, the cool thing about deep learning is that it does not necessarily require knowledge of grammar and language structure — it just needs enough data to learn on its own.

    This is both a blessing and a curse. Although I learned a lot about neural networks and deep learning during this project, I did not gain any knowledge regarding the language structure and composition of lyrics.

    I will probably not understand why hello world is a dream for now.

    But I am ok with that.


  • Is it Mila?

    Is it Mila?

    One of the great things about the Internet is that people create all sorts of silly, but interesting, stuff. I was recently fascinated by a deep learning project where an app can classify images as “hotdog” or “not hotdog”. The project was itself inspired by a fictional app that appears in HBO’s show Silicon Valley, and the project was organized by an employee at HBO.

    The creator of the app wrote an excellent article, outlining how the team approached building the app. From data gathering, over designing and training a deep learning neural network to building an app for the Android and iPhone app stores.

    Naturally, I thought to myself: perhaps I can be silly too. So I started a small project to try and classify whether an image contains my dog Mila or not. (Also, the architecture for the hotdog app is called DeepDog, so as you can see, it is all deeply connected!)

    The is-mila project is not as large and detailed as the hotdog project (for example, I am not building an app), but it was a fun way to get to know deep learning a bit better.

    The full code for the project is available on Github, and feel free to try and classify a photo as well.

    A simple start

    One of the obstacles to any kind of machine learning task is to get good training data. Fortunately, I have been using Flickr for years, and many of my photos have Mila in them. Furthermore, most of these photos are tagged with “Mila”, so it seemed like a good idea to use the Flickr photos as the basis for training the network.

    Mila as a puppy
    Mila as a puppy

    I prepared a small script and command-line interface (CLI) for fetching pictures via the Flickr API. Of course, my data was not as clean as I thought it would be, so I had to manually move some photos around. I also removed photos that only showed Mila from a great distance or with her back to the camera.

    In the end, I had 263 photos of Mila. There were many more “not Mila” photos available of course, but I decided to also use only 263 “not Mila” photos so the training set for the two classes “Mila” and “not Mila” had equal size. I do not really want to discuss overfitting, data quality, classification accuracy, etc. in this post, but there are many interesting topics to discuss there for another time.

    For the deep learning part, I used Keras which is a deep learning library that is a bit simpler to get started with than e.g. Tensorflow. In the first iteration, I created a super-simple convolutional neural network (CNN) with just three convolutional layers and one fully-connected layer (and some MaxPooling and Dropout layers in between).

    Training this network was faster than I thought and only took a few minutes. In my latest run, the accuracy settled at around 79% and validation accuracy (i.e. for photos that were not used to train the network) at 77% after 57 epochs of roughly six seconds each. This is not very impressive, but for binary classification, anything above 50-60% accuracy is at least better than a coin flip.

    Finally, I created a simple website for testing the classification. I did not bother using a JavaScript transpiler/bundler like Babel/Webpack, so the site only works in modern browsers. You can try the simple classification here if you like.

    The results from this initial experiment were interesting. In the validation set, most of the photos containing Mila were correctly classified as Mila, and a few were classified as not Mila for no obvious reasons. For example, these two images are from a similar setting, with similar lighting, but with different positioning of Mila, and they are classified differently:

    Mila, correctly classified
    Mila, correctly classified
    Mila, incorrectly classified as not Mila
    Mila, incorrectly classified as not Mila

    Perhaps more surprising though are the false positives, the photos classified as Mila when they do not have Mila in them. Here are some examples:

    Sports car, classified as Mila
    Sports car, classified as Mila
    Rainbow crosswalk, classified as Mila
    Rainbow crosswalk, classified as Mila
    Goats, classified as Mila
    Goats, classified as Mila

    Mila is certainly fast, but she is no sports car :-)

    As of writing this, I am still uncertain what the simple network sees in the photos it is given. I have not investigated this yet, but it would be an interesting topic to dive into at a later stage.

    Going deeper

    A cool feature of Keras is that it comes with a few pre-trained deep learning architectures. In an effort to improve accuracy, I tried my luck with using a slightly modified MobileNet architecture using pre-trained weights for the ImageNet dataset, which contains a big and diverse set of images.

    The Keras-provided MobileNet network is 55 layers deep so it is quite a different beast than the “simple” network outlined above. But by freezing the weights of the existing network layers and adding a few extra output layers as needed for my use case (binary classification of “Mila” and “not Mila”), the complexity of training the network was reduced since there were less weights to adjust.

    After training the network for 48 epochs of about 18 seconds each, the training accuracy settled around 97% and validation accuracy at 98%. The high accuracy was surprising and felt like an excellent result! For example, the Mila pictures shown above were now both correctly classified, and the sports car and rainbow cross walk were no longer classified as being Mila. However, the goat was still “Mila” so something was still not quite right…

    You can try out the network here if you like.

    At this point, I had a hunch that the increased accuracy of MobileNet was mainly due to its ability to detect dogs in pictures (and the occasional goat). Unfortunately, it was worse than that, and photos of both dogs, cats, birds, butterflies, bears, kangaroos and even a squirrel were classified as being Mila.

    It seemed I had not created a Mila detector, but an animal detector. I had kind of expected a result like this, but it was still a disappointing realization, and this is also where the story ends for now.

    Sneaky squirrels and other animals

    To summarize, I tried to create an image classifier that could detect Mila in photos, but in the current state of the project, this is not really possible. Writing this blog post feels like the end of the journey, but there are still many tweaks and improvements that could be made.

    For example, it would be interesting to know why the “simple” network saw a rainbow crosswalk as Mila, and it would be nice to figure out how to improve the quality of the predictions for the MobileNet version such that it does not just say that all pets are Mila. One idea could be to clean the training data a bit more, e.g. by having more pets in the “not Mila” photo set or perhaps restrict the Mila photos to close-ups to improve consistency and quality in that part of the data.

    One thing is for sure: there is always room for improvement, and working on this project has been a nice learning experience so far. As an added benefit, I managed to mention squirrels in a (technical) blog post, and I will leave you with a picture of the sneaky “Mila” squirrel:

    Sneaky squirrel, classified as Mila
    Sneaky squirrel, classified as Mila

    (I like squirrels. A lot. It was all worth it just for the squirrel.)