Image captioning model

Progress is slow for my various hobby efforts to generate images, create lyrics or even detect my dog in an image, but I do still experiment a little bit when I have the time for it.

Today, I ran through the image captioning tutorial from Tensorflow, because why not. I trained it on a reduced set of 5000 images compared to the tutorial, but otherwise the code is identical.

The purpose of the model is to look at an image and predict a caption for the image, e.g. “man sits next to computer” if the model was looking at me right now.

The results are quite hilarious at this point, and it has given me some ideas for future development that could potentially delight or confuse.

Anyway, for now, here is a caption that is as close to being accurate as I could get today. It’s a picture of Mila, my dog, with the predicted caption “a little dog sitting on dirt field”. Not too bad.

Image of Mila, my dog, sitting on a lawn
Best predicted caption: a little dog sitting on dirt field

The model has a bit of randomness in its output, so the same image might produce multiple different results. For example, the image above also produced “a dog is surfing on a field with it’s toppings in front of grass” as well as the completely nonsensical “a fire up of a green grass next to a grass in a grassy field”.

I also tried a few other images, but I am a too lazy to look up the proper attribution for including them in this post, so here is another image of Mila with the strange caption “a dog that is standing in tall grass in the grassy field with a large dog and white dog is standing in a grassy field next to a tree”

Best predicted caption: a dog that is standing in tall grass in the grassy field with a large dog and white dog is standing in a grassy field next to a tree

There is also the slightly more boring “a dog standing in front of a grass” and my personal favorite (although it’s a bit disturbing) “a dressed dog in a field eaten horse to enjoy zebras grass in the grass”.

That is it for now. Not much content here, but I felt like writing a post. Stay tuned for possibly more silliness when the spark of energy hits.


Dipping the feet in the game design pond

For my wife’s birthday this year, I created a prototype of a small game-like 3D environment that she could “walk around” in using the keyboard and mouse. The idea was to have an “exhibit” for each year we have known each other, consisting of a few photos from that year as well as a short text describing major events that happened during the year.

Unfortunately, I am terrible at planning things and started a bit too late so I did not finish the game in time for the birthday.

With the help of Unity, I managed to finish the project eventually. This post is about that journey as well as some observations on Unity and game design in general.

Note: I usually make my stuff available online, e.g. on Github. Not this time though. This was a personal thing I created for my wife. I’m reproducing the images for this blog post with her permission :-)

Focus on the content

The prototype was created for the browser and used three.js for most of the 3D stuff. I am a fan of WebGL, and I had some prior experience from working on Photo Amaze and Zombie Hugs, so it seemed like a natural choice.

Here are two screenshots from the initial prototype:

The thick fog was added as a way to limit the initial view and ensure only one piece of text was visible at a time, and the rest was hidden in the fog until the player started moving forward. This fog then cleared up once the first exhibit was reached, making the scene more open.

I found photos for three out of ten exhibits and wrote the initial intro text, but the rest remained unfinished. One reason for this was that I quickly started obsessing over details like shadows and lighting instead of focusing on the core content, which was selecting good photos and writing the right text.

Missing the birthday deadline had both good and bad consequences. I was very disappointed and embarrassed that I only had an early prototype to show, but it also allowed me to step back and rethink the project.

Beyond the browser

An appealing aspect of using three.js is that everything has to be defined in code. This provides a lot of control, but I quickly realized that I was not able to iterate and tweak the experience very fast. I had wanted to get my feet wet with a more full-fledged game engine for a while, so this was a good opportunity to do that.1

After researching the pros and cons of different game engines as well as experimenting briefly with Unity, Unreal Engine, Godot and Babylon.js (spending way too long trying to make grass out of fur), I ended up sticking with Unity, because it has a native Linux editor and good platform support.2

Unity is easy to get started with and includes lots of helpful tools out of the box. For example, I had good initial impressions of the built-in terrain editor and tree creator, and it seemed very easy to set up a basic outdoor environment. The Unity asset store also has a generous offering of free assets, including a ready-made first-person controller which is very handy.

Once the surface is scratched though, it becomes apparent that Unity is neither perfect nor complete, and creating games is not easy. Some of the challenges I faced were fun while others were frustrating.

The first (fun) obstacle I came across was creating the photo exhibits.

3D modeling the photo exhibit

Concept art for the “photo exhibit”… I only drew it for this blog post though :-)

As you can see in the prototype above, an “exhibit” has three wall-like structures with a photo. I wanted the pictures to protrude a bit from the walls, making it look like a white canvas is hanging from the wall, with the photo “painted” on top of it. The drawing on the right illustrates the idea.

I thought it would be super easy to do this in Unity. Just create two cubes (a canvas cube and a wall cube)3, flatten and stretch them a bit, and put them next to each other so they overlap.

Actually, this worked ok, but there were two problems:

  1. There was a weird flicker where the cubes touched each other.
  2. When adding the photo to the canvas cube, it showed the photo on all sides, not just the front.

I fixed the second problem by putting a “quad” — a flat surface — on top of the canvas cube (or rather, next to it). The wall structure thus consisted of three 3D objects that were technically separate from each other, and it did not look good. There was still a weird flicker, and it also felt like the wrong way to solve the problem.

Wall structure with photo canvas, created in Unity using two cubes and a quad. There is a flickering artifact at the edge of the photo, framed in red.

So I hit an early roadblock: Either I had to define the wall structure programmatically, or I had to make my own 3D model. I opted for the latter choice.

After going through some basic tutorials for Blender, I was able to create the wall structure and learn a thing or two about 3D modeling along the way. This is the result in all its simple glory:

Blender render of a wall structure aka. “photo exhibit”. I used different colors to indicate that separate materials and textures can be used for different parts of the structure.

Even though I only did very basic stuff in Blender, it felt like a big win to be able to make basic models. I also created an exhibit sign and a cylinder with one open end (to simulate a tunnel or tube). All models can be found here.

Free models are great

Besides the photos and text, I decided to also create a “display” for each exhibit. This consisted of a 3D model or effect that was either a direct or indirect reference to the year the exhibit was for. For example, I used a Big Foot model standing on top of Mt Saint Helens for the year when we visited the area.

Big Foot standing on top of a model of Mt Saint Helens erupting.

Using pre-made models was a fun and easy way to make the exhibits a bit more interesting. It took some time to find the right model, and it sometimes needed tweaking after import, but it gave me the opportunity to include visuals I could not have created on my own.

For the record, here is the list of the models I used:

All models are licensed under CC BY except for MtHelens (CC BY-NC-ND) and 15Legend (CC BY-NC).

All the models were found on Sketchfab, an online community with a lot of 3D models available either for free or purchase. It was a nice discovery!

An extra dimension

Besides downloading models, I also researched the possibility of adding models to the scene by simply scanning my environment or specific objects.

A technique known as photogrammetry makes it possible to turn multiple photos into 3D models. I played around with an open-source tool called Meshroom which is amazingly simple to work with. Just add a lot of photos from different viewing angles, wait a few hours, and a finished 3D model comes out.

A scan of a birch log from the forest made its way into the scene:

Photogrammetry scan of a fallen birch log. The rough/spiky surface is the result of reducing the model complexity by removing polygons. The light reflections are unnatural but I kept them because they look kind of fun.

I did not get outstanding results, but it is worth noting I also just took the photos with my bad phone camera and spent very little time making sure I got good shots from all angles.

It is mindblowing that it is possible to go from 2D photos to 3D model, and I will definitely revisit photogrammetry again in the future.

Creating fake rain

A small feature I had fun creating was a super simple rain effect. There are numerous weather system plugins available for Unity (some are free), and there was even a “hose” effect available in the standard assets that kind-of did the trick (it simulates spraying water). But I needed a more uniform down-pour, and I really just needed something simple.

The effect was created by taking a bunch of small particles, give them a blue/white gradient color, apply gravity too them, and that is basically it.

A simple rain effect using a Unity particle system.

I reused a texture from a water surface effect in the standard assets to give the raindrops a blue-ish appearance. The tails on the raindrops are automatically created by the particle system when using a render setting called “stretched billboard”. A bit of noise was added to the movement of the rain drops, so the rain does not fall straight down but looks slightly more natural and chaotic.

After playing around with the particle speed and size, I got the right look and feel I wanted. I was expecting this to be much more complicated, so it was a nice surprise when the process was fairly straightforward.

Designing for the player

The most enjoyable aspect of creating this game-like experience was going through our old photos to find a few that represented each year as well as thinking about the various events that happened throughout the year. It was a nice trip down memory lane.

Although the photos and text tell a story which is sequential in nature, the question was if they necessarily had to be experienced sequentially as well.

I considered two ways to handle progression through the game:

  1. Limit the initial environment with something like walls and corridors, guiding the player from exhibit to exhibit.
  2. Make the environment completely open, allowing the player to freely visit each exhibit in any order and with no restrictions.

The first option, limiting the player, would give me more control over the player’s movements and the “narrative” (if there was such a thing) of the experience, but it also felt like it would constrain the player. This can sometimes be a good technique to control pacing (a lot of games do this), but here it seemed unnecessarily constricting.

So I decided to go for 2., the open environment, but I still wanted to provide some guidance to help navigate the scene. I did this by creating a dirt path that leads through the grass between the exhibits. I thought it was a nice, obvious and non-constricting way to guide the player a bit:

Aerial top view showing the entire scene. The player starts in the center. The gray lines are dirt paths that go from exhibit to exhibit.
Example of a dirt path, this one leading between the 1st and 2nd exhibit.

During the first 5-60 seconds of the game, the player is presented with the movement keys and the purpose of the game in a series of three welcome messages that show up on the screen as 3D text.

I wanted to be absolutely sure that the player could not miss the information, especially the movement keys. The way I achieved this was to add some constraint to the otherwise open environment at the initial stage of the game.

If you look closely at the aerial top view above, you might notice a long green shape at the center of the scene. This is actually a cylinder (or tunnel) floating 50 meters above the ground. The player starts the game inside the tunnel, and can only move forward and backward, ensuring that the information is difficult (but not impossible) to miss.

Furthermore, during the first 1-2 seconds, the camera is actually fixed in place, showing the movement keys while the start menu is fading out.

To make the cylinder/tunnel slightly more interesting, I painted it a bright green and used a normal map from a tree bark texture to give some resemblance of walking inside of a tree trunk.

When the player steps over the edge of the cylinder, they land near the first exhibit.4

The player starts the game inside a green cylindrical shape, and is presented with the movement keys and other information further ahead.

The launch

I hope the above sections have provided at least some idea of how my little game-like experience turned out. I have not described everything, and there were even a few more ideas that did not make their way into the game at all, but I decided to stop the project when the core content was in a state I was satisfied with.

And then it was time to launch it, i.e. get my wife to play the game. I really wanted to see her reaction while playing, but I let her go through it by herself at her own pace.

I got quite emotional about it actually. Having revisited the memories of nice moments from the past while working on the project, I was already on a trip of nostalgia. Showing the game to my wife was the culmination of that journey, and when I heard a giggle coming from her room, I shed a little tear.

Moving forward

Even for a simple game-like experience like the one I created, there are still many little decisions that go into making it. Thinking through these decisions, playing around with solutions and seeing the result is often rewarding and interesting, and I can totally understand the appeal to work professionally with games and similar creative endeavors.

I also have a newfound appreciation for how long it takes to produce game content. Even though I am an amateur in everything that has to do with game design (except for writing code), and my project was extremely small in scope, it is still easy to see why it takes so long to create games, and why people specialize in modeling, programming, animation, sound design etc. instead of trying to do everything.

I do not think this is the last time I will dabble with creating games. I hope to be able to combine aspects of my professional work-life (data science/ML/AI) with game creation. That would be a win-win for a side-project indeed.

Continue on page 2 if you are interested in reading a bit more about my experience with Unity. If this does not sound interesting, you can just stop reading here. Thank you for making it this far :-)


Generating cartoon avatars with GANs

You might have heard of Deepfakes, which are images or videos where someone’s face is replaced by another person’s face. There are various techniques for creating Deepfakes, one of them being Generative Adversarial Networks (GANs).

A GAN is a type of neural network that can generate realistic data from random input data. When used for image generation, a generator network creates images and tries to fool a discriminator network into believing that the images are real. The discriminator network gets better at distinguishing between real and fake images over time, which forces the generator to create better and better images.

I wanted to play around with GANs for a while, specifically for generating small cartoon-like images. This post is a status update for the project so far.

Here is the code, and here are 16 examples of images generated by the current state of the network:

16 cartoon faces generated by a GAN

DCGAN Tutorial and drawing ellipses

There are many online tutorials on how to create a GAN. One of them is the DCGAN tutorial from the Tensorflow authors. This tutorial was my starting point for creating and training a GAN using the DCGAN (deep convolutional GAN) architecture.

In the tutorial, the authors train the GAN to generate hand-written digits, based on the famous MNIST dataset. Instead of creating hand-written-number-lookalikes, I wanted to see if I could generate simple shapes like these ellipses:

Color ellipses used for input

I thought these shapes would be a trivial task for the GAN to generate, but I was of course mistaken.

After implementing the DCGAN network based on the DCGAN tutorial, my first attempt that actually did something produced color in some kind of shape but not actual ellipses.

A note on the images shown throughout this post: Let’s say we have 10 thousand images in our dataset (in this case 10 thousand images of an ellipse). One epoch consists of running through all these images once and a network is trained for 50 epochs. At the end of each epoch, an image is captured based on 16 sample inputs to the generator. These inputs stay the same during training. Thus, we have 50 images (one for each epoch) with 16 generated samples when the network is done training, and we are ideally interested in seeing these 16 images get more realistic over time.

The video below shows the evolution of one of these network training sessions. The video is stitched together from the 50 epoch images. Notice that at the beginning of training, the output of the generator is a gray blob which is the random data. Over time, some colors emerge, until training collapses in the end and it just generates white backgrounds :-)

First attempt at making ellipses with a GAN

Ellipses in opaque black and white

Taking a step back and reviewing the tutorial again, I took note of a few things that I did not pay attention to initially:

  1. The tutorial uses white, opaque digits on a black background. I was using unfilled (not opaque) ellipses on a white background.
  2. The images are only black and white (grayscale). I was using many colors.
  3. The MNIST dataset consists of 60 thousand examples. I was using a few hundred images.

If the goal of the generator is to fool the discriminator, but the images of ellipses are actually mostly white background with a little bit of color, it makes somewhat intuitive sense that the generator ends up just drawing white backgrounds as seen in the video above.

With this in mind, I created 10 thousand opaque white ellipses on a black background, just to prove that the network was indeed working. Here are some examples:

Opaque ellipses, black and white

The result from doing this was much better, and the generator ended up creating something that resembles circles:

Second attempt at making ellipses with a GAN

Wow, I created a neural network with 1 million parameters that can generate white blobs on a black background *crowd goes wild and gives a standing ovation*.

Sarcasm aside, it is always a good feeling when the network finally does something within a reasonable timeframe (it took about a minute to train this network).

Deeper, wider, opaque, color

After the “success” of the black and white ellipses, I started reviewing some tips on how to tweak a GAN (see references at the bottom of post). Without going into too much detail, I basically made the neural network slightly deeper (more layers) and slightly wider (more features) and switched back to using random colors for the ellipses, while keeping them opaque.

Here are some examples of the input ellipses:

Opaque ellipses, with color

After training the network with these images, it was interesting to see the 16 generated samples converge to colored blobs and then change dramatically between epochs. I think this is what is known as “mode collapse” and is a known issue/risk when training GANs:

Each iteration of [the] generator over-optimizes for a particular discriminator, and the discriminator never manages to learn its way out of the trap. As a result the generators rotate through a small set of output types. This form of GAN failure is called mode collapse.

Google Developers, Common Problems with GANs

Mode collapse is most obvious when viewing the epoch images individually, so rather than stitch them together into a video, I have included 50 images below. Notice that after about 20-25 epochs, the output starts to resemble colored ellipses, and all epochs after that do not seem to improve much:

I must admit, I think there’s a certain beauty to these generated images, but to be honest, it is still just randomly colored blobs, and they could be generated with much simpler algorithms than this beast of a neural network.

Generating cartoon avatars

Instead of continuing to tweak the ellipses-generating network, I wanted to see if I could generate more complex images. My original idea was to generate cartoon like images, and to my great delight, Google provides the Cartoon Set, a dataset consisting of thousands of cartoon avatars, licensed under the CC-BY license.

You have already seen an example result of using this dataset at the top of this post. Here are the 50 epoch images from training the network on the small version of the dataset (10 thousand images). Notice that the network starts to create face-like images after just a few epochs, and then starts cycling the style of the face, probably due to the above mentioned mode collapse.:

This is as far as I got currently. I would like to create a little web app for generating these images in the browser, but that will have to wait for another day. It would also be nice to be able to provide the facial features (hair color, eye color, etc.) as inputs to the network and see how that performs.

To keep my motivation up though, I think I need to switch gears and try something else for now. This was fun! :-)


A search for “DCGAN Tensorflow” yields many useful results, a lot of which I have skimmed as well, but the above are the primary resources.