Manga Learn!
April 24, 2016
Why Manga?
Manga (漫画 Manga) are comics created in Japan, or by creators in the Japanese language, conforming to a style developed in Japan in the late 19th century.[1] They have a long and complex pre-history in earlier Japanese art. Manga stories are typically printed in black-and-white,[9] although some full-color manga exist (e.g., Colorful). Colorization of Manga is usually done after it is released in black and white format. This is often avoided because colorization is time-consuming.
To this point, the goal of this project is to aleviate this time consuming pain of colorization by using machine learning to train a model which will thereby be able to perform automatic colorization. It will train on colorized examples to identify shapes and objects in the manga image which are a consistent color. Then, when fed a black and white manga from the same artist and series, it should be able to guess the fill color by relating the properties of the shape to the training set. The same artist and series is chosen to simlify this procedure. Generalization may be part of expansion of this project if it goes well :). Now, what manga will we use? What a silly question! One Piece of Course! The inspiring story of a boy who eats the Gomu Gomu no mi (rubber fruit) and is going to be the Pirate King one day! If you haven’t read this epic tale, not only will you get a fun glimpse of it while reading through this notebook, but I highly suggest you read the rest.
.
Unfortunately the code that Richard uses to train his network is not online and even if it was, I question whether I could get it to work with caffe - I had a hard enough time just installing it to test the completed model.
Training the colornet architecture with manga images
Luckily, there are other colorization networks online that use tensorflow - something which I can install and have used before. One such network is well described here: Automatic Colorization with code here: https://github.com/pavelgonchar/colornet.
A notable difference between this architecture and other colorization architectures is that this uses the popular VGG16 network weights for feature identification. It feeds images forward through this network and then extracts the “hypercolumns” which give summary activations at various layers. These hypercolumns are then fed as features into the colorization half of the network. This essentially saves you the trouble of training the feature identification yourself.
To start with I fed 10,000 images through the network. Because the network accepts 224 x 224 image sizes, I had to slice my manga pages into seperate squares. Since my 1200 x 780 pages don’t quite chop up nicely, I ended up centering each page onto a 1344 x 1344 black backdrop and chopping this into 36 equal 224 x 224 squares. Thus there are some squares that are only black and others which have large black borders. The all black squares are thrown away and not used for training. The partially black ones are used for training.
Its important to note one potential drawback of this method is that since these squares must be stiched together again after the color is predicted, a difference in their edges will be apparent in the resulting colorized manga page. I haven’t determined how big of an issue this will be. There are a couple of ways we might avoid this but for now, we are going to turn a blind eye to this problem as well. This is a research project remember! Perfection is not required :)
Below are some test images taken at various stages of training. On the left is the black and white input image, the middle is the model’s predicted colorization, and the right is the true colorization
, it learns to throw up its hands and guess greyish brown every time because this minimizes the penalization it sees for each wrong guess. This is a more worrysome issue than the one described above. No matter how many images we train, greyish brown will remain a reasonable choice for a model to guess over and over again. Thus, perhaps a change in achitecture is afoot!
For this, we go back to Richard Zhang’s approach. He mentions this issue in his paper and comes up with an execellent workaround. Istead of treating this as a regression problem, we will quantize the output color space and thus the task become classifying each pixel into one of these quantized color bins. Specifically, our task is to predict a probability distribution for each pixel of the image where each probability is the likelyhood of it belonging to a particular bin. We then select the most likely bin and color the pixel as such. This way, rather than judging each guess as “correct” or “incorrect”, we can jugde a guess by its predictive power against each bin. This type of evaluation gives an optimizer much more information for it to improve the next time around.
Custom Model
My model borrows heavily from both richard’s and the colornet model from pavelgonchar. There is one file that converts pre-sliced images into hdf5 raw arrays - X which results from feeding the black and white image through a VGG-16-net and y which is the “binned” output columns. This is done in the populate_h5.py file and the custom model architecture is shown in the train.py file. Here is this architecture:
 becomes the input black and white image. This image is fed into VGG16 and hypecolumns are extracted after the 8th and 15th layer - both pooling layers.
These hypercolumns are fed into a simple colorization network.
Here is the loss after the first attempt, tried with 3 different optimizers - sgd with a learning rate of 0.001, momentum of 0.9 and nesterov enabled, adadelta, and adam.
unfortunately, its hovering at 4850, seemingly no matter what sample I try it on.
All 3 seem to run into the same basic bottleneck at the same point, leading me to avoid trying even more different optimizers.
At this point in the project, I have 2 conflicting emotions. First of all, pride. I am finally training my own custom network! Uncharted waters. No one has ever tried to solve this problem in the way that I am doing it now. This really feels like Im finally doing machine learning. 2nd, I feel overwhelmed. There are no guidelines for solving this type of problem. Uncharted waters are like that.
My first move is to simplify the network. Input features with size (n_samples,960,224,224) are MASSIVE in comparsion to most neural networks. The conical architecture where the data block is reduced each at each step is also unusual from what I’ve seen. Thus, in an effort to conform, I think reducing the scale of the whole thing is a good step.
My current model is sitting at around 5,800,000 parameters. I have a total of 300,000 sample images to work with
In order to simplify this model, there are several possibilities
- remove hypercolumns
- reduce image size. This would add a benefit of increasing number of samples. However I am worried about the patching issue described above
- remove layers from our colorization network.
I ended up doing all 3. I resized each image down from 224x224 to 50x50 effectively reducing the number of input features by a ratio of 50174/2500 or 18.1.
 and scaled image (below). Note obviously the lack of detail on the right. This is an especially detailed image however, so usually it won’t be this bad.
I also only used hypercolumns 8 and 15 after some deliberation. I only wanted 2 layers. Since as you move forward through the network, the hypercolumns give increasingly semantic rather than location specific information, Thus I want the earlier layers as i am interested in pixel specific coloring. So not 22. Then I picked 8 and 15 rather than 3 and 15 as 3 just seems a bit too early to give useful information, especially if it was one of only 2 hypercolumns.
Finally, removing layers from the colorization network basically came with the territory.
This reduces my network down to about 1.1 million parameters. When we train again….
This is the loss using a batch size of 3 samples - basically “online” training or just staindard SGD. As you can see it seems to get stuck at a loss of ~4850. After doing some reasearch into getting stuck like this, I found this paper: http://deeplearning.cs.cmu.edu/pdfs/Gori_Tesi.pdf which recommends “batch mode” learning as a scenario in which local minima are more commonly avoided. Thus, I upped my batch size to 64.
This actually looks a bit better - you can see that perhaps its not gonna stay at that 175 loss and indeed decrease. At this point, I realized to truly evaluate, I would need of course more data. Athough accuracy is great! 0.36
Retrospective
Since this is my first time working with neural networks, and first non-kaggle machine learning project (where I can’t copy from others), I figured I should write down some things I’ve noticed/learned.
-
Learn from others! My first attempts at SVG based processing were not a waste at all - in fact I might try them again. However, reading more on the prior art first would have given me invaluable information about feature selection and algorithm selection/strategy. Apparently, machine learning is more than just throwing classifiers at data! :p
-
Working with massive amounts of data was by far the most time consuming part. Also the most metally taxing part. By the time I actually had input data and labels and was running the neural networks, it felt like a huge relief. Slicing that many images, moving them around, worrying about the size, worrying about memory issues, connection speeds when uploading and downloading them, GPU tax - everything takes time. You have to strategize. You’ll want to start certain processes before you go to sleep so that you don’t waste time in the morning seeing as they take several hours. This is not a “solvable” problem either. It doesn’t go away - you just have to work around it in certain ways. HDF5 is a neat tool but its just another step in the chain of things you have do to prep the data. Unfortunately this also means that you are incentized to train with less data. If you look at everything I tried above, you’ll see a pattern that results could have been better with more data. However, its a tradeoff. You can’t keep throwing time at a problem, at least not until its proven to be worth it. A better computer can help a lot. I always figured comptuers were as fast as they needed to be until I tried my hand at machine learning. Apparently they have a ways to go.
-
AWS is the second best thing you can do if you wish to solve machine learning problems. The best is to buy your own computer. 2.2xlarge gpu instances are a lot better than a macbook pro laptop (yep I did attempt this), but getting the data onto them and installing everything is a serious PIA. For instance, the current version of tensorflow only works with newer graphics cards than the ones provided by amazon. It also means that every time you add more data, you have to go through the process of uploading it, transfering it, downloading it, etc. Ive done seperate parts of this project on a total of 4 computers (not counting the many many ec2 instances I tried to get things running on intially). Every time you add another computer to the mix, you are increasing the complexity of your workflow. 1 computer would have been the way to go.
-
Blacks are never colored. You can especially see this in the Kanji (lettering) where a black colored marking is supposedly bright red or purple to accent intensity, violence, suprise, etc. Note that in the case of black and white photographs for which this architecture was designed, this is fine. Black is supposed to stay black. HOwever, here, we are converting this to a black channel and trying to guess its a,b constituents, then converting it back into rbg to be shown. Predition of the black (luminance) channel is not something that the model covers. It assumes that the luminance will not change and for the lettering, this is not the case. It goes from 100% luminance(black) to 50% red. The only way we could forseeably pick this as well is to include this information in the prediction.
Colorization worked great, few issues: didn’t color large images - only small ones which were entirely contained in the the random 24/24 square. Also boundaries were obvious of course because they showed 2 different squares colored differently
Next step - take each panel at a time. a few questions - should it be stretched or just placed on 224x224?
Perhaps we should just do the contour…. Even if contours overlap, who cares? This might be a good thing?
apect ratio of a few random chapters….:
![histogram-chapter-2](http://www.alexmarshall.website/assets/img/2nd-sample-chapter-aspect-ratios.png” style=”width:100%”>
![histogram-chapter-3](http://www.alexmarshall.website/assets/img/3rd-sample-chapter-aspect-ratios.png” style=”width:100%”>
However, as I was inspecting the data, I found something really interesting. Look at the contours on this chapter in particular.
![overlapping-contours](http://www.alexmarshall.website/assets/img/1rst-sample-chapter-aspect-ratios.png” style=”width:100%”>
Note how in the main panel, there are several contours on top of each other - no one contour forms the entire image. This initially seems like a bad thing - how do you know what to feed to the model? Well I think you could feed all of these overlapping contours and they would all be analyzed seperately. Additionally there is no problem with that. Each contour holds a seperate object. In fact, why are we so held up on this concept of panel specific contours? Why don’t we just do the entire page by contour and let them overlap? Maybe this would have been obvious to some but for me it was a big lightbulb moment.
1 big question at hand is the threshold of areas we are willing to consider a “valid” contour. Obviously really small objects could be just noise. On the other extreme, if we wanted only panels, we would probably have to accept only really big contours. Since we are feeding in images to a network in 224x224 size, it seems reasonable that we use the threshold of 50176. However… I think it might be a question of how close it is to 50176 and the shape it is. This might be a factor that we have to compute specially.. Ill get back to this.
Chapter 1
Now were gonna look at our two seperate strategies for each chapter. Remember the orange squares mean that these were pulled apart first then patched together. The normal ones were shrunk down to 224x224, colored, and then that color was mapped onto the larger images.
</img>
</img>
Generally it looks to me like the entire sheet colored at once is doing better but both still need work.