Our project is a Japanese OCR for manga! We take manga text bubbles, segment characters into subimages, classify each subimage into machine-readable characters, and combine characters into sentences. We can even translate entire sentences to English with a simple API call. The user will be able to highlight the text boxes themselves to reduce the complexity of our project.
Fig 1. Our proposed implementation
Optical character recognition (OCR) is a field of ML that focuses on extracting text from images to machine-readable text (i.e. UTF=8). OCR has a number of classification methods, including artificial neural networks, Kernel methods, statistical methods, template matching techniques, and structural pattern recognition [1].
Manga109 has been used to train an AI model to recognize and retrieve a manga based on a user’s sketch of a character or panel contained within. This is not entirely relevant to OCR, but they do outline good image processing methods for Manga109 data [2].
Character classification is a central part of our project. There are multiple thousand unique kanji, and some kanji may not even be present in our dataset. A team has achieved a 48% accuracy on zero-sampled kanji by deconstructing kanji into radicals and using a ResNet model to feature extract [3].
Manga109: A dataset of 109 manga which are annotated with face, body, character, and (importantly for us) text.
CC-100: A collection of datasets for constructed sentences of multiple different languages, scraped from the internet, including Japanese which we will use to create samples.
KMNIST: Three datasets that organize Japanese Cursive, which is very diverse and oftentimes dissimilar to modern writing.
KanjiVG: A list of vector graphics which gives information about strokes and stroke order corresponding to Japanese Kanji.
The problem is to extract text from manga.
Learning Japanese is extremely difficult for native English speakers. Our project aims to convert manga images to machine-readable text so that Japanese learners can easily query for words.
We aim to have this project up on our website, and our main goal is for good user experience which is hard to quantify with a metric.
We don't expect perfect results as good OCR is a real pain. We did a small proof-of-concept with Google's Tesseract model, and it's not very good. Despite this, Google's model is for any language and any format, so there could be things we do in our project which leverages some features of manga to get better results.
Fig 2. Naive Implementation with Google's Tesseract
[1] J. Memon, M. Sami, R. A. Khan, and M. Uddin, ‘Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR)’, IEEE Access, vol. 8, pp. 142642–142668, 2020.
[2] Aizawa Yamasaki Matsui Lab, “Manga109: Japanese Manga Dataset.” Accessed Sep. 27, 2024. [Online]. Available: http://www.manga109.org/en/annotations.html
[3] T. Ishikawa, T. Miyazaki, and S. Omachi, ‘Japanese historical character recognition by focusing on character parts’, Pattern Recognition, vol. 148, p. 110181, 2024. doi: 10.1016/j.patcog.2023.110181
Fig 3. Gantt Chart
Fig 4. Team Contribution Table
Fig 5. How to center a div