How to use Free Software to learn Japanese, and more.

Mining from manga

April 01, 2021 — Tatsumoto

When we read manga, sometimes there's a need to quickly OCR a portion of the screen to look up new words and add sentences to Anki. To do so, you're going to use an optical character recognition program and a few helper tools.


Preface

Our goal is to be able to use Yomichan with manga. We need a toolchain that does the following:

  1. Takes a screenshot, selecting the part of the screen that contains a speech bubble with Japanese text.
  2. Processes the taken screenshot.
  3. Recognizes the text and copies it to the system clipboard.
  4. Using Yomichan's Clipboard Monitor we can perform lookups.

To recognize text on the pages of a manga, you can use Tesseract or Transformers. Tesseract is a more lightweight tool but makes more mistakes on average. With Transformers you have to install a big number of Python packages that take up several gibibytes of disk space, but you get much better text recognition.

In this article I explain how to set up both. The resulting user workflow is identical, see the demo below.

Video demonstration.

Image viewer

To read manga, it is nice to have an image viewer. I use sxiv, but for this setup you can install any image viewer. On many manga sites you can also read online in a web browser.

OCR method

Although Transformers requires more system resources, I prefer it to Tesseract. Compared to Tesseract it handles manga better.

Setting up Transformers

Install transformers_ocr from the AUR.

$ trizen -S transformers_ocr

transformers_ocr makes use of the following programs:

  • maim to take screenshots.
  • xclip to copy text to the clipboard.

If you're not running a distribution based on Arch Linux, install manually by following the instructions on GitHub.

By itself transformers_ocr is just a short wrapper script that installs Transformers and other required Python packages. After the installation you need to download additional dependencies. Run the following command.

$ transformers_ocr download

It will download manga-ocr, a Python library responsible for optical character recognition. The files will be saved to ~/.local/share/manga_ocr and take up 2GiB of disk space.

Note: transformers_ocr saves the Python packages to a standalone directory to ensure that later you can uninstall everything by simply removing the directory.

Usage

To OCR text on a manga page, run:

$ transformers_ocr recognize

When run, it will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Yomichan Search to quickly lookup Japanese words in real-time.

To open Yomichan Search, open your Web Browser and press Alt+Insert. Yomichan should be already installed.

The first run will take longer than usual. There's yet another set of files that have to be downloaded for the OCR to work. The files will be saved to ~/.cache/huggingface and take up another 500MiB.

Keyboard shortcut

Bind this script to a keyboard shortcut in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id transformers_ocr recognize

Autostart

Before transformers_ocr can recognize text, it needs to start a background listener. Although this is optional, to minimize the startup lag, add the following command to autostart.

transformers_ocr listen

Here's an example for i3wm:

exec --no-startup-id transformers_ocr listen

Setting up Tesseract

Install the following dependencies:

$ sudo pacman -S --needed tesseract maim xclip imagemagick unzip
  • tesseract is the OCR engine. It is considered fairly accurate, and many people like it.
  • maim is a utility for taking screenshots which can take parts of the screen.
  • xclip is a tool for copying text to the clipboard.
  • imagemagick is a command-line image editor. It's going to come handy to edit the screenshots before Tesseract analyzes them.
  • unzip is a tool for extracting zip archives.

Download maimocr and save it as ~/.local/bin/maimocr. maimocr is a script we are going to use to recognize Japanese text.

Make the file executable:

$ chmod +x ~/.local/bin/maimocr

The directory ~/.local/bin should be in your PATH.

Usage

Tesseract doesn't work without trained data files. These files tell Tesseract how to read and recognize text from images. When you first run maimocr, it should download Japanese data files automatically. Check the terminal output to see if the process succeeds.

When you run it the second time, maimocr will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Yomichan Search to quickly lookup Japanese words in real-time.

To open Yomichan Search, open your Web Browser and press Alt+Insert. Yomichan should be already installed.

Keyboard shortcut

Bind this script to a keyboard shortcut in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id maimocr

Now you can quickly call maimocr anywhere by pressing the keyboard shortcut.

Expanding data set

By default, maimocr automatically downloads tessdata.zip (mirror) with Tesseract data files, then saves the files to ~/.local/share/tessdata.

To use additional data files with maimocr, copy any new *.traineddata files to ~/.local/share/tessdata.

Capture2Text files

These instructions are no longer necessary. The files are included by default.

Download capture2text. We won't need the program itself because it's garbage but the trained data files are going to be useful. Extract the contents of the tessdata folder to ~/.local/share/tessdata:

$ unzip -j Capture2Text_v*_64bit.zip 'Capture2Text/tessdata/*' -d ~/.local/share/tessdata

Alternatively, download just the Capture2Text Japanese files from here.

Capture2Text archive

Contents of the ZIP archive.

Troubleshooting

If you notice that the script fails to OCR certain images, try to zoom in or find a scan with a better resolution. Tesseract works poorly at low resolutions.

Nonstandard fonts often fail to OCR properly. In this case I don't have a definitive answer at the moment. Try searching for more *.traineddata files online and adding them to the tessdata folder.

Adding screenshots

If you want to add a screenshot from a manga to your Anki card, maim can do that too. maimpick is a script that uses maim to screenshot parts of the screen and copy them to the clipboard. Install it to ~/.local/bin, make it executable and bind it to a key. Explore my dotfiles for details.

In addition to maim, maimpick requires dmenu and xdotool to work.

Note: ames is another program that can add screenshots to Anki.

Other software

See Resources.

Tags: guide