How to use Free Software to learn Japanese, and more.

Mining from manga

April 01, 2021 — Tatsumoto

When we read manga, sometimes there's a need to quickly OCR a portion of the screen to look up new words and add sentences to Anki. To do so, you're going to use an optical character recognition program and a few helper tools.

Install the following dependencies:

$ sudo pacman -S --needed sxiv maim tesseract xclip imagemagick
  • sxiv is an excellent image viewer. For this setup you can replace it with any image viewer, but sxiv is what I use.
  • tesseract is the OCR engine. It is considered fairly accurate, and many people like it.
  • maim is an utility for taking screenshots which can take parts of the screen.
  • xclip is a tool for copying text to clipboard.
  • imagemagick is a command-line image editor. It's going to come handy to edit the screenshots before Tesseract analyzes them.

By default Tesseract is not very good at detecting Japanese characters, but the quality of OCR operations can be improved by using custom trained data.

Download capture2text. We won't need the program itself because it's garbage but the trained data files are going to be useful. Extract the contents of the tessdata folder to ~/.local/share/capture2text_tessdata:

$ unzip -j Capture2Text_v* 'Capture2Text/tessdata/*' -d ~/.local/share/capture2text_tessdata

Alternatively, download just the Capture2Text Japanese files from here.

capture2text archive

Contents of the ZIP archive.

You don't need to install any data files from the repositories of your distro, the ones in the capture2text archive are way better.

Download maimocr and save it as ~/.local/bin/maimocr.

Make the file executable:

$ chmod +x ~/.local/bin/maimocr

The directory ~/.local/bin should be in your PATH.

Bind this script to any key in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id maimocr

The script is very trivial, so I hope you can understand it without explanations. When run, it will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Yomichan Search to quickly lookup Japanese words in real-time.

To open Yomichan Search, open your Web Browser and press Alt+Insert. Yomichan should be already installed.

If you notice that the script fails to OCR certain images, try to zoom in or find a scan with a better resolution. Tesseract works poorly at low resolutions.

Note: As an alternative, you can install kanjitomo but it's quite big and forces you to use a Japanese to English dictionary instead of a Japanese to Japanese one.

Tags: guide