Mining from manga

April 01, 2021 — Tatsumoto Ren

When we read manga, sometimes there's a need to quickly OCR a portion of the screen to look up new words and add sentences to Anki. To do so, you're going to use an optical character recognition program and a few helper tools.

Preface

Our goal is to be able to look up words in manga. We need a toolchain that does the following:

Takes a screenshot, selecting the part of the screen that contains a speech bubble with Japanese text.
Processes the taken screenshot.
Returns the recognized text.
Sends the text to a dictionary program. For example, GoldenDict or Rikaitan Search.
We can look up words and make Anki flashcards.

To recognize text on the pages of a manga, you can use Tesseract or Transformers. Tesseract is a more lightweight tool but makes more mistakes on average. With Transformers, you have to install a big number of Python packages that take up several gibibytes of disk space, but you get much better text recognition.

In this article I explain how to set up both. The resulting user workflow is identical, see the demo below.

Video demonstration.

Obtain manga

See Resources for places where you can get manga. For the best image quality, I recommend downloading manga from Torrent sites. However, if you don't want to wait for the download, you can also read manga online on various websites. No matter which option you choose, it's easy to find a great selection of manga to read.

Image viewer

To read manga, it is nice to have an image viewer. I use nsxiv, but for this setup you can install any image viewer. On many manga sites you can also read online in a web browser.

To open an entire image folder in nsxiv, you can run nsxiv . in the folder.

File manager

Another quick topic to cover up front is your file manager. Very often when you download manga, you get an archive which needs to be unpacked (*.zip, *.rar, etc.). For convenience, set up a keyboard shortcut to be able to extract archives in a simple keystroke.

For example, my file manager is lf. To extract archives by pressing E, put map E aunpack $fx in the config file at ~/.config/lf/lfrc (atool needs to be installed as well).

lf supports tags. When you finish a reading session, tag the last page (image file) you've read by pressing t so that you don't lose the position you're at. Next time you open the same folder, you will see a red asterisk next to the tagged file.

To have lf automatically select the image currently displayed in nsxiv, add the following code to ~/.config/nsxiv/exec/image-info. The snippet is taken from my dotfiles.

# If running as a child of lf, select the current file.
if [ -n "$id" ]; then
    lf -remote "send $id select \"$1\""
fi

It is possible to set up a keyboard shortcut in nsxiv that tells lf to add a tag to the currently displayed image. For example, to tag the current file by pressing t, add the following code to ~/.config/nsxiv/exec/key-handler. The snippet is taken from my dotfiles.

while read file; do
    case "$1" in
    # ...
    # other keys you may have set
    # ...
    "t")
        # Tag the current file using lf. E.g, the last read manga page.
        if [ -n "$id" ]; then
             lf -remote "send $id select \"$file\""
             lf -remote "send $id tag x"
        fi
        ;;
    esac
done

OCR method

Although Transformers requires more system resources, I prefer it to Tesseract. Compared to Tesseract it handles manga better.

Transformers
Tesseract

Setting up Transformers

Install transformers_ocr from the AUR.

$ trizen -S transformers_ocr

transformers_ocr makes use of the following programs:

maim to take screenshots.
xclip to copy text to the clipboard.

If you're not running a distribution based on Arch Linux, install manually by following the instructions on GitHub.

By itself transformers_ocr is just a short wrapper script that installs Transformers and other required Python packages. After the installation you need to download additional dependencies. Run the following command.

$ transformers_ocr download

It will download manga-ocr, a Python library responsible for optical character recognition. The files will be saved to ~/.local/share/manga_ocr and take up 2 GiB of disk space.

Note: transformers_ocr saves the Python packages to a standalone directory to ensure that later you can uninstall everything by simply removing the directory.

Usage

To OCR text on a manga page, run:

$ transformers_ocr recognize

When run, it will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Rikaitan Search to quickly lookup Japanese words in real-time.

The first run will take longer than usual. There's yet another set of files that have to be downloaded for the OCR to work. The files will be saved to ~/.cache/huggingface and take up another 500 MiB.

To send the recognized text directly to GoldenDict without copying it to the system clipboard, append clip_command=goldendict %TEXT% to the config file. The config file is read from ~/.config/transformers_ocr/config.

Keyboard shortcut

Bind this script to a keyboard shortcut in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id transformers_ocr recognize

Autostart

Before transformers_ocr can recognize text, it needs to start a background listener. Although this is optional, to minimize the startup lag, add the following command to autostart.

transformers_ocr listen

Here's an example for i3wm:

exec --no-startup-id transformers_ocr listen

Setting up Tesseract

Install the following dependencies:

$ sudo pacman -S --needed tesseract maim xclip imagemagick unzip

tesseract is the OCR engine. It is considered fairly accurate, and many people like it.
maim is a utility for taking screenshots which can take parts of the screen.
xclip is a tool for copying text to the clipboard.
imagemagick is a command-line image editor. It's going to come handy to edit the screenshots before Tesseract analyzes them.
unzip is a tool for extracting zip archives.

Download maimocr and save it as ~/.local/bin/maimocr. maimocr is a script we are going to use to recognize Japanese text.

Make the file executable:

$ chmod +x ~/.local/bin/maimocr

The directory ~/.local/bin should be in your PATH.

Usage

Tesseract doesn't work without trained data files. These files tell Tesseract how to read and recognize text from images. When you first run maimocr, it should download Japanese data files automatically. Check the terminal output to see if the process succeeds.

When you run it the second time, maimocr will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Rikaitan Search to quickly lookup Japanese words in real-time.

Keyboard shortcut

Bind this script to a keyboard shortcut in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id maimocr

Now you can quickly call maimocr anywhere by pressing the keyboard shortcut.

Expanding data set

By default, maimocr automatically downloads tessdata.zip (mirror) with Tesseract data files, then saves the files to ~/.local/share/tessdata.

To use additional data files with maimocr, copy any new *.traineddata files to ~/.local/share/tessdata.

Capture2Text files

These instructions are no longer necessary. The files are included by default.

Download capture2text. We won't need the program itself because it's garbage but the trained data files are going to be useful. Extract the contents of the tessdata folder to ~/.local/share/tessdata:

$ unzip -j Capture2Text_v*_64bit.zip 'Capture2Text/tessdata/*' -d ~/.local/share/tessdata

Alternatively, download just the Capture2Text Japanese files from here.

Capture2Text archive

Contents of the ZIP archive.

Troubleshooting

If you notice that the script fails to OCR certain images, try to zoom in or find a scan with a better resolution. Tesseract works poorly at low resolutions.

Nonstandard fonts often fail to OCR properly. In this case I don't have a definitive answer at the moment. Try searching for more *.traineddata files online and adding them to the tessdata folder.

Adding screenshots

If you want to add a screenshot from a manga to your Anki card, maim can do that too. maimpick is a script that uses maim to screenshot parts of the screen and copy them to the clipboard. Install it to ~/.local/bin, make it executable and bind it to a key. Explore my dotfiles for details.

In addition to maim, maimpick requires dmenu and xdotool to work.

Note: ames is another program that can add screenshots to Anki.

Other software

See Resources.

Tags: guide