Japanese Known Word Checker: About

← Back to the Form

Overview

This program compares a list of words you know, and a text you wish to read, in order to calculate the readability of the text. Ease of readability is displayed as a percentage and as a score based on the number of known and unknown words. Words not matched will be listed as an Unknown Word.

A list of known words can be created using Microsoft Excel, LibreOffice Calc, Anki, or any program which generates a text-delimited file.

The text that you wish to read should be written in standard spelling, and not primarily or only in hiragana, in order for the program to recognize terminology.

Form and Options

  • Known Words File: Every word must be on a new line, and the file format should be a text format (eg. ending in .txt or .csv). The selected file and its data is not uploaded to the internet.
  • Delimiter: Used to find the list of words from the Known Words File. In a text-delimited file, texts are separated by a character. For example, "Word Definition" is separated by a space. In order to make the "Word" found by the program, the program needs to be told that the delimiter is a space in order to separate the texts into columns. If the delimiter is a tab, this can be written as "\t". If the text is not separated by a delimiter, then ignore this option.
  • Column: Used to find the list of words from the Known Words File. Once the text has been separated into columns by the delimiter, the program will use the column to find the list of words. Columns start at 0, and not from 1.
  • Remember Settings: By selecting this option, the Known Words File, Delimiter and Column will be saved to browser storage. These settings will be loaded the next time the page is opened from the same device. Settings are not saved online, and do not sync to other devices.

How Matches Are Found

1. Known Words File

Words are found according to the delimiter and column. Once selected, words are cleaned by removing punctuation and English characters, and for する verbs, the ending する/します is removed.

2. The Text

Common grammatical terms and separating terms (eg. だろう/でしょう, けれど, ことが(できる), (て)しまう, のです) are removed, and word boundaries are created around modifiers (prefixes, suffixes, counters), verbs, and major place names in order to identify words better. Then, the text is separated into words using TinySegmenter, and post-processed to combine remaining modifiers and verb auxillaries. Duplicate words are removed.

3. The Comparison

Non-Japanese words, single-character and two-character Hiragana terms (eg. grammar particles like は, が, のは, には) and verb auxillaries (eg. され(て), ます), words starting or ending in "ー" or "っ", and words that are blacklisted will be ignored. The internal blacklist contains grammatical terms and common words including verbs, pronouns, conjunctions, adverbs, counters and modifiers (eg. いる/ある/行く, 私/僕/俺, しかし/でも, なぜ/もっと/もう, 分/全/別/他).

Matches are made by checking for string and substring equality, by a method known as Fuzzy Matching. In other words, accuracy is 'fuzzy'. This is necessary in order to quickly match terms which are slightly different, but are otherwise the same. For example, because the term 出る and 出ます are the same except for verb ending, it is not a 100% match but a 33% match. The matching algorithim recognizes this and will still match the two terms.

Matching Algorithim compares whether

  • the longer word contains the shorter word as Kanji
  • the longer word contains the shorter word as Kana, and is 3 or more characters
  • both words are a verb compound (i.e. Kanji-Kana-Kanji), and the first 3 characters are the same
  • words which contain Kanji and Kana, that when Kana is removed, both (partially) contain the same Kanji (eg. other verbs and adjectives)

Matching Accuracy:

  • Words which are one or two characters need to match 100%
  • Words which are three or more characters need to match 60% or more

Troubleshooting

Open the console via F12 and type:

  • "_debug = true" to display match results
  • "text" to display text delimitation
  • "segs" to display text segmentation
  • "banned" to display banned terms

Download and Use Offline

This program can be downloaded and used offline, however, will not receive updates and improvements to the matching algorithim.

To download this program, go back to the form, and use the keyboard combination Ctrl+S or use the menu buttons File > Save Page As... to save the web page as complete or archive.

This process is explained below for major browsers: