Sentence collector for Japanese language (日本語の文章について)

Hi all.
I write the sentences.
I'm publishing the sentences I've submitted for reference. I also note my thoughts and questions (This note is written in the kana orthography and seiji, sorry).

Now, let me ask you a few questions.

1. Numbers

Even in Japanese, there are different ways to read it. As written in the how-to, they should probably be avoided.
However, if it's part of a word, it would be an exception, I think. For example,

  • issho meaning "together"
  • jūbun meaning "enough"
  • 三十miso-ji meaning "thirty years old"
  • 五十igarashi is a person's family name
  • 九十九tsukumo-gami meaning "old woman's gray hair" or "old woman with gray hair"

2. っ/ッ at the end of a word

For example,

  • 負けてたまるか
  • マジかよ

Some of the words are indeed as exclamations. For example, "あっ" and "えっ".
There is no specific pronunciation for this spelling. The pronunciation depends on the speaker.
It is written to express some "momentum". In other words, a stress.
Can I include it in a sentence?

3. Sentence length

How about 35 characters or less, excluding punctuation?
In reference to the topic on sentence length limit, sentences that can be spoken in less than 10 seconds seem to be appropriate.

This is the reference time that the native speaker read slowly:

  1. 今日は良い天気ですね。(3 seconds / 10 characters)
  2. 私はそれを恋だなんて思ってないけどね。(5 seconds / 18 characters)
  3. お父さんは箔が付くからって言うけど、あたしとしちゃどうだって良いね。(7 seconds / 32 characters)
  4. このサハラというのが本名から取ったにしろ、サハラ砂漠か何かから取ったにしろ、大して興味は無い。(11 seconds / 44 characters)

4. Range of kanji to be used

This is a difficult question. Can I write in the sense of a native speaker?
For example, the kanji for ore (meaning "I") is a very popular first person character, and even children can read it, but I was surprised to see that someone read it as kare (meaning "he") in Common Voice.
It is not wrong to write it in hiragana. However, a sentence with many hiragana characters becomes difficult to read.

I am automatically translating it into English. Sorry if that was unnatural!
I hope to inspire you.
Thank you!

  1. as long as there is not multiple ways to pronounce what you have written in the context it is in, it should be fine. Also, even in english and other languages sometimes numbers are included in the form of for example “thirtieth place” instead of “30th place”, or “the war started in nineteen thirty eight” in place of “the war started in 1938”
  2. Do I understand that those take place of English !?.,, or are similar in their intent? In that case I would probably include them, anyone who uses the data can strip them, or try to even use them for I don’t know, determining how speakers usually indicate presence of those?
  3. just choose whatever length you find suitable, and better shorter than longer. If in doubt, try pronouncing the sentence, and keep in mind people tend to stutter so may take longer to read than you
  4. no idea, someone else will have to input on that

Hmmm. Maybe. If someone can remove it from the data, I think it's okay.
How do we determine? .... "mood" is all I can say. A feeling. It works the same way as a exclamation mark. It expresses the speaker's feelings of surprise, emotion, anger, and joy.
Perhaps I'm just too concerned about it. But I'm glad you answered it. Thank you so much, @Adrijaned.

The shorter the better. You're right. I should break up my sentences more.

I see. So I guess "abbreviations" are okay too. With the exception of a few, Japanese abbreviations have a fixed reading.

And this was my question. There are multiple ways to read kanji. I'll write about it later.

On reading Kanji characters

日本語版: 漢字の讀み方について

The way a number is read depends on context and might introduce confusion in the dataset.

I've always wondered about this sentence from How to, too.
The way a kanji is read depends on context! And most kanji have two or more readings on their own.

I'll list as many as I can think of.

A. same meaning / same character / different reading

I think this is what @Adrijaned is concerned about:

  • (Rei / Zero / Maru) meaning "Zero". Maru is a limited reading.
  • (Shi / Yon) meaning "Four". Both are major readings.
  • (Shichi / Nana) meaning "Seven". Both are major readings, too.
  • 明日 (Ashita / Asu / Myōnichi) meaning "tomorrow". Asu and Myōnichi are a bit formal.
  • 昨日 (Kinō / Sakujitsu) meaning "yesterday". Sakujitsu is a bit formal, too.
  • 重複 (Chōfuku / Jūfuku) meaning "duplicate". Is there more people who read Jūfuku?
  • 経緯 (Keii / Ikisatsu) meaning "circumstance". Is there more people who read Keii?
  • 世論 (Seron / Seiron / Yoron) meaning "public opinion". I'm sure most people don't know about Seiron. It is generally read as Yoron.

Certainly, the context can narrow down the reading to some extent. But it's a "trend", not an absolute. How a speaker reads depends on their knowledge and lifestyle (e.g. occupation, amount of reading, etc.). Or, more to the point, it can be a matter of "preference". Therefore, when we are asked to read something correctly, we are perplexed. "They are all correct, aren't they?"

The speech algorithm needs to know how to read everything.

B. same meaning / different character / same reading

It is used differently depending on the meaning of each character. Or preference.

  • 暗黒An-Koku / 闇黒An-Koku
  • 日差Hi-Zaし / 陽射Hi-Za

C. different meaning / same character / different reading

The reading depends on the context and the word.

  • 小人Ko-Bito / 小人Ko-Domo
  • 最中Sai-Chū / 最中Mo-Naka
  • 落着Raku-Chaku / OTsu
  • 過去Ka-Ko / SuSa
  • Akaるい / Kuraい / 明暗Mei-An


  • ここは人気があります。
    • ここは人気Nin-Kiがあります。(This place is popular.)
    • ここは人気Hito-Keがあります。(There are signs of people here.)

Yes, it's impossible to determine how to read in this short context.

D. different meaning / different character / same reading

So-called 同音異義語Dōon-Igi-Go (meaning "homonyms").

  • けんとうKen-Tō: 見当 / 拳闘 / 軒灯 / 健闘 / 検討 / 賢答 and more.
  • せいかくSei-Kaku: 正確 / 性格 / 正格 / 精確 / 醒覚 and more.
  • いしI-Shi: 石 / 意志 / 医師 / 遺志 / 遺子 and more.
  • かなうKana-U: 適う / 叶う / 敵う

Example 1

  • きじKijiniかけてkakateいるiruぶぶんbubungaあるaru
    • 記事に書けている部分がある。(There are parts of the article that could be written about.)
    • 記事に欠けている部分がある。(There is a part of the article that is missing.)
    • 生地に欠けている部分がある。(There is a part of the fabric that is missing.)
    • 生地に掛けている部分がある。(There is a part of the fabric that the fabric.)
    • Um, more?

All Japanese pronunciations can be written in hiragana, but here's why they shouldn't be. Of course, there is a difference in intonation between 書けて and 欠けて. But 記事 and 生地 are the same. If we're trying to figure out the meaning from a hiragana sentence, we're going to need more "background".

Example 2

  • ここではきものをぬぎます。
    • ここKokode履き物hakimonowo脱ぎnugiますmasu(This is where you take off your footwear.)
    • ここKokoではdewa着物kimonowo脱ぎnugiますmasu(This is where you take off your kimono.)

It's a common pun. Like "Ice Cream" and "I Scream"? It's pronounced a little differently, though.

The sentence collector should contain sentences in most instances, not single words. All languages may have ambiguity in how to read a single word, letter, or character, but there will typically be much less ambiguity for a whole sentence. I don’t think it’s a good idea to intentionally include puns in the prompts, which is a sort of sentence-level ambiguity. It is best if each prompt has just one unambiguous reading, but applications using the data always have to handle some degree of variation. In fact, the types of variation present in the dataset should ideally include those to be found in the target application, and the variation is important for the model to learn.

I don’t know Japanese, but in Chinese, handling of 多音字 (characters with multiple pronunciations) is an integral part of speech recognition systems. Such homographs are found for all languages except those with perfectly phonemic writing systems. The previous generation of ASR models typically used a pronunciation dictionary, which lists all possible pronunciations of each word, as part of training the system. In newer systems, like those that use connectionist temporal classification (CTC), the model learns about this stuff on its own and no dictionary is even needed.

We already collect some amount of information on regional accents. Perhaps it would be possible to collect more detailed information on social class, educational background, etc. but I don’t know of any existing ASR systems that make use of this sort of data. It is also very difficult to get accurate self-reports of this, and Mozilla seems to be very concerned with privacy. It is more realistic for the model to infer these sorts of characteristics by listening to the utterance, and doing what’s called speaker adaptation. Again, the variation present in the dataset will help the model learn. We do our best to create sentences which might be similar to those found in the target application, and the model must take care of modeling the variation. Modern ASR models are more than capable of disambiguating 多音字 (homographs) when enough context is provided, and detecting and adapting to different accents and speaking styles, and it is potentially important that these phenomena are part of the dataset.

I don't know much about the system, so it was very helpful to get your input. Thank you, Craig.

So, for example, the sentence "明日行くよ。" can be read either as "Ashita iku yo" or "Asu iku yo", and I can provide such a sentence to the Collector, right?
--that is, as long as the context (the meaning of the sentence) is clear, the speaker does not have to worry about multiple readings of the kanji. Rather, each reading is necessary for the system to learn. Is this interpretation correct?

Yes, if the voice recognition system can learn Chinese, it will probably be fine in Japanese as well (and the existing system actually understands what we speak). As you said, there are very few sentences that are difficult to interpret.

Thank you for providing the Japanese example, sinumade. According to my dictionary (Wiktionary), the pronunciation ashita for 明日 is a colloquial form meaning “tomorrow”, while asu is a polite form with the same meaning, both native Japanese words with two different origins (not etymologically Chinese). Since I don’t know Japanese (明日 is an archaic or formal term in Chinese, while the same word has persisted in colloquial Beijing dialect but is written 明兒 and pronounced as a single syllable míngr), I’m unable to determine the degree of ambiguity of this sentence. Would one pronunciation be more likely than another here, if you showed the sentence to several different native speakers? If one pronunciation is more likely here, then a computer model could capture this pattern.

I wonder if there would be a way to make this sentence unambiguous? For example, add some more polite words to show that asu is the best reading, or add some colloquial words for ashita. In my opinion this would be the best approach, if it is indeed very ambiguous. It really depends on the application; I believe it could be problematic for text-to-speech (TTS), but ok for speech-to-text (ASR), since there is no guessing involved here for ASR.

Hello, all!

I’m a Japanese, so I bet I can give you all some hints.

As you know, Japanese wouldn’t require a space between words. A sentence written all with ひらがな often confuses us, so we use 漢字 to make the meaning clear.
However, there are so many different set of words with same sound but different meanings, what you call 同音異義語. It is a good word-play but we need a specific background to know the correct meaning.

Reading 明日 as あす or as あした is an expression depending on a person. Of course, you know, we feel あす is formal.
They have all same meaning but 明日行きます is formal, 明日行くね and 明日行くよ are casual. In the case 明日行きます, both あす and あした are read, but in the case 明日行くね, あした is normally read, 明日 is hardly.

By the way, I have two more comments for @sinumade.
One is that for 重複 (Chōfuku / Jūfuku), ちょうふく(Chōfuku) is correct actually but some people in Japan read it じゅうふく(Jūfuku).
Two is that some 漢字 you use are not natural for Japanese. For example, [漢字の讀み方について], people in Japan use normally 読み方. I think 讀み方 is used in a newspaper or an old book.

Have a nice Japanese day!

1 Like

Hi Craig.
I looked it up (on the web) too. As you say, the etymology of the word あしたAshita and あすAsu are different. But today, they are used to mean the same thing. 'The next day'. In the example "明日行くよ。", most people would read Ashita more often than not. As @safejourney pointed out, the example is quite colloquial and casual in its wording. As I mentioned in post #4, most people will read it as Ashita, though not always.

Yes, it's reasonable to control it with the way we write it. But some people might read it as Ashita even in a stiff sentence. I'd like to hear more opinions from Japanese speakers.
Sure, I think speech-to-text is fine. Fortunately, there are no words that are pronounced the same as 明日 (actually, there are a few, but we can limit them by context).

Hmmm, the reading used in everyday life is certainly limited.

In order for the system to properly convert speech into text, the sentences should still be appropriate (i.e. natural), right? I used to write some of my sentences in hiragana as a workaround for How-to's (e.g., I wrote 回り as ひと回り. There are certainly uses for this kind of usage. I see it all the time on blogs and in magazines).
Also, people tend to fix hiragana for difficult kanji, or those that may be discriminatory. For example, shōidan (, meaning "incendiary shell") in TV news, and がいshōgaisha (, meaning "handicapped person") in public facilities. Of course, it depends on which characters we find difficult or discriminatory.
The names of plants and animals are not uniform, either. Some people write "cat" as (kanji), while others write ネコ (katakana).

Yeah, it doesn't matter either way (as long as it makes sense).

Hi safejourney!
Thank you for all the advice! I'm glad you took an interest in this topic!

It is true that people who use 正字seiji today are minor (of course, there are individuals and groups that use them. For example, 國語問題協議會KOKUGOMONDAI KYOUGIKAI; みんなのかなづかひMinnanokanazukai is a doujinshi that publishes writings that include seiji). I was surprised when I first saw it, too. Considering how few people can read, it's probably best not to use it in sentence collection. So the sentence I sent to the Collector tool is in 新字shinji.