So if I understand it right (using the current master branch) we do not need a Language mode/ trie anymore? all we need is a ‘scorer’. Now my questions are:
Can i use a model trained with DS 0.6 together with the current master branch and thus using a scorer instead of LM+Trie ?
What is the scorer model based on ? is it an n-gram language model or something new?
I havent found an explenation yet on how to make a scorer file and save it, can someone link me to one or explain it?
Thank you
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
It’s just a onefile LM + trie
I think it is documented under data/lm and generate_scorer.py
parser = argparse.ArgumentParser(
description="Generate an external scorer package for DeepSpeech."
)
parser.add_argument(
"--alphabet",
help="Path of alphabet file to use for vocabulary construction. Words with characters not in the alphabet will not be included in the vocabulary. Optional if using UTF-8 mode.",
)
parser.add_argument(
"--lm",
required=True,
help="Path of KenLM binary LM file. Must be built without including the vocabulary (use the -v flag). See generate_lm.py for how to create a binary LM.",
)
parser.add_argument(
"--vocab",
required=True,
help="Path of vocabulary file. Must contain words separated by whitespace.",
)
parser.add_argument("--package", required=True, help="Path to save scorer package.")
parser.add_argument(
"--default_alpha",
type=float,
required=True,
help="Default value of alpha hyperparameter.",
)
parser.add_argument(
"--default_beta",
type=float,
required=True,
help="Default value of beta hyperparameter.",
)
the flag --alphabet is not required.
The scorer will not work with code before PR 2681.
However, you should be able to use recently trained checkpoints, I think 0.6.1 or later should be fine but you should test, to export a model that can be used with post PR 2681 code.
Note that --alphabet is not required only if you’re using UTF-8 mode. If you don’t know what that is, you’re not using it, and need to specify that flag
follow the procedure, just to understood what file is that, what file needed, then reproduce / engineering my own dataset to have similar format, then delete the librispeech file that I don’t need it in train or any future process…
nice…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
It feels like you are unhappy ?
It looks like you have not properly read the documentation. You need that file to reproduce the same LM as we do. If you want your own, use your own file.
I think the documentation is not explain clearly…
i mean, the code is create only for librispeech…
so the other people who want to make something else, means could be from different database, should be understood the code flow mechanism.
if the structure of his database source is different, to make it works for his resources…
he must re-engginering the structure of his database…to make it fit with Deepspeech.
so, to do that, he must understood how the code works, and that’s means everybody push to follow the script (download file that have size 1.4 GB), after that he compare the stucture of the database, after understood it. the, he reconfigure or reenginerring his database structure in order to make it works with deepspeech.
well, i think everybody that in purpose plan to not use librespeech dataspeech will be agree with me.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
Well, you are welcome to articulate what is not clear: we can’t do divination, and we can’t know when it is not clear if people don’t give us the feedback
I think the code is pretty trivial to understand that the input, as documented elsewhere, is just pure text file, one sentence per line.
This is not a deepspeech constraint here, this is how kenlm works, and it’s explained
producing a flat text file, that should not be too complicated?