Question regarding the new scorer function instead of LM+trie

So if I understand it right (using the current master branch) we do not need a Language mode/ trie anymore? all we need is a ‘scorer’. Now my questions are:

  1. Can i use a model trained with DS 0.6 together with the current master branch and thus using a scorer instead of LM+Trie ?
  2. What is the scorer model based on ? is it an n-gram language model or something new?
  3. I havent found an explenation yet on how to make a scorer file and save it, can someone link me to one or explain it?

Thank you :smile:

It’s just a onefile LM + trie

I think it is documented under data/lm and generate_scorer.py

I think it is documented under data/lm and generate_scorer.py

Thank you, found it under data/lm/generate_package.py

So if i understand it correctly all I would need is:
-alphabet
-language model binary made with kenLM
-vocabulary
-LM alpha and beta

can you confirm this? Then i’m sure about what to do

And after making the scorer, will it work with models trained before the addition of this feature or do i still need the LM and trie seperatly then?

Note that for the argument parsing

    parser = argparse.ArgumentParser(
        description="Generate an external scorer package for DeepSpeech."
    )
    parser.add_argument(
        "--alphabet",
        help="Path of alphabet file to use for vocabulary construction. Words with characters not in the alphabet will not be included in the vocabulary. Optional if using UTF-8 mode.",
    )
    parser.add_argument(
        "--lm",
        required=True,
        help="Path of KenLM binary LM file. Must be built without including the vocabulary (use the -v flag). See generate_lm.py for how to create a binary LM.",
    )
    parser.add_argument(
        "--vocab",
        required=True,
        help="Path of vocabulary file. Must contain words separated by whitespace.",
    )
    parser.add_argument("--package", required=True, help="Path to save scorer package.")
    parser.add_argument(
        "--default_alpha",
        type=float,
        required=True,
        help="Default value of alpha hyperparameter.",
    )
    parser.add_argument(
        "--default_beta",
        type=float,
        required=True,
        help="Default value of beta hyperparameter.",
    )

the flag --alphabet is not required.

The scorer will not work with code before PR 2681.

However, you should be able to use recently trained checkpoints, I think 0.6.1 or later should be fine but you should test, to export a model that can be used with post PR 2681 code.

1 Like

Note that --alphabet is not required only if you’re using UTF-8 mode. If you don’t know what that is, you’re not using it, and need to specify that flag :slight_smile:

1 Like

so…If i wan’t to make my own language…, i must download the librispeech-lm-norm.txt.gz from [http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz], that have size 1.4 GB…

follow the procedure, just to understood what file is that, what file needed, then reproduce / engineering my own dataset to have similar format, then delete the librispeech file that I don’t need it in train or any future process…

nice…

It feels like you are unhappy ?

It looks like you have not properly read the documentation. You need that file to reproduce the same LM as we do. If you want your own, use your own file.

I think the documentation is not explain clearly…
i mean, the code is create only for librispeech…

so the other people who want to make something else, means could be from different database, should be understood the code flow mechanism.

if the structure of his database source is different, to make it works for his resources…
he must re-engginering the structure of his database…to make it fit with Deepspeech.

so, to do that, he must understood how the code works, and that’s means everybody push to follow the script (download file that have size 1.4 GB), after that he compare the stucture of the database, after understood it. the, he reconfigure or reenginerring his database structure in order to make it works with deepspeech.

well, i think everybody that in purpose plan to not use librespeech dataspeech will be agree with me.

Well, you are welcome to articulate what is not clear: we can’t do divination, and we can’t know when it is not clear if people don’t give us the feedback

I think the code is pretty trivial to understand that the input, as documented elsewhere, is just pure text file, one sentence per line.

This is not a deepspeech constraint here, this is how kenlm works, and it’s explained

producing a flat text file, that should not be too complicated?

1 Like