Preprocessing, Silence, Lyric Recognition

Hello! My schoolmates and I are working on a group project using DeepSpeech, we want to ask a couple of questions.

  1. What are the operations that deepspeech does for preprocessing the audio files?

  2. Will silence in the front and end of an audio file affect the package’s ability to do inference?

  3. The main objective of our project is to be able to recognize sung notes (especially solfege). For example, someone says “do re mi fa re do”. We want to be able to get the exact thing that person said without consideration of the actual note the person sang. (This means that they say ‘do’ but sang ‘la’ - we want this ‘do’) Is this python package suitable for such a use case? If not, what are some suggested packages for this kind of project?

Thanks a bunch!