Fine tuning Deepspeech 0.9.1 with same alphabet

Ghada_Mjanah · November 30, 2020, 10:00am

i restarted my PC and nvidia-smi states 0% GPU utilization

othiele · November 30, 2020, 10:02am

Sometimes it helps to just start over because you fiddled with all the libs too much. Why don’t you try it in a fresh environment. Helped me in the past.

lissyx · November 30, 2020, 10:07am

it’s not about GPU usage, it’s about GPU being locked / GPU memory being allocated even a tiny portion.

Ghada_Mjanah · November 30, 2020, 10:08am

I will try in a new env right away !

lissyx · November 30, 2020, 10:09am

@Ghada_Mjanah Like I said, I just ran into a similar stack, and stopping gdm3 / gnome3 fixed it. Those errors can have a lot of root cause, and debugging CUDNN is hard, and not in our scope, sorry.

Ghada_Mjanah · November 30, 2020, 11:42am

nvidia-smi shows that these processes are running over GPU:

                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       980      G   /usr/lib/xorg/Xorg                 47MiB |
|    0   N/A  N/A      1065      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      1375      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      1561      G   /usr/bin/gnome-shell               39MiB |
+-----------------------------------------------------------------------------+

I can’t stop either of them, I don’t think that’s the problem for me …

Ghada_Mjanah · November 30, 2020, 11:43am

I created a whole new environment and did reboot, same issue …

lissyx · November 30, 2020, 12:35pm

Believe whatever you want, I can just tell you I had exactly this issue earlier today and that killing GNOME3 by sudo systemctl stop gdm3.service helped. If you are not willing to try our suggestions, we can’t help you.

Ghada_Mjanah · November 30, 2020, 1:18pm

@lissyx I’m sorry if I made you understand that I’m not willing to try your suggestion, I meant it didn’t work for my case… I’m still getting this error:

tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[Mean/_61]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I changed to CUDNN 7.6.2 and a new environment, also same issue

lissyx · November 30, 2020, 1:24pm

You say it did not work, but you dont explicit if you still have Xorg and GNOME shell process. If they are still there, it’s inconclusive.
If they are killed and you still repro, there’s something wrong on your system and I have no idea. Please try and use docker image maybe?

Ghada_Mjanah · November 30, 2020, 2:02pm

When I run sudo systemctl stop gdm3.service and then systemctl status gdm3 I get:

● gdm.service - GNOME Display Manager
   Loaded: loaded (/lib/systemd/system/gdm.service; static; vendor preset: enabled)
   Active: inactive (dead) since Mon 2020-11-30 08:52:19 EST; 45s ago
  Process: 915 ExecStart=/usr/sbin/gdm3 (code=exited, status=0/SUCCESS)
  Process: 907 ExecStartPre=/usr/share/gdm/generate-config (code=exited, status=0/SUCCESS)
 Main PID: 915 (code=exited, status=0/SUCCESS)

Nov 30 08:39:58 ghada-Inspiron-3593 systemd[1]: Starting GNOME Display Manager...
Nov 30 08:39:58 ghada-Inspiron-3593 systemd[1]: Started GNOME Display Manager.
Nov 30 08:39:58 ghada-Inspiron-3593 gdm-launch-environment][955]: pam_unix(gdm-launch-environment:session): session opened for user gdm by (uid=0)
Nov 30 08:40:12 ghada-Inspiron-3593 gdm-password][1347]: pam_unix(gdm-password:session): session opened for user ghada by (uid=0)
Nov 30 08:52:19 ghada-Inspiron-3593 systemd[1]: Stopping GNOME Display Manager...
Nov 30 08:52:19 ghada-Inspiron-3593 gdm3[915]: GLib: g_hash_table_find: assertion 'version == hash_table->version' failed
Nov 30 08:52:19 ghada-Inspiron-3593 systemd[1]: Stopped GNOME Display Manager.

And I still have Xorg and gnom-shell processes
I also noticed that even with the flag --load_cudnn instead of --train_cudnn it gives the same error, but it’s fixed once I state: export CUDA_VISIBLE_DEVICES=-1

lissyx · November 30, 2020, 2:05pm

that makes you not use cuda devices, so it’s not what you want

then it could still be the cause

maybe it’s gdm.service on your system: I’m not here to debug your setup, sorry. I explained you what it could be, but I can’t fix it for you.

Ghada_Mjanah · November 30, 2020, 2:06pm

Yes I got it ! Thanks alot !

Ghada_Mjanah · November 30, 2020, 3:20pm

Problem solved after setting TF_FORCE_GPU_ALLOW_GROWTH = True

soerengustenhoff · December 2, 2020, 9:54pm

Hi lissyx.

This also gave me a huge headache.
The official https://deepspeech.readthedocs.io/en/v0.9.1/TRAINING.html documentation states that we need cuda 10.1 for deepspeech.
i wasted 6 hours today trying to solve this, lol.
the official documentation should be updated.

this solved my problem though:

running on ubuntu 18.04 (since thats the latest OS with cuda support.
https://www.tensorflow.org/install/gpu
change all 10.1 -> 10.0 and it will work!

also maybe i am a complete retard, but it took me an hour to solve another issue where DeepSpeech.py could not resolve path with spaces.

lissyx · December 3, 2020, 8:44am

it was fixed, after it was properly reported the discrepency

othiele · December 3, 2020, 9:17am

@soerengustenhoff, @Ghada_Mjanah thanks for pointing that problem out. I was still using my old setup and also didn’t realize we had a problem there.

@soerengustenhoff, as for the white spaces. If you can, make a PR and file it. If you don’t have the time, open a new thread here on Discourse, give some examples and we’ll fix that either in code or the docs for the next release.

soerengustenhoff · December 3, 2020, 9:49am

Hi lissyx.

Where has it been fixed ?
My problems were yesterday, and then it persisted.
is it fixed in the new version ?

And thank you for your very swift reply!

Olaf i will try to write down any further issues that i will have.
right now my model failed after running common voice english with a batch size of 50. another 4 hours wasted due to running out of memory i suppose ?
I am not sure how it works though, will have to look through the code

is a PR through github ?
I assume it is a public request, and i am happy to help.
I did however just think it was a simply python3 argument thing i ran into.

othiele · December 3, 2020, 10:18am

It is already fixed on Github, might not be updated on readthedocs though, @lissyx?

Try multiples of 2 like 16, 32, etc. But it is a common problem to find the right batch size. If it persists, start a new post here on Discourse.

Yes, it would be great if you could generate a Pull Request on Github. If not, open a post for the problem here and we’ll see to it that we find a solution. If it is just a usage related problem, also open a post, describe what happened and how you fixed it. Most people search for the problem and will find your post.

lissyx · December 3, 2020, 10:18am

I fixed doc on master and r0.9 a few days ago.