Fine tuning Deepspeech 0.9.1 with same alphabet

i restarted my PC and nvidia-smi states 0% GPU utilization

Sometimes it helps to just start over because you fiddled with all the libs too much. Why don’t you try it in a fresh environment. Helped me in the past.

1 Like

it’s not about GPU usage, it’s about GPU being locked / GPU memory being allocated even a tiny portion.

I will try in a new env right away !

@Ghada_Mjanah Like I said, I just ran into a similar stack, and stopping gdm3 / gnome3 fixed it. Those errors can have a lot of root cause, and debugging CUDNN is hard, and not in our scope, sorry.

nvidia-smi shows that these processes are running over GPU:

                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       980      G   /usr/lib/xorg/Xorg                 47MiB |
|    0   N/A  N/A      1065      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      1375      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      1561      G   /usr/bin/gnome-shell               39MiB |
+-----------------------------------------------------------------------------+

I can’t stop either of them, I don’t think that’s the problem for me …

I created a whole new environment and did reboot, same issue …

Believe whatever you want, I can just tell you I had exactly this issue earlier today and that killing GNOME3 by sudo systemctl stop gdm3.service helped. If you are not willing to try our suggestions, we can’t help you.

@lissyx I’m sorry if I made you understand that I’m not willing to try your suggestion, I meant it didn’t work for my case… I’m still getting this error:

tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[Mean/_61]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I changed to CUDNN 7.6.2 and a new environment, also same issue

You say it did not work, but you dont explicit if you still have Xorg and GNOME shell process. If they are still there, it’s inconclusive.
If they are killed and you still repro, there’s something wrong on your system and I have no idea. Please try and use docker image maybe?

When I run sudo systemctl stop gdm3.service and then systemctl status gdm3 I get:

● gdm.service - GNOME Display Manager
   Loaded: loaded (/lib/systemd/system/gdm.service; static; vendor preset: enabled)
   Active: inactive (dead) since Mon 2020-11-30 08:52:19 EST; 45s ago
  Process: 915 ExecStart=/usr/sbin/gdm3 (code=exited, status=0/SUCCESS)
  Process: 907 ExecStartPre=/usr/share/gdm/generate-config (code=exited, status=0/SUCCESS)
 Main PID: 915 (code=exited, status=0/SUCCESS)

Nov 30 08:39:58 ghada-Inspiron-3593 systemd[1]: Starting GNOME Display Manager...
Nov 30 08:39:58 ghada-Inspiron-3593 systemd[1]: Started GNOME Display Manager.
Nov 30 08:39:58 ghada-Inspiron-3593 gdm-launch-environment][955]: pam_unix(gdm-launch-environment:session): session opened for user gdm by (uid=0)
Nov 30 08:40:12 ghada-Inspiron-3593 gdm-password][1347]: pam_unix(gdm-password:session): session opened for user ghada by (uid=0)
Nov 30 08:52:19 ghada-Inspiron-3593 systemd[1]: Stopping GNOME Display Manager...
Nov 30 08:52:19 ghada-Inspiron-3593 gdm3[915]: GLib: g_hash_table_find: assertion 'version == hash_table->version' failed
Nov 30 08:52:19 ghada-Inspiron-3593 systemd[1]: Stopped GNOME Display Manager.

And I still have Xorg and gnom-shell processes
I also noticed that even with the flag --load_cudnn instead of --train_cudnn it gives the same error, but it’s fixed once I state: export CUDA_VISIBLE_DEVICES=-1

that makes you not use cuda devices, so it’s not what you want

then it could still be the cause

maybe it’s gdm.service on your system: I’m not here to debug your setup, sorry. I explained you what it could be, but I can’t fix it for you.

Yes I got it ! Thanks alot !

Problem solved after setting TF_FORCE_GPU_ALLOW_GROWTH = True

1 Like

Hi lissyx.

This also gave me a huge headache.
The official https://deepspeech.readthedocs.io/en/v0.9.1/TRAINING.html documentation states that we need cuda 10.1 for deepspeech.
i wasted 6 hours today trying to solve this, lol.
the official documentation should be updated.

this solved my problem though:

running on ubuntu 18.04 (since thats the latest OS with cuda support.
https://www.tensorflow.org/install/gpu
change all 10.1 -> 10.0 and it will work!

also maybe i am a complete retard, but it took me an hour to solve another issue where DeepSpeech.py could not resolve path with spaces.

it was fixed, after it was properly reported the discrepency

@soerengustenhoff, @Ghada_Mjanah thanks for pointing that problem out. I was still using my old setup and also didn’t realize we had a problem there.

@soerengustenhoff, as for the white spaces. If you can, make a PR and file it. If you don’t have the time, open a new thread here on Discourse, give some examples and we’ll fix that either in code or the docs for the next release.

Hi lissyx.

Where has it been fixed ?
My problems were yesterday, and then it persisted.
is it fixed in the new version ?

And thank you for your very swift reply!

Olaf i will try to write down any further issues that i will have.
right now my model failed after running common voice english with a batch size of 50. another 4 hours wasted due to running out of memory i suppose ?
I am not sure how it works though, will have to look through the code :slightly_smiling_face:

is a PR through github ?
I assume it is a public request, and i am happy to help.
I did however just think it was a simply python3 argument thing i ran into.

It is already fixed on Github, might not be updated on readthedocs though, @lissyx?

Try multiples of 2 like 16, 32, etc. But it is a common problem to find the right batch size. If it persists, start a new post here on Discourse.

Yes, it would be great if you could generate a Pull Request on Github. If not, open a post for the problem here and we’ll see to it that we find a solution. If it is just a usage related problem, also open a post, describe what happened and how you fixed it. Most people search for the problem and will find your post.

I fixed doc on master and r0.9 a few days ago.