Building libdeepspeech.so.. multiple definition zlib_archive vs zlib errors?

Well since you ask… the Dockerfile is based on
nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

Presumably to get TF 2.x compatibility. Which carries CUDA 10.1 with it. But our deployment Docker files are based on tensorflow/tensorflow:1.15.2-gpu-py3 which carries CUDA 10.0. I was seeing problems attempting to load CUDA10.1 files so [I tried various things, such as upgrading our deployed docker and downgrading the Dockerfile.build none of which worked yet so yeah]… well yeah you can imagine.

Also compounded by the problem of the GPU, a K80 (aka p3.2xlarge machine from AWS) not supporting 10.1, I think. Maybe. So yeah just trying to get deployment Docker and models to line up.

That’s what I was saying, if you target TensorFlow r1.15, you need to change the FROM to use the 10.0 CUDA version and not the 10.1.

We decoupled build and training because our training code still requires r1.15 but inference code leverages r2.2.

1 Like

I know this is not the right forum for this but I thought Id just ask here in case you might have seen this. You see I’m compiling/building using a db downgraded to nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 but I’m still getting problems matching the compiled version with the deployed docker - even - inexplicably when i use nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 as the base docker in production.

In debugging what’s going on I noticed this…

And I’m just like whaaat? How come cuda:10 docker has cuda 10.2 installed???

FWIW current status is that I can successfully build these two guys

But then when I run my transcribe test… this happens.

2020-06-29 12:08:07.365936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
TensorFlow: v1.15.0-24-gceb46aae58
DeepSpeech: v0.6.1-438-ga6c6dc21
2020-06-29 12:08:07.380255: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-29 12:08:07.381570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-29 12:08:09.989048: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:09.989887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-06-29 12:08:09.989929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-29 12:08:09.991633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-29 12:08:09.993011: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-29 12:08:09.993459: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-29 12:08:09.995427: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-29 12:08:09.996887: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-29 12:08:10.001230: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-29 12:08:10.001347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:10.002204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:10.002963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0.
2020-06-29 12:08:10.095435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-29 12:08:10.095479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2020-06-29 12:08:10.095496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
Invalid argument: No OpKernel was registered to support Op 'Minimum' used by {{node Minimum}}with these attrs: [T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  <no registered kernels>

         [[Minimum]]
Loading deepspeech weights from file model/20200610_ds.0.7.1_thm/output_graph.pbmm
Traceback (most recent call last):
  File "client/transcribe.py", line 156, in <module>
    main()
  File "client/transcribe.py", line 141, in main
    scorer_path=args.scorer)
  File "client/transcribe.py", line 103, in load_model
    ds = Model(model_weights_path)
  File "/usr/local/lib/python3.7/dist-packages/deepspeech/__init__.py", line 38, in __init__
    raise RuntimeError("CreateModel failed with '{}' (0x{:X})".format(deepspeech.impl.ErrorCodeToErrorMessage(status),status))
RuntimeError: CreateModel failed with 'Failed to create session.' (0x3006)
makefile:35: recipe for target 'transcribe-test' failed
make: *** [transcribe-test] Error 1

I suspect it may have something to do with the build flags set up in the Dockerfile.build which reads as follows

# Need devel version cause we need /usr/include/cudnn.h 
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
ENV DEEPSPEECH_REPO=https://github.com/tehikumedia/DeepSpeech.git
ENV DEEPSPEECH_SHA=origin/thm_confidences_71

# >> START Install base software

# Get basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
        apt-utils \
        bash-completion \
        build-essential \
        ca-certificates \
        cmake \
        curl \
        g++ \
        gcc \
        git \
        git-lfs \
        libbz2-dev \
        libboost-all-dev \
        libgsm1-dev \
        libltdl-dev \
        liblzma-dev \
        libmagic-dev \
        libpng-dev \
        libsox-fmt-mp3 \
        libsox-dev \
        locales \
        openjdk-8-jdk \
        pkg-config \
        python3 \
        python3-dev \
        python3-pip \
        python3-wheel \
        python3-numpy \
        sox \
        unzip \
        wget \
        zlib1g-dev \
        software-properties-common

# RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
# RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1

# install python 3.7  
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    python3.7 python3.7-dev python3-dev python3-pip \
    python3-uno python3-setuptools


RUN rm /usr/bin/python3
RUN ln -s /usr/bin/python3.7 /usr/bin/python3
RUN rm /usr/bin/python
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN pip3 install --upgrade pip
RUN pip3 install wheel twine

ENV PYTHON_BIN_PATH /usr/bin/python3.7
ENV PYTHON_LIB_PATH /usr/local/lib/python3.7/dist-packages

# Install Bazel
RUN curl -LO "https://github.com/bazelbuild/bazel/releases/download/2.0.0/bazel_2.0.0-linux-x86_64.deb"
RUN dpkg -i bazel_*.deb

ARG BAZEL_VERSION
COPY bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh /root
RUN /root/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh

# << END Install base software

# >> START Configure Tensorflow Build

# Clone TensorFlow from Mozilla repo
RUN git clone https://github.com/mozilla/tensorflow/
WORKDIR /tensorflow
RUN git checkout r1.15

# GPU Environment Setup
ENV TF_NEED_ROCM 0
ENV TF_NEED_OPENCL_SYCL 0
ENV TF_NEED_OPENCL 0
ENV TF_NEED_CUDA 1
ENV TF_CUDA_PATHS "/usr,/usr/local/cuda-10.0,/usr/lib/x86_64-linux-gnu/"
ENV TF_CUDA_VERSION 10.0
ENV TF_CUDNN_VERSION 7.5
ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7
ENV TF_NCCL_VERSION 2.4

# Common Environment Setup
ENV TF_BUILD_CONTAINER_TYPE GPU
ENV TF_BUILD_OPTIONS OPT
ENV TF_BUILD_DISABLE_GCP 1
ENV TF_BUILD_ENABLE_XLA 0
ENV TF_BUILD_PYTHON_VERSION PYTHON3
ENV TF_BUILD_IS_OPT OPT
ENV TF_BUILD_IS_PIP PIP

# Other Parameters
ENV CC_OPT_FLAGS -mavx -mavx2 -msse4.1 -msse4.2 -mfma
ENV TF_NEED_GCP 0
ENV TF_NEED_HDFS 0
ENV TF_NEED_JEMALLOC 1
ENV TF_NEED_OPENCL 0
ENV TF_CUDA_CLANG 0
ENV TF_NEED_MKL 0
ENV TF_ENABLE_XLA 0
ENV TF_NEED_AWS 0
ENV TF_NEED_KAFKA 0
ENV TF_NEED_NGRAPH 0
ENV TF_DOWNLOAD_CLANG 0
ENV TF_NEED_TENSORRT 0
ENV TF_NEED_GDR 0
ENV TF_NEED_VERBS 0
ENV TF_NEED_OPENCL_SYCL 0

# << END Configure Tensorflow Build

# >> START Configure Bazel

# Running bazel inside a `docker build` command causes trouble, cf:
#   https://github.com/bazelbuild/bazel/issues/134
# The easiest solution is to set up a bazelrc file forcing --batch.
RUN echo "startup --batch" >>/etc/bazel.bazelrc
# Similarly, we need to workaround sandboxing issues:
#   https://github.com/bazelbuild/bazel/issues/418
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
    >>/etc/bazel.bazelrc

# << END Configure Bazel

WORKDIR /

RUN git clone $DEEPSPEECH_REPO
WORKDIR /DeepSpeech
RUN git checkout $DEEPSPEECH_SHA

# Link DeepSpeech native_client libs to tf folder
RUN ln -s /DeepSpeech/native_client /tensorflow

# >> START Build and bind

WORKDIR /tensorflow

# Fix for not found script https://github.com/tensorflow/tensorflow/issues/471
RUN ./configure

# Using CPU optimizations:
# -mtune=generic -march=x86-64 -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx.
# Adding --config=cuda flag to build using CUDA.

# passing LD_LIBRARY_PATH is required cause Bazel doesn't pickup it from environment

# Build DeepSpeech
RUN bazel build \
	--workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" \
	--config=monolithic \
	--config=cuda \
	-c opt \
	--copt=-O3 \
	--copt="-D_GLIBCXX_USE_CXX11_ABI=0" \
	--copt=-mtune=generic \
	--copt=-march=x86-64 \
	--copt=-msse \
	--copt=-msse2 \
	--copt=-msse3 \
	--copt=-msse4.1 \
	--copt=-msse4.2 \
	--copt=-mavx \
	--copt=-fvisibility=hidden \
	//native_client:libdeepspeech.so \
	--verbose_failures \
	--action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}

# Copy built libs to /DeepSpeech/native_client
RUN cp /tensorflow/bazel-bin/native_client/libdeepspeech.so /DeepSpeech/native_client/

# Build client.cc and install Python client and decoder bindings
ENV TFDIR /tensorflow

RUN nproc

# Have to upgrade this because of upgrading python to 3.7 above
# FIXME: move this to above
RUN pip3 install --upgrade numpy==1.19.0

WORKDIR /DeepSpeech/native_client
RUN make NUM_PROCESSES=$(nproc) deepspeech

WORKDIR /DeepSpeech
RUN cd native_client/python && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/python/dist/*.whl

RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/ctcdecode/dist/*.whl

# << END Build and bind

# Allow Python printing utf-8
ENV PYTHONIOENCODING UTF-8

# Build KenLM in /DeepSpeech/native_client/kenlm folder
WORKDIR /DeepSpeech/native_client
RUN rm -rf kenlm && \
	git clone https://github.com/kpu/kenlm && \
	cd kenlm && \
	git checkout 87e85e66c99ceff1fab2500a7c60c01da7315eec && \
	mkdir -p build && \
	cd build && \
	cmake .. && \
	make -j $(nproc)

# Done
WORKDIR /DeepSpeech

You are missing something at build time. This might be because of your changes.

to me it seems likely its this thing …

Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0.

FWIW I’ve tried setting this env variable on the Dockerfile.build before building… but still getting the same error

ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7

Guess I’ll build again without setting the custom DS repo just to get a clean error.