Building libdeepspeech.so.. multiple definition zlib_archive vs zlib errors?

utunga · June 25, 2020, 8:45am

Don’t quite know what you mean here. In some sense bazel is in fact telling me - because its saying it doesnt build… but perhaps you mean this in some other sense.

Thanks again for the reply!

lissyx · June 25, 2020, 8:46am

right at the beggining it applied the patches, you would have an error clearly stating it’s unable to apply the patches.

BTW, what are you building? we just merged moving to tensorflow r2.2 for inference, you might benefit from that

lissyx · June 25, 2020, 8:49am

it’s not a tag, it’s a branch

right, I missed it when reading, so you seem to have the monolithic config enabled. Just make sure you are not getting tricked by weird utf8 / copy-pasting that would mess -- for example.

So maybe you changed something related?

lissyx · June 25, 2020, 8:49am

Also, have you verified completely eradicating Bazel’s cache? Sometimes, it might go crazy. And which version of Bazel do you use?

utunga · June 25, 2020, 10:37am

Thanks for the replies @lissyx I’ll try to respond to each one as needed.

FWIW I’m just trying to build v0.7.1 of Deepspeech with these tiny changes to extract letter by letter confidences out into the ds.sttWithMetadata(audio) method.

I’d totally think about upgrading to TensorFlow 2.2 etc but we just spent ages upgrading everything to 0.7 and retraining model etc so kinda keen to get this going at this version…

This letter by letter confidences stuff was working great in v0.5.1 btw just trying to bring it over into 0.7.

Sorry yes I meant branch.

Just to eliminate all other confusion I’m going to roll back DeepSpeech to the v0.7.1 tag, leave tensorflow at r1.15 rm -rf the .bazel_cache and rebuild. It takes ages to do that though so will be a while before i report back.

bazel 0.24.1

More specifically… (this is from within our Dockerfile btw)

root@13f49df5ac06:/code/tensorflow# bazel info
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=157
INFO: Reading rc options for 'info' from /code/tensorflow/.bazelrc:
  Inherited 'build' options: --apple_platform_type=macos --define framework_shared_object=true --define open_source_build=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone --strategy=Genrule=standalone -c opt --announce_rc --define=grpc_no_ares=true --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include
INFO: Reading rc options for 'info' from /code/tensorflow/.tf_configure.bazelrc:
  Inherited 'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python --action_env PYTHON_LIB_PATH=/usr/local/lib/python3.7/dist-packages --python_path=/usr/bin/python --config=xla --action_env CUDA_TOOLKIT_PATH=/usr/local/cuda --action_env TF_CUDA_COMPUTE_CAPABILITIES=7.0 --action_env LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --action_env GCC_HOST_COMPILER_PATH=/usr/bin/gcc --config=cuda --action_env TF_CONFIGURE_IOS=0
INFO: Found applicable config definition build:xla in file /code/tensorflow/.tf_configure.bazelrc: --define with_xla_support=true
INFO: Found applicable config definition build:cuda in file /code/tensorflow/.bazelrc: --config=using_cuda --define=using_cuda_nvcc=true
INFO: Found applicable config definition build:using_cuda in file /code/tensorflow/.bazelrc: --define=using_cuda=true --action_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain
DEBUG: Rule 'io_bazel_rules_docker' indicated that a canonical reproducible form can be obtained by modifying arguments shallow_since = "1556410077 -0400"
bazel-bin: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/execroot/org_tensorflow/bazel-out/k8-opt/bin
bazel-genfiles: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/execroot/org_tensorflow/bazel-out/k8-opt/genfiles
bazel-testlogs: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/execroot/org_tensorflow/bazel-out/k8-opt/testlogs
character-encoding: file.encoding = ISO-8859-1, defaultCharset = ISO-8859-1
command_log: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/command.log
committed-heap-size: 964MB
execution_root: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/execroot/org_tensorflow
gc-count: 5
gc-time: 119ms
install_base: /root/.cache/bazel/_bazel_root/install/7da6a92c096ada842b8d48c251312343
java-home: /root/.cache/bazel/_bazel_root/install/7da6a92c096ada842b8d48c251312343/_embedded_binaries/embedded_tools/jdk
java-runtime: OpenJDK Runtime Environment (build 11.0.2+7-LTS) by Azul Systems, Inc.
java-vm: OpenJDK 64-Bit Server VM (build 11.0.2+7-LTS, mixed mode) by Azul Systems, Inc.
max-heap-size: 14309MB
output_base: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f
output_path: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/execroot/org_tensorflow/bazel-out
package_path: %workspace%
release: release 0.24.1
repository_cache: /root/.cache/bazel/_bazel_root/cache/repos/v1
server_log: /root/.cache/bazel/_bazel_root/dc321da52ae17570621748eb04acb03f/java.log.13f49df5ac06.root.log.java.20200625-103403.278
server_pid: 278
used-heap-size: 228MB
workspace: /code/tensorflow

utunga · June 25, 2020, 10:59am

As i mentioned I’m currently rebuilding per above.

That said I really just wish I could figure out what bit of code inside the many bazel configs is the bit that actually applies the patches because it really does seem like this thing about importing multiple definitions of zlib caused by protobuf is pretty much a known problem with the tensorflow build and the reason why that patch is there in the first place… it just seems like I dunno… its not doing it for some reason?

lissyx · June 25, 2020, 11:21am

Right, so nothing that should impact

Training is still r1.15 and the models are compatible, so you can just do that safely

Ok, some people reported weird issues when building with 0.26.0

Again, if the patch is referenced in TensorFlow’s build configs, it is applied. If it failed to apply, you would be blocked at it.

Anyway you can still verify manually the files if you are unsure. Likely find -L . -type f -name "zlib*" ?

lissyx · June 25, 2020, 11:22am

@utunga Also, are you able to rebuild clean tree ? Can you try with our supplied Dockerfile.build ? You can change the repo used when issuing make Dockerfile.build

utunga · June 25, 2020, 12:19pm

Thanks @lissyx its late here so i may not get to this till tomorrow but fyi i finished a build after cleaning bazel_cache and rolling DeepSpeech back to v0.7.1 - still getting the same ‘multiple definition’ errors relating to zlib.

I will try with the other Dockerfile.build also as you ask ( more from a ‘clean bug perspective’ tbh because this Dockerfile was able to build earlier things OK )

Actually just to clarify when you say Dockerfile.build do you mean HEAD Dockerfile.build.tmpl or v0.7.1 Dockerfile

–

PS I am not quite sure what you mean by…

FWIW I’m doing git clean -f -d and git reset --hard in both tensorflow and Deepspeech dir before building if that’s what you mean?

lissyx · June 25, 2020, 12:22pm

No, just building mozilla/tensorflow@r1.15 plain after cleaning up any bazel cache

HEAD one

utunga · June 27, 2020, 6:57am

Because I appreciate your help @lissyx I thought I’d give you an update on this…

I used the HEAD Dockerfile.build.tmpl and with a few changes (listed below) was able to build both libdeepspeech.so and the python wheel.

First thing I did was verify, as you suggested, that I could build the HEAD from mozilla/Deepspeech against tensorflow 2.2… then was able to specify the custom DeepSpeech REPO by altering the DEEPSPEECH_REPO and DEEPSPEECH_SHA param at the top of the DockerFIle.

In case someone else is wrestling with this, I had to also do the following, in order to build DeepSpeech at v0.7.1_gpu specifically:

downgrade BAZEL_VERSION to 0.24.1
downgrade tensorflow to r1.15

RUN git clone https://github.com/mozilla/tensorflow/
WORKDIR /tensorflow
RUN git checkout r1.15

–

The only problem I have now is that the dockerfile builds at python 3.6 but our code relies on async / await so I have to upgrade to python 3.7 and do it again.

lissyx · June 26, 2020, 7:59am

Ubuntu 18.04 has python 3.7 packages, so you should be able to do so.

Was that against our pristine repo or your code? Anyway, looks like you have progress, so you should be able to find if it’s an issue with your changes or in your build environment.

utunga · June 26, 2020, 11:41am

Both actually. Was able to generate py3.6 version on pristine version of deepspeech and py3.6 and py3.7 version of the wheel on our own branch. That said, unfortunately still having a few problems getting the exact right combination of Cuda library dependencies and such to work.

lissyx · June 26, 2020, 11:51am

Where? The dockerfile has them clearly stated. But TensorFlow r1.15 needs CUDA 10.0 + CUDNN v7.6

utunga · June 26, 2020, 12:09pm

Well since you ask… the Dockerfile is based on
nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

Presumably to get TF 2.x compatibility. Which carries CUDA 10.1 with it. But our deployment Docker files are based on tensorflow/tensorflow:1.15.2-gpu-py3 which carries CUDA 10.0. I was seeing problems attempting to load CUDA10.1 files so [I tried various things, such as upgrading our deployed docker and downgrading the Dockerfile.build none of which worked yet so yeah]… well yeah you can imagine.

Also compounded by the problem of the GPU, a K80 (aka p3.2xlarge machine from AWS) not supporting 10.1, I think. Maybe. So yeah just trying to get deployment Docker and models to line up.

lissyx · June 26, 2020, 12:12pm

That’s what I was saying, if you target TensorFlow r1.15, you need to change the FROM to use the 10.0 CUDA version and not the 10.1.

We decoupled build and training because our training code still requires r1.15 but inference code leverages r2.2.

utunga · June 29, 2020, 12:05pm

I know this is not the right forum for this but I thought Id just ask here in case you might have seen this. You see I’m compiling/building using a db downgraded to nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 but I’m still getting problems matching the compiled version with the deployed docker - even - inexplicably when i use nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 as the base docker in production.

In debugging what’s going on I noticed this…

And I’m just like whaaat? How come cuda:10 docker has cuda 10.2 installed???

utunga · June 29, 2020, 12:17pm

FWIW current status is that I can successfully build these two guys

But then when I run my transcribe test… this happens.

2020-06-29 12:08:07.365936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
TensorFlow: v1.15.0-24-gceb46aae58
DeepSpeech: v0.6.1-438-ga6c6dc21
2020-06-29 12:08:07.380255: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-29 12:08:07.381570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-29 12:08:09.989048: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:09.989887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-06-29 12:08:09.989929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-29 12:08:09.991633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-29 12:08:09.993011: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-29 12:08:09.993459: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-29 12:08:09.995427: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-29 12:08:09.996887: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-29 12:08:10.001230: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-29 12:08:10.001347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:10.002204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-29 12:08:10.002963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0.
2020-06-29 12:08:10.095435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-29 12:08:10.095479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2020-06-29 12:08:10.095496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
Invalid argument: No OpKernel was registered to support Op 'Minimum' used by {{node Minimum}}with these attrs: [T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  <no registered kernels>

         [[Minimum]]
Loading deepspeech weights from file model/20200610_ds.0.7.1_thm/output_graph.pbmm
Traceback (most recent call last):
  File "client/transcribe.py", line 156, in <module>
    main()
  File "client/transcribe.py", line 141, in main
    scorer_path=args.scorer)
  File "client/transcribe.py", line 103, in load_model
    ds = Model(model_weights_path)
  File "/usr/local/lib/python3.7/dist-packages/deepspeech/__init__.py", line 38, in __init__
    raise RuntimeError("CreateModel failed with '{}' (0x{:X})".format(deepspeech.impl.ErrorCodeToErrorMessage(status),status))
RuntimeError: CreateModel failed with 'Failed to create session.' (0x3006)
makefile:35: recipe for target 'transcribe-test' failed
make: *** [transcribe-test] Error 1

I suspect it may have something to do with the build flags set up in the Dockerfile.build which reads as follows

# Need devel version cause we need /usr/include/cudnn.h 
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
ENV DEEPSPEECH_REPO=https://github.com/tehikumedia/DeepSpeech.git
ENV DEEPSPEECH_SHA=origin/thm_confidences_71

# >> START Install base software

# Get basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
        apt-utils \
        bash-completion \
        build-essential \
        ca-certificates \
        cmake \
        curl \
        g++ \
        gcc \
        git \
        git-lfs \
        libbz2-dev \
        libboost-all-dev \
        libgsm1-dev \
        libltdl-dev \
        liblzma-dev \
        libmagic-dev \
        libpng-dev \
        libsox-fmt-mp3 \
        libsox-dev \
        locales \
        openjdk-8-jdk \
        pkg-config \
        python3 \
        python3-dev \
        python3-pip \
        python3-wheel \
        python3-numpy \
        sox \
        unzip \
        wget \
        zlib1g-dev \
        software-properties-common

# RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
# RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1

# install python 3.7  
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    python3.7 python3.7-dev python3-dev python3-pip \
    python3-uno python3-setuptools


RUN rm /usr/bin/python3
RUN ln -s /usr/bin/python3.7 /usr/bin/python3
RUN rm /usr/bin/python
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN pip3 install --upgrade pip
RUN pip3 install wheel twine

ENV PYTHON_BIN_PATH /usr/bin/python3.7
ENV PYTHON_LIB_PATH /usr/local/lib/python3.7/dist-packages

# Install Bazel
RUN curl -LO "https://github.com/bazelbuild/bazel/releases/download/2.0.0/bazel_2.0.0-linux-x86_64.deb"
RUN dpkg -i bazel_*.deb

ARG BAZEL_VERSION
COPY bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh /root
RUN /root/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh

# << END Install base software

# >> START Configure Tensorflow Build

# Clone TensorFlow from Mozilla repo
RUN git clone https://github.com/mozilla/tensorflow/
WORKDIR /tensorflow
RUN git checkout r1.15

# GPU Environment Setup
ENV TF_NEED_ROCM 0
ENV TF_NEED_OPENCL_SYCL 0
ENV TF_NEED_OPENCL 0
ENV TF_NEED_CUDA 1
ENV TF_CUDA_PATHS "/usr,/usr/local/cuda-10.0,/usr/lib/x86_64-linux-gnu/"
ENV TF_CUDA_VERSION 10.0
ENV TF_CUDNN_VERSION 7.5
ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7
ENV TF_NCCL_VERSION 2.4

# Common Environment Setup
ENV TF_BUILD_CONTAINER_TYPE GPU
ENV TF_BUILD_OPTIONS OPT
ENV TF_BUILD_DISABLE_GCP 1
ENV TF_BUILD_ENABLE_XLA 0
ENV TF_BUILD_PYTHON_VERSION PYTHON3
ENV TF_BUILD_IS_OPT OPT
ENV TF_BUILD_IS_PIP PIP

# Other Parameters
ENV CC_OPT_FLAGS -mavx -mavx2 -msse4.1 -msse4.2 -mfma
ENV TF_NEED_GCP 0
ENV TF_NEED_HDFS 0
ENV TF_NEED_JEMALLOC 1
ENV TF_NEED_OPENCL 0
ENV TF_CUDA_CLANG 0
ENV TF_NEED_MKL 0
ENV TF_ENABLE_XLA 0
ENV TF_NEED_AWS 0
ENV TF_NEED_KAFKA 0
ENV TF_NEED_NGRAPH 0
ENV TF_DOWNLOAD_CLANG 0
ENV TF_NEED_TENSORRT 0
ENV TF_NEED_GDR 0
ENV TF_NEED_VERBS 0
ENV TF_NEED_OPENCL_SYCL 0

# << END Configure Tensorflow Build

# >> START Configure Bazel

# Running bazel inside a `docker build` command causes trouble, cf:
#   https://github.com/bazelbuild/bazel/issues/134
# The easiest solution is to set up a bazelrc file forcing --batch.
RUN echo "startup --batch" >>/etc/bazel.bazelrc
# Similarly, we need to workaround sandboxing issues:
#   https://github.com/bazelbuild/bazel/issues/418
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
    >>/etc/bazel.bazelrc

# << END Configure Bazel

WORKDIR /

RUN git clone $DEEPSPEECH_REPO
WORKDIR /DeepSpeech
RUN git checkout $DEEPSPEECH_SHA

# Link DeepSpeech native_client libs to tf folder
RUN ln -s /DeepSpeech/native_client /tensorflow

# >> START Build and bind

WORKDIR /tensorflow

# Fix for not found script https://github.com/tensorflow/tensorflow/issues/471
RUN ./configure

# Using CPU optimizations:
# -mtune=generic -march=x86-64 -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx.
# Adding --config=cuda flag to build using CUDA.

# passing LD_LIBRARY_PATH is required cause Bazel doesn't pickup it from environment

# Build DeepSpeech
RUN bazel build \
	--workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" \
	--config=monolithic \
	--config=cuda \
	-c opt \
	--copt=-O3 \
	--copt="-D_GLIBCXX_USE_CXX11_ABI=0" \
	--copt=-mtune=generic \
	--copt=-march=x86-64 \
	--copt=-msse \
	--copt=-msse2 \
	--copt=-msse3 \
	--copt=-msse4.1 \
	--copt=-msse4.2 \
	--copt=-mavx \
	--copt=-fvisibility=hidden \
	//native_client:libdeepspeech.so \
	--verbose_failures \
	--action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}

# Copy built libs to /DeepSpeech/native_client
RUN cp /tensorflow/bazel-bin/native_client/libdeepspeech.so /DeepSpeech/native_client/

# Build client.cc and install Python client and decoder bindings
ENV TFDIR /tensorflow

RUN nproc

# Have to upgrade this because of upgrading python to 3.7 above
# FIXME: move this to above
RUN pip3 install --upgrade numpy==1.19.0

WORKDIR /DeepSpeech/native_client
RUN make NUM_PROCESSES=$(nproc) deepspeech

WORKDIR /DeepSpeech
RUN cd native_client/python && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/python/dist/*.whl

RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/ctcdecode/dist/*.whl

# << END Build and bind

# Allow Python printing utf-8
ENV PYTHONIOENCODING UTF-8

# Build KenLM in /DeepSpeech/native_client/kenlm folder
WORKDIR /DeepSpeech/native_client
RUN rm -rf kenlm && \
	git clone https://github.com/kpu/kenlm && \
	cd kenlm && \
	git checkout 87e85e66c99ceff1fab2500a7c60c01da7315eec && \
	mkdir -p build && \
	cd build && \
	cmake .. && \
	make -j $(nproc)

# Done
WORKDIR /DeepSpeech

lissyx · June 29, 2020, 12:33pm

utunga:

Invalid argument: No OpKernel was registered to support Op 'Minimum' used by {{node Minimum}}with these attrs: [T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  <no registered kernels>

You are missing something at build time. This might be because of your changes.

utunga · June 29, 2020, 12:40pm

to me it seems likely its this thing …

Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0.

FWIW I’ve tried setting this env variable on the Dockerfile.build before building… but still getting the same error

ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7

Guess I’ll build again without setting the custom DS repo just to get a clean error.