FPGA Implementation, Rather than CPU

ashtonites · January 24, 2020, 3:22am

I suppose that this is a stupid suggestion, but would it not be more efficient to replace higher clock speed and increased number of cores in CPUs with some of the processing performed in an FPGA. The fact that parallel processing is fundamental to FPGA design should mean that at least some parts of voice decoding/encoding might be faster, although I’m not sure if it would be cheaper. Verilog and VHDL code can be as open source as C. Probably some implementations of voice services already are implemented in gate arrays. The reason for my interest is that I am using a rather low throughput STM32F7 CPU with microphones; voice recognition would be useful. It is not connected to the internet, but there is space for an FPGA. If this idea is not nonsensical, perhaps someone can point me to an existing implementation or at least further information. Sorry if my request is far from Deep Speech interests.

lissyx · January 24, 2020, 10:08am

That “”“just”“” means designing a FPGA implementation.

Maybe, but nobody on the team has experience in those. Not to mention that, to the best of my knowledge, FPGA world is not as open as one would love, you still have a lot of proprietary tooling around.

I’d love having time to play around with that, but:

we don’t have the bandwidth as of now
we can’t enforce FPGA, as this would break a lot of other use-cases
this require a lot of hardware-related analysis / decisions
we might still require quite a lot of gates

While I’d love seeing some of those, we are not aware of any FPGA implementation.

Now, given that we rely on TensorFlow, and I think it supports some intermediate level instruction set that can be used to implement dedicated hardware, maybe someone can come up with something?

david-song · January 27, 2020, 8:49pm

Is an FGPA solution for this workload that much more efficient than off the shelf GPUs? It seems TensorFlow supports CUDA anyway, and apparently you can run CUDA on some FPGAs.

Not pooh-poohing the idea btw, I’m genuinely interested in learning more.

ashtonites · January 28, 2020, 3:00am

I have no information on relative efficiencies of the two approaches… that’s why I was asking. If TensorFlow is already implemented in FPGAs, it may not be faster than a GPU implementation. I was hoping for a design structured around an FPGA, rather a design structured around a CPU but implemented in an FPGA. As you know, state machines are common elements in both spheres, so I can understand how CPU–>FPGA conversions might be performed quickly, but at what cost to optimum performance? The question I asked was only half serious.

lissyx · January 28, 2020, 8:56am

I could understand in some use-case the preference of a FPGA over a GPU, though.

I’d really be curious to see some FPGA anyway.

lissyx · February 1, 2020, 2:08pm

When I see https://antmicro.com/blog/2019/12/tflite-in-zephyr-on-litex-vexriscv/ I’m jealous and I’d love to have some time to play with that

ashtonites · February 10, 2020, 5:47pm

Just received email from Arm regarding Cortex-M55… contains hardware for AI-related vector processing. I’m sure that this is a trend that cannot be stopped. Not sure of details, but probably it is a cross between an M4 and gate array, so performance may be somewhere between the two. Don’t know how performance compares with GPU. GPU can do many operations in parallel, but each operation may not be optimized for AI. M-55 advertised as having operations optimized for AI, but only one core… at least that is my understanding now.

lissyx · February 10, 2020, 5:50pm

Basically, what would be interested is “can it accelerate LSTM” ? And also, you refer to Cortex-M that are rather on the micro-controller side, I’m wondering if the size / complexity of our model would not be another issue here.

ashtonites · February 10, 2020, 6:44pm

You are correct… I am not interested in voice processing on a mainframe supercomputer. I am interested “at the edge” without connection to the cloud where Google/Amazon (or even i7) computers can do the heavy lifting… so addressing my question concerning FPGAs to Mozilla-related engineers was probably not optimum.

lissyx · February 10, 2020, 6:49pm

Your message sounds like you think I’m mocking you. That’s not the case. We are interested in the same thing (hint: it’s running on RPi4 decently ; that can still be considered as “supercomputer” in some context, but that’s already helping on the edge). Implementing with FPGA is just well above what we can commit because we are such a small team. I’m very curious if anyone is interested in that.

I was just analyzing the Cortex-M55 announces and see how easy it could be to leverage it. I’m worried our model is still too complex.

dabinat · February 11, 2020, 4:00am

Perhaps I’m misunderstanding, but how is an FPGA superior to a general hardware inference accelerator? I was under the impression that FPGAs were useful for acceleration use-cases that don’t currently have a hardware implementation, but as inference accelerators already exist (and many more are coming to market), surely it’s easier to just use those?

lissyx · February 11, 2020, 8:48am

FPGA allows to perform custom implementation of a lot of things, so you can have much more efficiency at runtime than on a generic CPU. So far, general hardware inference accelerator seems to focus only on convolution networks, when our network is recurrent and with LSTM. So basically those accelerator can’t help us at all.

I’m just wondering, though, given the complexity of our model, what would be the constraints on a FPGA (46MB TFLite, and then you still need to get back to the main CPU for the language model).

ashtonites · February 11, 2020, 2:05pm

I am not an expert in any aspect of voice recognition. I used to be an FPGA designer and implemented complex algorithms that had previously only existed in software on a CPU. I will bet that your algorithm is much, much too large to be implemented on even a multi-hundred $ FPGA (or gate array). However, some part of the algorithm could still be implemented in an FPGA, or the algorithm could be modified to take advantage of parallelism available in the FPGA. I know nothing about voice recognition algorithms, but it is even possible that the use of AI is wasteful of CPU resources.

lissyx · February 11, 2020, 2:17pm

Given that a GPU can improve the execution, that’s pretty sure something can be done.

Then you likely know much more about FPGA than we do, so your opinion could be valuable, so if you want to explore more, you are welcome

ashtonites · February 11, 2020, 5:53pm

I’m afraid that I will be of no help to you, so I will not explore more. When I started this discussion, I was looking for an already-complete FPGA design for experimentation, then share results with you. I already have an LCD touchscreen solution that works well, but speech recognition is handier because I can be anywhere in a room to control the system. In an earlier attempt to achieve this objective, I used a Sony TV remote control and the system spoke with fixed phrases/numbers/etc. stitched together by the software (using “situational awareness”, as is standard.)… but of course for that to work, I must be holding the remote control… no such limitation for voice. On the other hand, anyone in the building can use a remote control, whereas I would need speaker-independent algorithms for voice recognition (a more expensive/complex system).

I wish you guys every success and goodbye from Canada.

Topic		Replies	Views
DeepSpeech benchmarking / Shorten inference time DeepSpeech	16	5728	February 14, 2018
Google's On-Device Speech Recognizer DeepSpeech	37	6112	November 11, 2019
How can I use intel-tensorflow for inference of deepspeech? DeepSpeech	24	1344	December 2, 2020
Deep Speech optimization in production DeepSpeech	26	1721	March 13, 2020
What are the options for someone without a proper GPU? Cloud services, VMs or external GPUs? DeepSpeech	45	1651	July 7, 2020

FPGA Implementation, Rather than CPU

Related topics