How to deal with music/film title in another language of the speaker?

Hello everyone,

I was playing with my Alexa assistant ( sorry I have one) and it’s difficult to ask to play a music title if it’s in english (I’m french). An example ?
Alexa joue moi le morceau ‘Promises’ ? -> Alexa play “promises” title ?

So I wonder how "Common voice " will deal with specific expressions/words like music title, films title which are not in the language of the speaker ?

Thanks all,

David.

Hello bidouilles,

In my opinion, the goal of Common Voice is to gather datasets for separated languages.

I think being able to indentify the original language of words should be a feature of the trained model and not be an inherent feature of the dataset.

I don’t think we should have sentences in French with English, Italian, Spanish, German, Polish, Russian, Chinese etc … expressions other that city names or stuff like that.

It is up to the AI to have a global knowledge of speech that includes an understanding of all the learnt languages, and then be able in the same sentence to find which word is being spoken.

Otherwise each word of every language should be in the corpus of each language for it could be learned by all language-dependant AIs.

To be shorter, I think the problem does not come from the way we gather data, but how we model our systems.

This is only my opinion but I am very interested in what others may think about this issue.

Luc.

I don’t share your opinion Luc because it depends a lot on the Ai model. I think this is hard to mix different language in a sentence whatever your architecture choice but you cannot detect the language of each word.
It’s not so hard if you limit your learnt corpus to specific expression, a little bit like when you program a chabot, you have to define all the specific expression that anyone use.

So you suggest to add the vocabulary of every language in the corpus of French in order to be able to understand every word of every language ?

As of today, Speech Recognition Engines are trained on independant languages. In my opinion, the problem does not come from the fact that French sentences don’t have foreign expressions, but from the fact that the current AIs we developped don’t have a global understanding of speech.

I think the goal of Common Voice is to have independant datasets for each language, and if an AI developper wants to have an AI able to understand English expressions inside French sentences then it is up to him to develop a solution that combines an English model with a French one.

The way we collect data today should not be biased by the limitations of the state of the art of today.

The problem I have with adding foreign expressions in French is that CV tries to gather datasets specific to each language and adding foreign expressions in the sentences would lead the foreign words to be added in the French vocabulary like if they were French, and they are not.

So at the end of the day, we try to have independant datasets but in the same time we are creating a global vocabulary including every language instead of a specific French vocabulary.

Foreign expressions should stay considered as foreign expressions.

The ideal we should aim at in my opinion is to have independant vocabularies and then merge them at training time if you need to be able to understand mixed sentences.

What I am trying to say is that, yes in real life we often mix the languages in one sentence, there are no clear boundaries between languages. But actually the way we model speech today has boudaries and that’s why we have so much trouble when using foreign expressions.

The problem does not come from the data, but from the way we model speech.

Still I understand we could have a different opinion on this issue.

I didn’t propose anything I just said “think about how to solve this issue” and apparently Amazon did not solve this problem !
And that does not turn around french language but rather about english because english is everywhere. And many (most of) people doesn’t speak english with the accent.
I don’t have the answer.
When you talk about “The problem does not come from the data, but from the way we model speech.” you talk about AI model ?

I understand what you mean by “people don’t speak with the accent”.
I would say that in the ideal world, we should have people with every accent for all languages on Common Voice.

For example, it is also important that people with English, Spanish etc … accent speak the French sentences. This way your AI would be able to understand people speaking in French even with a foreign accent.

I think the solution to understand English expressions with French accent would be to have French people contributing to the English dataset of Common Voice, and not adding English expressions in the French dataset.
A lot of people like myself used to do it before Common Voice allowed to gather a French dataset.

By working this way we would still have very independant datasets and with a variety of accents in each of them.

When I say “The problem does not come from the data, but from the way we model speech.” I indeed mean that in my opinion the problem comes from the architectures of our neural netwoks.
We currently have great solutions for SR but not being able to go cross-language in the same sentence is an actual limitation of today’s approach.

I think the state of the art today in Speech Recognition is to have a model for each language, but it might probably change in the future.

So English expressions with French accent should be in the English dataset, and if you want an AI to understand English expressions inside French sentences, then i don’t think you should need a specific dataset being cross-language, but you should have to train a model that has an understanding of French and English, knowing how to recognise both.

This way this model would be able to recognise a mixed sentence, even if the speaker is saying French-English expressions with a Japanese accent.

thank u to all who answered i was looking for the answers.