I see that CoVoST sentences were added for Swedish last week and is shown in Common Voice. Were they tested in any way if they conform to the standard we have for manually added sentences? Spontaneously the quality of this look worse than the Europarl sentences I have prepared (but still haven’t gotten around to get anyone else to quality check for me).
Right now I get the impression that I wouldn’t finish a set of five sentences without happen upon at least one problematic sentence either because of incorrect grammar or a lot of words that would be pronounced in English.
I looked into this because I got the following sentence to listen to
Vissa skärmdumpar av onlinedokument sk. thumbshots förbättras med informationsikoner eller korta textsammanfattningar.
which have the abbreviation “sk.” which would be read out as “så kallade” and the word “thumbshots” which would be pronounced in English and not Swedish.
There are other sentences which contain even more words that would typically be read out with English pronunciation instead of Swedish. Such as:
Hertigen är sponsor till The Northumberland Church of England Academy. (6/10 words in English)
Harmonic society choral (3/3 words in English, and not a real sentence either)
Mark Twain National Forest innehåller ytterligare allmän mark som bl.a. Bell Mountain Wilderness. (7/13 words, also note the abbreviation “bl.a.”)
Wyoming Women's Center, som är en del av Wyoming Department of Corrections, ligger i Lusk. (7/15 words)
“Källa: China Social Science Network" (4/5 words, also using quotes unnecessarily)
10556/263556 lines in the file contains quotes which do look a bit high to my eyes. (In addition to that 8844 lines contains “typographical” quotes which haven’t been normalized to the other type.)
Caesar sallad hör inte hemma på en pizza även om den är vegansk. (all words could be pronounced in Swedish but the first two words should be written together or at least hyphenated to be correct Swedish)
Begäran filtreras av umask. (so umask is probably the unix command here and nothing we can expect users to even begin to pronounce)
Cryer stödde ett flertal vänsterpolitiska ändamål och han var även EU-skeptisk. (EU-skeptisk here contains the acronym EU which I don’t think Sentence Collector would allow?)
Of the most common abbreviations in Sweden there are:
37 with “bl.a.”
11 with “osv.”
15 with “etc.”
24 with “s.k.” or “sk.”
124 with “t.ex.”
10 with “pga.” or “p.g.a.”
So not too many sentences with abbreviations but something that a native speaker would have checked before adding.
There also seems to be trouble with grammar.
Staden var ursprungligen en etrusk och osker bosättning.
feels like an incorrect automatic translation of “The city was originally an Etruscan-Oscan settlement.” where the two adjectives are wrong.
Ni har gjort själv inte bra av det. which I have trouble even to understand what it is supposed to mean but probably needs to have some words rearranged and some words added to make any sense.