Trying to create a 'Common Voice overview' for newbies, SVG drawin' style

Well,

For those who are annoyed by my persistent will to improve
A/ my understanding of the project,
B/ the tools, and
C/ the documentation associated,

I sorry to say Oops ! I did it again. But not Britney’s style. More Corben Dallas style. I didn’t make it to the fog, but I least, I tried.

Please review this with the link below, and tell me if it’s dope or nope.

file is located at:

Fell free to send :cupid: or rotten :tomato: to tell me what you think !

1 Like

Nice diagram :slight_smile: But IMHO:

  • This is not simplified for common user :slight_smile: It might be useful for those who start on coding or working on AI. You might need to specify the target audience beforehand (or different targets) for UX.
  • All repos under commonvoice should be there and can be colored as repos, e.g. cvdataset repo has summaries of each dataset per release. The release process is more complex with human reviews and code additions and such and needs input from staff.
  • The whole “add your language” workflow, including Pontoon is not there, might need to check the About page and Playbook…
  • Simpler stuff and details seem to be mixed here…

Perhaps it is not a good thing to put all these into one graph, but divide them as workflows, much like about page/playbook.

1 Like

Thanks for keeping an eye on the documentation! Documentation usually doesn’t get the love it deserves, so any effort here is appreciated. Generally I agree with Bülent that this is not an “oversimplified” view and like his suggestion to split it up by workflow. Alternatively I could see benefit in a very simple diagram, but no subprocesses described. Basically Sentence Collector/Sentence Extractor/Bulk Upload -> Common Voice DB -> Recording/Validating on CV website -> Dataset without the subprocesses.

I would also like to avoid more duplication of diagrams, mostly when it comes to the Sentence Collector as we already have diagrams and explanations here: https://github.com/common-voice/sentence-collector#detailed-flow

The Sentence Extractor also has a quite extensive documentation here: https://github.com/Common-Voice/cv-sentence-extractor#common-voice-sentence-extractor . For this specifically, I can see that it’s not clear whether the Sentence Extractor does cleanup or not. It currently does not, it’s all based on the rules file. Essentially it validates every sentence and if it does not fulfill the rules, it tries the next one. Of course up to the limit per article of 3 sentences in the case of Wikipedia for example. A high-level diagram for this flow might be beneficial within that README, but I’d argue it should be very high-level and only include steps absolutely necessary. I can give this a shot tomorrow!

In the end, as processes and code changes, I would also like to avoid to forget documentation be updated, as it’s clear it would happen more often the more diagrams we have and the more detailed diagrams or explanations are in different places. Therefore the closer the diagram and explanations live to the actual process, the easier it is to not forget to update it.

1 Like

In my opinion, the medium for documentation usually defines the structure. When second version of the Community Playbook came to review, I suggested the use of readthedocs.io, to keep the whole documentation, which supports github linking and localization.

Most of the Mozilla projects are/were there already, e.g. https://deepspeech.readthedocs.io/en/master/

In such an environment, one can divide it to different user levels, let the language communities give better examples/directions (e.g. replacing the dinosaurs in CV), localized to their needs.

This is one big step thou, which would need a large team from locales and many man hours.

I’m with @mkohler to keep specifics of one repo in that repo. Repositories have different people/teams who make regular changes.

But I feel the need of high level graph on how they all fit together, each box directing to individual repos.

Workflows towards different target audiences, like:

  • General public: Contribution
  • New teams: New language addition
  • Teams: How to analyze/make better (from sentence selection to gender equality)
  • Technical (modelers/coders): A simplified version of all workflows with repos, inc. release

Although many repos also have good docs, I find myself decoding the “re” in code lines to understand what it really does…

Maybe we can start a topic on “how we can compile a new documentation”?

I also don’t like how it is on the current CV web. About is too much to read, the most basic (and the most important) stuff in “contribution criteria” is very deep and not recognizable, etc.

I’d like that, as long as it’s high-level and linking to details, I’d support that for sure!

I think we don’t need a new topic for this, but rather can just keep going here?

Added here: https://github.com/Common-Voice/cv-sentence-extractor#flow

2 Likes

@mkohler, very nice :slight_smile: Perhaps some idea on what “manually” means? By who’s hand I mean :slight_smile:

Well, week end is gone, but I see you did well ! I looked at you diagram, many thanks, it make it more readable !

I totally agree on “level 1 oversight diagram” in higher repo, and “depper diagram” on lower levels. As I was doing my diagram, I was thinking that the very simple diagram for common voice website, with :studio_microphone:microphone and :loud_sound: sound icons were as simple as needed, and that you’d say that :smile_cat:

I’ll try to simplify as you proposed, and I think we need to go further on diagrams in sub repos.

…Let’s start ! (:construction: work in progress :construction: )

@bozden Hey Bülent :raising_hand_man:, please accept my apology for not catching all your proposal upfront…

I’ll retake your message and ask :smiley_cat: …I apologize if my question could be seen as rude, it’s not the point, I try to stay on the point of PRO/CONS, and of course I don’t try to hurt/offense/[put whatever hurtful action] you.

In my opinion, the medium for documentation usually defines the structure. When second version of the Community Playbook came to review, I suggested the use of readthedocs.io, to keep the whole documentation, which supports github linking and localization.

Could you please elaborate In my opinion, the medium for documentation usually defines the structure.? Do you mean that the tool we use format the way we build the doc, less than a design decision ? And, as I think you have much more experience than me in this matter, what are the consequence for project(s)? (either of any project or for this project, the easiest answer for you :slight_smile:)

For example, all .md files would be removed and all available in ONE place, with auto table and logical structure, instead of having to go inside each and every repo, looking for the needle in the haystack ? (No, I don’t speak from personal experience :sweat_smile:)

Most of the Mozilla projects are/were there already, e.g. https://deepspeech.readthedocs.io/en/master/

In such an environment, one can divide it to different user levels, let the language communities give better examples/directions (e.g. replacing the dinosaurs in CV), localized to their needs.

When you say one can divide it to different user levels, you mean that the structure of the help files can be [more] easily divided [than in a raw .MD file] ? By this, I mean that it’s not automatic, it is still a ‘human that decide for the layout’s decision’ ?

Indeed, I fell that it could be interesting, but I didn’t find proof or evidence that it’s better… I mean, I’m not used to this tool, so I don’t grab the ‘wonderful feature that make it a killer app’.

On the other hand, to anticipate counter debate, is it a good idea to add another player (the external website) for maintenance and documentation purpose ?

This is one big step thou, which would need a large team from locales and many man hours.

I’m with @mkohler to keep specifics of one repo in that repo. Repositories have different people/teams who make regular changes.

Isn’t it a counterargument for exterior doc repo/website, and thus for having doc maintained ‘inside the repo’ ? Don’t get me wrong, I agree with having specific things in deeper repo… But again, will community/developers be willing to go to a separate website for documentation reading and/or writing ? …IMHO, it must have a good incentive to carry people to switch… (…and that’s why I’m asking all this! to understand!)

But I feel the need of high level graph on how they all fit together, each box directing to individual repos.

OK with that ! :slight_smile:

Workflows towards different target audiences, like:

  • General public: Contribution
  • New teams: New language addition
  • Teams: How to analyze/make better (from sentence selection to gender equality)
  • Technical (modelers/coders): A simplified version of all workflows with repos, inc. release

I don’t get it. I mean, I understand the idea (different target = different explanation), but not how to implement it in a README.md file. …At least, as it is actually build. But I think it’s a good idea, so how to do it ?

Although many repos also have good docs, I find myself decoding the “re” in code lines to understand what it really does…

Again, I don’t get it… what’s the re ?

EDIT : after re-reading for [choose N, with N tend to infinite]'th time, you mean the REmarks in the code ? Thus, if it’s the point, I didn’t understood how http://readthedocs.io/ generate auto-documentation from code… :thinking: If it’s doing it ?!

Maybe we can start a topic on “how we can compile a new documentation”?

I also don’t like how it is on the current CV web. About is too much to read, the most basic (and the most important) stuff in “contribution criteria” is very deep and not recognizable, etc.

I agree on that… But (shoot fired) I think the actual README.md files are worth :sweat_smile: :firecracker::boom::face_with_head_bandage: More seriously, that’s why I’m trying hard to build something more newbie friendly, because writing code is not the only way to help. And thus, I want to help :mechanical_arm:.

Thanks for your time to :bulb: :man_teacher: help me to catch it !

Updated version…

File is still located at : https://github.com/CapitainFlam/common-voice/blob/main/docs/Process%20overview.svg

No need to apologize, this is a discussion area :slight_smile:

I was merely saying these:

Each medium has its limits, and different set of audiences. In each medium, you try to give some information you (the developer) finds necessary. This is usually more than the end-user requires. Therefore from the very start of SW Engineering practice, and even for any appliance you buy to home, there are multiple levels of documentation: A “Quick Start Guide”, a “User Manual”, a more technical one, sometimes service manual etc.

So it is required to have a multi level approach for multiple user requirements and/or knowledge levels. Some info can be replicated/rephrased, but you can but “for xxx please refer to yyy for more info” etc, which is done here.

The mediums are not designed to be documentation specific. E.g. if you design a mobile app and put a “?” button for help, you cannot put the whole manual, it should fit the screen/popup/whatever. And if it is multi-lingual, it becomes harder.

That is in addition to the fact that “people don’t read anymore”. The whole UX field is born from that, trying to provide the users intuitive layouts, icons, actions, etc so that that they do not read and just start using. It is very good for Common Voice, but it has it’s own consequences, such as people not reading, thus not understanding how important their demographics data for Voice AI development are, so not creating a profile and keep themselves logged-in, they don’t know that their mic cable have problems, they don’t know what should not be validated etc.

Same is happening here. CV frontend is React, so it finally uses bare HTML with div tags and text. It uses Pontoon for translations, which is based on sentence by sentence translations, which in turn is not appropriate for the work we are doing, which should be deeply localized and adapted if needed.

For example, the “contribution criteria” in English gives dinosaur examples and how English shortening works (“they’ve been” like stuff), which is not valid for any other locale and cannot be translated but should be converted/localized. But there are (say) 3 lines which can be converted, but you might need 5 examples for your locale. So the medium puts limits.

To overcome this, we tried these:

etc… These all could be incorporated in a multi-lingual, locale team managed documentation tool, such as the one I mentioned. They are made for these purposes.

For the levels, just check any good documentation for main headings: Introduction, How-To guides, FAQ’s, Basic Usage, Advanced Usage, Technical Details etc…

It refers to “Regular Expressions” in many languages. The cleanup procedures we’ve been talking about are mainly based on them, and sometimes they are hard to decode for humans and very error prone when coding.

They don’t, you can keep a documentation repo on github and link to it. E.g. each tool can have a doc repo and that would go into appropriate sub-topic in the documentation.

As an example check the Coqui STT repo, there are no DOCs in the repo, just a link:


And here it is:
https://stt.readthedocs.io/en/latest/
1 Like

RFC (Request For Comments) update before official Pull Request in Main REPO

Bülent, @bozden

I wanted to thank you for both links

(I didn’t go through Facebook for “Facebook personal refusal” reasons).

Thanks to Google Translate, I’ve able to read your texts, and I watched your very interesting video Common Voice Türkçe - FOSDEM’22 Sunumu, that is in english, and is very helpfull to understand quite deeply everything. The programmation part for Google Colab is clearly out of my league, but then your analysis of datasets is very interesting, it’s based on real data, and i allow me to understand some other technical discussion I saw in CommonVoice-fr repo / matrix / discourse :slight_smile:

So thank you for this. Your video should be more seen around !

1 Like

Thanks, these are just compilations from what I learned from great people and resources here.

I’m currently working on a set of automated utilities to help language communities analyze their datasets and create better models. I’m tired of using Excel, with 3 month releases :slight_smile:

Two things can be interesting for the current discussion:

As you can see (especially from the Discourse post) there are too many things which come into play, and that results in long texts/explanations. Nobody has so much time, thus TL;DR’s are a necessity.

Secondly there are two lightly animated Powerpoint graphics on the presentation. The first one simple and the second one is more detailed because it has many steps. When they are used with explaining voice and with animation, you can reach the audience. Otherwise, if I put the whole graph beforehand, they would be lost.

These might shed some light to your graph and what we were trying to say:

  • We need multiple graphs for different audience
  • There must be an overall view (e.g. text-corpus => voice-corpus => release)
  • Each process/workflow should have a second level more detailed (not too much) graph (mentioned above)
  • Each module/repo should have a more detailed/technical graph in their repo.

And maybe explain the whole thing through a video :slight_smile:

Btw, your graph has errors in it, hard to explain in a limited time.

I saw your videos AFTER the next graph release… I might explain some mistake that could have been avoided.

Anyway,

These might shed some light to your graph and what we were trying to say:

  • We need multiple graphs for different audience
  • There must be an overall view (e.g. text-corpus => voice-corpus => release)
  • Each process/workflow should have a second level more detailed (not too much) graph (mentioned above)
  • Each module/repo should have a more detailed/technical graph in their repo.

I agree on that. It’s hard for me to DO it, instead of ‘just being OK with it’, as I have to simplify the simplification, but I totally agree (even it’s not shown in the graph actually) that we must go further in simplification.

I’ll try again :sweat_smile:

Btw, your graph has errors in it,

:scream::sob:
…just joking, I tried of course to do the correct things, but I still have to go deeper.

Indeed, after I saw your video, I though that it was reaaaaaly complex, and reaaaaaaaaly needed to be A/ better explained (what I try to do), and B/ I had go go deeper to catch all this stuff (what I have to do to do what I try to do :laughing: ).

hard to explain in a limited time.

:joy: No problem ! I’ll refactor it anyway. …And I was hoping other people to comment it. You’re not in charge alone, it has to be a multi people with multi input work.

I’m not in charge of anything around here, I’m just a volunteer trying to help, trying to give back what I’ve learned from here.

1 Like

/me : image

My idea was to have a “level one” graphic, showing the REPO. That’s why I go “as deed as” this level, to show the different repo :sweat_smile:

…Is it a bad approach ? In a repo MAIN, it seems logical to me to understand what sub repository are made for.

We may be shall have an OTHER FIRST graphic, in the idea of your presentation and proposal There must be an overall view (e.g. text-corpus => voice-corpus => release), and BELOW, this “map of Common Voice repository” ?

because, in second thought, this graphic (…this map) will never fit this text-corpus => voice-corpus => release for dummies’ explanation.

I’ll go in this way to propose a SECOND graph (the flow ?) (…I’ll copycat your presentation idea)

I’m just a volunteer trying to help, trying to give back what I’ve learned from here.

…and many thanks for that :pray:

update 03 oct 2022
(removing graph, removing manual dataset, moving commonVoice-fr in community animation, first “macro” process to understand flow and adding “this is a MAP for repo” information)