What are Sample DBs

Congratulations on release 0.7!

I saw on the release notes there’s a mention of using Sample DBs: Added Sample DBs, a new format for training data that allows for much improved training speeds

Is there any high level detail on the format and perhaps how to convert an existing dataset to the new format?

I had a look in the code for Sample DB and found this:


So I may be able to piece a bit more insight together by reading that later on. Is it a project specific format? (ie it’s not some standard/formatting convention I should try googling? I couldn’t find anything so far by that approach)

Thanks again. Looking forward to trying this release out soon.

The Sample DB format is project-specific, yes. It can be produced using the bin/build_sdb.py tool.
It’s main purpose/advantage is faster reading of training samples, as data is pre-ordered and thus requires less seeking on classic hard-drives. If used with Opus compression it also allows to cut required disk space by a factor of around 16. You can also mix and match them with our normal CSV files.

An, yes, I should soon add a doc-page or section for the format.

2 Likes

Does using Opus slow down training or affect accuracy in any way compared to Wave?

Training speed has not been affected (at least for the setups I tried). I haven’t done a serious accuracy test yet.