Balancing most common words vs sentences number

Is there a methodology to identify this per language or this is something we’ll have to get from an external source?