Can site handle 50K+ users in one day?

On 14th of April is Georgian Language Day. Local community, we are planning a wide scale campaign, starting on Monday, to get 50K+ volunteers contribute on that day. Not sure if we will be able to reach this goal, but, lets see.

Just wanted to check if site could handle such traffic?

Hey @Razmik-Badalyan, I hope everything goes well. I have no idea about the capability of the backbone, but I remember we had problems on the last global campaign two years before, but it was Amazon AWS.

But I can think of one additional possible problem:

Common Voice uses a caching mechanism (Redis) to feed the sentences and it is not randomized after it has this cache. I have several issues reported regarding validation (see 1, 2). At the time of posting those issues I was thinking that the culprit is in dataset queries, but after checking, I recognized it is because of the implemented “lazyCache”.

It works like this:

  • The system caches N sentences (say max 50) every T seconds (say 60 sec = 1 minute) and feeds them to the users as it is when they visit the website, the data is loaded into the browser and used from there until the whole data is used, then a new set is requested.
  • Say, at that minute 100 people start to record, they will get the same sentences in the same order, and say recording a sentence takes 10 secs, so they each record 6 sentences, resulting the same 6 sentences recorded 100 times.

I’m not sure of this mechanism is valid for recording (related code is too much distributed to follow quickly), but I know it is valid for validation. Perhaps the team can shed some light on this (@jesslynnrose).

I any case, I think distributing the load over time will be best, perhaps a “Language Week”?

Some more thoughts

Validation: You should also think of validating the sentences. If 50.000 people record 100 sentences on the average, you will have 5 million recordings to validate. On the minimum (5 sentence batch) you will have 250k. It will take many man-hours to validate them if it is done by few people (min 1000, max 25.000 man-hours).

Maybe saying “record N, validate 2*N” is a better goal for the campaign.

Text-corpus: I can see that you have about 50k unrecorded sentences (see the text corpus tab here), so these also will be consumed rapidly and recorded many times. Also the existing text-corpus of ~77k will be re-recorded many more times.

Maybe some should also write sentences and validate them?

Support: People would need support, have questions, you would need to connect to them someway in case something is not right. So you need a medium, support channel, possibly IM like Telegram.

These are just some ideas…

I really hope everything goes well, I’m happy for you (and feel envy :slight_smile:).

1 Like

Interesting engineering problem, so, some rough calculations. Assuming:

  • 50.000 people=devices will record
  • Each record 100 sentences on the average, so 20 x 5 sentence batches per person
  • Each batch of 5 sentences will take a minute to record, so one people will post a batch each minute (x20)
  • 5 sec/recording, which will result (estimated from file sizes) 250 KB uploads per batch
  • The contributions are evenly distributed to a 10 hour day.

So:

  • There will be 1M 5 sentence batches uploaded (50.000 x (100 / 5))
  • 100.000 uploads per hour, 1667 uploads per minute
  • Database: That would result in 5M insert operations, 139 per sec + update operations for statistics etc.
  • Total upload per minute will be 417 MB, or ~7MB/sec = 70 Mbps upload connection
  • Totally 250 GB data will be uploaded, 7 MB/sec disk writes (that’s OK)

Engineering-wise:

  • You should think of other users
  • You should think of a slack, you should plan to saturate max 50% of the bandwidth, so at peak hours it can be handled (it will not be equally distributed.
  • So end up with a ~200 Mbps upload speed requirement

My opinion - without knowing about the infrastructure/deal:

  • Major problems will be database and ingress bandwidth, it can be achievable on Google Cloud (see).

You will have 7000 hours of data, you would need nice HW to train on these :slight_smile:

Hi @bozden, thank you so much for the detailed response. As always, very informative. I will share it with the team, and will consider to distribute this on a week.

50K is shot for the sky, reality probably will be more moderate.

1 Like

Thanks so much both! @Razmik-Badalyan we’re working with our SRE team now to make sure that you should have the capacity you need, but there might be a brief slowdown this morning as we make adjustments.

2 Likes