In classification tasks it is common to use the Softmax function that turns model's outputs into the classes' probabilities. If the number of classes N is large then computation of the gradients becomes a performance bottleneck: it requires O(N) time in case of the plain Softmax. Such a problem may arise in language modelling or recommender systems.

In practice effective approximations of Softmax are used, e.g. Sampled Softmax which creates and utilizes a small sample of classes. The sampling distribution plays a big role for the approximation's quality. Nevertheless, despite its importance, almost all recent applications still use simple sampling distributions, such as uniform, which leads to either bad quality or bad performance.

At the seminar we wiil discuss the issues of Sampled Softmax and will take a look at a recent neat method of designing the sampling distribution that resolves these issues.

Speaker: Egor Shcherbin.

Presentation language: Russian.

Date and time: April 10th, 6:30-8:00 pm.

Location: Times, room 204.

Videos from previous seminars are available at http://bit.ly/MLJBSeminars