Skip-gram | Notion

In the traditional skip-gram model, neural network weights are updated after processing a single (target word, context word) pair. Batch skip-gram is a variation that improves training efficiency by updating the weights after considering a batch of such pairs.

The Concept: Instead of immediate updates after every word pair, batch skip-gram accumulates the gradients from multiple (target, context) pairs within a batch. Then, it updates the model's weights in a single step based on the average of those gradients.

Why Use Batch Skip-gram?

Faster Training: Batch updates tend to make training faster overall, especially when using hardware acceleration with GPUs, which are optimized for parallel operations on batches of data.
Stabilized Gradients: The averaged gradients across a batch often lead to smoother updates and less overfitting as the model is less likely to be overly influenced by individual examples.

How it Works (Simplified)

Batch Formation: Create a batch of (target word, context word) pairs.
Calculate Gradients: For each pair in the batch, calculate the prediction error and the resulting gradient for weight updates.
Aggregate Gradients: Instead of immediate updates, sum or average the gradients across the entire batch.
Update Weights: Update the neural network weights using the aggregated gradient.

Considerations

Batch Size: Like in general deep learning, the batch size is a hyperparameter that influences performance. Smaller batches may lead to more frequent but noisy updates, while larger batches provide more stable updates but may take longer per iteration.

Strengths

Training Speed: The primary advantage of batch skip-gram lies in its increased training speed. Batch updates make better use of parallel computation capabilities of GPUs, leading to faster training times on large datasets.
Gradient Stability: Averaging gradients across a batch helps reduce noise and can lead to more stable convergence during the training process, potentially reducing overfitting.
General Deep Learning Benefits: Batch skip-gram inherits many core benefits seen across other batch-based deep learning methods, particularly those related to computational efficiency.

Weaknesses

Hyperparameter Sensitivity: The choice of batch size becomes an additional hyperparameter to tune. Finding the optimal batch size to balance speed and model performance might require experimentation.
Potential for Decreased Accuracy: In some cases, large batch sizes may lead to slightly less optimal convergence compared to smaller batch sizes. This is because large batches might smooth out the optimization process in a way that hinders the model's ability to find a fine-tuned minimum.
Memory Considerations: Larger batches require more memory. You may run into hardware constraints when dealing with very large datasets and extremely large batch sizes.