This is a separate repo for a specific part of my undergrad thesis (My undergrad thesis repo is here). I attempt to improve sentence embeddings
via discrete optimization
, which is learnt through Gumbel Softmax
. The performance results are not ideal, yet the methodology
and implementation
behind this exploration are worth documenting.
The highlights
are:
- Huggingface style
Trainer
+transformers:v4.22.0
compatibility Discrete Optimization
+Gumbel-Softmax
method- More reasonable
evaluation
# for a separate python env
conda create -n hgf python=3.8.13 -y
conda activate hgf
# requirements
pip install -r requirements.txt
# download eval data
cd SentEval/data/downstream
bash scripts/reproduce.sh
# or
bash scripts/gumbel_softmax.sh
# eval scripts are also provided.
# The scripts are highly customizable, with arugments explained in metadata.
Gumbel Softmax (original paper)
Originally, discrete optimization is learnt in RL style through REINFORCE algorithm.
$$\nabla_\theta\mathbb{E}z[\mathcal{L}(z)] = \nabla\theta \int dz\ p_\theta(z)\mathcal{L}(z) = \int dz\ p_\theta(z)\nabla_\theta \log p_\theta(z)\mathcal{L}(z)\ =\mathbb{E}z[\mathcal{L}(z)\nabla\theta \log p_\theta(z)]$$
where efficienctly sampling
from the distribution during each learning step. Even binary combinations would grow expotentially. Another difficulty is related to high variance
. The motivation for a better optimization objective is strong.
sampling from an arbitrary discrete distribution can be tackled through the Gumbel-Max
trick, which takes the form of:
Let us inspect the distribution of
Denote
We have
Recall that sampling from arbitrary categorical distribution can be reparameterized as sampling
To tackle the problem underlying discrete optimization, we would like to approximate discrete operations
with continuous operations, such that gradients flow back to the parameters that the sampled distribution is conditioned on (Here,
It is intuitive that by replacing argmax
with softmax
, we would obtain a continuous approximation where gradients to distribution parameter softmax
operation would approach argmax
in both sampled value and expectation value.
For an input sentence
Should a token be deleted from input, the ATTN
process would be affected. Here, we shrink the attention_probs
of continuous dropout
between
During inference, we keep tokens with
In the original SimCSE repo, performance is evaluated on STS
tasks. Interestingly, only the dev
split of STSBenchmark
is used for validation during training. Test evaluation is performed on train
and test
split. However, considering the semi-supervized nature of SimCSE, it is more natural to combine the train
split and dev
split as validation set for less variation. statistics are tracked through Trainer
and tensorboard
. Sample from sup-roberta-base
:
Here we report spearman correlation
results.
avg_sts | avg_test | avg_train_and_dev | sickrelatedness | sts12 | sts13 | sts14 | sts15 | sts16 | stsbenchmark | stsbenchmark_test | stsbenchmark_train_and_dev | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
unsup-bert-GS | 0.7422 | 0.7394 | 0.7820 | 0.6948 | 0.6505 | 0.7950 | 0.7076 | 0.8034 | 0.7655 | 0.7782 | 0.7592 | 0.7820 |
unsup-bert | 0.7499 | 0.7473 | 0.7822 | 0.7077 | 0.6428 | 0.8168 | 0.7226 | 0.8025 | 0.7784 | 0.7785 | 0.7600 | 0.7822 |
unsup-roberta-GS | 0.7463 | 0.7458 | 0.7863 | 0.6831 | 0.6386 | 0.7931 | 0.7202 | 0.8179 | 0.7861 | 0.7852 | 0.7816 | 0.7863 |
unsup-roberta | 0.7602 | 0.7604 | 0.7955 | 0.6948 | 0.6597 | 0.8124 | 0.7339 | 0.8192 | 0.806 | 0.7954 | 0.7969 | 0.7955 |
sup-bert-GS | 0.8148 | 0.8144 | 0.8451 | 0.8095 | 0.7493 | 0.8461 | 0.7989 | 0.8535 | 0.8023 | 0.8444 | 0.8414 | 0.8451 |
sup-bert | 0.8153 | 0.8149 | 0.8456 | 0.8104 | 0.7521 | 0.8452 | 0.7981 | 0.8534 | 0.8027 | 0.8451 | 0.8427 | 0.8456 |
sup-roberta-GS | 0.8249 | 0.8257 | 0.8511 | 0.8063 | 0.7616 | 0.8571 | 0.8096 | 0.8580 | 0.8294 | 0.8521 | 0.8578 | 0.8511 |
sup-roberta | 0.8213 | 0.8224 | 0.8497 | 0.8025 | 0.7586 | 0.8475 | 0.8033 | 0.8536 | 0.8321 | 0.8511 | 0.8591 | 0.8497 |
It can be seen that merely introducing gumbel softmax
do not guarantee boost in performance. Another metric that is worth noticing is the sparsity of learned probabilities. As it turns out, the proxy model tends to retain all tokens instead of removing any.
The scheme for obtaining positive examples in SimCSE is to leverage different dropout masks
. Though it seems simple, its meanings are profound and it might be more complicated than supposed.
# attention_scores -> [bs, num_head, seq_len, seq_len]
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
attention_probs[bs, num_head, i, j]
denote the probability with respect to batch bs
and multihead num_head
that token[i]
should plus the value of token[j]
. Applying dropout to this element is equivalent to removing token[j]
when computing ATTN
score for token[i]
in a specific multihead
from a specific layer
. However, the continuous dropout mask
that I introduce applies to all layers, all heads and all tokens ( as seen in code proxy_outputs[:, None, None, :]
). Therefore, the original dropout scheme introduces fine-grained noises within the model and it is no wonder that it outperforms other discrete schemes over inputs.