はじめに

Prototypical Contrastive Learning of Unsupervised Representationsを読んだのでメモ．

気持ち

近年流行しているContrastive Learning（例えばSimCLR）はpositive pairとnegative pariを元となるデータ点が等しいかどうかで識別的に分類することで表現学習を行う．この論文ではそのようなinstance-wiseなcontrastive learningでは例えどんなに似たデータだったとしても異なるデータとして扱うためデータに内在するsemanticな構造を得られないと考え，instance-wiseではなくprototypical contrastive learningを提案する．

一言で言ってしまえば，contrastive learningとself-labeling(DeepClusterなど)の組み合わせ．

Prototypical Contrastive Learning

データセットを $X=\{x_1,x_2,\dots,x_n\}$ で定義する．学習対象のモデルを $f_\theta,v_i=f_\theta(x_i)$ とする．従来のinstance-wise contrastive learningでは次の損失を最小化する．

$\displaystyle \mathcal{L}_\text{InfoNCE}=\sum_{i=1}^n-\log\frac{\exp(v_i\cdot v_i'/\tau)}{\sum_{j=0}^r\exp(v_i\cdot v_j'/\tau)}$

$v_i'$ は $i$ 番目のデータに対する埋め込みで， $v'_j$ は一つのpositive pairの埋め込みと $r$ 個のnegative pairの埋め込みを含む． $\tau$ は温度パラメータ．

prototypical contrastive learningではこの $v'$ の代わりにprototype $c$ を利用する．さらにハイパーパラメータ $\tau$ をprototype毎の集中度 $\phi$ に置き換える．

ここでの目的は下記の対数尤度の最大化として表現される．

$\displaystyle \theta^\ast=\underset{\theta}{\mathrm{arg}\max}\sum_{i=1}^n\log p(x_i;\theta)$

ここでは仮定として，観測されたデータ $\{x_i\}_{i=1}^n$ は潜在変数 $C=\{c_i\}_{i=1}^k$ に関係があるとし，対数尤度を次のように書き直す．

$\displaystyle \theta^\ast=\underset{\theta}{\mathrm{arg}\max}\sum_{i=1}^n\log p(x_i;\theta)=\underset{\theta}{\mathrm{arg}\max}\sum_{i=1}^n\log\sum_{c_i\in C}p(x_i,c_i;\theta)$

これを直接的に最適化するのは困難であるため，次の下界を考える．

$\displaystyle \sum_{i=1}^n\log\sum_{c_i\in C}p(x_i,c_i;\theta)=\sum_{i=1}^n\log\sum_{c_i\in C}Q(c_i)\frac{p(x_i,c_i;\theta)}{Q(c_i)}\geq\sum_{i=1}^n\sum_{c_i\in C}Q(c_i)\log\frac{p(x_i,c_i;\theta)}{Q(c_i)}$

$Q(c_i)$ は $\sum_{c_i\in C}Q(c_i)=1$ を満たす分布．この不等式はJensenの不等式から得られる．等式は $\log$ の中身が定数の場合に成り立ち，そこから $Q(c_i)=p(c_i;x_i,\theta)$ が得られる．

定数 $-\sum_{i=1}^n\sum_{c_i\in C}Q(c_i)\log Q(c_i)$ を無視することで次の式の最大化として考えられる．

$\displaystyle \sum_{i=1}^n\sum_{c_i\in C}Q(c_i)\log p(x_i,c_i;\theta)$

最適化は次のEMアルゴリズムで実現される．

E-stepでは $p(c_i;x_i,\theta)$ の推定を行う．これは $v_i'=f_{\theta'}(x_i)$ 上の $k$ -meansとして実行される．prototype $c_i$ を $i$ 番目のクラスターのセントロイドとし， $p(c_i;x_i,\theta)=\mathbb{1}(x_i\in c_i)$ として計算する．この時の $f_{\theta'}(x_i)$ はMoCoと同様にmomentum encoderを利用する．
M-stepではE-stepで得られた $p(c_i;x_i,\theta)=\mathbb{1}(x_i\in c_i)$ を前述の目的関数の下界に代入して $\theta$ について最大化する．

ここでは $p(x_i,c_i;\theta)$ に対し事前分布として一様分布を仮定することで次のように書き直す．

$\displaystyle p(x_i,c_i;\theta)=p(x_i;c_i,\theta)p(c_i;\theta)=\frac{1}{k}\cdot p(x_i;c_i,\theta)$

さらに $p(x_i;c_i,\theta)$ に対して等方性のガウス分布を仮定する．

$\displaystyle p(x_i;c_i,\theta)=\exp\left(\frac{-(v_i-c_s)^2}{2\sigma_s^2}\right)/\sum_{j=1}^k\exp\left(\frac{-(v_i-c_j)^2}{2\sigma^2_j}\right)$

これを目的関数に代入すると，下記の目的関数が得られる．

$\displaystyle \theta^\ast=\underset{\theta}{\mathrm{arg}\min}\sum_{i=1}^n-\log\frac{\exp(v_i\cdot c_s/\phi_s)}{\sum_{j=1}^k\exp(v_i\cdot c_j/\phi_j)}$

ただしここでは $v$ と $c$ のl2ノルムが1に正規化されていることを仮定した．また，practicalなところとして従来のInfoNCEをコストに加える， $k$ -meansによるセントロイドの計算は複数回繰り返すというのが効いたらしく，最終的な目的関数は以下のようになる．

$\displaystyle \mathcal{L}_\text{ProtoNCE}=\sum_{i=1}^n-\left(\log\frac{\exp(v_i\cdot v_i'/\tau)}{\sum_{j=0}^r\exp(v_i\cdot v_j'/\tau)}+\frac{1}{M}\sum_{m=1}^M\log\frac{\exp(v_i\cdot c_s^m/\phi_s^m)}{\sum_{j=0}^r\exp(v_i\cdot c_j^m/\phi_j^m)}\right)$