We revisit the core assumption that supports co-training in Semi-supervised Segmentation: multiple compatible and conditionally independent views, and show that current co-training models are tightly coupled together leading to sub-optimal performance. We present Diverse Cotraining that promotes diversity by different input domains, diverse architectures, and distinct strong augmentations, leadng to SOTA performance.
We first revisit the assumptions behind co-training: two or multiple independent views compatible with the target function. By deriving the generalization upper bound of Co-training, we theoretically show that the homogenization of networks accounts for the generalization error of Co-training methods.
Given hypothesis class \(\mathcal{H}\) and labeled data set \(D_{l}\) of size \(l\) that are sufficient to learn an initial segmentor \(f_i^0\) with an upper bound \(b_i^0\) on the generalization error (with probability \(\delta\), i.e. \(l \ge \max\{\tfrac1{b_i^0}\ln\tfrac{|\mathcal{H}|}{\delta}\}\)), we then train \(f_i^0\) by ERM on the union of labeled and unlabeled set \(\sigma^i\), where pseudo-labels come from the other model \(f_{3-i}^0\). Then we have
\[ \Pr\bigl[d(f_i^k, f^*) \ge b_i^k\bigr] \;\le\; \delta \]
provided \(l\,b_i^0 \le e\,\sqrt[M]{M!} - M\), where \(M = u\,b_{3-i}^0\), and \( b_i^k = \max\!\Bigl\{\tfrac{l\,b_i^0 + u\,b_{3-i}^0 - u\,d(f_{3-i}^{\,k-1},f_i^k)}{l},\;0\Bigr\} \).
Theorem 1 shows that the bigger the difference between the two models \( f_{3-i}^{k-1}\) and \(f_{i}^{k} \), the smaller the upper bound of the generalization error. Thus we can conclude Remark 1.
Homogenization negatively impacts the generalization ability of the Co-training method leading to sub-optimal performance.
With the condition that the difference between the two models is large enough \(d(f_{3-i}^{k-1}, f_{i}^{k}) \ge b_{3-i}^0 \), we can see that the larger the \( u \) the smaller the upper bound of the generalization error. Then we have Remark 2
Given a large difference between the two models, more unlabeled data decreases the generalization error of Co-training.
This remark is consistent with empirical results that more unlabeled data leads to better performance. Further, with this remark, we provide theoretical guarantees for strong augmentations used in our method.
Given the statement that homogenization negatively impacts performance, we now investigate the existing Co-training methods. As shown in the below figure (Figure 1 of the paper), we summarize two co-training paradigms other than CPS (b), i.e. co-training with cross heads and shared backbone (c) and n-CPS (d) which leverages multiple models to perform co-training. As shown in the above figure (Figure 3 of the paper), we can observe that all three paradigms have a severer homogenization issue. We also provide rigorous analysis in logits and prediction space with L2 distance and KL Divergence demonstrating similar phenomena in Appendix B.
We further validate our arguments with empirical performance of these models. As shown in the below table (Table 1 of the paer), we observe that less similar models bring performance benefits , i.e. co-training outperforms the other two consistently over all settings.
After analyzing the limitation of current co-training paradigms, we provide a comprehensive investigation of co-training to (i) promote the diversity between models and (ii) provide a relatively more independent pseudo view that better fits the assumption in the Co-training. We propose, Diverse Cotraining that incorperates (1) diverse input domains (DCT, RGB, HSV, etc) as pseudo views, (2)different augmentation to provide different views and (3) different architectures (ViT and CNN) for different inductive biases. We provide two variants of Diverse Co-training, termed by 2-cps ((e) of Figure 1) and 3-cps ((f) of Figure 1).
We first show that all three techniques that promotes diversity and pseudo views lead to substantial performance improvement.
Diverse Cotraining demonstrates state-of-the-arts (SOTA) performance on two datasets across various settings and architectures.
Example qualitative results from PASCAL VOC 2012. (a) RGB input; (b) ground truth; (c) FixMatch; (d) Co-training baseline; (e) Diverse Co-training (ours). (c) and (d) use DeepLabv3+ with ResNet50 as the segmentation network while (e) uses DeepLabv3+ with ResNet50 and SegFormerb2 (with MLP head) as the two segmentation networks.
If you use this work or find it helpful, please consider citing our work.
@inproceedings{li2023diverse, title={Diverse Cotraining Makes Strong Semi-Supervised Segmentor}, author={Li, Yijiang and Wang, Xinjiang and Yang, Lihe and Feng, Litong and Zhang, Wayne and Gao, Ying}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={16055--16067}, year={2023} }
Credit: The design of this project page references the project pages of NeRF, DeepMotionEditing, and LERF.