Research Postgraduate, Imperial College London, South Kensington, United Kingdom
Nikolaos Giakoumoglou is currently a postgraduate researcher at Imperial College London in the Department of Electrical and Electronic Engineering (EEE) in the Communications and Signal Processing (CSP) group, where he is pursuing his PhD under the guidance of Professor Tania Stathaki. Before his current role, he worked as a Research Assistant at the Centre for Research and Technology Hellas (CERTH) in the Information Technologies Institute (ITI) department. He obtained his Diploma in Electrical and Computer Engineering in 2021 from the Department of Electrical & Computer Engineering at Aristotle University of Thessaloniki. Nikolaos' research is primarily focused on Artificial Intelligence, Machine Learning, and Deep Learning, with a special interest in applications within the field of Computer Vision.
SynCo: Synthetic Hard Negatives for Contrastive Visual Representation Learning
N. Giakoumoglou and T. Stathaki
Under review...
Contrastive learning has become a dominant approach in self-supervised visual representation learning, but efficiently leveraging hard negatives, which are samples closely resembling the anchor, remains challenging. We introduce SynCo (Synthetic negatives in Contrastive learning), a novel approach that improves model performance by generating synthetic hard negatives in the representation space. Building on the MoCo framework, SynCo introduces six strategies for creating diverse synthetic hard negatives "on-the-fly" with minimal computational overhead. SynCo achieves faster training and strong representation learning, surpassing MoCo-v2 by +0.4% and MoCHI by +1.0% on ImageNet ILSVRC-2012 linear evaluation. It also transfers more effectively to detection tasks, achieving strong results on PASCAL VOC detection (57.2% AP) and significantly improving over MoCo-v2 on COCO detection (+1.0% APbb) and instance segmentation (+0.8% APmsk). Our synthetic hard negative generation approach significantly enhances visual representations learned through self-supervised contrastive learning.
SynCo-v2: An Empirical Study of Training Self-Supervised Vision Transformers with Synthetic Hard Negatives
N. Giakoumoglou, A. Floros, K. M. Papadopoulos, T. Stathaki
Under review...
We introduce SynCo-v2, a method that integrates synthetic hard negatives into unsupervised vision transformer pretraining to improve representation quality. Our approach is thoroughly benchmarked on ImageNet and transfer learning, image retrieval, copy detection, and image and video segmentation tasks. Notably, our proposed negatives give rise to emergent properties, where learned representations contain explicit information about the semantic content of an image and serve as excellent classifiers (up to +11.3% over baselines). SynCo-v2 achieves these benefits through simple modifications to existing contrastive frameworks and outperforms competing methods while being more resource efficient, e.g., our ViT-B surpasses V-JEPA with ViT-L. Our findings motivate reconsidering contrastive learning as a simpler yet powerful alternative to dominant generative and self-distillation approaches.
Relational Representation Distillation
N. Giakoumoglou and T. Stathaki
Under review...
Knowledge distillation transfers knowledge from large, high-capacity teacher models to more compact student networks. The standard approach minimizes the Kullback-Leibler (KL) divergence between the probabilistic outputs of the teacher and student, effectively aligning predictions but neglecting the structural relationships encoded within the teacher's internal representations. Recent advances have adopted contrastive learning objectives to address this limitation; however, such instance-discrimination-based methods inevitably induce a "class collision problem", in which semantically related samples are inappropriately pushed apart despite belonging to similar classes. To overcome this, we propose Relational Representation Distillation (RRD) that preserves the relative relationships among instances rather than enforcing absolute separation. Our method introduces separate temperature parameters for teacher and student distributions, with a sharper teacher (low τ_t) emphasizing primary relationships and a softer student (high τ_s) maintaining secondary similarities. This dual-temperature formulation creates an implicit information bottleneck that preserves fine-grained relational structure while avoiding the over-separation characteristic of contrastive losses. We establish theoretical connections showing that InfoNCE emerges as a limiting case of our objective when τ_t approaches 0, and empirically demonstrate that this relaxed formulation yields superior relational alignment and generalization across classification and detection tasks.
SNAP: Synthetically Negative Augmented Pretraining for Vision-Language Models
N. Giakoumoglou, P. Giakoumoglou, K. M. Papadopoulos, A. Floros, T. Stathaki
Under review...
Vision-language contrastive pretraining relies on large batches of randomly sampled image-text pairs to provide negative examples. Prior approaches address this by generating hard negatives in the input space—rewriting captions with LLMs or synthesizing images with diffusion models—but incur substantial computational overhead and typically augment only one modality. Synthetic hard negatives generated in the representation space have proven effective for unimodal self-supervised learning, but extending them to vision-language models that align two distinct modalities via an InfoNCE objective is not straightforward. We identify two failure modes: cross-modal synthetic negatives fall into the modality gap and are trivially rejected, while intra-modal negatives involving the positive pair suffer from positive leakage that sends contradictory gradients. Both failure modes additionally cause the learnable temperature to diverge. We propose SNAP, which generates intra-modal hard negatives that never involve the positive from either modality, avoiding both failure modes entirely. SNAP is model-agnostic, requires no external generative models, and adds less than 9% training time overhead. Evaluated on top of CLIP and FLIP across multiple architectures and datasets, SNAP delivers consistent improvements on zero-shot retrieval, zero-shot classification, and linear probe evaluation.
Expert Clustering and Knowledge Transfer for Whole Slide Image Classification
K. M. Papadopoulos, N. Giakoumoglou, A. Floros, P. L. Dragotti, T. Stathaki
Accepted for presentation at ISBI 2026 Main Conference (Oral)
Multiple Instance Learning (MIL) is widely adopted for Whole Slide Image (WSI) classification. Existing MIL methods suffer from representation bottlenecks where slide-level aggregation compresses diverse patch information, limiting performance. Our proposed Divide-and-Distill (D&D) framework addresses this by partitioning the feature space into representation-coherent clusters, training specialized expert models on each cluster, and distilling their collective knowledge into a unified model. This reduces information compression loss while maintaining inference efficiency. Experiments across three datasets and six MIL methods demonstrate consistent performance gains without added inference cost.
A Multimodal Approach for Cross-Domain Image Retrieval
L. Iijima, N. Giakoumoglou and T. Stathaki
Accepted for presentation at VISAPP 2026 Main Conference (Poster)
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.
Mitigating Representation Bottlenecks in Multiple Instance Learning
K. M. Papadopoulos, N. Giakoumoglou, A. Floros, T. Stathaki
Accepted for presentation at NeurIPS 2025 Workshop "Medical Imaging meets EurIPS (MedEurIPS)"
Multiple Instance Learning (MIL) is widely used for Whole Slide Image classification in computational pathology, yet existing approaches suffer from a representation bottleneck where diverse patch-level features are compressed into a single slide-level embedding. We propose Divide-and-Distill (D&D), which clusters the feature space into coherent regions, trains expert models on each cluster, and distills their knowledge into a unified model. Experiments demonstrate that D&D consistently improves six state-of-the-art MIL methods in both accuracy and AUC while maintaining single-model inference efficiency.
Cluster Contrast for Unsupervised Visual Representation Learning
N. Giakoumoglou, T. Stathaki
Accepted for presentation at ICIP 2025 Main Conference
[ieeexplore] [arXiv] [pdf] [bibtex]
We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.
Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives
N. Giakoumoglou, A. Floros, K. M. Papadopoulos, T. Stathaki
Accepted for presentation at ICCV 2025 Workshop "Representation Learning with Very Limited Resources: When Data, Modalities, Labels, and Computing Resources are Scarce"
[openreview] [arXiv] [pdf] [bibtex]
This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage "fake it till you make it". While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of "faking it" in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework—dubbed Syn2Co—combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.
Unsupervised Training of Vision Transformers with Synthetic Negatives
N. Giakoumoglou, A. Floros, K. M. Papadopoulos, T. Stathaki
Accepted for presentation at CVPR 2025 Workshop "Second Workshop on Visual Concepts"
[openreview] [arXiv] [pdf] [suppl] [bibtex] [code]
This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.
Discriminative and Consistent Representation Distillation
N. Giakoumoglou and T. Stathaki
Under review...
What Makes Pretraining Data Good for Self-Supervised Learning?
N. Giakoumoglou, A. Floros, K. M. Papadopoulos, T. Stathaki
Under review...
Open-World Semantic Segmentation with Sensitivity Modeling
A. R. Varvarigos, N. Giakoumoglou, T. Stathaki
Under review...
A Review on Discriminative Self-supervised Learning Methods
N. Giakoumoglou, T. Stathaki, A. Gkelias
Under review...
A Review on Artificial Intelligence Methods for Plant Disease and Pest Detection
N. Giakoumoglou, D. Kapetas, K. M. Papadopoulos, P. Christakakis, T. Stathaki, E. M. Pechlivani
Under review...