Computer Vision and Pattern Recognition

New submissions

Submissions received from Tue 30 Apr 24 to Wed 1 May 24, announced Thu, 2 May 24

New submissions
Cross-lists
Replacements

[ total of 117 entries: 1-117 ]
[ showing up to 1000 entries per page: fewer | more ]

New submissions for Thu, 2 May 24

[1] arXiv:2405.00021 [pdf, other]: Title: SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials

Authors: Wonjoong Kim, Sangwu Park, Yeonjun In, Seokwon Han, Chanyoung Park

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Recently, interpreting complex charts with logical reasoning have emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model, Deplot, has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Models (LLMs) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments. Furthermore, we propose a novel prompt addressing the shortcoming of recent SOTA model, ignoring visual attributes such as color. Our source code is available at https://github.com/sangwu99/Simplot.
[2] arXiv:2405.00023 [pdf, ps, other]: Title: Revolutionizing Retail Analytics: Advancing Inventory and Customer Insight with AI

Authors: A. Hossam, A. Ramadan, M. Magdy, R. Abdelwahab, S. Ashraf, Z. Mohamed

Subjects: Computer Vision and Pattern Recognition (cs.CV)

In response to the significant challenges facing the retail sector, including inefficient queue management, poor demand forecasting, and ineffective marketing, this paper introduces an innovative approach utilizing cutting-edge machine learning technologies. We aim to create an advanced smart retail analytics system (SRAS), leveraging these technologies to enhance retail efficiency and customer engagement. To enhance customer tracking capabilities, a new hybrid architecture is proposed integrating several predictive models. In the first stage of the proposed hybrid architecture for customer tracking, we fine-tuned the YOLOV8 algorithm using a diverse set of parameters, achieving exceptional results across various performance metrics. This fine-tuning process utilized actual surveillance footage from retail environments, ensuring its practical applicability. In the second stage, we explored integrating two sophisticated object-tracking models, BOT-SORT and ByteTrack, with the labels detected by YOLOV8. This integration is crucial for tracing customer paths within stores, which facilitates the creation of accurate visitor counts and heat maps. These insights are invaluable for understanding consumer behavior and improving store operations. To optimize inventory management, we delved into various predictive models, optimizing and contrasting their performance against complex retail data patterns. The GRU model, with its ability to interpret time-series data with long-range temporal dependencies, consistently surpassed other models like Linear Regression, showing 2.873% and 29.31% improvements in R2-score and mAPE, respectively.
[3] arXiv:2405.00025 [pdf, other]: Title: Leveraging Pre-trained CNNs for Efficient Feature Extraction in Rice Leaf Disease Classification

Authors: Md. Shohanur Islam Sobuj, Md. Imran Hossen, Md. Foysal Mahmud, Mahbub Ul Islam Khan

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Rice disease classification is a critical task in agricultural research, and in this study, we rigorously evaluate the impact of integrating feature extraction methodologies within pre-trained convolutional neural networks (CNNs). Initial investigations into baseline models, devoid of feature extraction, revealed commendable performance with ResNet-50 and ResNet-101 achieving accuracies of 91% and 92%, respectively. Subsequent integration of Histogram of Oriented Gradients (HOG) yielded substantial improvements across architectures, notably propelling the accuracy of EfficientNet-B7 from 92\% to an impressive 97%. Conversely, the application of Local Binary Patterns (LBP) demonstrated more conservative performance enhancements. Moreover, employing Gradient-weighted Class Activation Mapping (Grad-CAM) unveiled that HOG integration resulted in heightened attention to disease-specific features, corroborating the performance enhancements observed. Visual representations further validated HOG's notable influence, showcasing a discernible surge in accuracy across epochs due to focused attention on disease-affected regions. These results underscore the pivotal role of feature extraction, particularly HOG, in refining representations and bolstering classification accuracy. The study's significant highlight was the achievement of 97% accuracy with EfficientNet-B7 employing HOG and Grad-CAM, a noteworthy advancement in optimizing pre-trained CNN-based rice disease identification systems. The findings advocate for the strategic integration of advanced feature extraction techniques with cutting-edge pre-trained CNN architectures, presenting a promising avenue for substantially augmenting the precision and effectiveness of image-based disease classification systems in agricultural contexts.
[4] arXiv:2405.00027 [pdf, other]: Title: Multidimensional Compressed Sensing for Spectral Light Field Imaging

Authors: Wen Cao, Ehsan Miandji, Jonas Unger

Comments: 8 pages, published of VISAPP 2024

Journal-ref: In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP 2024, ISBN 978-989-758-679-8, ISSN 2184-4321, pages 349-356

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

This paper considers a compressive multi-spectral light field camera model that utilizes a one-hot spectralcoded mask and a microlens array to capture spatial, angular, and spectral information using a single monochrome sensor. We propose a model that employs compressed sensing techniques to reconstruct the complete multi-spectral light field from undersampled measurements. Unlike previous work where a light field is vectorized to a 1D signal, our method employs a 5D basis and a novel 5D measurement model, hence, matching the intrinsic dimensionality of multispectral light fields. We mathematically and empirically show the equivalence of 5D and 1D sensing models, and most importantly that the 5D framework achieves orders of magnitude faster reconstruction while requiring a small fraction of the memory. Moreover, our new multidimensional sensing model opens new research directions for designing efficient visual data acquisition algorithms and hardware.
[5] arXiv:2405.00029 [pdf, ps, other]: Title: Automatic Creative Selection with Cross-Modal Matching

Authors: Alex Kim, Jia Huang, Rob Monarch, Jerry Kwac, Anikesh Kamath, Parmeshwar Khurd, Kailash Thiyagarajan, Goodman Gu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline using a Transformer model for search terms, and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs. Our approach achieves 0.96 AUC score for advertiser associated ground truth, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%. For human labeled ground truth, our approach achieves 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%.
[6] arXiv:2405.00031 [pdf, other]: Title: SegNet: A Segmented Deep Learning based Convolutional Neural Network Approach for Drones Wildfire Detection

Authors: Aditya V. Jonnalagadda, Hashim A. Hashim

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

This research addresses the pressing challenge of enhancing processing times and detection capabilities in Unmanned Aerial Vehicle (UAV)/drone imagery for global wildfire detection, despite limited datasets. Proposing a Segmented Neural Network (SegNet) selection approach, we focus on reducing feature maps to boost both time resolution and accuracy significantly advancing processing speeds and accuracy in real-time wildfire detection. This paper contributes to increased processing speeds enabling real-time detection capabilities for wildfire, increased detection accuracy of wildfire, and improved detection capabilities of early wildfire, through proposing a new direction for image classification of amorphous objects like fire, water, smoke, etc. Employing Convolutional Neural Networks (CNNs) for image classification, emphasizing on the reduction of irrelevant features vital for deep learning processes, especially in live feed data for fire detection. Amidst the complexity of live feed data in fire detection, our study emphasizes on image feed, highlighting the urgency to enhance real-time processing. Our proposed algorithm combats feature overload through segmentation, addressing challenges arising from diverse features like objects, colors, and textures. Notably, a delicate balance of feature map size and dataset adequacy is pivotal. Several research papers use smaller image sizes, compromising feature richness which necessitating a new approach. We illuminate the critical role of pixel density in retaining essential details, especially for early wildfire detection. By carefully selecting number of filters during training, we underscore the significance of higher pixel density for proper feature selection. The proposed SegNet approach is rigorously evaluated using real-world dataset obtained by a drone flight and compared to state-of-the-art literature.
[7] arXiv:2405.00117 [pdf, ps, other]: Title: Training a high-performance retinal foundation model with half-the-data and 400 times less compute

Authors: Justin Engelmann, Miguel O. Bernabeu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.
[8] arXiv:2405.00156 [pdf, other]: Title: Expanding the Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray Classification

Authors: Skylar Chan, Pranav Kulkarni, Paul H. Yi, Vishwa S. Parekh

Comments: 11 pages, 13 figures, 3 tables

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)

Quantum machine learning (QML) has the potential for improving the multi-label classification of rare, albeit critical, diseases in large-scale chest x-ray (CXR) datasets due to theoretical quantum advantages over classical machine learning (CML) in sample efficiency and generalizability. While prior literature has explored QML with CXRs, it has focused on binary classification tasks with small datasets due to limited access to quantum hardware and computationally expensive simulations. To that end, we implemented a Jax-based framework that enables the simulation of medium-sized qubit architectures with significant improvements in wall-clock time over current software offerings. We evaluated the performance of our Jax-based framework in terms of efficiency and performance for hybrid quantum transfer learning for long-tailed classification across 8, 14, and 19 disease labels using large-scale CXR datasets. The Jax-based framework resulted in up to a 58% and 95% speed-up compared to PyTorch and TensorFlow implementations, respectively. However, compared to CML, QML demonstrated slower convergence and an average AUROC of 0.70, 0.73, and 0.74 for the classification of 8, 14, and 19 CXR disease labels. In comparison, the CML models had an average AUROC of 0.77, 0.78, and 0.80 respectively. In conclusion, our work presents an accessible implementation of hybrid quantum transfer learning for long-tailed CXR classification with a computationally efficient Jax-based framework.
[9] arXiv:2405.00168 [pdf, other]: Title: Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Method

Authors: Zhangyong Tang, Tianyang Xu, Zhenhua Feng, Xuefeng Zhu, He Wang, Pengcheng Shao, Chunyang Cheng, Xiao-Jun Wu, Muhammad Awais, Sara Atito, Josef Kittler

Subjects: Computer Vision and Pattern Recognition (cs.CV)

RGBT tracking draws increasing attention due to its robustness in multi-modality warranting (MMW) scenarios, such as nighttime and bad weather, where relying on a single sensing modality fails to ensure stable tracking results. However, the existing benchmarks predominantly consist of videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This makes the data unrepresentative of severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark, MV-RGBT, captured specifically in MMW scenarios. In contrast with the existing datasets, MV-RGBT comprises more object categories and scenes, providing a diverse and challenging benchmark. Furthermore, for severe imaging conditions of MMW scenarios, a new problem is posed, namely \textit{when to fuse}, to stimulate the development of fusion strategies for such data. We propose a new method based on a mixture of experts, namely MoETrack, as a baseline fusion strategy. In MoETrack, each expert generates independent tracking results along with the corresponding confidence score, which is used to control the fusion process. Extensive experimental results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Significantly, the proposed MoETrack method achieves new state-of-the-art results not only on MV-RGBT, but also on standard benchmarks, such as RGBT234, LasHeR, and the short-term split of VTUAV (VTUAV-ST). More information of MV-RGBT and the source code of MoETrack will be released at https://github.com/Zhangyong-Tang/MoETrack.
[10] arXiv:2405.00181 [pdf, other]: Title: Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Authors: Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, Xiaofeng Tao

Comments: Codebase: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA.
[11] arXiv:2405.00187 [pdf, other]: Title: Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer

Authors: Tahira Shehzadi, Shalini Sarode, Didier Stricker, Muhammad Zeshan Afzal

Comments: ICDAR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
[12] arXiv:2405.00196 [pdf, other]: Title: Synthetic Image Verification in the Era of Generative AI: What Works and What Isn't There Yet

Authors: Diangarti Tariang, Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva

Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this work we present an overview of approaches for the detection and attribution of synthetic images and highlight their strengths and weaknesses. We also point out and discuss hot topics in this field and outline promising directions for future research.
[13] arXiv:2405.00228 [pdf, other]: Title: Synthetic Face Datasets Generation via Latent Space Exploration from Brownian Identity Diffusion

Authors: David Geissbühler, Hatef Otroshi Shahreza, Sébastien Marcel

Comments: 17 pages, 7 figures, 10 tables

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face Recognition (FR) models are trained on large-scale datasets, which have privacy and ethical concerns. Lately, the use of synthetic data to complement or replace genuine data for the training of FR models has been proposed. While promising results have been obtained, it still remains unclear if generative models can yield diverse enough data for such tasks. In this work, we introduce a new method, inspired by the physical motion of soft particles subjected to stochastic Brownian forces, allowing us to sample identities distributions in a latent space under various constraints. With this in hands, we generate several face datasets and benchmark them by training FR models, showing that data generated with our method exceeds the performance of previously GAN-based datasets and achieves competitive performance with state-of-the-art diffusion-based synthetic datasets. We also show that this method can be used to mitigate leakage from the generator's training set and explore the ability of generative models to generate data beyond it.
[14] arXiv:2405.00242 [pdf, other]: Title: Guiding Attention in End-to-End Driving Models

Authors: Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. López

Comments: Accepted for publication at the 35th IEEE Intelligent Vehicles Symposium (IV 2024)

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.
[15] arXiv:2405.00244 [pdf, other]: Title: Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network

Authors: Yong Shu, Liquan Shen, Xiangyu Hu, Mengyao Li, Zihao Zhou

Comments: This paper has been accepted by CVPR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)

As an important and practical way to obtain high dynamic range (HDR) video, HDR video reconstruction from sequences with alternating exposures is still less explored, mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets, which perform poorly in real scenes. In this work, to facilitate the development of real-world HDR video reconstruction, we present Real-HDRV, a large-scale real-world benchmark dataset for HDR video reconstruction, featuring various scenes, diverse motion patterns, and high-quality labels. Specifically, our dataset contains 500 LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly, we propose an end-to-end network for HDR video reconstruction, where a novel two-stage strategy is designed to perform alignment sequentially. Specifically, the first stage performs global alignment with the adaptively estimated global offsets, reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at https://github.com/yungsyu99/Real-HDRV.
[16] arXiv:2405.00250 [pdf, other]: Title: SemVecNet: Generalizable Vector Map Generation for Arbitrary Sensor Configurations

Authors: Narayanan Elavathur Ranganatha, Hengyuan Zhang, Shashank Venkatramani, Jing-Yan Liao, Henrik I. Christensen

Comments: 8 pages, 6 figures, Accepted to IV 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Vector maps are essential in autonomous driving for tasks like localization and planning, yet their creation and maintenance are notably costly. While recent advances in online vector map generation for autonomous vehicles are promising, current models lack adaptability to different sensor configurations. They tend to overfit to specific sensor poses, leading to decreased performance and higher retraining costs. This limitation hampers their practical use in real-world applications. In response to this challenge, we propose a modular pipeline for vector map generation with improved generalization to sensor configurations. The pipeline leverages probabilistic semantic mapping to generate a bird's-eye-view (BEV) semantic map as an intermediate representation. This intermediate representation is then converted to a vector map using the MapTRv2 decoder. By adopting a BEV semantic map robust to different sensor configurations, our proposed approach significantly improves the generalization performance. We evaluate the model on datasets with sensor configurations not used during training. Our evaluation sets includes larger public datasets, and smaller scale private data collected on our platform. Our model generalizes significantly better than the state-of-the-art methods.
[17] arXiv:2405.00251 [pdf, other]: Title: Semantically Consistent Video Inpainting with Conditional Diffusion Models

Authors: Dylan Green, William Harvey, Saeid Naderiparizi, Matthew Niedoba, Yunpeng Liu, Xiaoxuan Liang, Jonathan Lavington, Ke Zhang, Vasileios Lioutas, Setareh Dabiri, Adam Scibior, Berend Zwartsenberg, Frank Wood

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
[18] arXiv:2405.00256 [pdf, other]: Title: ASAM: Boosting Segment Anything Model with Adversarial Tuning

Authors: Bo Li, Haoke Xiao, Lv Tang

Comments: This paper is accepted by CVPR2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the evolving landscape of computer vision, foundation models have emerged as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks. Among these, the Segment Anything Model (SAM) by Meta AI has distinguished itself in image segmentation. However, SAM, like its counterparts, encounters limitations in specific niche applications, prompting a quest for enhancement strategies that do not compromise its inherent capabilities. This paper introduces ASAM, a novel methodology that amplifies SAM's performance through adversarial tuning. We harness the potential of natural adversarial examples, inspired by their successful implementation in natural language processing. By utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B dataset, generating adversarial instances that are more representative of natural variations rather than conventional imperceptible perturbations. Our approach maintains the photorealism of adversarial examples and ensures alignment with original mask annotations, thereby preserving the integrity of the segmentation task. The fine-tuned ASAM demonstrates significant improvements across a diverse range of segmentation tasks without necessitating additional data or architectural modifications. The results of our extensive evaluations confirm that ASAM establishes new benchmarks in segmentation tasks, thereby contributing to the advancement of foundational models in computer vision. Our project page is in https://asam2024.github.io/.
[19] arXiv:2405.00260 [pdf, other]: Title: CREPE: Coordinate-Aware End-to-End Document Parser

Authors: Yamato Okamoto, Youngmin Baek, Geewook Kim, Ryota Nakao, DongHyun Kim, Moon Bin Yim, Seunghyun Park, Bado Lee

Comments: Accepted at the International Conference on Document Analysis and Recognition (ICDAR 2024) main conference

Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
[20] arXiv:2405.00264 [pdf, other]: Title: Using Texture to Classify Forests Separately from Vegetation

Authors: David R. Treadwell IV, Derek Jacoby, Will Parkinson, Bruce Maxwell, Yvonne Coady

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Identifying terrain within satellite image data is a key issue in geographical information sciences, with numerous environmental and safety implications. Many techniques exist to derive classifications from spectral data captured by satellites. However, the ability to reliably classify vegetation remains a challenge. In particular, no precise methods exist for classifying forest vs. non-forest vegetation in high-level satellite images. This paper provides an initial proposal for a static, algorithmic process to identify forest regions in satellite image data through texture features created from detected edges and the NDVI ratio captured by Sentinel-2 satellite images. With strong initial results, this paper also identifies the next steps to improve the accuracy of the classification and verification processes.
[21] arXiv:2405.00293 [pdf, other]: Title: MoPEFT: A Mixture-of-PEFTs for the Segment Anything Model

Authors: Rajat Sahay, Andreas Savakis

Comments: Workshop on Foundation Models, CVPR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)

The emergence of foundation models, such as the Segment Anything Model (SAM), has sparked interest in Parameter-Efficient Fine-Tuning (PEFT) methods that tailor these large models to application domains outside their training data. However, different PEFT techniques modify the representation of a model differently, making it a non-trivial task to select the most appropriate method for the domain of interest. We propose a new framework, Mixture-of-PEFTs methods (MoPEFT), that is inspired by traditional Mixture-of-Experts (MoE) methodologies and is utilized for fine-tuning SAM. Our MoPEFT framework incorporates three different PEFT techniques as submodules and dynamically learns to activate the ones that are best suited for a given data-task setup. We test our method on the Segment Anything Model and show that MoPEFT consistently outperforms other fine-tuning methods on the MESS benchmark.
[22] arXiv:2405.00313 [pdf, other]: Title: Streamlining Image Editing with Layered Diffusion Brushes

Authors: Peyman Gholami, Robert Xiao

Comments: arXiv admin note: text overlap with arXiv:2306.00219

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Denoising diffusion models have recently gained prominence as powerful tools for a variety of image generation and manipulation tasks. Building on this, we propose a novel tool for real-time editing of images that provides users with fine-grained region-targeted supervision in addition to existing prompt-based controls. Our novel editing technique, termed Layered Diffusion Brushes, leverages prompt-guided and region-targeted alteration of intermediate denoising steps, enabling precise modifications while maintaining the integrity and context of the input image. We provide an editor based on Layered Diffusion Brushes modifications, which incorporates well-known image editing concepts such as layer masks, visibility toggles, and independent manipulation of layers; regardless of their order. Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU, enabling real-time feedback and rapid exploration of candidate edits. We validated our method and editing system through a user study involving both natural images (using inversion) and generated images, showcasing its usability and effectiveness compared to existing techniques such as InstructPix2Pix and Stable Diffusion Inpainting for refining images. Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation, demonstrating its versatility and potential for enhancing creative workflows.
[23] arXiv:2405.00340 [pdf, other]: Title: NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation

Authors: Ziyi Chen, Xiaolong Wu, Yu Zhang

Subjects: Computer Vision and Pattern Recognition (cs.CV)

State-of-the-art neural implicit surface representations have achieved impressive results in indoor scene reconstruction by incorporating monocular geometric priors as additional supervision. However, we have observed that multi-view inconsistency between such priors poses a challenge for high-quality reconstructions. In response, we present NC-SDF, a neural signed distance field (SDF) 3D reconstruction framework with view-dependent normal compensation (NC). Specifically, we integrate view-dependent biases in monocular normal priors into the neural implicit representation of the scene. By adaptively learning and correcting the biases, our NC-SDF effectively mitigates the adverse impact of inconsistent supervision, enhancing both the global consistency and local details in the reconstructions. To further refine the details, we introduce an informative pixel sampling strategy to pay more attention to intricate geometry with higher information content. Additionally, we design a hybrid geometry modeling approach to improve the neural implicit representation. Experiments on synthetic and real-world datasets demonstrate that NC-SDF outperforms existing approaches in terms of reconstruction quality.
[24] arXiv:2405.00354 [pdf, other]: Title: CrossMatch: Enhance Semi-Supervised Medical Image Segmentation with Perturbation Strategies and Knowledge Distillation

Authors: Bin Zhao, Chunshi Wang, Shuxue Ding

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Semi-supervised learning for medical image segmentation presents a unique challenge of efficiently using limited labeled data while leveraging abundant unlabeled data. Despite advancements, existing methods often do not fully exploit the potential of the unlabeled data for enhancing model robustness and accuracy. In this paper, we introduce CrossMatch, a novel framework that integrates knowledge distillation with dual perturbation strategies-image-level and feature-level-to improve the model's learning from both labeled and unlabeled data. CrossMatch employs multiple encoders and decoders to generate diverse data streams, which undergo self-knowledge distillation to enhance consistency and reliability of predictions across varied perturbations. Our method significantly surpasses other state-of-the-art techniques in standard benchmarks by effectively minimizing the gap between training on labeled and unlabeled data and improving edge accuracy and generalization in medical image segmentation. The efficacy of CrossMatch is demonstrated through extensive experimental validations, showing remarkable performance improvements without increasing computational costs. Code for this implementation is made available at https://github.com/AiEson/CrossMatch.git.
[25] arXiv:2405.00355 [pdf, other]: Title: Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis

Authors: Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.
[26] arXiv:2405.00378 [pdf, other]: Title: Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation

Authors: Hanyang Chi, Jian Pang, Bingfeng Zhang, Weifeng Liu

Comments: Accepted to CVPR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation (SSMIS), which enforces the model to produce consistent predictions under the perturbation. However, most current approaches solely focus on utilizing a specific single perturbation, which can only cope with limited cases, while employing multiple perturbations simultaneously is hard to guarantee the quality of consistency learning. In this paper, we propose an Adaptive Bidirectional Displacement (ABD) approach to solve the above challenge. Specifically, we first design a bidirectional patch displacement based on reliable prediction confidence for unlabeled data to generate new samples, which can effectively suppress uncontrollable regions and still retain the influence of input perturbations. Meanwhile, to enforce the model to learn the potentially uncontrollable content, a bidirectional displacement operation with inverse confidence is proposed for the labeled images, which generates samples with more unreliable information to facilitate model learning. Extensive experiments show that ABD achieves new state-of-the-art performances for SSMIS, significantly improving different baselines. Source code is available at https://github.com/chy-upc/ABD.
[27] arXiv:2405.00384 [pdf, other]: Title: Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol

Authors: Konstantinos Apostolidis, Jakob Abesser, Luca Cuccovillo, Vasileios Mezaris

Comments: Accepted for publication, 3rd ACM Int. Workshop on Multimedia AI against Disinformation (MAD'24) at ACM ICMR'24, June 10, 2024, Phuket, Thailand. This is the "accepted version"

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
[28] arXiv:2405.00420 [pdf, other]: Title: Self-supervised Pre-training of Text Recognizers

Authors: Martin Kišš, Michal Hradiš

Comments: 18 pages, 6 figures, 4 tables, accepted to ICDAR24

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://github.com/DCGM/pero-pretraining.
[29] arXiv:2405.00431 [pdf, other]: Title: Detail-Enhancing Framework for Reference-Based Image Super-Resolution

Authors: Zihan Wang, Ziliang Xiong, Hongying Tang, Xiaobing Yuan

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent years have witnessed the prosperity of reference-based image super-resolution (Ref-SR). By importing the high-resolution (HR) reference images into the single image super-resolution (SISR) approach, the ill-posed nature of this long-standing field has been alleviated with the assistance of texture transferred from reference images. Although the significant improvement in quantitative and qualitative results has verified the superiority of Ref-SR methods, the presence of misalignment before texture transfer indicates room for further performance improvement. Existing methods tend to neglect the significance of details in the context of comparison, therefore not fully leveraging the information contained within low-resolution (LR) images. In this paper, we propose a Detail-Enhancing Framework (DEF) for reference-based super-resolution, which introduces the diffusion model to generate and enhance the underlying detail in LR images. If corresponding parts are present in the reference image, our method can facilitate rigorous alignment. In cases where the reference image lacks corresponding parts, it ensures a fundamental improvement while avoiding the influence of the reference image. Extensive experiments demonstrate that our proposed method achieves superior visual results while maintaining comparable numerical outcomes.
[30] arXiv:2405.00448 [pdf, other]: Title: MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

Authors: Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, Xiaodan Liang

Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 2) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. For the first issue, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. Besides, MMTryon's impressive performance on multi-items and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.
[31] arXiv:2405.00452 [pdf, other]: Title: Predictive Accuracy-Based Active Learning for Medical Image Segmentation

Authors: Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, Bing Yan

Comments: 9 pages, 4 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Active learning is considered a viable solution to alleviate the contradiction between the high dependency of deep learning-based segmentation methods on annotated data and the expensive pixel-level annotation cost of medical images. However, most existing methods suffer from unreliable uncertainty assessment and the struggle to balance diversity and informativeness, leading to poor performance in segmentation tasks. In response, we propose an efficient Predictive Accuracy-based Active Learning (PAAL) method for medical image segmentation, first introducing predictive accuracy to define uncertainty. Specifically, PAAL mainly consists of an Accuracy Predictor (AP) and a Weighted Polling Strategy (WPS). The former is an attached learnable module that can accurately predict the segmentation accuracy of unlabeled samples relative to the target model with the predicted posterior probability. The latter provides an efficient hybrid querying scheme by combining predicted accuracy and feature representation, aiming to ensure the uncertainty and diversity of the acquired samples. Extensive experiment results on multiple datasets demonstrate the superiority of PAAL. PAAL achieves comparable accuracy to fully annotated data while reducing annotation costs by approximately 50% to 80%, showcasing significant potential in clinical applications. The code is available at https://github.com/shijun18/PAAL-MedSeg.
[32] arXiv:2405.00466 [pdf, other]: Title: Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable

Authors: Haozhe Liu, Wentian Zhang, Bing Li, Bernard Ghanem, Jürgen Schmidhuber

Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

Foundational generative models should be traceable to protect their owners and facilitate safety regulation. To achieve this, traditional approaches embed identifiers based on supervisory trigger-response signals, which are commonly known as backdoor watermarks. They are prone to failure when the model is fine-tuned with nontrigger data. Our experiments show that this vulnerability is due to energetic changes in only a few 'busy' layers during fine-tuning. This yields a novel arbitrary-in-arbitrary-out (AIAO) strategy that makes watermarks resilient to fine-tuning-based removal. The trigger-response pairs of AIAO samples across various neural network depths can be used to construct watermarked subpaths, employing Monte Carlo sampling to achieve stable verification results. In addition, unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths, where a mask-controlled trigger function is proposed to preserve the generation performance and ensure the invisibility of the embedded backdoor. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO; while the verification rates of other trigger-based methods fall from ~90% to ~70% after fine-tuning, those of our method remain consistently above 90%.
[33] arXiv:2405.00468 [pdf, other]: Title: Feature-Aware Noise Contrastive Learning For Unsupervised Red Panda Re-Identification

Authors: Jincheng Zhang, Qijun Zhao, Tie Liu

Comments: 7 pages, 5 figures, IJCNN2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

To facilitate the re-identification (Re-ID) of individual animals, existing methods primarily focus on maximizing feature similarity within the same individual and enhancing distinctiveness between different individuals. However, most of them still rely on supervised learning and require substantial labeled data, which is challenging to obtain. To avoid this issue, we propose a Feature-Aware Noise Contrastive Learning (FANCL) method to explore an unsupervised learning solution, which is then validated on the task of red panda re-ID. FANCL employs a Feature-Aware Noise Addition module to produce noised images that conceal critical features and designs two contrastive learning modules to calculate the losses. Firstly, a feature consistency module is designed to bridge the gap between the original and noised features. Secondly, the neural networks are trained through a cluster contrastive learning module. Through these more challenging learning tasks, FANCL can adaptively extract deeper representations of red pandas. The experimental results on a set of red panda images collected in both indoor and outdoor environments prove that FANCL outperforms several related state-of-the-art unsupervised methods, achieving high performance comparable to supervised learning methods.
[34] arXiv:2405.00479 [pdf, other]: Title: Enhanced Visual Question Answering: A Comparative Analysis and Textual Feature Extraction Via Convolutions

Authors: Zhilin Zhang

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model complexity and its effect on performance. In this work, we conduct a comprehensive comparison between complex textual models that leverage long dependency mechanisms and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we introduce an improved model, ConvGRU, which incorporates convolutional layers to enhance the representation of question text. Tested on the VQA-v2 dataset, ConvGRU achieves better performance without substantially increasing parameter complexity.
[35] arXiv:2405.00483 [pdf, other]: Title: In Anticipation of Perfect Deepfake: Identity-anchored Artifact-agnostic Detection under Rebalanced Deepfake Detection Protocol

Authors: Wei-Han Wang, Chin-Yuan Yeh, Hsi-Wen Chen, De-Nian Yang, Ming-Syan Chen

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

As deep generative models advance, we anticipate deepfakes achieving "perfection"-generating no discernible artifacts or noise. However, current deepfake detectors, intentionally or inadvertently, rely on such artifacts for detection, as they are exclusive to deepfakes and absent in genuine examples. To bridge this gap, we introduce the Rebalanced Deepfake Detection Protocol (RDDP) to stress-test detectors under balanced scenarios where genuine and forged examples bear similar artifacts. We offer two RDDP variants: RDDP-WHITEHAT uses white-hat deepfake algorithms to create 'self-deepfakes,' genuine portrait videos with the resemblance of the underlying identity, yet carry similar artifacts to deepfake videos; RDDP-SURROGATE employs surrogate functions (e.g., Gaussian noise) to process both genuine and forged examples, introducing equivalent noise, thereby sidestepping the need of deepfake algorithms.
Towards detecting perfect deepfake videos that aligns with genuine ones, we present ID-Miner, a detector that identifies the puppeteer behind the disguise by focusing on motion over artifacts or appearances. As an identity-based detector, it authenticates videos by comparing them with reference footage. Equipped with the artifact-agnostic loss at frame-level and the identity-anchored loss at video-level, ID-Miner effectively singles out identity signals amidst distracting variations. Extensive experiments comparing ID-Miner with 12 baseline detectors under both conventional and RDDP evaluations with two deepfake datasets, along with additional qualitative studies, affirm the superiority of our method and the necessity for detectors designed to counter perfect deepfakes.
[36] arXiv:2405.00485 [pdf, other]: Title: The Pyramid of Captions

Authors: Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung

Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
[37] arXiv:2405.00507 [pdf, other]: Title: NeRF-Guided Unsupervised Learning of RGB-D Registration

Authors: Zhinan Yu, Zheng Qin, Yijie Tang, Yongjun Wang, Renjiao Yi, Chenyang Zhu, Kai Xu

Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper focuses on training a robust RGB-D registration model without ground-truth pose supervision. Existing methods usually adopt a pairwise training strategy based on differentiable rendering, which enforces the photometric and the geometric consistency between the two registered frames as supervision. However, this frame-to-frame framework suffers from poor multi-view consistency due to factors such as lighting changes, geometry occlusion and reflective materials. In this paper, we present NeRF-UR, a novel frame-to-model optimization framework for unsupervised RGB-D registration. Instead of frame-to-frame consistency, we leverage the neural radiance field (NeRF) as a global model of the scene and use the consistency between the input and the NeRF-rerendered frames for pose optimization. This design can significantly improve the robustness in scenarios with poor multi-view consistency and provides better learning signal for the registration model. Furthermore, to bootstrap the NeRF optimization, we create a synthetic dataset, Sim-RGBD, through a photo-realistic simulator to warm up the registration model. By first training the registration model on Sim-RGBD and later unsupervisedly fine-tuning on real data, our framework enables distilling the capability of feature extraction and registration from simulation to reality. Our method outperforms the state-of-the-art counterparts on two popular indoor RGB-D datasets, ScanNet and 3DMatch. Code and models will be released for paper reproduction.
[38] arXiv:2405.00514 [pdf, other]: Title: Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring

Authors: Sizhuo Li, Dimitri Gominski, Martin Brandt, Xiaoye Tong, Philippe Ciais

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image-level regression is an important task in Earth observation, where visual domain and label shifts are a core challenge hampering generalization. However, cross-domain regression with remote sensing data remains understudied due to the absence of suited datasets. We introduce a new dataset with aerial and satellite imagery in five countries with three forest-related regression tasks. To match real-world applicative interests, we compare methods through a restrictive setup where no prior on the target domain is available during training, and models are adapted with limited information during testing. Building on the assumption that ordered relationships generalize better, we propose manifold diffusion for regression as a strong baseline for transduction in low-data regimes. Our comparison highlights the comparative advantages of inductive and transductive methods in cross-domain regression.
[39] arXiv:2405.00571 [pdf, other]: Title: Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Authors: Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
[40] arXiv:2405.00574 [pdf, other]: Title: EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Authors: Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
[41] arXiv:2405.00587 [pdf, other]: Title: GraCo: Granularity-Controllable Interactive Segmentation

Authors: Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant results. In this work, we introduce Granularity-Controllable Interactive Segmentation (GraCo), a novel approach that allows precise control of prediction granularity by introducing additional parameters to input. This enhances the customization of the interactive system and eliminates redundancy while resolving ambiguity. Nevertheless, the exorbitant cost of annotating multi-granularity masks and the lack of available datasets with granularity annotations make it difficult for models to acquire the necessary guidance to control output granularity. To address this problem, we design an any-granularity mask generator that exploits the semantic property of the pre-trained IS model to automatically generate abundant mask-granularity pairs without requiring additional manual annotation. Based on these pairs, we propose a granularity-controllable learning strategy that efficiently imparts the granularity controllability to the IS model. Extensive experiments on intricate scenarios at object and part levels demonstrate that our GraCo has significant advantages over previous methods. This highlights the potential of GraCo to be a flexible annotation tool, capable of adapting to diverse segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo.
[42] arXiv:2405.00620 [pdf, other]: Title: Lane Segmentation Refinement with Diffusion Models

Authors: Antonio Ruiz, Andrew Melnik, Dong Wang, Helge Ritter

Subjects: Computer Vision and Pattern Recognition (cs.CV)

The lane graph is a key component for building high-definition (HD) maps and crucial for downstream tasks such as autonomous driving or navigation planning. Previously, He et al. (2022) explored the extraction of the lane-level graph from aerial imagery utilizing a segmentation based approach. However, segmentation networks struggle to achieve perfect segmentation masks resulting in inaccurate lane graph extraction. We explore additional enhancements to refine this segmentation-based approach and extend it with a diffusion probabilistic model (DPM) component. This combination further improves the GEO F1 and TOPO F1 scores, which are crucial indicators of the quality of a lane graph, in the undirected graph in non-intersection areas. We conduct experiments on a publicly available dataset, demonstrating that our method outperforms the previous approach, particularly in enhancing the connectivity of such a graph, as measured by the TOPO F1 score. Moreover, we perform ablation studies on the individual components of our method to understand their contribution and evaluate their effectiveness.
[43] arXiv:2405.00630 [pdf, other]: Title: Depth Priors in Removal Neural Radiance Fields

Authors: Zhihao Guo, Peng Wang

Comments: 15 pages

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
[44] arXiv:2405.00631 [pdf, other]: Title: Deep Metric Learning-Based Out-of-Distribution Detection with Synthetic Outlier Exposure

Authors: Assefa Seyoum Wahd

Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a novel approach that combines deep metric learning and synthetic data generation using diffusion models for out-of-distribution (OOD) detection. One popular approach for OOD detection is outlier exposure, where models are trained using a mixture of in-distribution (ID) samples and ``seen" OOD samples. For the OOD samples, the model is trained to minimize the KL divergence between the output probability and the uniform distribution while correctly classifying the in-distribution (ID) data. In this paper, we propose a label-mixup approach to generate synthetic OOD data using Denoising Diffusion Probabilistic Models (DDPMs). Additionally, we explore recent advancements in metric learning to train our models.
In the experiments, we found that metric learning-based loss functions perform better than the softmax. Furthermore, the baseline models (including softmax, and metric learning) show a significant improvement when trained with the generated OOD data. Our approach outperforms strong baselines in conventional OOD detection metrics.
[45] arXiv:2405.00646 [pdf, other]: Title: Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

Authors: Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that incorporating our objective to the existing framework consistently improves the objective-centric learning and enhances the robustness to the architectural choices.
[46] arXiv:2405.00650 [pdf, other]: Title: Grains of Saliency: Optimizing Saliency-based Training of Biometric Attack Detection Models

Authors: Colton R. Crum, Samuel Webster, Adam Czajka

Comments: 10 pages, 3 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Incorporating human-perceptual intelligence into model training has shown to increase the generalization capability of models in several difficult biometric tasks, such as presentation attack detection (PAD) and detection of synthetic samples. After the initial collection phase, human visual saliency (e.g., eye-tracking data, or handwritten annotations) can be integrated into model training through attention mechanisms, augmented training samples, or through human perception-related components of loss functions. Despite their successes, a vital, but seemingly neglected, aspect of any saliency-based training is the level of salience granularity (e.g., bounding boxes, single saliency maps, or saliency aggregated from multiple subjects) necessary to find a balance between reaping the full benefits of human saliency and the cost of its collection. In this paper, we explore several different levels of salience granularity and demonstrate that increased generalization capabilities of PAD and synthetic face detection can be achieved by using simple yet effective saliency post-processing techniques across several different CNNs.
[47] arXiv:2405.00666 [pdf, other]: Title: RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models

Authors: Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling-Qi Yan, Miloš Hašan

Journal-ref: SIGGRAPH Conference Papers '24, July 27-August 1, 2024, Denver, CO, USA

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB$\rightarrow$X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X$\rightarrow$RGB, can also be addressed in a diffusion framework.
Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB$\rightarrow$X, which also estimates lighting, as well as the first diffusion X$\rightarrow$RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X$\rightarrow$RGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest.
This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.
[48] arXiv:2405.00670 [pdf, other]: Title: Adapting Pretrained Networks for Image Quality Assessment on High Dynamic Range Displays

Authors: Andrei Chubarau, Hyunjin Yoo, Tara Akhavan, James Clark

Comments: 7 pages, 3 figures, 3 tables. Submitted to Human Vision and Electronic Imaging 2024 (HVEI)

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Conventional image quality metrics (IQMs), such as PSNR and SSIM, are designed for perceptually uniform gamma-encoded pixel values and cannot be directly applied to perceptually non-uniform linear high-dynamic-range (HDR) colors. Similarly, most of the available datasets consist of standard-dynamic-range (SDR) images collected in standard and possibly uncontrolled viewing conditions. Popular pre-trained neural networks are likewise intended for SDR inputs, restricting their direct application to HDR content. On the other hand, training HDR models from scratch is challenging due to limited available HDR data. In this work, we explore more effective approaches for training deep learning-based models for image quality assessment (IQA) on HDR data. We leverage networks pre-trained on SDR data (source domain) and re-target these models to HDR (target domain) with additional fine-tuning and domain adaptation. We validate our methods on the available HDR IQA datasets, demonstrating that models trained with our combined recipe outperform previous baselines, converge much quicker, and reliably generalize to HDR inputs.
[49] arXiv:2405.00676 [pdf, other]: Title: Spectrally Pruned Gaussian Fields with Neural Compensation

Authors: Runyi Yang, Zhenxin Zhu, Zhou Jiang, Baijun Ye, Xiaoxue Chen, Yifei Zhang, Yuantao Chen, Jian Zhao, Hao Zhao

Comments: Code: this https URL Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, 3D Gaussian Splatting, as a novel 3D representation, has garnered attention for its fast rendering speed and high rendering quality. However, this comes with high memory consumption, e.g., a well-trained Gaussian field may utilize three million Gaussian primitives and over 700 MB of memory. We credit this high memory footprint to the lack of consideration for the relationship between primitives. In this paper, we propose a memory-efficient Gaussian field named SUNDAE with spectral pruning and neural compensation. On one hand, we construct a graph on the set of Gaussian primitives to model their relationship and design a spectral down-sampling module to prune out primitives while preserving desired signals. On the other hand, to compensate for the quality loss of pruning Gaussians, we exploit a lightweight neural network head to mix splatted features, which effectively compensates for quality losses while capturing the relationship between primitives in its weights. We demonstrate the performance of SUNDAE with extensive results. For example, SUNDAE can achieve 26.80 PSNR at 145 FPS using 104 MB memory while the vanilla Gaussian splatting algorithm achieves 25.60 PSNR at 160 FPS using 523 MB memory, on the Mip-NeRF360 dataset. Codes are publicly available at https://runyiyang.github.io/projects/SUNDAE/.

Cross-lists for Thu, 2 May 24

[50] arXiv:2405.00130 (cross-list from eess.IV) [pdf, other]: Title: A Flexible 2.5D Medical Image Segmentation Approach with In-Slice and Cross-Slice Attention

Authors: Amarjeet Kumar, Hongxu Jiang, Muhammad Imran, Cyndi Valdes, Gabriela Leon, Dahyun Kang, Parvathi Nataraj, Yuyin Zhou, Michael D. Weiss, Wei Shao

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Deep learning has become the de facto method for medical image segmentation, with 3D segmentation models excelling in capturing complex 3D structures and 2D models offering high computational efficiency. However, segmenting 2.5D images, which have high in-plane but low through-plane resolution, is a relatively unexplored challenge. While applying 2D models to individual slices of a 2.5D image is feasible, it fails to capture the spatial relationships between slices. On the other hand, 3D models face challenges such as resolution inconsistencies in 2.5D images, along with computational complexity and susceptibility to overfitting when trained with limited data. In this context, 2.5D models, which capture inter-slice correlations using only 2D neural networks, emerge as a promising solution due to their reduced computational demand and simplicity in implementation. In this paper, we introduce CSA-Net, a flexible 2.5D segmentation model capable of processing 2.5D images with an arbitrary number of slices through an innovative Cross-Slice Attention (CSA) module. This module uses the cross-slice attention mechanism to effectively capture 3D spatial information by learning long-range dependencies between the center slice (for segmentation) and its neighboring slices. Moreover, CSA-Net utilizes the self-attention mechanism to understand correlations among pixels within the center slice. We evaluated CSA-Net on three 2.5D segmentation tasks: (1) multi-class brain MRI segmentation, (2) binary prostate MRI segmentation, and (3) multi-class prostate MRI segmentation. CSA-Net outperformed leading 2D and 2.5D segmentation methods across all three tasks, demonstrating its efficacy and superiority. Our code is publicly available at https://github.com/mirthAI/CSA-Net.
[51] arXiv:2405.00142 (cross-list from cs.LG) [pdf, other]: Title: Utilizing Machine Learning and 3D Neuroimaging to Predict Hearing Loss: A Comparative Analysis of Dimensionality Reduction and Regression Techniques

Authors: Trinath Sai Subhash Reddy Pittala, Uma Maheswara R Meleti, Manasa Thatipamula

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

In this project, we have explored machine learning approaches for predicting hearing loss thresholds on the brain's gray matter 3D images. We have solved the problem statement in two phases. In the first phase, we used a 3D CNN model to reduce high-dimensional input into latent space and decode it into an original image to represent the input in rich feature space. In the second phase, we utilized this model to reduce input into rich features and used these features to train standard machine learning models for predicting hearing thresholds. We have experimented with autoencoders and variational autoencoders in the first phase for dimensionality reduction and explored random forest, XGBoost and multi-layer perceptron for regressing the thresholds. We split the given data set into training and testing sets and achieved an 8.80 range and 22.57 range for PT500 and PT4000 on the test set, respectively. We got the lowest RMSE using multi-layer perceptron among the other models.
Our approach leverages the unique capabilities of VAEs to capture complex, non-linear relationships within high-dimensional neuroimaging data. We rigorously evaluated the models using various metrics, focusing on the root mean squared error (RMSE). The results highlight the efficacy of the multi-layer neural network model, which outperformed other techniques in terms of accuracy. This project advances the application of data mining in medical diagnostics and enhances our understanding of age-related hearing loss through innovative machine-learning frameworks.
[52] arXiv:2405.00145 (cross-list from cs.SE) [pdf, other]: Title: GUing: A Mobile GUI Search Engine using a Vision-Language Model

Authors: Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, Walid Maalej

Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
[53] arXiv:2405.00236 (cross-list from cs.RO) [pdf, other]: Title: STT: Stateful Tracking with Transformers for Autonomous Driving

Authors: Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, Congcong Li

Comments: ICRA 2024

Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.
[54] arXiv:2405.00239 (cross-list from eess.IV) [pdf, other]: Title: IgCONDA-PET: Implicitly-Guided Counterfactual Diffusion for Detecting Anomalies in PET Images

Authors: Shadab Ahamed, Yixi Xu, Arman Rahmim

Comments: 12 pages, 6 figures, 1 table

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Minimizing the need for pixel-level annotated data for training PET anomaly segmentation networks is crucial, particularly due to time and cost constraints related to expert annotations. Current un-/weakly-supervised anomaly detection methods rely on autoencoder or generative adversarial networks trained only on healthy data, although these are more challenging to train. In this work, we present a weakly supervised and Implicitly guided COuNterfactual diffusion model for Detecting Anomalies in PET images, branded as IgCONDA-PET. The training is conditioned on image class labels (healthy vs. unhealthy) along with implicit guidance to generate counterfactuals for an unhealthy image with anomalies. The counterfactual generation process synthesizes the healthy counterpart for a given unhealthy image, and the difference between the two facilitates the identification of anomaly locations. The code is available at: https://github.com/igcondapet/IgCONDA-PET.git
[55] arXiv:2405.00314 (cross-list from cs.LG) [pdf, other]: Title: Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Authors: Dayou Du, Gu Gong, Xiaowen Chu

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related open-source materials at https://github.com/DD-DuDa/awesome-vit-quantization-acceleration.
[56] arXiv:2405.00318 (cross-list from cs.NE) [pdf, other]: Title: Covariant spatio-temporal receptive fields for neuromorphic computing

Authors: Jens Egholm Pedersen, Jörg Conradt, Tony Lindeberg

Comments: Code available at this https URL

Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Biological nervous systems constitute important sources of inspiration towards computers that are faster, cheaper, and more energy efficient. Neuromorphic disciplines view the brain as a coevolved system, simultaneously optimizing the hardware and the algorithms running on it. There are clear efficiency gains when bringing the computations into a physical substrate, but we presently lack theories to guide efficient implementations. Here, we present a principled computational model for neuromorphic systems in terms of spatio-temporal receptive fields, based on affine Gaussian kernels over space and leaky-integrator and leaky integrate-and-fire models over time. Our theory is provably covariant to spatial affine and temporal scaling transformations, and with close similarities to the visual processing in mammalian brains. We use these spatio-temporal receptive fields as a prior in an event-based vision task, and show that this improves the training of spiking networks, which otherwise is known as problematic for event-based vision. This work combines efforts within scale-space theory and computational neuroscience to identify theoretically well-founded ways to process spatio-temporal signals in neuromorphic systems. Our contributions are immediately relevant for signal processing and event-based vision, and can be extended to other processing tasks over space and time, such as memory and control.
[57] arXiv:2405.00351 (cross-list from cs.HC) [pdf, other]: Title: Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Authors: Zidong Cao, Zhan Wang, Yexin Liu, Yan-Pei Cao, Ying Shan, Wei Zeng, Lin Wang

Comments: 11 pages

Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user's ability to engage with objects of interest. In this paper, we present a novel system, called OmniVR, designed to enhance visual clarity during VR navigation. Our system enables users to effortlessly locate and zoom in on the objects of interest in VR. It captures user commands for navigation and zoom, converting these inputs into parameters for the Mobius transformation matrix. Leveraging these parameters, the ODI is refined using a learning-based algorithm. The resultant ODI is presented within the VR media, effectively reducing blur and increasing user engagement. To verify the effectiveness of our system, we first evaluate our algorithm with state-of-the-art methods on public datasets, which achieves the best performance. Furthermore, we undertake a comprehensive user study to evaluate viewer experiences across diverse scenarios and to gather their qualitative feedback from multiple perspectives. The outcomes reveal that our system enhances user engagement by improving the viewers' recognition, reducing discomfort, and improving the overall immersive experience. Our system makes the navigation and zoom more user-friendly.
[58] arXiv:2405.00430 (cross-list from physics.med-ph) [pdf, ps, other]: Title: Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

Authors: Xia Li, Muheng Li, Antony Lomax, Joachim Buhmann, Ye Zhang

Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)

Background and purpose: Deformable image registration (DIR) is a crucial tool in radiotherapy for extracting and modelling organ motion. However, when significant changes and sliding boundaries are present, it faces compromised accuracy and uncertainty, determining the subsequential contour propagation and dose accumulation procedures. Materials and methods: We propose an implicit neural representation (INR)-based approach modelling motion continuously in both space and time, named Continues-sPatial-Temporal DIR (CPT-DIR). This method uses a multilayer perception (MLP) network to map 3D coordinate (x,y,z) to its corresponding velocity vector (vx,vy,vz). The displacement vectors (dx,dy,dz) are then calculated by integrating velocity vectors over time. The MLP's parameters can rapidly adapt to new cases without pre-training, enhancing optimisation. The DIR's performance was tested on the DIR-Lab dataset of 10 lung 4DCT cases, using metrics of landmark accuracy (TRE), contour conformity (Dice) and image similarity (MAE). Results: The proposed CPT-DIR can reduce landmark TRE from 2.79mm to 0.99mm, outperforming B-splines' results for all cases. The MAE of the whole-body region improves from 35.46HU to 28.99HU. Furthermore, CPT-DIR surpasses B-splines for accuracy in the sliding boundary region, lowering MAE and increasing Dice coefficients for the ribcage from 65.65HU and 90.41% to 42.04HU and 90.56%, versus 75.40HU and 89.30% without registration. Meanwhile, CPT-DIR offers significant speed advantages, completing in under 15 seconds compared to a few minutes with the conventional B-splines method. Conclusion: Leveraging the continuous representations, the CPT-DIR method significantly enhances registration accuracy, automation and speed, outperforming traditional B-splines in landmark and contour precision, particularly in the challenging areas.
[59] arXiv:2405.00472 (cross-list from eess.IV) [pdf, other]: Title: DmADs-Net: Dense multiscale attention and depth-supervised network for medical image segmentation

Authors: Zhaojin Fu, Zheng Chen, Jinjiang Li, Lu Ren

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep learning has made important contributions to the development of medical image segmentation. Convolutional neural networks, as a crucial branch, have attracted strong attention from researchers. Through the tireless efforts of numerous researchers, convolutional neural networks have yielded numerous outstanding algorithms for processing medical images. The ideas and architectures of these algorithms have also provided important inspiration for the development of later technologies.Through extensive experimentation, we have found that currently mainstream deep learning algorithms are not always able to achieve ideal results when processing complex datasets and different types of datasets. These networks still have room for improvement in lesion localization and feature extraction. Therefore, we have created the Dense Multiscale Attention and Depth-Supervised Network (DmADs-Net).We use ResNet for feature extraction at different depths and create a Multi-scale Convolutional Feature Attention Block to improve the network's attention to weak feature information. The Local Feature Attention Block is created to enable enhanced local feature attention for high-level semantic information. In addition, in the feature fusion phase, a Feature Refinement and Fusion Block is created to enhance the fusion of different semantic information.We validated the performance of the network using five datasets of varying sizes and types. Results from comparative experiments show that DmADs-Net outperformed mainstream networks. Ablation experiments further demonstrated the effectiveness of the created modules and the rationality of the network architecture.
[60] arXiv:2405.00515 (cross-list from cs.RO) [pdf, other]: Title: GAD-Generative Learning for HD Map-Free Autonomous Driving

Authors: Weijian Sun, Yanbo Jia, Qi Zeng, Zihao Liu, Jiang Liao, Yue Li, Xianfeng Li, Bolin Zhao

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Deep-learning-based techniques have been widely adopted for autonomous driving software stacks for mass production in recent years, focusing primarily on perception modules, with some work extending this method to prediction modules. However, the downstream planning and control modules are still designed with hefty handcrafted rules, dominated by optimization-based methods such as quadratic programming or model predictive control. This results in a performance bottleneck for autonomous driving systems in that corner cases simply cannot be solved by enumerating hand-crafted rules. We present a deep-learning-based approach that brings prediction, decision, and planning modules together with the attempt to overcome the rule-based methods' deficiency in real-world applications of autonomous driving, especially for urban scenes. The DNN model we proposed is solely trained with 10 hours of human driver data, and it supports all mass-production ADAS features available on the market to date. This method is deployed onto a Jiyue test car with no modification to its factory-ready sensor set and compute platform. the feasibility, usability, and commercial potential are demonstrated in this article.
[61] arXiv:2405.00542 (cross-list from eess.IV) [pdf, other]: Title: UWAFA-GAN: Ultra-Wide-Angle Fluorescein Angiography Transformation via Multi-scale Generation and Registration Enhancement

Authors: Ruiquan Ge, Zhaojie Fang, Pengxue Wei, Zhanghao Chen, Hongyang Jiang, Ahmed Elazab, Wangting Li, Xiang Wan, Shaochong Zhang, Changmiao Wang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Fundus photography, in combination with the ultra-wide-angle fundus (UWF) techniques, becomes an indispensable diagnostic tool in clinical settings by offering a more comprehensive view of the retina. Nonetheless, UWF fluorescein angiography (UWF-FA) necessitates the administration of a fluorescent dye via injection into the patient's hand or elbow unlike UWF scanning laser ophthalmoscopy (UWF-SLO). To mitigate potential adverse effects associated with injections, researchers have proposed the development of cross-modality medical image generation algorithms capable of converting UWF-SLO images into their UWF-FA counterparts. Current image generation techniques applied to fundus photography encounter difficulties in producing high-resolution retinal images, particularly in capturing minute vascular lesions. To address these issues, we introduce a novel conditional generative adversarial network (UWAFA-GAN) to synthesize UWF-FA from UWF-SLO. This approach employs multi-scale generators and an attention transmit module to efficiently extract both global structures and local lesions. Additionally, to counteract the image blurriness issue that arises from training with misaligned data, a registration module is integrated within this framework. Our method performs non-trivially on inception scores and details generation. Clinical user studies further indicate that the UWF-FA images generated by UWAFA-GAN are clinically comparable to authentic images in terms of diagnostic reliability. Empirical evaluations on our proprietary UWF image datasets elucidate that UWAFA-GAN outperforms extant methodologies. The code is accessible at https://github.com/Tinysqua/UWAFA-GAN.
[62] arXiv:2405.00588 (cross-list from cs.CL) [pdf, other]: Title: Are Models Biased on Text without Gender-related Language?

Authors: Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

Comments: In International Conference on Learning Representations 2024

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.
[63] arXiv:2405.00604 (cross-list from cs.RO) [pdf, other]: Title: A Preprocessing and Evaluation Toolbox for Trajectory Prediction Research on the Drone Datasets

Authors: Theodor Westny, Björn Olofsson, Erik Frisk

Comments: this https URL

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

The availability of high-quality datasets is crucial for the development of behavior prediction algorithms in autonomous vehicles. This paper highlights the need for standardizing the use of certain datasets for motion forecasting research to simplify comparative analysis and proposes a set of tools and practices to achieve this. Drawing on extensive experience and a comprehensive review of current literature, we summarize our proposals for preprocessing, visualizing, and evaluation in the form of an open-sourced toolbox designed for researchers working on trajectory prediction problems. The clear specification of necessary preprocessing steps and evaluation metrics is intended to alleviate development efforts and facilitate the comparison of results across different studies. The toolbox is available at: https://github.com/westny/dronalize.
[64] arXiv:2405.00672 (cross-list from cs.GR) [pdf, other]: Title: TexSliders: Diffusion-Based Texture Editing in CLIP Space

Authors: Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre

Comments: SIGGRAPH 2024 Conference Proceedings

Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.

Replacements for Thu, 2 May 24

[65] arXiv:2012.12437 (replaced) [pdf, other]: Title: Pit30M: A Benchmark for Global Localization in the Age of Self-Driving Cars

Authors: Julieta Martinez, Sasha Doubov, Jack Fan, Ioan Andrei Bârsan, Shenlong Wang, Gellért Máttyus, Raquel Urtasun

Comments: Published at IROS 2020

Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[66] arXiv:2211.10307 (replaced) [pdf, other]: Title: SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

Authors: Lukáš Adam, Vojtěch Čermák, Kostas Papafitsoros, Lukáš Picek

Comments: The SeaTurtleID2022 dataset is the latest version of the SeaTurtleID dataset which was described in the previous versions of this arXiv submission. Notice the change of title in the latest version

Journal-ref: Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7146-7156

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[67] arXiv:2212.06278 (replaced) [pdf, other]: Title: Efficient Bayesian Uncertainty Estimation for nnU-Net

Authors: Yidong Zhao, Changchun Yang, Artur Schweidtmann, Qian Tao

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[68] arXiv:2212.13253 (replaced) [pdf, other]: Title: DSI2I: Dense Style for Unpaired Image-to-Image Translation

Authors: Baran Ozaydin, Tong Zhang, Sabine Süsstrunk, Mathieu Salzmann

Comments: To appear on TMLR '24, Reviewed on OpenReview: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[69] arXiv:2301.09430 (replaced) [pdf, other]: Title: Rethinking Real-world Image Deraining via An Unpaired Degradation-Conditioned Diffusion Model

Authors: Yiyang Shen, Mingqiang Wei, Yongzhen Wang, Xueyang Fu, Jing Qin

Comments: 18 pages

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[70] arXiv:2302.06358 (replaced) [pdf, other]: Title: Anticipating Next Active Objects for Egocentric Videos

Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

Comments: Accepted by IEEE ACCESS, this paper carries the Manuscript DOI: 10.1109/ACCESS.2024.3395282. The complete peer-reviewed version is available via this DOI, while the arXiv version is a post-author manuscript without peer-review

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[71] arXiv:2305.12661 (replaced) [pdf, other]: Title: Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition

Authors: Chuanxin Song, Hanbo Wu, Xin Ma

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[72] arXiv:2306.10274 (replaced) [pdf, other]: Title: Benchmarking Deep Learning Architectures for Urban Vegetation Point Cloud Semantic Segmentation from MLS

Authors: Aditya Aditya, Bharat Lohani, Jagannath Aryal, Stephan Winter

Comments: The paper has been accepted for publication in IEEE Transactions on Geoscience and Remote Sensing. DOI: 10.1109/TGRS.2024.3381976

Journal-ref: in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-14, 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[73] arXiv:2308.00692 (replaced) [pdf, other]: Title: LISA: Reasoning Segmentation via Large Language Model

Authors: Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia

Comments: Code, models, and data are available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[74] arXiv:2308.03290 (replaced) [pdf, other]: Title: FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Authors: Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li

Comments: Accepted to AutoML 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[75] arXiv:2310.07355 (replaced) [pdf, other]: Title: IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Authors: Che Liu, Sibo Cheng, Miaojing Shi, Anand Shah, Wenjia Bai, Rossella Arcucci

Comments: Under Review

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[76] arXiv:2311.02189 (replaced) [pdf, other]: Title: FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling

Authors: Yu Tian, Min Shi, Yan Luo, Ava Kouhana, Tobias Elze, Mengyu Wang

Comments: ICLR 2024; Codes available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[77] arXiv:2311.04811 (replaced) [pdf, other]: Title: Image-Based Virtual Try-On: A Survey

Authors: Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu

Comments: 30 pages, 18 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[78] arXiv:2311.11210 (replaced) [pdf, other]: Title: HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

Authors: Lei Wang, Bo Liu, Yinchi Ma, Fangfang Liang, Nawei Guo

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[79] arXiv:2311.13172 (replaced) [pdf, other]: Title: Learning to Complement with Multiple Humans

Authors: Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, Gustavo Carneiro

Comments: Under review

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[80] arXiv:2312.06709 (replaced) [pdf, other]: Title: AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

Authors: Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

Comments: CVPR 2024 Version 3: CVPR Camera Ready, reconfigured full paper, table 1 is now more comprehensive Version 2: Added more acknowledgements and updated table 7 with more recent results. Ensured that the link in the abstract to our code is working properly Version 3: Fix broken hyperlinks

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[81] arXiv:2401.04801 (replaced) [pdf, other]: Title: Refining Remote Photoplethysmography Architectures using CKA and Empirical Methods

Authors: Nathan Vance, Patrick Flynn

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[82] arXiv:2401.16465 (replaced) [pdf, other]: Title: DressCode: Autoregressively Sewing and Generating Garments from Text Guidance

Authors: Kai He, Kaixin Yao, Qixuan Zhang, Lingjie Liu, Jingyi Yu, Lan Xu

Comments: Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[83] arXiv:2402.00724 (replaced) [pdf, ps, other]: Title: Automatic Segmentation of the Spinal Cord Nerve Rootlets

Authors: Jan Valosek, Theo Mathieu, Raphaelle Schlienger, Olivia S. Kowalczyk, Julien Cohen-Adad

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[84] arXiv:2402.04829 (replaced) [pdf, other]: Title: NeRF as a Non-Distant Environment Emitter in Physics-based Inverse Rendering

Authors: Jingwang Ling, Ruihan Yu, Feng Xu, Chun Du, Shuang Zhao

Comments: SIGGRAPH 2024. Project page and video: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[85] arXiv:2402.07330 (replaced) [pdf, other]: Title: Expert-Adaptive Medical Image Segmentation

Authors: Binyan Hu, A. K. Qin

Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
[86] arXiv:2403.00175 (replaced) [pdf, other]: Title: FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything

Authors: Safouane El Ghazouali, Youssef Mhirit, Ali Oukhrid, Umberto Michelucci, Hichem Nouira

Comments: 14 pages, 9 figures, 1 table

Journal-ref: Sensors 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[87] arXiv:2403.00209 (replaced) [pdf, other]: Title: ChartReformer: Natural Language-Driven Chart Image Editing

Authors: Pengyu Yan, Mahesh Bhosale, Jay Lal, Bikhyat Adhikari, David Doermann

Comments: Published in ICDAR 2024. Code and model are available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[88] arXiv:2403.13315 (replaced) [pdf, other]: Title: PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Authors: Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, Soujanya Poria

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[89] arXiv:2404.00231 (replaced) [pdf, ps, other]: Title: Attention-based Shape-Deformation Networks for Artifact-Free Geometry Reconstruction of Lumbar Spine from MR Images

Authors: Linchen Qian, Jiasong Chen, Linhai Ma, Timur Urakov, Weiyong Gu, Liang Liang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[90] arXiv:2404.01094 (replaced) [pdf, other]: Title: HairFastGAN: Realistic and Robust Hair Transfer with a Fast Encoder-Based Approach

Authors: Maxim Nikolaev, Mikhail Kuznetsov, Dmitry Vetrov, Aibek Alanov

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[91] arXiv:2404.02877 (replaced) [pdf, other]: Title: FlightScope: A Deep Comprehensive Assessment of Aircraft Detection Algorithms in Satellite Imagery

Authors: Safouane El Ghazouali, Arnaud Gucciardi, Nicola Venturi, Michael Rueegsegger, Umberto Michelucci

Comments: 15 figures, 4 tables, comprehensive survey, comparative study

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[92] arXiv:2404.03443 (replaced) [pdf, ps, other]: Title: Part-Attention Based Model Make Occluded Person Re-Identification Stronger

Authors: Zhihao Chen, Yiyuan Ge

Comments: Accepted By International Joint Conference on Neural Networks 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[93] arXiv:2404.09378 (replaced) [pdf, other]: Title: Orientation-conditioned Facial Texture Mapping for Video-based Facial Remote Photoplethysmography Estimation

Authors: Sam Cantrill, David Ahmedt-Aristizabal, Lars Petersson, Hanna Suominen, Mohammad Ali Armin

Comments: 12 pages, 8 figures, 6 tables; minor corrections

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[94] arXiv:2404.15378 (replaced) [pdf, other]: Title: Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint Distributions

Authors: Khai Nguyen, Nhat Ho

Comments: 28 pages, 11 figures, 4 tables

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Machine Learning (stat.ML)
[95] arXiv:2404.15789 (replaced) [pdf, other]: Title: MotionMaster: Training-free Camera Motion Transfer For Video Generation

Authors: Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, Lizhuang Ma

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[96] arXiv:2404.17335 (replaced) [pdf, other]: Title: A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Authors: Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey

Comments: 16 pages

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[97] arXiv:2404.17888 (replaced) [pdf, other]: Title: A Hybrid Approach for Document Layout Analysis in Document images

Authors: Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

Comments: ICDAR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[98] arXiv:2404.18253 (replaced) [pdf, other]: Title: Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment

Authors: Tengjun Huang

Comments: Accepted by the Twelfth International Conference on Learning Representations (ICLR) Workshop

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[99] arXiv:2404.18399 (replaced) [pdf, other]: Title: Semantic Line Combination Detector

Authors: Jinwon Ko, Dongkwon Jin, Chang-Su Kim

Comments: CVPR 2024 accepted

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[100] arXiv:2404.19242 (replaced) [pdf, other]: Title: A Minimal Set of Parameters Based Depth-Dependent Distortion Model and Its Calibration Method for Stereo Vision Systems

Authors: Xin Ma, Puchen Zhu, Xiao Li, Xiaoyin Zheng, Jianshu Zhou, Xuchen Wang, Kwok Wai Samuel Au

Comments: This paper has been accepted for publication in IEEE Transactions on Instrumentation and Measurement

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Methodology (stat.ME)
[101] arXiv:2404.19265 (replaced) [pdf, other]: Title: Mapping New Realities: Ground Truth Image Creation with Pix2Pix Image-to-Image Translation

Authors: Zhenglin Li, Bo Guan, Yuanzhou Wei, Yiming Zhou, Jingyu Zhang, Jinxin Xu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[102] arXiv:2404.19326 (replaced) [pdf, other]: Title: LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

Authors: Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, Wei Zhang, Wenqiang Zhang

Comments: LVOS V2

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[103] arXiv:2404.19706 (replaced) [pdf, other]: Title: RTG-SLAM: Real-time 3D Reconstruction at Scale using Gaussian Splatting

Authors: Zhexi Peng, Tianjia Shao, Yong Liu, Jingke Zhou, Yin Yang, Jingdong Wang, Kun Zhou

Comments: To be published in ACM SIGGRAPH 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV)
[104] arXiv:2209.05557 (replaced) [pdf, other]: Title: Blurring Diffusion Models

Authors: Emiel Hoogeboom, Tim Salimans

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[105] arXiv:2209.11200 (replaced) [pdf, other]: Title: Attention is All They Need: Exploring the Media Archaeology of the Computer Vision Research Paper

Authors: Samuel Goree, Gabriel Appleby, David Crandall, Norman Su

Subjects: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
[106] arXiv:2211.05716 (replaced) [pdf, other]: Title: Resource-Aware Heterogeneous Federated Learning using Neural Architecture Search

Authors: Sixing Yu, J. Pablo Muñoz, Ali Jannesari

Comments: Accepted at the 30th International European Conference on Parallel and Distributed Computing (Euro-Par 2024)

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[107] arXiv:2301.02608 (replaced) [pdf, other]: Title: An interpretable machine learning system for colorectal cancer diagnosis from pathology slides

Authors: Pedro C. Neto, Diana Montezuma, Sara P. Oliveira, Domingos Oliveira, João Fraga, Ana Monteiro, João Monteiro, Liliana Ribeiro, Sofia Gonçalves, Stefan Reinhard, Inti Zlobec, Isabel M. Pinto, Jaime S. Cardoso

Comments: Accepted at npj Precision Oncology. Available at: this https URL

Journal-ref: npj Precis. Onc. 8, 56 (2024)

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[108] arXiv:2306.02176 (replaced) [pdf, other]: Title: TransRUPNet for Improved Polyp Segmentation

Authors: Debesh Jha, Nikhil Kumar Tomar, Debayan Bhattacharya, Ulas Bagci

Comments: Accepted at EMBC 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[109] arXiv:2307.10182 (replaced) [pdf, other]: Title: Enhancing Super-Resolution Networks through Realistic Thick-Slice CT Simulation

Authors: Zeyu Tang, Xiaodan Xing, Guang Yang

Comments: 11 pages, 4 figures

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
[110] arXiv:2307.15615 (replaced) [pdf, other]: Title: A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyond

Authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du

Comments: A list of open-sourced code from the papers reviewed has been organized and is available at this https URL

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[111] arXiv:2308.13646 (replaced) [pdf, other]: Title: GRASP: A Rehearsal Policy for Efficient Online Continual Learning

Authors: Md Yousuf Harun, Jhair Gallardo, Junyu Chen, Christopher Kanan

Comments: Accepted to the Conference on Lifelong Learning Agents (CoLLAs) 2024

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[112] arXiv:2310.12153 (replaced) [pdf, other]: Title: Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing

Authors: Jan-Nico Zaech, Martin Danelljan, Tolga Birdal, Luc Van Gool

Comments: Accepted at CVPR 2024

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[113] arXiv:2403.00549 (replaced) [pdf, other]: Title: Relaxometry Guided Quantitative Cardiac Magnetic Resonance Image Reconstruction

Authors: Yidong Zhao, Yi Zhang, Qian Tao

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[114] arXiv:2403.05452 (replaced) [pdf, other]: Title: The R2D2 deep neural network series paradigm for fast precision imaging in radio astronomy

Authors: Amir Aghabiglou, Chung San Chu, Arwa Dabbech, Yves Wiaux

Comments: Accepted for publication in ApJS

Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[115] arXiv:2403.13890 (replaced) [pdf, other]: Title: Towards Learning Contrast Kinetics with Multi-Condition Latent Diffusion Models

Authors: Richard Osuala, Daniel Lang, Preeti Verma, Smriti Joshi, Apostolia Tsirikoglou, Grzegorz Skorupko, Kaisar Kushibar, Lidia Garrucho, Walter H. L. Pinaya, Oliver Diaz, Julia Schnabel, Karim Lekadir

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[116] arXiv:2404.07356 (replaced) [pdf, other]: Title: GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data

Authors: Daniel Platnick, Sourena Khanzadeh, Alireza Sadeghian, Richard Anthony Valenzano

Comments: Accepted to the 37th Canadian Artificial Intelligence Conference (2024), 12 pages, 4 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[117] arXiv:2404.18416 (replaced) [pdf, other]: Title: Capabilities of Gemini Models in Medicine

Authors: Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G.T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean-baptiste Alayrac, Neil Houlsby, Nenad Tomasev, Jan Freyberg, Charles Lau, Jonas Kemp, Jeremy Lai, Shekoofeh Azizi, Kimberly Kanada, SiWai Man, Kavita Kulkarni, Ruoxi Sun, Siamak Shakeri, Luheng He, Ben Caine, Albert Webson, Natasha Latysheva, Melvin Johnson, Philip Mansfield, Jian Lu, Ehud Rivlin, Jesper Anderson, Bradley Green, Renee Wong, Jonathan Krause, Jonathon Shlens, Ewa Dominowska, S. M. Ali Eslami, Katherine Chou, Claire Cui, Oriol Vinyals, Koray Kavukcuoglu, et al. (12 additional authors not shown)

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

New submissions
Cross-lists
Replacements

[ total of 117 entries: 1-117 ]
[ showing up to 1000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2405, contact, help (Access key information)

> cs > cs.CV

Computer Vision and Pattern Recognition

New submissions

New submissions for Thu, 2 May 24

Cross-lists for Thu, 2 May 24

Replacements for Thu, 2 May 24