計算機視覺方向0219

2021-03-02 arXiv Daily

關注即可獲取arXiv每日論文自動推送；
如果您有任何問題或建議，請公眾號留言。
[如果您覺得本公眾號對你有幫助，就是我們最大的榮幸]

今日 cs.CV方向共計47篇文章。檢測(5篇)

[1]：Unbiased Teacher for Semi-Supervised Object Detection

標題：半監督目標檢測的無偏算法

作者：Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, Peter Vajda

備註：Accepted to ICLR 2021; Code is available atthis https URL

連結：https://arxiv.org/abs/2102.09480

摘要：Semi-supervised learning, i.e., training networks with both labeled and unlabeled data, has made significant progress recently. However, existing works have primarily focused on image classification tasks and neglected object detection which requires more annotation effort. In this work, we revisit the Semi-Supervised Object Detection (SS-OD) and identify the pseudo-labeling bias issue in SS-OD. To address this, we introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing teacher in a mutually-beneficial manner. Together with a class-balance loss to downweight overly confident pseudo-labels, Unbiased Teacher consistently improved state-of-the-art methods by significant margins on COCO-standard, COCO-additional, and VOC datasets. Specifically, Unbiased Teacher achieves 6.8 absolute mAP improvements against state-of-the-art method when using 1% of labeled data on MS-COCO, achieves around 10 mAP improvements against the supervised baseline when using only 0.5, 1, 2% of labeled data on MS-COCO.

[2]：DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results

標題：DeeperforensicsChallenge2020真實世界人臉偽造檢測：方法和結果

作者：Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy, Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, Ganchao Tan

備註：Technical report. Challenge website:this https URL

連結：https://arxiv.org/abs/2102.09471

摘要：This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set with multiple sources and diverse distortions. A total of 115 participants registered for the competition, and 25 teams made valid submissions. We will summarize the winning solutions and present some discussions on potential research directions.

[3]：Minimizing false negative rate in melanoma detection and providing insight into the causes of classification

標題：降低黑色素瘤檢測的假陰性率並提供分類原因的見解

作者：Ellák Somfai, Benjámin Baffy, Kristian Fenech, Changlu Guo, Rita Hosszú, Dorina Korózs, Marcell Pólik, Attila Ulbert, András Lőrincz

備註：supplementary materials included

連結：https://arxiv.org/abs/2102.09199

摘要：Our goal is to bridge human and machine intelligence in melanoma detection. We develop a classification system exploiting a combination of visual pre-processing, deep learning, and ensembling for providing explanations to experts and to minimize false negative rate while maintaining high accuracy in melanoma detection. Source images are first automatically segmented using a U-net CNN. The result of the segmentation is then used to extract image sub-areas and specific parameters relevant in human evaluation, namely center, border, and asymmetry measures. These data are then processed by tailored neural networks which include structure searching algorithms. Partial results are then ensembled by a committee machine. Our evaluation on the largest skin lesion dataset which is publicly available today, ISIC-2019, shows improvement in all evaluated metrics over a baseline using the original images only. We also showed that indicative scores computed by the feature classifiers can provide useful insight into the various features on which the decision can be based.

[4]：Densely Nested Top-Down Flows for Salient Object Detection

標題：用於顯著目標檢測的密集嵌套自頂向下流

作者：Chaowei Fang, Haibin Tian, Dingwen Zhang, Qiang Zhang, Jungong Han, Junwei Han

連結：https://arxiv.org/abs/2102.09133

摘要：With the goal of identifying pixel-wise salient object regions from each input image, salient object detection (SOD) has been receiving great attention in recent years. One kind of mainstream SOD methods is formed by a bottom-up feature encoding procedure and a top-down information decoding procedure. While numerous approaches have explored the bottom-up feature extraction for this task, the design on top-down flows still remains under-studied. To this end, this paper revisits the role of top-down modeling in salient object detection and designs a novel densely nested top-down flows (DNTDF)-based framework. In every stage of DNTDF, features from higher levels are read in via the progressive compression shortcut paths (PCSP). The notable characteristics of our proposed method are as follows. 1) The propagation of high-level features which usually have relatively strong semantic information is enhanced in the decoding procedure; 2) With the help of PCSP, the gradient vanishing issues caused by non-linear operations in top-down information flows can be alleviated; 3) Thanks to the full exploration of high-level features, the decoding process of our method is relatively memory efficient compared against those of existing methods. Integrating DNTDF with EfficientNet, we construct a highly light-weighted SOD model, with very low computational complexity. To demonstrate the effectiveness of the proposed model, comprehensive experiments are conducted on six widely-used benchmark datasets. The comparisons to the most state-of-the-art methods as well as the carefully-designed baseline models verify our insights on the top-down flow modeling for SOD. The code of this paper is available atthis https URL.

[5]：Automated Detection of Equine Facial Action Units

標題：馬面部動作單位的自動檢測

作者：Zhenghong Li, Sofia Broomé, Pia Haubro Andersen, Hedvig Kjellström

連結：https://arxiv.org/abs/2102.08983

摘要：The recently developed Equine Facial Action Coding System (EquiFACS) provides a precise and exhaustive, but laborious, manual labelling method of facial action units of the horse. To automate parts of this process, we propose a Deep Learning-based method to detect EquiFACS units automatically from images. We use a cascade framework; we firstly train several object detectors to detect the predefined Region-of-Interest (ROI), and secondly apply binary classifiers for each action unit in related regions. We experiment with both regular CNNs and a more tailored model transferred from human facial action unit recognition. Promising initial results are presented for nine action units in the eye and lower face regions.分割(4篇)

[1]：SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering

標題：SLAKE：一個語義標記的醫學視覺問答知識增強數據集

作者：Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, Xiao-Ming Wu

備註：ISBI 2021

連結：https://arxiv.org/abs/2102.09542

摘要：Medical visual question answering (Med-VQA) has tremendous potential in healthcare. However, the development of this technology is hindered by the lacking of publicly-available and high-quality labeled datasets for training and evaluation. In this paper, we present a large bilingual dataset, SLAKE, with comprehensive semantic labels annotated by experienced physicians and a new structural medical knowledge base for Med-VQA. Besides, SLAKE includes richer modalities and covers more human body parts than the currently available dataset. We show that SLAKE can be used to facilitate the development and evaluation of Med-VQA systems. The dataset can be downloaded fromthis http URL.

[2]：Image Compositing for Segmentation of Surgical Tools without Manual Annotations

標題：無需人工標註的手術器械分割圖像合成

作者：Luis C. Garcia-Peraza-Herrera, Lucas Fidon, Claudia D'Ettorre, Danail Stoyanov, Tom Vercauteren, Sebastien Ourselin

備註：Accepted by IEEE TMI

連結：https://arxiv.org/abs/2102.09528

摘要：Producing manual, pixel-accurate, image segmentation labels is tedious and time-consuming. This is often a rate-limiting factor when large amounts of labeled images are required, such as for training deep convolutional networks for instrument-background segmentation in surgical scenes. No large datasets comparable to industry standards in the computer vision community are available for this task. To circumvent this problem, we propose to automate the creation of a realistic training dataset by exploiting techniques stemming from special effects and harnessing them to target training performance rather than visual appeal. Foreground data is captured by placing sample surgical instruments over a chroma key (a.k.a. green screen) in a controlled environment, thereby making extraction of the relevant image segment straightforward. Multiple lighting conditions and viewpoints can be captured and introduced in the simulation by moving the instruments and camera and modulating the light source. Background data is captured by collecting videos that do not contain instruments. In the absence of pre-existing instrument-free background videos, minimal labeling effort is required, just to select frames that do not contain surgical instruments from videos of surgical interventions freely available online. We compare different methods to blend instruments over tissue and propose a novel data augmentation approach that takes advantage of the plurality of options. We show that by training a vanilla U-Net on semi-synthetic data only and applying a simple post-processing, we are able to match the results of the same network trained on a publicly available manually labeled real dataset.

[3]：NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation

標題：NuCLS：一種可擴展的眾包、深度學習方法和數據集，用於核心分類、定位和分割

作者：Mohamed Amgad, Lamees A. Atteya, Hagar Hussein, Kareem Hosny Mohammed, Ehab Hafiz, Maha A.T. Elsebaie, Ahmed M. Alhusseiny, Mohamed Atef AlMoslemany, Abdelmagid M. Elmatboly, Philip A. Pappalardo, Rokia Adel Sakr, Pooya Mobadersany, Ahmad Rachid, Anas M. Saad, Ahmad M. Alkashash, Inas A. Ruhban, Anas Alrefai, Nada M. Elgazar, Ali Abdulkarim, Abo-Alela Farag, Amira Etman, Ahmed G. Elsaeed, Yahya Alagha, Yomna A. Amer, Ahmed M. Raslan, Menatalla K. Nadim, Mai A.T. Elsebaie, Ahmed Ayad, Liza E. Hanna, Ahmed Gadallah, Mohamed Elkady, Bradley Drumheller, David Jaye, David Manthey, David A. Gutman, Habiba Elfandy, Lee A.D. Cooper

連結：https://arxiv.org/abs/2102.09099

摘要：High-resolution mapping of cells and tissue structures provides a foundation for developing interpretable machine-learning models for computational pathology. Deep learning algorithms can provide accurate mappings given large numbers of labeled instances for training and validation. Generating adequate volume of quality labels has emerged as a critical barrier in computational pathology given the time and effort required from pathologists. In this paper we describe an approach for engaging crowds of medical students and pathologists that was used to produce a dataset of over 220,000 annotations of cell nuclei in breast cancers. We show how suggested annotations generated by a weak algorithm can improve the accuracy of annotations generated by non-experts and can yield useful data for training segmentation algorithms without laborious manual tracing. We systematically examine interrater agreement and describe modifications to the MaskRCNN model to improve cell mapping. We also describe a technique we call Decision Tree Approximation of Learned Embeddings (DTALE) that leverages nucleus segmentations and morphologic features to improve the transparency of nucleus classification models. The annotation data produced in this study are freely available for algorithm development and benchmarking at:this https URL.

[4]：BEDS: Bagging ensemble deep segmentation for nucleus segmentation with testing stage stain augmentation

標題：BEDS：Bagging集成深度分割用於核分割和測試階段染色增強

作者：Xing Li, Haichun Yang, Jiaxin He, Aadarsh Jha, Agnes B. Fogo, Lee E. Wheless, Shilin Zhao, Yuankai Huo

備註：4 pages, 5 figures, ISBI 2021

連結：https://arxiv.org/abs/2102.08990

摘要：Reducing outcome variance is an essential task in deep learning based medical image analysis. Bootstrap aggregating, also known as bagging, is a canonical ensemble algorithm for aggregating weak learners to become a strong learner. Random forest is one of the most powerful machine learning algorithms before deep learning era, whose superior performance is driven by fitting bagged decision trees (weak learners). Inspired by the random forest technique, we propose a simple bagging ensemble deep segmentation (BEDs) method to train multiple U-Nets with partial training data to segment dense nuclei on pathological images. The contributions of this study are three-fold: (1) developing a self-ensemble learning framework for nucleus segmentation; (2) aggregating testing stage augmentation with self-ensemble learning; and (3) elucidating the idea that self-ensemble and testing stage stain augmentation are complementary strategies for a superior segmentation performance. Implementation Detail:this https URL.分類、識別(2篇)

[1]：Deep Gait Recognition: A Survey

標題：深度步態識別研究綜述

作者：Alireza Sepas-Moghaddam, Ali Etemad

連結：https://arxiv.org/abs/2102.09546

摘要：Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk. Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations. Gait recognition methods based on deep learning now dominate the state-of-the-art in the field and have fostered real-world applications. In this paper, we present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning, and cover broad topics including datasets, test protocols, state-of-the-art solutions, challenges, and future research directions. We first review the commonly used gait datasets along with the principles designed for evaluating them. We then propose a novel taxonomy made up of four separate dimensions namely body representation, temporal representation, feature representation, and neural architecture, to help characterize and organize the research landscape and literature in this area. Following our proposed taxonomy, a comprehensive survey of gait recognition methods using deep learning is presented with discussions on their performances, characteristics, advantages, and limitations. We conclude this survey with a discussion on current challenges and mention a number of promising directions for future research in gait recognition.

[2]：One-shot action recognition towards novel assistive therapies

標題：新型輔助治療的一次性動作識別

作者：Alberto Sabater, Laura Santos, Jose Santos-Victor, Alexandre Bernardino, Luis Montesano, Ana C. Murillo

連結：https://arxiv.org/abs/2102.08997

摘要：One-shot action recognition is a challenging problem, especially when the target video can contain one, more or none repetitions of the target action. Solutions to this problem can be used in many real world applications that require automated processing of activity videos. In particular, this work is motivated by the automated analysis of medical therapies that involve action imitation games. The presented approach incorporates a pre-processing step that standardizes heterogeneous motion data conditions and generates descriptive movement representations with a Temporal Convolutional Network for a final one-shot (or few-shot) action recognition. Our method achieves state-of-the-art results on the public NTU-120 one-shot action recognition challenge. Besides, we evaluate the approach on a real use-case of automated video analysis for therapy support with autistic people. The promising results prove its suitability for this kind of application in the wild, providing both quantitative and qualitative measures, essential for the patient evaluation and monitoring.跟蹤(1篇)

[1]：Spatio-Temporal Graph Dual-Attention Network for Multi-Agent Prediction and Tracking

標題：用於多智能體預測和跟蹤的時空圖雙注意網絡

作者：Jiachen Li, Hengbo Ma, Zhihao Zhang, Jinning Li, Masayoshi Tomizuka

備註：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

連結：https://arxiv.org/abs/2102.09117

摘要：An effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are indispensable for intelligent mobile systems (e.g. autonomous vehicles and social robots) to achieve safe and high-quality planning when they navigate in highly interactive and crowded scenarios. Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired for the prediction system to enable relational reasoning on different entities and provide a distribution of future trajectories for each agent. In this paper, we propose a generic generative neural system (called STG-DAT) for multi-agent trajectory prediction involving heterogeneous agents. The system takes a step forward to explicit interaction modeling by incorporating relational inductive biases with a dynamic graph representation and leverages both trajectory and scene context information. We also employ an efficient kinematic constraint layer applied to vehicle trajectory prediction. The constraint not only ensures physical feasibility but also enhances model performance. Moreover, the proposed prediction model can be easily adopted by multi-target tracking frameworks. The tracking accuracy proves to be improved by empirical results. The proposed system is evaluated on three public benchmark datasets for trajectory prediction, where the agents cover pedestrians, cyclists and on-road vehicles. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction and tracking accuracy.人體姿態估計、位姿估計(1篇)

[1]：StablePose: Learning 6D Object Poses from Geometrically Stable Patches

標題：StablePose：從幾何穩定面片學習6D物體姿態

作者：Junwen Huang, Yifei Shi, Xin Xu, Yifan Zhang, Kai Xu

連結：https://arxiv.org/abs/2102.09334

摘要：We introduce the concept of geometric stability to the problem of 6D object pose estimation and propose to learn pose inference based on geometrically stable patches extracted from observed 3D point clouds. According to the theory of geometric stability analysis, a minimal set of three planar/cylindrical patches are geometrically stable and determine the full 6DoFs of the object pose. We train a deep neural network to regress 6D object pose based on geometrically stable patch groups via learning both intra-patch geometric features and inter-patch contextual features. A subnetwork is jointly trained to predict per-patch poses. This auxiliary task is a relaxation of the group pose prediction: A single patch cannot determine the full 6DoFs but is able to improve pose accuracy in its corresponding DoFs. Working with patch groups makes our method generalize well for random occlusion and unseen instances. The method is easily amenable to resolve symmetry ambiguities. Our method achieves the state-of-the-art results on public benchmarks compared not only to depth-only but also to RGBD methods. It also performs well in category-level pose estimation.弱/半/無監督(1篇)

[1]：CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

標題：CReST：一個非均衡半監督學習的類再平衡自學習框架

作者：Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, Fan Yang

連結：https://arxiv.org/abs/2102.09559

摘要：Semi-supervised learning on class-imbalanced data, although a realistic problem, has been under studied. While existing semi-supervised learning (SSL) methods are known to perform poorly on minority classes, we find that they still generate high precision pseudo-labels on minority classes. By exploiting this property, in this work, we propose Class-Rebalancing Self-Training (CReST), a simple yet effective framework to improve existing SSL methods on class-imbalanced data. CReST iteratively retrains a baseline SSL model with a labeled set expanded by adding pseudo-labeled samples from an unlabeled set, where pseudo-labeled samples from minority classes are selected more frequently according to an estimated class distribution. We also propose a progressive distribution alignment to adaptively adjust the rebalancing strength dubbed CReST+. We show that CReST and CReST+ improve state-of-the-art SSL algorithms on various class-imbalanced datasets and consistently outperform other popular rebalancing methods.強化學習(1篇)

[1]：Multi-Agent Reinforcement Learning of 3D Furniture Layout Simulation in Indoor Graphics Scenes

標題：室內圖形場景中三維家具布局仿真的多智能體強化學習

作者：Xinhan Di, Pengqian Yu

備註：8 pages, 3 figures submit to conference. arXiv admin note: substantial text overlap witharXiv:2101.07462

連結：https://arxiv.org/abs/2102.09137

摘要：In the industrial interior design process, professional designers plan the furniture layout to achieve a satisfactory 3D design for selling. In this paper, we explore the interior graphics scenes design task as a Markov decision process (MDP) in 3D simulation, which is solved by multi-agent reinforcement learning. The goal is to produce furniture layout in the 3D simulation of the indoor graphics scenes. In particular, we firstly transform the 3D interior graphic scenes into two 2D simulated scenes. We then design the simulated environment and apply two reinforcement learning agents to learn the optimal 3D layout for the MDP formulation in a cooperative way. We conduct our experiments on a large-scale real-world interior layout dataset that contains industrial designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts as compared with the state-of-art model. The developed simulator and codes are available at \url{this https URL}.Zero/One-Shot、遷移學習、Domain Adaptation(3篇)

[1]：Domain Adaptation for Medical Image Analysis: A Survey

標題：醫學圖像分析領域自適應技術綜述

作者：Hao Guan, Mingxia Liu

備註：15 pages, 7 figures

連結：https://arxiv.org/abs/2102.09508

摘要：Machine learning techniques used in computer-aided medical image analysis usually suffer from the domain shift problem caused by different distributions between source/reference data and target data. As a promising solution, domain adaptation has attracted considerable attention in recent years. The aim of this paper is to survey the recent advances of domain adaptation methods in medical image analysis. We first present the motivation of introducing domain adaptation techniques to tackle domain heterogeneity issues for medical image analysis. Then we provide a review of recent domain adaptation models in various medical image analysis tasks. We categorize the existing methods into shallow and deep models, and each of them is further divided into supervised, semi-supervised and unsupervised methods. We also provide a brief summary of the benchmark medical image datasets that support current domain adaptation research. This survey will enable researchers to gain a better understanding of the current status, challenges.

[2]：Domain Impression: A Source Data Free Domain Adaptation Method

標題：域印象：一種源數據無關的域適配方法

作者：Vinod K Kurmi, Venkatesh K Subramanian, Vinay P Namboodiri

備註：Published- WACV-2021

連結：https://arxiv.org/abs/2102.09003

摘要：Unsupervised Domain adaptation methods solve the adaptation problem for an unlabeled target set, assuming that the source dataset is available with all labels. However, the availability of actual source samples is not always possible in practical cases. It could be due to memory constraints, privacy concerns, and challenges in sharing data. This practical scenario creates a bottleneck in the domain adaptation problem. This paper addresses this challenging scenario by proposing a domain adaptation technique that does not need any source data. Instead of the source data, we are only provided with a classifier that is trained on the source data. Our proposed approach is based on a generative framework, where the trained classifier is used for generating samples from the source classes. We learn the joint distribution of data by using the energy-based modeling of the trained classifier. At the same time, a new classifier is also adapted for the target domain. We perform various ablation analysis under different experimental setups and demonstrate that the proposed approach achieves better results than the baseline models in this extremely novel scenario.

[3]：DINO: A Conditional Energy-Based GAN for Domain Translation

標題：DINO：一種基於條件能量的GAN域翻譯算法

作者：Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

備註：Accepted to ICLR 2021

連結：https://arxiv.org/abs/2102.09281

摘要：Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new framework, where two networks are simultaneously trained, in a supervised manner, to perform domain translation in opposite directions. Our method is not only better at capturing the shared information between two domains but is more generic and can be applied to a broader range of problems. The proposed framework performs well even in challenging cross-modal translations, such as video-driven speech reconstruction, for which other systems struggle to maintain correspondence.時序動作檢測、視頻相關(1篇)

[1]：Clockwork Variational Autoencoders for Video Prediction

標題：用於視頻預測的鐘控變分自動編碼器

作者：Vaibhav Saxena, Jimmy Ba, Danijar Hafner

備註：17 pages, 12 figures, 4 tables

連結：https://arxiv.org/abs/2102.09532

摘要：Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.Networks(5篇)

[1]：Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

標題：基於遞歸異步多模態網絡的單目深度預測

作者：Daniel Gehrig, Michelle Rüegg, Mathias Gehrig, Javier Hidalgo Carrio, Davide Scaramuzza

連結：https://arxiv.org/abs/2102.09320

摘要：Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

[2]：Robust PDF Document Conversion Using Recurrent Neural Networks

標題：基於遞歸神經網絡的魯棒PDF文檔轉換

作者：Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

備註：9 pages, 2 tables, 4 figures, uses aaai21.sty. Accepted at the "Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21)". Received the "IAAI-21 Innovative Application Award"

連結：https://arxiv.org/abs/2102.09395

摘要：The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19.

[3]：Enhanced Magnetic Resonance Image Synthesis with Contrast-Aware Generative Adversarial Networks

標題：基於對比度感知生成對抗網絡的增強磁共振圖像合成

作者：Jonas Denck, Jens Guehring, Andreas Maier, Eva Rothgang

連結：https://arxiv.org/abs/2102.09386

摘要：A Magnetic Resonance Imaging (MRI) exam typically consists of the acquisition of multiple MR pulse sequences, which are required for a reliable diagnosis. Each sequence can be parameterized through multiple acquisition parameters affecting MR image contrast, signal-to-noise ratio, resolution, or scan time. With the rise of generative deep learning models, approaches for the synthesis of MR images are developed to either synthesize additional MR contrasts, generate synthetic data, or augment existing data for AI training. However, current generative approaches for the synthesis of MR images are only trained on images with a specific set of acquisition parameter values, limiting the clinical value of these methods as various sets of acquisition parameter settings are used in clinical practice. Therefore, we trained a generative adversarial network (GAN) to generate synthetic MR knee images conditioned on various acquisition parameters (repetition time, echo time, image orientation). This approach enables us to synthesize MR images with adjustable image contrast. In a visual Turing test, two experts mislabeled 40.5% of real and synthetic MR images, demonstrating that the image quality of the generated synthetic and real MR images is comparable. This work can support radiologists and technologists during the parameterization of MR sequences by previewing the yielded MR contrast, can serve as a valuable tool for radiology training, and can be used for customized data generation to support AI training.

[4]：NFCNN: Toward a Noise Fusion Convolutional Neural Network for Image Denoising

標題：NFCNN：一種用於圖像去噪的噪聲融合卷積神經網絡

作者：Maoyuan Xu, Xiaoping Xie

連結：https://arxiv.org/abs/2102.09376

摘要：Deep learning based methods have achieved the state-of-the-art performance in image denoising. In this paper, a deep learning based denoising method is proposed and a module called fusion block is introduced in the convolutional neural network. For this so-called Noise Fusion Convolutional Neural Network (NFCNN), there are two branches in its multi-stage architecture. One branch aims to predict the latent clean image, while the other one predicts the residual image. A fusion block is contained between every two stages by taking the predicted clean image and the predicted residual image as a part of inputs, and it outputs a fused result to the next stage. NFCNN has an attractive texture preserving ability because of the fusion block. To train NFCNN, a stage-wise supervised training strategy is adopted to avoid the vanishing gradient and exploding gradient problems. Experimental results show that NFCNN is able to perform competitive denoising results when compared with some state-of-the-art algorithms.

[5]：GradFreeBits: Gradient Free Bit Allocation for Dynamic Low Precision Neural Networks

標題：GradFreeBits：動態低精度神經網絡的無梯度比特分配

作者：Benjamin J. Bodner, Gil Ben Shalom, Eran Treister

連結：https://arxiv.org/abs/2102.09298

摘要：Quantized neural networks (QNNs) are among the main approaches for deploying deep neural networks on low resource edge devices. Training QNNs using different levels of precision throughout the network (dynamic quantization) typically achieves superior trade-offs between performance and computational load. However, optimizing the different precision levels of QNNs can be complicated, as the values of the bit allocations are discrete and difficult to differentiate for. Also, adequately accounting for the dependencies between the bit allocation of different layers is not straight-forward. To meet these challenges, in this work we propose GradFreeBits: a novel joint optimization scheme for training dynamic QNNs, which alternates between gradient-based optimization for the weights, and gradient-free optimization for the bit allocation. Our method achieves better or on par performance with current state of the art low precision neural networks on CIFAR10/100 and ImageNet classification. Furthermore, our approach can be extended to a variety of other applications involving neural networks used in conjunction with parameters which are difficult to optimize for.超分辨(1篇)

[1]：A Comprehensive Review of Deep Learning-based Single Image Super-resolution

標題：基於深度學習的單幅圖像超解析度研究綜述

作者：Syed Muhammad Arsalan Bashir, Yi Wang, Mahrukh Khan

備註：35 Pages, 10 Figures

連結：https://arxiv.org/abs/2102.09351

摘要：Image super-resolution (SR) is one of the vital image processing methods that improve the resolution of an image in the field of computer vision. In the last two decades, significant progress has been made in the field of super-resolution, especially utilizing deep learning methods. This survey is an effort to provide a detailed survey of recent progress in the field of super-resolution in the perspective of deep learning while also informing about the initial classical methods used for achieving super-resolution. The survey classifies the image SR methods into four categories, i.e., classical methods, supervised learning-based methods, unsupervised learning-based methods, and domain-specific SR methods. We also introduce the problem of SR to provide intuition about image quality metrics, available reference datasets, and SR challenges. Deep learning-based approaches of SR are evaluated using a reference dataset. Finally, this survey is concluded with future directions and trends in the field of SR and open problems in SR to be addressed by the researchers.GAN、圖像文本生成(1篇)

[1]：An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild

標題：一種結合潛在特徵的增強對抗網絡在野外的時空面部情感估計

作者：Decky Aspandi, Federico Sukno, Björn Schuller, Xavier Binefa

備註：Accepted Version on VISAPP 2021

連結：https://arxiv.org/abs/2102.09150

摘要：Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates both in qualitative and quantitative terms. Furthermore, we find that the inclusion of attention mechanisms leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of around 160\,ms to be the optimum one for temporal modelling, which is consistent with other relevant findings utilising similar lengths.行人重識別、行人檢測(1篇)

[1]：Deep Miner: A Deep and Multi-branch Network which Mines Rich and Diverse Features for Person Re-identification

標題：Deep Miner：一個深度和多分支的網絡，它挖掘出豐富多樣的特徵來重新識別人

作者：Abdallah Benzine, Mohamed El Amine Seddik, Julien Desmarais

連結：https://arxiv.org/abs/2102.09321

摘要：Most recent person re-identification approaches are based on the use of deep convolutional neural networks (CNNs). These networks, although effective in multiple tasks such as classification or object detection, tend to focus on the most discriminative part of an object rather than retrieving all its relevant features. This behavior penalizes the performance of a CNN for the re-identification task, since it should identify diverse and fine grained features. It is then essential to make the network learn a wide variety of finer characteristics in order to make the re-identification process of people effective and robust to finer changes. In this article, we introduce Deep Miner, a method that allows CNNs to "mine" richer and more diverse features about people for their re-identification. Deep Miner is specifically composed of three types of branches: a Global branch (G-branch), a Local branch (L-branch) and an Input-Erased branch (IE-branch). G-branch corresponds to the initial backbone which predicts global characteristics, while L-branch retrieves part level resolution features. The IE-branch for its part, receives partially suppressed feature maps as input thereby allowing the network to "mine" new features (those ignored by G-branch) as output. For this special purpose, a dedicated suppression procedure for identifying and removing features within a given CNN is introduced. This suppression procedure has the major benefit of being simple, while it produces a model that significantly outperforms state-of-the-art (SOTA) re-identification methods. Specifically, we conduct experiments on four standard person re-identification benchmarks and witness an absolute performance gain up to 6.5% mAP compared to SOTA.數據集(1篇)

[1]：HVAQ: A High-Resolution Vision-Based Air Quality Dataset

標題：HVAQ：一個基於視覺的高解析度空氣品質數據集

作者：Zuohui Chen, Tony Zhang, Zhuangzhi Chen, Yun Xiang, Qi Xuan, Robert P. Dick

連結：https://arxiv.org/abs/2102.09332

摘要：Air pollutants, such as particulate matter, strongly impact human health. Most existing pollution monitoring techniques use stationary sensors, which are typically sparsely deployed. However, real-world pollution distributions vary rapidly in space and the visual effects of air pollutant can be used to estimate concentration, potentially at high spatial resolution. Accurate pollution monitoring requires either densely deployed conventional point sensors, at-a-distance vision-based pollution monitoring, or a combination of both.
This paper makes the following contributions: (1) we present a high temporal and spatial resolution air quality dataset consisting of PM2.5, PM10, temperature, and humidity data; (2) we simultaneously take images covering the locations of the particle counters; and (3) we evaluate several vision-based state-of-art PM concentration prediction algorithms on our dataset and demonstrate that prediction accuracy increases with sensor density and image. It is our intent and belief that this dataset can enable advances by other research teams working on air quality estimation.醫學生物應用(1篇)

[1]：Biometrics in the Era of COVID-19: Challenges and Opportunities

標題：COVID-19時代的生物特徵識別：挑戰與機遇

作者：Marta Gomez-Barrero, Pawel Drozdowski, Christian Rathgeb, Jose Patino, Massimmiliano Todisco, Andras Nautsch, Naser Damer, Jannis Priesnitz, Nicholas Evans, Christoph Busch

連結：https://arxiv.org/abs/2102.09258

摘要：Since early 2020 the COVID-19 pandemic has had a considerable impact on many aspects of daily life. A range of different measures have been implemented worldwide to reduce the rate of new infections and to manage the pressure on national health services. A primary strategy has been to reduce gatherings and the potential for transmission through the prioritisation of remote working and education. Enhanced hand hygiene and the use of facial masks have decreased the spread of pathogens when gatherings are unavoidable. These particular measures present challenges for reliable biometric recognition, e.g. for facial-, voice- and hand-based biometrics. At the same time, new challenges create new opportunities and research directions, e.g. renewed interest in non-constrained iris or periocular recognition, touch-less fingerprint- and vein-based authentication and the use of biometric characteristics for disease detection. This article presents an overview of the research carried out to address those challenges and emerging opportunities.其他(18篇)

[1]：Essentials for Class Incremental Learning

標題：課堂增量學習要點

作者：Sudhanshu Mittal, Silvio Galesso, Thomas Brox

連結：https://arxiv.org/abs/2102.09517

摘要：Contemporary neural networks are limited in their ability to learn from evolving streams of training data. When trained sequentially on new or evolving tasks, their accuracy drops sharply, making them unsuitable for many real-world applications. In this work, we shed light on the causes of this well-known yet unsolved phenomenon - often referred to as catastrophic forgetting - in a class-incremental setup. We show that a combination of simple components and a loss that balances intra-task and inter-task learning can already resolve forgetting to the same extent as more complex measures proposed in literature. Moreover, we identify poor quality of the learned representation as another reason for catastrophic forgetting in class-IL. We show that performance is correlated with secondary class information (dark knowledge) learned by the model and it can be improved by an appropriate regularizer. With these lessons learned, class-incremental learning results on CIFAR-100 and ImageNet improve over the state-of-the-art by a large margin, while keeping the approach simple.

[2]：Gifsplanation via Latent Shift: A Simple Autoencoder Approach to Progressive Exaggeration on Chest X-rays

標題：通過潛伏期移位進行gifplastation：一種簡單的自動編碼方法，用於胸部x光片上進行性誇張

作者：Joseph Paul Cohen, Rupert Brooks, Sovann En, Evan Zucker, Anuj Pareek, Matthew P. Lungren, Akshay Chaudhari

備註：Under review at MIDL2021

連結：https://arxiv.org/abs/2102.09475

摘要：Motivation: Traditional image attribution methods struggle to satisfactorily explain predictions of neural networks. Prediction explanation is important, especially in the medical imaging, for avoiding the unintended consequences of deploying AI systems when false positive predictions can impact patient care. Thus, there is a pressing need to develop improved models for model explainability and introspection.
Specific Problem: A new approach is to transform input images to increase or decrease features which cause the prediction. However, current approaches are difficult to implement as they are monolithic or rely on GANs. These hurdles prevent wide adoption.
Our approach: Given an arbitrary classifier, we propose a simple autoencoder and gradient update (Latent Shift) that can transform the latent representation of an input image to exaggerate or curtail the features used for prediction. We use this method to study chest X-ray classifiers and evaluate their performance. We conduct a reader study with two radiologists assessing 240 chest X-ray predictions to identify which ones are false positives (half are) using traditional attribution maps or our proposed method.
Results: We found low overlap with ground truth pathology masks for models with reasonably high accuracy. However, the results from our reader study indicate that these models are generally looking at the correct features. We also found that the Latent Shift explanation allows a user to have more confidence in true positive predictions compared to traditional approaches (0.15$\pm$0.95 in a 5 point scale with p=0.01) with only a small increase in false positive predictions (0.04$\pm$1.06 with p=0.57).
Accompanying webpage:this https URL
Source code:this https URL

[3]：Hierarchical Similarity Learning for Language-based Product Image Retrieval

標題：基於語言的產品圖像檢索的層次相似學習

作者：Zhe Ma, Fenghao Liu, Jianfeng Dong, Xiaoye Qu, Yuan He, Shouling Ji

備註：Accepted by ICASSP 2021. Code and data will be available atthis https URL

連結：https://arxiv.org/abs/2102.09375

摘要：This paper aims for the language-based product image retrieval task. The majority of previous works have made significant progress by designing network structure, similarity measurement, and loss function. However, they typically perform vision-text matching at certain granularity regardless of the intrinsic multiple granularities of images. In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network. HSL first learns multi-level representations of input data by stacked encoders, and object-granularity similarity and image-granularity similarity are computed at each level. All the similarities are combined as the final hierarchical cross-modal similarity. Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method. Code and data are available atthis https URL.

[4]：Sliced $\mathcal{L}_2$ Distance for Colour Grading

標題：用於顏色分級的切片距離$\mathcal{l}\u 2$

作者：Hana Alghamdi, Rozenn Dahyot

備註：5 pages, 9 figures

連結：https://arxiv.org/abs/2102.09297

摘要：We propose a new method with $\mathcal{L}_2$ distance that maps one $N$-dimensional distribution to another, taking into account available information about correspondences. We solve the high-dimensional problem in 1D space using an iterative projection approach. To show the potentials of this mapping, we apply it to colour transfer between two images that exhibit overlapped scenes. Experiments show quantitative and qualitative competitive results as compared with the state of the art colour transfer methods.

[5]：HandTailor: Towards High-Precision Monocular 3D Hand Recovery

標題：手工裁縫：走向高精度單目三維手部修復

作者：Jun Lv, Wenqiang Xu, Lixin Yang, Sucheng Qian, Chongzhao Mao, Cewu Lu

連結：https://arxiv.org/abs/2102.09244

摘要：3D hand pose estimation and shape recovery are challenging tasks in computer vision. We introduce a novel framework HandTailor, which combines a learning-based hand module and an optimization-based tailor module to achieve high-precision hand mesh recovery from a monocular RGB image. The proposed hand module unifies perspective projection and weak perspective projection in a single network towards accuracy-oriented and in-the-wild scenarios. The proposed tailor module then utilizes the coarsely reconstructed mesh model provided by the hand module as initialization, and iteratively optimizes an energy function to obtain better results. The tailor module is time-efficient, costs only 8ms per frame on a modern CPU. We demonstrate that HandTailor can get state-of-the-art performance on several public benchmarks, with impressive qualitative results on in-the-wild experiments.

[6]：DSRN: an Efficient Deep Network for Image Relighting

標題：DSRN：一種高效的圖像重照明深度網絡

作者：Sourya Dipta Das, Nisarg A. Shah, Saikat Dutta, Himanshu Kumar

備註：under review

連結：https://arxiv.org/abs/2102.09242

摘要：Custom and natural lighting conditions can be emulated in images of the scene during post-editing. Extraordinary capabilities of the deep learning framework can be utilized for such purpose. Deep image relighting allows automatic photo enhancement by illumination-specific retouching. Most of the state-of-the-art methods for relighting are run-time intensive and memory inefficient. In this paper, we propose an efficient, real-time framework Deep Stacked Relighting Network (DSRN) for image relighting by utilizing the aggregated features from input image at different scales. Our model is very lightweight with total size of about 42 MB and has an average inference time of about 0.0116s for image of resolution $1024 \times 1024$ which is faster as compared to other multi-scale models. Our solution is quite robust for translating image color temperature from input image to target image and also performs moderately for light gradient generation with respect to the target image. Additionally, we show that if images illuminated from opposite directions are used as input, the qualitative results improve over using a single input image.

[7]：Hierarchical Attention Fusion for Geo-Localization

標題：地理定位中的層次注意融合

作者：Liqi Yan, Yiming Cui, Yingjie Chen, Dongfang Liu

連結：https://arxiv.org/abs/2102.09186

摘要：Geo-localization is a critical task in computer vision. In this work, we cast the geo-localization as a 2D image retrieval task. Current state-of-the-art methods for 2D geo-localization are not robust to locate a scene with drastic scale variations because they only exploit features from one semantic level for image representations. To address this limitation, we introduce a hierarchical attention fusion network using multi-scale features for geo-localization. We extract the hierarchical feature maps from a convolutional neural network (CNN) and organically fuse the extracted features for image representations. Our training is self-supervised using adaptive weights to control the attention of feature emphasis from each hierarchical level. Evaluation results on the image retrieval and the large-scale geo-localization benchmarks indicate that our method outperforms the existing state-of-the-art methods. Code is available here: \url{this https URL}.

[8]：Improved Point Transformation Methods For Self-Supervised Depth Prediction

標題：一種改進的點變換自監督深度預測方法

作者：Chen Ziwen, Zixuan Guo, Jerod Weinman

連結：https://arxiv.org/abs/2102.09142

摘要：Given stereo or egomotion image pairs, a popular and successful method for unsupervised learning of monocular depth estimation is to measure the quality of image reconstructions resulting from the learned depth predictions. Continued research has improved the overall approach in recent years, yet the common framework still suffers from several important limitations, particularly when dealing with points occluded after transformation to a novel viewpoint. While prior work has addressed this problem heuristically, this paper introduces a z-buffering algorithm that correctly and efficiently handles occluded points. Because our algorithm is implemented with operators typical of machine learning libraries, it can be incorporated into any existing unsupervised depth learning framework with automatic support for differentiation. Additionally, because points having negative depth after transformation often signify erroneously shallow depth predictions, we introduce a loss function to penalize this undesirable behavior explicitly. Experimental results on the KITTI data set show that the z-buffer and negative depth loss both improve the performance of a state of the art depth-prediction network.

[9]：Understanding and Creating Art with AI: Review and Outlook

標題：用人工智慧理解和創造藝術：回顧與展望

作者：Eva Cetinic, James She

備註：17 pages, 3 figures

連結：https://arxiv.org/abs/2102.09109

摘要：Technologies related to artificial intelligence (AI) have a strong impact on the changes of research and creative practices in visual arts. The growing number of research initiatives and creative applications that emerge in the intersection of AI and art, motivates us to examine and discuss the creative and explorative potentials of AI technologies in the context of art. This paper provides an integrated review of two facets of AI and art: 1) AI is used for art analysis and employed on digitized artwork collections; 2) AI is used for creative purposes and generating novel artworks. In the context of AI-related research for art understanding, we present a comprehensive overview of artwork datasets and recent works that address a variety of tasks such as classification, object detection, similarity retrieval, multimodal representations, computational aesthetics, etc. In relation to the role of AI in creating art, we address various practical and theoretical aspects of AI Art and consolidate related works that deal with those topics in detail. Finally, we provide a concise outlook on the future progression and potential impact of AI technologies on our understanding and creation of art.

[10]：DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates

標題：DeepMetaHandles：基於雙調和坐標的三維網格變形元控制柄學習

作者：Minghua Liu, Minhyuk Sung, Radomir Mech, Hao Su

連結：https://arxiv.org/abs/2102.09105

摘要：We propose DeepMetaHandles, a 3D conditional generative model based on mesh deformation. Given a collection of 3D meshes of a category and their deformation handles (control points), our method learns a set of meta-handles for each shape, which are represented as combinations of the given handles. The disentangled meta-handles factorize all the plausible deformations of the shape, while each of them corresponds to an intuitive deformation. A new deformation can then be generated by sampling the coefficients of the meta-handles in a specific range. We employ biharmonic coordinates as the deformation function, which can smoothly propagate the control points' translations to the entire mesh. To avoid learning zero deformation as meta-handles, we incorporate a target-fitting module which deforms the input mesh to match a random target. To enhance deformations' plausibility, we employ a soft-rasterizer-based discriminator that projects the meshes to a 2D space. Our experiments demonstrate the superiority of the generated deformations as well as the interpretability and consistency of the learned meta-handles.

[11]：Mobile Computational Photography: A Tour

標題：移動計算攝影之旅

作者：Mauricio Delbracio, Damien Kelly, Michael S. Brown, Peyman Milanfar

連結：https://arxiv.org/abs/2102.09000

摘要：The first mobile camera phone was sold only 20 years ago, when taking pictures with one's phone was an oddity, and sharing pictures online was unheard of. Today, the smartphone is more camera than phone. How did this happen? This transformation was enabled by advances in computational photography -the science and engineering of making great images from small form factor, mobile cameras. Modern algorithmic and computing advances, including machine learning, have changed the rules of photography, bringing to it new modes of capture, post-processing, storage, and sharing. In this paper, we give a brief history of mobile computational photography and describe some of the key technological components, including burst photography, noise reduction, and super-resolution. At each step, we may draw naive parallels to the human visual system.

[12]：Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

標題：概念12M：推進網頁級圖文預訓練識別長尾視覺概念

作者：Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

連結：https://arxiv.org/abs/2102.08981

摘要：The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

[13]：Delving into Deep Imbalanced Regression

標題：深度不平衡回歸研究

作者：Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, Dina Katabi

備註：Code and data are available atthis https URL

連結：https://arxiv.org/abs/2102.09554

摘要：Real-world data often exhibit imbalanced distributions, where certain target values have significantly fewer observations. Existing techniques for dealing with imbalanced data focus on targets with categorical indices, i.e., different classes. However, many tasks involve continuous targets, where hard boundaries between classes do not exist. We define Deep Imbalanced Regression (DIR) as learning from such imbalanced data with continuous targets, dealing with potential missing data for certain target values, and generalizing to the entire target range. Motivated by the intrinsic difference between categorical and continuous label space, we propose distribution smoothing for both labels and features, which explicitly acknowledges the effects of nearby targets, and calibrates both label and learned feature distributions. We curate and benchmark large-scale DIR datasets from common real-world tasks in computer vision, natural language processing, and healthcare domains. Extensive experiments verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for practical imbalanced regression problems. Code and data are available atthis https URL.

[14]：Vision-Aided 6G Wireless Communications: Blockage Prediction and Proactive Handoff

標題：視覺輔助6G無線通信：阻塞預測和主動切換

作者：Gouranga Charan, Muhammad Alrabeiah, Ahmed Alkhateeb

備註：Submitted to IEEE, 30 pages, 11 figures. The dataset will be available soon on the ViWi websitethis https URL

連結：https://arxiv.org/abs/2102.09527

摘要：The sensitivity to blockages is a key challenge for the high-frequency (5G millimeter wave and 6G sub-terahertz) wireless networks. Since these networks mainly rely on line-of-sight (LOS) links, sudden link blockages highly threatens the reliability of the networks. Further, when the LOS link is blocked, the network typically needs to hand off the user to another LOS basestation, which may incur critical time latency, especially if a search over a large codebook of narrow beams is needed. A promising way to tackle the reliability and latency challenges lies in enabling proaction in wireless networks. Proaction basically allows the network to anticipate blockages, especially dynamic blockages, and initiate user hand-off beforehand. This paper presents a complete machine learning framework for enabling proaction in wireless networks relying on visual data captured, for example, by RGB cameras deployed at the base stations. In particular, the paper proposes a vision-aided wireless communication solution that utilizes bimodal machine learning to perform proactive blockage prediction and user hand-off. The bedrock of this solution is a deep learning algorithm that learns from visual and wireless data how to predict incoming blockages. The predictions of this algorithm are used by the wireless network to proactively initiate hand-off decisions and avoid any unnecessary latency. The algorithm is developed on a vision-wireless dataset generated using the ViWi data-generation framework. Experimental results on two basestations with different cameras indicate that the algorithm is capable of accurately detecting incoming blockages more than $\sim 90\%$ of the time. Such blockage prediction ability is directly reflected in the accuracy of proactive hand-off, which also approaches $87\%$. This highlights a promising direction for enabling high reliability and low latency in future wireless networks.

[15]：Equivariant Spherical Deconvolution: Learning Sparse Orientation Distribution Functions from Spherical Data

標題：等變球面反卷積：從球面數據學習稀疏方向分布函數

作者：Axel Elaldi, Neel Dey, Heejong Kim, Guido Gerig

備註：Accepted to Information Processing in Medical Imaging (IPMI) 2021. Code available atthis https URL. First two authors contributed equally. 13 pages with 6 figures

連結：https://arxiv.org/abs/2102.09462

摘要：We present a rotation-equivariant unsupervised learning framework for the sparse deconvolution of non-negative scalar fields defined on the unit sphere. Spherical signals with multiple peaks naturally arise in Diffusion MRI (dMRI), where each voxel consists of one or more signal sources corresponding to anisotropic tissue structure such as white matter. Due to spatial and spectral partial voluming, clinically-feasible dMRI struggles to resolve crossing-fiber white matter configurations, leading to extensive development in spherical deconvolution methodology to recover underlying fiber directions. However, these methods are typically linear and struggle with small crossing-angles and partial volume fraction estimation. In this work, we improve on current methodologies by nonlinearly estimating fiber structures via unsupervised spherical convolutional networks with guaranteed equivariance to spherical rotation. Experimentally, we first validate our proposition via extensive single and multi-shell synthetic benchmarks demonstrating competitive performance against common baselines. We then show improved downstream performance on fiber tractography measures on the Tractometer benchmark dataset. Finally, we show downstream improvements in terms of tractography and partial volume estimation on a multi-shell dataset of human subjects.

[16]：VAE Approximation Error: ELBO and Conditional Independence

標題：VAE逼近誤差：ELBO與條件獨立性

作者：Dmitrij Schlesinger, Alexander Shekhovtsov, Boris Flach

連結：https://arxiv.org/abs/2102.09310

摘要：The importance of Variational Autoencoders reaches far beyond standalone generative models -- the approach is also used for learning latent representations and can be generalized to semi-supervised learning. This requires a thorough analysis of their commonly known shortcomings: posterior collapse and approximation errors. This paper analyzes VAE approximation errors caused by the combination of the ELBO objective with the choice of the encoder probability family, in particular under conditional independence assumptions. We identify the subclass of generative models consistent with the encoder family. We show that the ELBO optimizer is pulled from the likelihood optimizer towards this consistent subset. Furthermore, this subset can not be enlarged, and the respective error cannot be decreased, by only considering deeper encoder networks.

[17]：PLAM: a Posit Logarithm-Approximate Multiplier for Power Efficient Posit-based DNNs

標題：PLAM：一種基於Posit的DNNs的對數近似乘法器

作者：Raul Murillo, Alberto A. Del Barrio, Guillermo Botella, Min Soo Kim, HyunJin Kim, Nader Bagherzadeh

連結：https://arxiv.org/abs/2102.09262

摘要：The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in Neural Network related tasks and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, the most power-hungry units within Deep Neural Network architectures. When comparing with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively, without accuracy degradation.

[18]：Learning Invariant Representation of Tasks for Robust Surgical State Estimation

標題：魯棒手術狀態估計中任務的學習不變表示

作者：Yidan Qin, Max Allan, Yisong Yue, Joel W. Burdick, Mahdi Azizian

備註：Accepted to IEEE Robotics & Automation Letters

連結：https://arxiv.org/abs/2102.09119

摘要：Surgical state estimators in robot-assisted surgery (RAS) - especially those trained via learning techniques - rely heavily on datasets that capture surgeon actions in laboratory or real-world surgical tasks. Real-world RAS datasets are costly to acquire, are obtained from multiple surgeons who may use different surgical strategies, and are recorded under uncontrolled conditions in highly complex environments. The combination of high diversity and limited data calls for new learning methods that are robust and invariant to operating conditions and surgical techniques. We propose StiseNet, a Surgical Task Invariance State Estimation Network with an invariance induction framework that minimizes the effects of variations in surgical technique and operating environments inherent to RAS datasets. StiseNet's adversarial architecture learns to separate nuisance factors from information needed for surgical state estimation. StiseNet is shown to outperform state-of-the-art state estimation methods on three datasets (including a new real-world RAS dataset: HERNIA-20).

中文來自機器翻譯，僅供參考。

計算機視覺方向0219

相關焦點

計算機視覺方向準研一，暑假期間可以做哪些準備

與計算機視覺跳一曲華爾茲

鳥瞰計算機視覺

計算機視覺新手指南

2021考研,計算機考研方向有哪些方向?

【視覺】醫學3D計算機視覺任務和挑戰簡述

計算機視覺(及卷積神經網絡)簡史

計算機視覺的三部曲 - 人人都是產品經理

計算機視覺技術發展的下一個十年

國內外有名的計算機視覺團隊和大牛匯總

來了解下計算機視覺的八大應用

計算機視覺方向0121

計算機視覺如何尋找突破口?三維重建或許是一個

計算機視覺、機器視覺、圖像處理以及人工智慧技術

「歐洲矽谷」計算機方向NO.1,聖三一大學計算機與統計學院

計算機視覺中的注意力機制

經驗貼 | 深度學習(計算機視覺方向)小白入門的一些建議

【深度學習】:一文入門3D計算機視覺

中國西安或將主辦ICCV 2025,開創計算機視覺新篇章

計算機視覺方向0126