Mask TextSpotter:一种端到端的可训练神经网络,用于任意形状的文本检测+Face-Cap:使用面部表情分析的图像字幕+ 将SLAM与多光谱光度立体声相结合,实现实时密集的三维重建

Minutia Texture Cylinder Codes for fingerprint matching

Wajih Ullah Baig, Umar Munir, Waqas Ellahi, Adeel Ejaz, Kashif Sardar

Minutia Cylinder Codes (MCC) are minutiae based fingerprint descriptors that take into account minutiae information in a fingerprint image for fingerprint matching. In this paper, we present a modification to the underlying information of the MCC descriptor and show that using different features, the accuracy of matching is highly affected by such changes. MCC originally being a minutia only descriptor is transformed into a texture descriptor. The transformation is from minutiae angular information to orientation, frequency and energy information using Short Time Fourier Transform (STFT) analysis. The minutia cylinder codes are converted to minutiae texture cylinder codes (MTCC). Based on a fixed set of parameters, the proposed changes to MCC show improved performance on FVC 2002 and 2004 data sets and surpass the traditional MCC performance. [1807.02251v1]


Dynamic Multimodal Instance Segmentation guided by natural language queries

Edgar A. Margffoy-Tuay, Juan C. Pérez, Emilio Botero, Pablo Arbeláez

In this paper, we address the task of segmenting an object given a natural language expression that references it, \textit{i.e.} a referring expression. Current techniques tackle this task by either (\textit{i}) directly or recursively merging the linguistic and visual information in the channel dimension and then performing convolutions; or by (\textit{ii}) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to look for the object. We propose a novel method that merges the best of both worlds to exploit the recursive nature of language, and that also, during the upsampling process, takes advantage of the intermediate information generated when downsampling the image, so that detailed segmentations can be obtained. Our method is compared with the state-of-the-art approaches in four standard datasets, in which it yields high performance and surpasses all previous methods in six of eight of the standard dataset splits for this task. Code will be made available in the final version of this paper. Full implementation of our method and training routines, written in PyTorch, can be found at \url{} [1807.02257v1]


Adversarial Learning for Fine-grained Image Search

Kevin Lin, Fan Yang, Qiaosong Wang, Robinson Piramuthu

Fine-grained image search is still a challenging problem due to the difficulty in capturing subtle differences regardless of pose variations of objects from fine-grained categories. In practice, a dynamic inventory with new fine-grained categories adds another dimension to this challenge. In this work, we propose an end-to-end network, called FGGAN, that learns discriminative representations by implicitly learning a geometric transformation from multi-view images for fine-grained image search. We integrate a generative adversarial network (GAN) that can automatically handle complex view and pose variations by converting them to a canonical view without any predefined transformations. Moreover, in an open-set scenario, our network is able to better match images from unseen and unknown fine-grained categories. Extensive experiments on two public datasets and a newly collected dataset have demonstrated the outstanding robust performance of the proposed FGGAN in both closed-set and open-set scenarios, providing as much as 10% relative improvement compared to baselines. [1807.02247v1]


Reversed Active Learning based Atrous DenseNet for Pathological Image Classification

Yuexiang Li, Xinpeng Xie, Linlin Shen, Shaoxiong Liu

Witnessed the development of deep learning in recent years, increasing number of researches try to adopt deep learning model for medical image analysis. However, the usage of deep learning networks for the pathological image analysis encounters several challenges, e.g. high resolution (gigapixel) of pathological images and lack of annotations of cancer areas. To address the challenges, we proposed a complete framework for the pathological image classification, which consists of a novel training strategy, namely reversed active learning (RAL), and an advanced network, namely atrous DenseNet (ADN). The proposed RAL can remove the mislabel patches in the training set. The refined training set can then be used to train widely used deep learning networks, e.g. VGG-16, ResNets, etc. A novel deep learning network, i.e. atrous DenseNet (ADN), is also proposed for the classification of pathological images. The proposed ADN achieves multi-scale feature extraction by integrating the atrous convolutions to the Dense Block. The proposed RAL and ADN have been evaluated on two pathological datasets, i.e. BACH and CCG. The experimental results demonstrate the excellent performance of the proposed ADN + RAL framework, i.e. the average patch-level ACAs of 94.10% and 92.05% on BACH and CCG validation sets were achieved. [1807.02420v1]


Face-Cap: Image Captioning using Facial Expression Analysis

Omid Mohamad Nezami, Mark Dras, Peter Anderson, Len Hamey

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions. [1807.02250v1]


End-to-End Race Driving with Deep Reinforcement Learning

Maximilian Jaritz, Raoul de Charette, Marin Toromanoff, Etienne Perot, Fawzi Nashashibi

We present research using the latest reinforcement learning algorithm for end-to-end driving without any mediated perception (object recognition, scene understanding). The newly proposed reward and learning strategies lead together to faster convergence and more robust driving using only RGB image from a forward facing camera. An Asynchronous Actor Critic (A3C) framework is used to learn the car control in a physically and graphically realistic rally game, with the agents evolving simultaneously on tracks with a variety of road structures (turns, hills), graphics (seasons, location) and physics (road adherence). A thorough evaluation is conducted and generalization is proven on unseen tracks and using legal speed limits. Open loop tests on real sequences of images show some domain adaption capability of our method. [1807.02371v1]


Multi-modal Non-line-of-sight Passive Imaging

Andre Beckus, Alexandru Tamasan, George K. Atia

We consider the non-line-of-sight (NLOS) imaging of an object using light reflected off a diffusive wall. The wall scatters incident light such that a lens is no longer useful to form an image. Instead, we exploit the four-dimensional spatial coherence function to reconstruct a two-dimensional projection of the obscured object. The approach is completely passive in the sense that no control over the light illuminating the object is assumed, and is compatible with the partially coherent fields ubiquitous in both indoor and outdoor environments. We formulate a multi-criteria convex optimization problem for reconstruction, which fuses reflected field’s intensity and spatial coherence information at different scales. Our formulation leverages established optics models of light propagation and scattering and exploits the sparsity common to many images in different bases. We also develop an algorithm based on the Alternating Direction Method of Multipliers to efficiently solve the convex program proposed. A means for analyzing the null space of the measurement matrices is provided, as well as a means for weighing the contribution of individual measurements to the reconstruction. This work holds promise to advance passive imaging in challenging NLOS regimes in which the intensity does not necessarily retain distinguishable features, and provides a framework for multi-modal information fusion for efficient scene reconstruction. [1807.02444v1]


From Rank Estimation to Rank Approximation: Rank Residual Constraint for Image Denoising

Zhiyuan Zha, Xin Yuan, Tao Yue, Jiaotao Zhou

Inspired by the recent advances of Generative Adversarial Networks (GAN) in deep learning, we propose a novel rank minimization approach, termed rank residual constraint (RRC), for image denoising in the optimization framework. Different from GAN, where a discriminative model is trained jointly with a generative model, in image denoising, since the labels are not available, we build an unsupervised mechanism, where two generative models are employed and jointly optimized. Specifically, by integrating the image nonlocal self-similarity prior with the proposed RRC model, we develop an iterative algorithm for image denoising. We first present a recursive based nonlocal means approach to obtain a good reference of the original image patch groups, and then the rank residual of image patch groups between this reference and the noisy image is minimized to achieve a better estimate of the desired image. In this manner, both the reference and the estimated image in each iteration are improved gradually and jointly; in the meantime, we progressively \emph{approximate} the underlying low-rank matrix (constructed by image patch groups) via minimizing the rank residual, which is different from existing low-rank based approaches that estimate the underlying low-rank matrix directly from the corrupted observation. We further provide a theoretical analysis on the feasibility of the proposed RRC model from the perspective of group-based sparse representation. Experimental results demonstrate that the proposed RRC model outperforms many state-of-the-art denoising methods. [1807.02504v1]


Tangent Convolutions for Dense Prediction in 3D

Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, Qian-Yi Zhou

We present an approach to semantic scene analysis using deep convolutional networks. Our approach is based on tangent convolutions – a new construction for convolutional networks on 3D data. In contrast to volumetric approaches, our method operates directly on surface geometry. Crucially, the construction is applicable to unstructured point clouds and other noisy real-world data. We show that tangent convolutions can be evaluated efficiently on large-scale point clouds with millions of points. Using tangent convolutions, we design a deep fully-convolutional network for semantic segmentation of 3D point clouds, and apply it to challenging real-world datasets of indoor and outdoor 3D environments. Experimental results show that the presented approach outperforms other recent deep network constructions in detailed analysis of large 3D scenes. [1807.02443v1]


Optimal Sensor Data Fusion Architecture for Object Detection in Adverse Weather Conditions

Andreas Pfeuffer, Klaus Dietmayer

A good and robust sensor data fusion in diverse weather conditions is a quite challenging task. There are several fusion architectures in the literature, e.g. the sensor data can be fused right at the beginning (Early Fusion), or they can be first processed separately and then concatenated later (Late Fusion). In this work, different fusion architectures are compared and evaluated by means of object detection tasks, in which the goal is to recognize and localize predefined objects in a stream of data. Usually, state-of-the-art object detectors based on neural networks are highly optimized for good weather conditions, since the well-known benchmarks only consist of sensor data recorded in optimal weather conditions. Therefore, the performance of these approaches decreases enormously or even fails in adverse weather conditions. In this work, different sensor fusion architectures are compared for good and adverse weather conditions for finding the optimal fusion architecture for diverse weather situations. A new training strategy is also introduced such that the performance of the object detector is greatly enhanced in adverse weather scenarios or if a sensor fails. Furthermore, the paper responds to the question if the detection accuracy can be increased further by providing the neural network with a-priori knowledge such as the spatial calibration of the sensors. [1807.02323v1]


Combining SLAM with muti-spectral photometric stereo for real-time dense 3D reconstruction

Yuanhong Xu, Pei Dong, Junyu Dong, Lin Qi

Obtaining dense 3D reconstrution with low computational cost is one of the important goals in the field of SLAM. In this paper we propose a dense 3D reconstruction framework from monocular multispectral video sequences using jointly semi-dense SLAM and Multispectral Photometric Stereo approaches. Starting from multispectral video, SALM (a) reconstructs a semi-dense 3D shape that will be densified;(b) recovers relative sparse depth map that is then fed as prioris into optimization-based multispectral photometric stereo for a more accurate dense surface normal recovery;(c)obtains camera pose that is subsequently used for conversion of view in the process of fusion where we combine the relative sparse point cloud with the dense surface normal using the automated cross-scale fusion method proposed in this paper to get a dense point cloud with subtle texture information. Experiments show that our method can effectively obtain denser 3D reconstructions. [1807.02294v1]


Parallel Convolutional Networks for Image Recognition via a Discriminator

Shiqi Yang, Gang Peng

In this paper, we introduce a simple but quite effective recognition framework dubbed D-PCN, aiming at enhancing feature extracting ability of CNN. The framework consists of two parallel CNNs, a discriminator and an extra classifier which takes integrated features from parallel networks and gives final prediction. The discriminator is core which drives parallel networks to focus on different regions and learn different representations. The corresponding training strategy is introduced to ensures utilization of discriminator. We validate D-PCN with several CNN models on benchmark datasets: CIFAR-100, and ImageNet, D-PCN enhances all models. In particular it yields state of the art performance on CIFAR-100 compared with related works. We also conduct visualization experiment on fine-grained Stanford Dogs dataset to verify our motivation. Additionally, we apply D-PCN for segmentation on PASCAL VOC 2012 and also find promotion. [1807.02265v1]


Deep Sequential Segmentation of Organs in Volumetric Medical Scans

Alexey Novikov, David Major, Maria Wimmer, Dimitrios Lenis, Katja Bühler

Segmentation in 3D scans is playing an increasingly important role in current clinical practice supporting diagnosis, tissue quantification, or treatment planning. The current 3D approaches based on CNN usually suffer from at least three main issues caused predominantly by implementation constraints – first, they require resizing the volume to the lower-resolutional reference dimensions, second, the capacity of such approaches is very limited due to memory restrictions, and third, all slices of volumes have to be available at any given training or testing time. We address these problems by a U-Net-like architecture consisting of bidirectional Convolutional LSTM and convolutional, pooling, upsampling and concatenation layers enclosed into time-distributed wrappers. Our network can either process the full volumes in a sequential manner, or segment slabs of slices on demand. We demonstrate performance of our architecture on vertebrae and liver segmentation tasks in 3D CT scans. [1807.02437v1]


Progressive Spatial Recurrent Neural Network for Intra Prediction

Yueyu Hu, Wenhan Yang, Mading Li, Jiaying Liu

Intra prediction is an important component of modern video codecs, which is able to efficiently squeeze out the spatial redundancy in video frames. With preceding pixels as the context, traditional intra prediction schemes generate linear predictions based on several predefined directions (i.e. modes) for blocks to be encoded. However, these modes are relatively simple and their predictions may fail when facing blocks with complex textures, which leads to additional bits encoding the residue. In this paper, we design a Progressive Spatial Recurrent Neural Network (PS-RNN) that learns to conduct intra prediction. Specifically, our PS-RNN consists of three spatial recurrent units and progressively generates predictions by passing information along from preceding contents to blocks to be encoded. To make our network generate predictions considering both distortion and bit-rate, we propose to use Sum of Absolute Transformed Difference (SATD) as the loss function to train PS-RNN since SATD is able to measure rate-distortion cost of encoding a residue block. Moreover, our method supports variable-block-size for intra prediction, which is more practical in real coding conditions. The proposed intra prediction scheme achieves on average 2.4% bit-rate reduction on variable-block-size settings under the same reconstruction quality compared with HEVC. [1807.02232v1]


A Fully Convolutional Two-Stream Fusion Network for Interactive Image Segmentation

Yang Hu, Andrea Soltoggio, Russell Lock, Steve Carter

In this paper, we propose a novel fully convolutional two-stream fusion network (FCTSFN) for interactive image segmentation. The proposed network includes two sub-networks: a two-stream late fusion network (TSLFN) that predicts the foreground at a reduced resolution, and a multi-scale refining network (MSRN) that refines the foreground at full resolution. The TSLFN includes two distinct deep streams followed by a fusion network. The intuition is that, since user interactions are more direction information on foreground/background than the image itself, the two-stream structure of the TSLFN reduces the number of layers between the pure user interaction features and the network output, allowing the user interactions to have a more direct impact on the segmentation result. The MSRN fuses the features from different layers of TSLFN with different scales, in order to seek the local to global information on the foreground to refine the segmentation result at full resolution. We conduct comprehensive experiments on four benchmark datasets. The results show that the proposed network achieves competitive performance compared to current state-of-the-art interactive image segmentation methods. [1807.02480v1]


Deep Back Projection for Sparse-View CT Reconstruction

Dong Hye Ye, Gregery T. Buzzard, Max Ruby, Charles A. Bouman

Filtered back projection (FBP) is a classical method for image reconstruction from sinogram CT data. FBP is computationally efficient but produces lower quality reconstructions than more sophisticated iterative methods, particularly when the number of views is lower than the number required by the Nyquist rate. In this paper, we use a deep convolutional neural network (CNN) to produce high-quality reconstructions directly from sinogram data. A primary novelty of our approach is that we first back project each view separately to form a stack of back projections and then feed this stack as input into the convolutional neural network. These single-view back projections convert the encoding of sinogram data into the appropriate spatial location, which can then be leveraged by the spatial invariance of the CNN to learn the reconstruction effectively. We demonstrate the benefit of our CNN based back projection on simulated sparse-view CT data over classical FBP. [1807.02370v1]


Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai

Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks. [1807.02242v1]


Explainable Learning: Implicit Generative Modelling during Training for Adversarial Robustness

Priyadarshini Panda, Kaushik Roy

We introduce Explainable Learning ,ExL, an approach for training neural networks that are intrinsically robust to adversarial attacks. We find that the implicit generative modelling of random noise, during posterior maximization, improves a model’s understanding of the data manifold furthering adversarial robustness. We prove our approach’s efficacy and provide a simplistic visualization tool for understanding adversarial data, using Principal Component Analysis. Our analysis reveals that adversarial robustness, in general, manifests in models with higher variance along the high-ranked principal components. We show that models learnt with ExL perform remarkably well against a wide-range of black-box attacks. [1807.02188v1]

Minutia Texture Cylinder Code用于指纹匹配

Wajih Ullah BaigUmar MunirWaqas EllahiAdeel EjazKashif Sardar

细节圆柱码(MCC)是基于细节的指纹描述符,其考虑指纹图像中的细节信息以用于指纹匹配。在本文中,我们对MCC描述符的基础信息进行了修改,并表明使用不同的特征,匹配的准确性受这些变化的影响很大。MCC最初是仅用于细节的描述符被转换为纹理描述符。使用短时傅里叶变换(STFT)分析,转换是从细节角度信息到方向,频率和能量信息。细节圆柱码被转换为细节纹理柱面码(MTCC)。基于一组固定的参数,MCC的拟议变更显示FVC 20022004数据集的性能有所提升,超过了传统的MCC性能。[1807.02251v1]



Edgar A. Margffoy-TuayJuan C. PerezEmilio BoteroPabloArbeláez

在本文中,我们解决了在给定引用它的自然语言表达式的对象的分割任务,\ textit {ie}一个引用表达式。当前技术通过(\ textit {i})直接或递归地合并信道维度中的语言和视觉信息然后执行卷积来处理该任务或者通过(\ textit {ii})将表达式映射到一个空间,在该空间中它可以被认为是一个过滤器,其响应与图像中给定空间坐标处对象的存在直接相关,因此卷积可以应用于查找对象。我们提出了一种新方法,它融合了两个世界中最好的方法来利用语言的递归本质,并且在上采样过程中,利用下采样图像时产生的中间信息,这样就可以获得详细的分割。我们的方法与四个标准数据集中的最新方法进行了比较,其中它产生了高性能,并且超过了此任务的八个标准数据集拆分中的六个中的所有先前方法。代码将在本文的最终版本中提供。用PyTorch编写的方法和训练程序的完整实现可以在\ url {}找到[1807.02257v1]



Kevin Lin, Fan Yang, Qiaosong Wang, Robinson Piramuthu



基于反向主动学习的Atrous DenseNet用于病理图像分类

Yuexiang Li, Xinpeng Xie, Linlin Shen, Shaoxiong Liu

近年来,随着深度学习的发展,越来越多的研究尝试采用深度学习模型进行医学图像分析。然而,用于病理图像分析的深度学习网络的使用遇到了若干挑战,例如病理图像的高分辨率(千兆像素)和癌症区域的缺乏注释。为了应对这些挑战,我们提出了一个完整的病理图像分类框架,其中包括一种新颖的训练策略,即反向主动学习(RAL)和一种先进的网络,即atrous DenseNetADN)。建议的RAL可以删除训练集中的错误标签补丁。然后,精细的训练集可用于训练广泛使用的深度学习网络,例如VGG-16ResNets等。一种新颖的深度学习网络,即atrous DenseNetADN),还提出了病理图像的分类。所提出的ADN通过将迂回卷积与密集块集成来实现多尺度特征提取。所提出的RALADN已经在两个病理学数据集上进行了评估,即BACHCCG。实验结果证明了所提出的ADN + RAL框架的优异性能,即在BACHCCG验证集上实现了94.10%和92.05%的平均补丁级ACA[1807.02420v1] 实验结果证明了所提出的ADN + RAL框架的优异性能,即在BACHCCG验证集上实现了94.10%和92.05%的平均补丁级ACA[1807.02420v1] 实验结果证明了所提出的ADN + RAL框架的优异性能,即在BACHCCG验证集上实现了94.10%和92.05%的平均补丁级ACA[1807.02420v1]



Omid Mohamad NezamiMark DrasPeter AndersonLen Hamey

图像字幕是生成图像的自然语言描述的过程。然而,大多数当前的图像字幕模型没有考虑图像的情绪方面,这与其中表示的活动和人际关系非常相关。为了开发一种可以生成包含这些类似人类字幕的模型,我们使用从包括人脸在内的图像中提取的面部表情特征,旨在提高模型的描述能力。在这项工作中,我们提出了两种Face-Cap模型,它以不同的方式嵌入面部表情特征,以生成图像标题。使用所有标准评估指标,当应用于从标准Flickr 30K数据集中提取的图像标题数据集时,我们的Face-Cap模型优于用于生成图像标题的最先进的基线模型,该数据集由包含面部的大约11K图像组成。对字幕的分析发现,或许令人惊讶的是,字幕质量的提高似乎不是来自添加与图像的情感方面相关的形容词,而是来自字幕中描述的动作的更多变化。[1807.02250v1]



Maximilian JaritzRaoul de CharetteMarin ToromanoffEtienne PerotFawzi Nashashibi




Andre BeckusAlexandru TamasanGeorge K. Atia

我们考虑使用从漫射壁反射的光对物体进行非视距(NLOS)成像。该壁散射入射光,使得透镜不再用于形成图像。相反,我们利用四维空间相干函数来重建被遮挡物体的二维投影。该方法在某种意义上是完全被动的,即不能控制照亮物体的光,并且与在室内和室外环境中普遍存在的部分相干场兼容。我们为重建制定了一个多准则凸优化问题,融合了不同尺度下反射场的强度和空间相干信息。我们的配方利用已建立的光传播和散射光学模型,并利用不同基础中许多图像共有的稀疏性。我们还开发了一种基于乘法的交替方向法的算法,以有效地求解所提出的凸规划。提供了一种用于分析测量矩阵的零空间的装置,以及用于加权各个测量对重建的贡献的装置。这项工作有望在激烈的NLOS制度中推进被动成像,其中强度不一定保留可区分的特征,并为多模态信息融合提供框架以进行有效的场景重建。[1807.02444v1] 我们还开发了一种基于乘法的交替方向法的算法,以有效地求解所提出的凸规划。提供了一种用于分析测量矩阵的零空间的装置,以及用于加权各个测量对重建的贡献的装置。这项工作有望在激烈的NLOS制度中推进被动成像,其中强度不一定保留可区分的特征,并为多模态信息融合提供框架以进行有效的场景重建。[1807.02444v1] 我们还开发了一种基于乘法的交替方向法的算法,以有效地求解所提出的凸规划。提供了一种用于分析测量矩阵的零空间的装置,以及用于加权各个测量对重建的贡献的装置。这项工作有望在激烈的NLOS制度中推进被动成像,其中强度不一定保留可区分的特征,并为多模态信息融合提供框架以进行有效的场景重建。[1807.02444v1] 这项工作有望在激烈的NLOS制度中推进被动成像,其中强度不一定保留可区分的特征,并为多模态信息融合提供框架以进行有效的场景重建。[1807.02444v1] 这项工作有望在激烈的NLOS制度中推进被动成像,其中强度不一定保留可区分的特征,并为多模态信息融合提供框架以进行有效的场景重建。[1807.02444v1]



Zhiyuan Zha, Xin Yuan, Tao Yue, Jiaotao Zhou

受深度学习中生成对抗网络(GAN)的最新进展的启发,我们提出了一种新的秩最小化方法,称为秩残差约束(RRC),用于优化框架中的图像去噪。与GAN不同,其中判别模型与生成模型共同训练,在图像去噪中,由于标签不可用,我们建立了一种无监督机制,其中采用两种生成模型并联合优化。具体地,通过将 图像非局部自相似性先验与所提出的RRC模型相结合,我们开发了用于图像去噪的迭代算法。我们首先提出一种基于递归的非局部均值方法,以获得原始图像补丁组的良好参考,然后,最小化该参考和噪声图像之间的图像块组的秩残差,以实现对所需图像的更好估计。以这种方式,每次迭代中的参考和估计图像都被逐渐地和共同地改进与此同时,我们通过最小化秩残差逐步\逼近{近似}基础低秩矩阵(由图像补丁组构建),这与现有的基于低秩的方法不同,后者直接从下面估计潜在的低秩矩阵。腐败的观察。我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 每次迭代中的参考和估计图像都是逐步和联合地改进的与此同时,我们通过最小化秩残差逐步\逼近{近似}基础低秩矩阵(由图像补丁组构建),这与现有的基于低秩的方法不同,后者直接从下面估计潜在的低秩矩阵。腐败的观察。我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 每次迭代中的参考和估计图像都是逐步和联合地改进的与此同时,我们通过最小化秩残差逐步\逼近{近似}基础低秩矩阵(由图像补丁组构建),这与现有的基于低秩的方法不同,后者直接从下面估计潜在的低秩矩阵。腐败的观察。我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 我们通过最小化秩残差来逐渐地逼近潜在的低秩矩阵(由图像块组构成),这与现有的基于低秩的方法不同,该方法直接从损坏的观察中估计潜在的低秩矩阵。我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 我们通过最小化秩残差来逐渐地逼近潜在的低秩矩阵(由图像块组构成),这与现有的基于低秩的方法不同,该方法直接从损坏的观察中估计潜在的低秩矩阵。我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1] 我们从基于群的稀疏表示的角度进一步提供了所提出的RRC模型的可行性的理论分析。实验结果表明,所提出的RRC模型优于许多现有技术的去噪方法。[1807.02504v1]



Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, Qian-Yi Zhou




Andreas PfeufferKlaus Dietmayer

在各种天气条件下,良好而强大的传感器数据融合是一项非常具有挑战性的任 文献中有几种融合架构,例如传感器数据可以在开始时融合(早期融合),或者它们可以先分开处理然后再连接(Late Fusion)。在这项工作中,通过对象检测任务比较和评估不同的融合架构,其中目标是识别和定位数据流中的预定义对象。通常,基于神经网络的最先进的物体检测器针对良好的天气条件进行了高度优化,因为众所周知的基准仅包括在最佳天气条件下记录的传感器数据。因此,这些方法的性能在恶劣天气条件下极大地降低甚至失败。在这项工作中,针对良好和恶劣的天气条件,比较不同的传感器融合架构,以便为不同的天气情况找到最佳的融合架构。还引入了新的训练策略,使得在恶劣天气情况下或者如果传感器发生故障时,物体检测器的性能大大提高。此外,本文通过为神经网络提供诸如传感器的空间校准之类的先验知识,来回答是否可以进一步提高检测精度的问题。[1807.02323v1] 还引入了新的训练策略,使得在恶劣天气情况下或者如果传感器发生故障时,物体检测器的性能大大提高。此外,本文通过为神经网络提供诸如传感器的空间校准之类的先验知识,来回答是否可以进一步提高检测精度的问题。[1807.02323v1] 还引入了新的训练策略,使得在恶劣天气情况下或者如果传感器发生故障时,物体检测器的性能大大提高。此外,本文通过为神经网络提供诸如传感器的空间校准之类的先验知识,来回答是否可以进一步提高检测精度的问题。[1807.02323v1]



Yuanhong Xu, Pei Dong, Junyu Dong, Lin Qi




Shiqi Yang, Gang Peng

在本文中,我们介绍了一个简单但非常有效的D-PCN识别框架,旨在增强CNN的特征提取能力。该框架由两个并行CNN组成,一个鉴别器和一个额外的分类器,它从并行网络中获取集成特征并给出最终预测。鉴别器是核心,它驱动并行网络聚焦于不同的区域并学习不同的表示。引入相应的培训策略以确保使用鉴别器。我们在基准数据集上使用几种CNN模型验证D-PCNCIFAR-100ImageNetD-PCN增强了所有模型。特别是与相关工作相比,它在CIFAR-100上产生了最先进的性能。我们还对细粒度的Stanford Dogs数据集进行可视化实验,以验证我们的动机。此外,我们在PASCAL VOC 2012上应用D-PCN进行细分,并进行促销。[1807.02265v1]



Alexey NovikovDavid MajorMaria WimmerDimitrios LenisKatjaBühler

3D扫描中的分割在支持诊断,组织定量或治疗计划的当前临床实践中发挥越来越重要的作用。目前基于CNN3D方法通常至少受到三个主要问题的影响,主要是由于实施限制首先,它们需要将音量调整到较低分辨率的参考尺寸;其次,由于存储器限制,这种方法的容量非常有限第三,在任何给定的培训或测试时间内,所有卷片都必须可用。我们通过类似U-Net的体系结构来解决这些问题,该体系结构包括双向卷积LSTM以及封装在时间分布包装器中的卷积,池化,上采样和连接层。我们的网络可以按顺序处理整个卷,或根据需要细分切片。我们展示了我们在3D CT扫描中对椎骨和肝脏分割任务的体系结构的表现。[1807.02437v1]



Yueyu Hu, Wenhan Yang, Mading Li, Jiaying Liu

帧内预测是现代视频编解码器的重要组成部分,能够有效地挤出视频帧中的空间冗余。利用先前像素作为上下文,传统帧内预测方案基于要编码的块的若干预定义方向(即,模式)生成线性预测。然而,这些模式相对简单,并且当面对具有复杂纹理的块时,它们的预测可能失败,这导致编码残余物的附加位。在本文中,我们设计了一个渐进空间回归神经网络(PS-RNN),学习进行帧内预测。具体地说,我们的PS-RNN由三个空间循环单元组成,并通过将信息从前面的内容传递到要编码的块来逐步生成预测。为了使我们的网络生成考虑失真和比特率的预测,我们建议使用绝对变换差和(SATD)作为训练PS-RNN的损失函数,因为SATD能够测量编码残差块的速率失真成本。 。此外,我们的方法支持用于帧内预测的可变块大小,这在实际编码条件下更实用。与HEVC相比,在相同的重建质量下,所提出的帧内预测方案在可变块大小设置上实现了平均2.4%的比特率降低。[1807.02232v1] 我们的方法支持用于帧内预测的可变块大小,这在实际编码条件下更实用。与HEVC相比,在相同的重建质量下,所提出的帧内预测方案在可变块大小设置上实现了平均2.4%的比特率降低。[1807.02232v1] 我们的方法支持用于帧内预测的可变块大小,这在实际编码条件下更实用。与HEVC相比,在相同的重建质量下,所提出的帧内预测方案在可变块大小设置上实现了平均2.4%的比特率降低。[1807.02232v1]



杨虎,Andrea SoltoggioRussell LockSteve Carter




Dong Hye YeGregery T. BuzzardMax RubyCharles A. Bouman



Mask TextSpotter:一种端到端的可训练神经网络,用于查找具有任意形状的文本

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai

最近,基于深度神经网络的模型已经主导了场景文本检测和识别领域。在本文中,我们研究了场景文本定位的问题,其目的是在自然图像中同时进行文本检测和识别。提出了一种用于场景文本定位的端到端可训练神经网络模型。拟议的模型,名为Mask TextSpotter,受到新发布的工作Mask R-CNN的启发。与以前使用端到端可训练深度神经网络完成文本定位的方法不同,Mask TextSpotter利用简单,流畅的端到端学习过程,通过语义分割获得精确的文本检测和识别。此外,它在处理不规则形状的文本实例(例如,弯曲文本)方面优于先前的方法。ICDAR2013ICDAR2015Total-Text上的实验表明,所提出的方法在场景文本检测和端到端文本识别任务中都实现了最先进的结果。[1807.02242v1]



Priyadarshini PandaKaushik Roy


转载请注明:《Mask TextSpotter:一种端到端的可训练神经网络,用于任意形状的文本检测+Face-Cap:使用面部表情分析的图像字幕+ 将SLAM与多光谱光度立体声相结合,实现实时密集的三维重建