TextTopicNet – 通过在语义文本空间上嵌入图像来自动监督学习视觉特征+基于Siamese-LSTM的深度度量学习的三维人体动作识别+具有尺度自适应锚点的单步文本检测器

Volumetric performance capture from minimal camera viewpoints

Andrew Gilbert, Marco Volino, John Collomosse, Adrian Hilton

We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count. [1807.01950v1]

 

Automatic deep learning-based normalization of breast dynamic contrast-enhanced magnetic resonance images

Jun Zhang, Ashirbani Saha, Brian J. Soher, Maciej A. Mazurowski

Objective: To develop an automatic image normalization algorithm for intensity correction of images from breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) acquired by different MRI scanners with various imaging parameters, using only image information. Methods: DCE-MR images of 460 subjects with breast cancer acquired by different scanners were used in this study. Each subject had one T1-weighted pre-contrast image and three T1-weighted post-contrast images available. Our normalization algorithm operated under the assumption that the same type of tissue in different patients should be represented by the same voxel value. We used four tissue/material types as the anchors for the normalization: 1) air, 2) fat tissue, 3) dense tissue, and 4) heart. The algorithm proceeded in the following two steps: First, a state-of-the-art deep learning-based algorithm was applied to perform tissue segmentation accurately and efficiently. Then, based on the segmentation results, a subject-specific piecewise linear mapping function was applied between the anchor points to normalize the same type of tissue in different patients into the same intensity ranges. We evaluated the algorithm with 300 subjects used for training and the rest used for testing. Results: The application of our algorithm to images with different scanning parameters resulted in highly improved consistency in pixel values and extracted radiomics features. Conclusion: The proposed image normalization strategy based on tissue segmentation can perform intensity correction fully automatically, without the knowledge of the scanner parameters. Significance: We have thoroughly tested our algorithm and showed that it successfully normalizes the intensity of DCE-MR images. We made our software publicly available for others to apply in their analyses. [1807.02152v1]

 

Subpixel-Precise Tracking of Rigid Objects in Real-time

Tobias Böttger, Markus Ulrich, Carsten Steger

We present a novel object tracking scheme that can track rigid objects in real time. The approach uses subpixel-precise image edges to track objects with high accuracy. It can determine the object position, scale, and rotation with subpixel-precision at around 80fps. The tracker returns a reliable score for each frame and is capable of self diagnosing a tracking failure. Furthermore, the choice of the similarity measure makes the approach inherently robust against occlusion, clutter, and nonlinear illumination changes. We evaluate the method on sequences from rigid objects from the OTB-2015 and VOT2016 dataset and discuss its performance. The evaluation shows that the tracker is more accurate than state-of-the-art real-time trackers while being equally robust. [1807.01952v1]

 

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Christoph Wick, Christian Reul, Frank Puppe

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and L\”udeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient (Reul et al., 2018a, Reul et al., 2018b). Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-ShortTerm-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares. [1807.02004v1]

 

Model-free Consensus Maximization for Non-Rigid Shapes

Thomas Probst, Ajad Chhatkuli, Danda Pani Paudel, Luc Van Gool

Many computer vision methods rely on consensus maximization to relate measurements containing outliers with a reliable transformation model. In the context of matching rigid shapes, this is typically done using Random Sampling and Consensus (RANSAC) to estimate an analytical model that agrees with the largest number of measurements, which make the inliers. However, such models are either not available or too complex for non-rigid shapes. In this paper, we formulate the model-free consensus maximization problem as an Integer Program in a graph using ‘rules’ on measurements. We then provide a method to solve such a formulation optimally using the Branch and Bound (BnB) paradigm. In the context of non-rigid shapes, we apply the method to filter out outlier 3D correspondences and achieve performance superior to the state-of-the-art. Our method works with outlier ratio as high as 80%. We further derive a similar formulation for 3D template to image correspondences. Our approach achieves similar or better performance compared to the state-of-the-art. [1807.01963v1]

 

Consistent Generative Query Networks

Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, Murray Shanahan

Stochastic video prediction is usually framed as an extrapolation problem where the goal is to sample a sequence of consecutive future image frames conditioned on a sequence of observed past frames. For the most part, algorithms for this task generate future video frames sequentially in an autoregressive fashion, which is slow and requires the input and output to be consecutive. We introduce a model that overcomes these drawbacks — it learns to generate a global latent representation from an arbitrary set of frames within a video. This representation can then be used to simultaneously and efficiently sample any number of temporally consistent frames at arbitrary time-points in the video. We apply our model to synthetic video prediction tasks and achieve results that are comparable to state-of-the-art video prediction models. In addition, we demonstrate the flexibility of our model by applying it to 3D scene reconstruction where we condition on location instead of time. To the best of our knowledge, our model is the first to provide flexible and coherent prediction on stochastic video datasets, as well as consistent 3D scene samples. Please check the project website https://bit.ly/2jX7Vyu to view scene reconstructions and videos produced by our model. [1807.02033v1]

 

Open Logo Detection Challenge

Hang Su, Xiatian Zhu, Shaogang Gong

Existing logo detection benchmarks consider artificial deployment scenarios by assuming that large training data with fine-grained bounding box annotations for each class are available for model training. Such assumptions are often invalid in realistic logo detection scenarios where new logo classes come progressively and require to be detected with little or none budget for exhaustively labelling fine-grained training data for every new class. Existing benchmarks are thus unable to evaluate the true performance of a logo detection method in realistic and open deployments. In this work, we introduce a more realistic and challenging logo detection setting, called Open Logo Detection. Specifically, this new setting assumes fine-grained labelling only on a small proportion of logo classes whilst the remaining classes have no labelled training data to simulate the open deployment. We further create an open logo detection benchmark, called OpenLogo,to promote the investigation of this new challenge. OpenLogo contains 27,189 images from 309 logo classes, built by aggregating/refining 7 existing datasets and establishing an open logo detection evaluation protocol. To address this challenge, we propose a Context Adversarial Learning (CAL) approach to synthesising training data with coherent logo instance appearance against diverse background context for enabling more effective optimisation of contemporary deep learning detection models. Experiments show the performance advantage of CAL over existing state-of-the-art alternative methods on the more realistic and challenging OpenLogo benchmark. [1807.01964v1]

 

Reflection Analysis for Face Morphing Attack Detection

Clemens Seibold, Anna Hilsmann, Peter Eisert

A facial morph is a synthetically created image of a face that looks similar to two different individuals and can even trick biometric facial recognition systems into recognizing both individuals. This attack is known as face morphing attack. The process of creating such a facial morph is well documented and a lot of tutorials and software to create them are freely available. Therefore, it is mandatory to be able to detect this kind of fraud to ensure the integrity of the face as reliable biometric feature. In this work, we study the effects of face morphing on the physically correctness of the illumination. We estimate the direction to the light sources based on specular highlights in the eyes and use them to generate a synthetic map for highlights on the skin. This map is compared with the highlights in the image that is suspected to be a fraud. Morphing faces with different geometries, a bad alignment of the source images or using images with different illuminations, can lead to inconsistencies in reflections that indicate the existence of a morphing attack. [1807.02030v1]

 

Learning a Representation Map for Robot Navigation using Deep Variational Autoencoder

Kaixin Hu, Peter O’Connor

The aim of this work is to use Variational Autoencoder (VAE) to learn a representation of an indoor environment that can be used for robot navigation. We use images extracted from a video, in which a camera takes a tour around a house, for training the VAE model with a 4 dimensional latent space. After the model is trained, each real frame has a corresponding representation point on manifold in the latent space, and each representation point has corresponding reconstructed image. For the navigation problem, we map the starting image and destination image to the latent space, then optimize a path on the learned manifold connecting the two points, and finally map the path back through decoder to a sequence of images. The ideal sequence of images should correspond to a route that is spatially continuous – i.e. neighbor images in the route should correspond to neighbor locations in physical space. Such a route could be used for navigation with computer vision techniques, i.e. a robot could follow the image sequence from starting location to destination in the environment step by step. We implement this algorithm, but find in our experimental results that the resulting route is not satisfactory. The route consist of several discontinuous image frames along the ideal routes, so that the route could not be followed by a robot with computer vision techniques in practice. In our evaluation, we propose two reasons for our failure to automatically find continuous routes: (1) The VAE tends to capture global structures, but discard the details; (2) the Euclidean similarity metric used for measuring continuity between house images is sub-optimal. For further work, we propose: trying other generative models like VAE-GANs which may be better at reconstructing the details to learn the representation map, and adjusting the similarity metric in the path selecting algorithm. [1807.02401v1]

 

Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving

Peiliang Li, Tong Qin, Shaojie Shen

We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-the-art solutions. [1807.02062v1]

 

Beef Cattle Instance Segmentation Using Fully Convolutional Neural Network

Aram Ter-Sarkisov, Robert Ross, John Kelleher, Bernadette Earley, Michael Keane

We present an instance segmentation algorithm trained and applied to a CCTV recording of beef cattle during a winter finishing period. A fully convolutional network was transformed into an instance segmentation network that learns to label each instance of an animal separately. We introduce a conceptually simple framework that the network uses to output a single prediction for every animal. These results are a contribution towards behaviour analysis in winter finishing beef cattle for early detection of animal welfare-related problems. [1807.01972v1]

 

Perspective-Aware CNN For Crowd Counting

Miaojing Shi, Zhaohui Yang, Chao Xu, Qijun Chen

Crowd counting is the task of estimating pedestrian numbers in crowd images. Modern crowd counting methods employ deep neural networks to estimate crowd counts via crowd density regressions. A major challenge of this task lies in the drastic changes of scales and perspectives in images. Representative approaches usually utilize different (large) sized filters and conduct patch-based estimations to tackle it, which is however computationally expensive. In this paper, we propose a perspective-aware convolutional neural network (PACNN) with a single backbone of small filters (e.g. 3×3). It directly predicts a perspective map in the network and encodes it as a perspective-aware weighting layer to adaptively combine the density outputs from multi-scale feature maps. The weights are learned at every pixel of the map such that the final combination is robust to perspective changes and pedestrian size variations. We conduct extensive experiments on the ShanghaiTech, WorldExpo’10 and UCF_CC_50 datasets, and demonstrate that PACNN achieves state-of-the-art results and runs as fast as the fastest. [1807.01989v1]

 

MAT-CNN-SOPC: Motionless Analysis of Traffic Using Convolutional Neural Networks on System-On-a-Programmable-Chip

Somdip Dey, Grigorios Kalliatakis, Sangeet Saha, Amit Kumar Singh, Shoaib Ehsan, Klaus McDonald-Maier

Intelligent Transportation Systems (ITS) have become an important pillar in modern “smart city” framework which demands intelligent involvement of machines. Traffic load recognition can be categorized as an important and challenging issue for such systems. Recently, Convolutional Neural Network (CNN) models have drawn considerable amount of interest in many areas such as weather classification, human rights violation detection through images, due to its accurate prediction capabilities. This work tackles real-life traffic load recognition problem on System-On-a-Programmable-Chip (SOPC) platform and coin it as MAT-CNN- SOPC, which uses an intelligent re-training mechanism of the CNN with known environments. The proposed methodology is capable of enhancing the efficacy of the approach by 2.44x in comparison to the state-of-art and proven through experimental analysis. We have also introduced a mathematical equation, which is capable of quantifying the suitability of using different CNN models over the other for a particular application based implementation. [1807.02098v1]

 

Detecting Visual Relationships Using Box Attention

Alexander Kolesnikov, Christoph H. Lampert, Vittorio Ferrari

In this paper we propose a new model for detecting visual relationships. Our main technical novelty is a Box Attention mechanism that allows modelling pairwise interactions between objects in visual scenes using standard object detection pipelines. The resulting model is conceptually clean, expressive and relies on well-justified training and prediction procedures. Moreover, unlike previously proposed approaches, our model does not introduce any additional complex components or hyperparameters on top of those already required by the underlying detection model. We conduct an experimental evaluation on two challenging datasets, V-COCO and Visual Relationships, demonstrating strong quantitative and qualitative results. [1807.02136v1]

 

A Gauss-Newton Approach to Real-Time Monocular Multiple Object Tracking

Henning Tjaden, Ulrich Schwanecke, Elmar Schömer, Daniel Cremers

We propose an algorithm for real-time 6DOF pose tracking of rigid 3D objects using a monocular RGB camera. The key idea is to derive a region-based cost function using temporally consistent local color histograms. While such region-based cost functions are commonly optimized using first-order gradient descent techniques, we systematically derive a Gauss-Newton optimization scheme which gives rise to drastically faster convergence and highly accurate and robust tracking performance. We furthermore propose a novel complex dataset dedicated for the task of monocular object pose tracking and make it publicly available to the community. To our knowledge, It is the first to address the common and important scenario in which both the camera as well as the objects are moving simultaneously in cluttered scenes. In numerous experiments – including our own proposed data set – we demonstrate that the proposed Gauss-Newton approach outperforms existing approaches, in particular in the presence of cluttered backgrounds, heterogeneous objects and partial occlusions. [1807.02087v1]

 

3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning

Seyma Yucer, Yusuf Sinan Akgul

This paper proposes a new 3D Human Action Recognition system as a two-phase system: (1) Deep Metric Learning Module which learns a similarity metric between two 3D joint sequences using Siamese-LSTM networks; (2) A Multiclass Classification Module that uses the output of the first module to produce the final recognition output. This model has several advantages: the first module is trained with a larger set of data because it uses many combinations of sequence pairs.Our deep metric learning module can also be trained independently of the datasets, which makes our system modular and generalizable. We tested the proposed system on standard and newly introduced datasets that showed us that initial results are promising. We will continue developing this system by adding more sophisticated LSTM blocks and by cross-training between different datasets. [1807.02131v1]

 

Detecting Tiny Moving Vehicles in Satellite Videos

Wei Ao, Yanwei Fu, Feng Xu

In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the city-scale scene. In a broad field-of-view of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixel-level and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multi-morphological-cue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the state-of-the-art baselines, and demonstrates the advantages of our framework over the benchmarks. [1807.01864v1]

 

Real-Time Subpixel Fast Bilateral Stereo

Rui Fan, Yanan Liu, Mohammud Junaid Bocus, Ming Liu

Stereo vision technique has been widely used in robotic systems to acquire 3-D information. In recent years, many researchers have applied bilateral filtering in stereo vision to adaptively aggregate the matching costs. This has greatly improved the accuracy of the estimated disparity maps. However, the process of filtering the whole cost volume is very time consuming and therefore the researchers have to resort to some powerful hardware for the real-time purpose. This paper presents the implementation of fast bilateral stereo on a state-of-the-art GPU. By highly exploiting the parallel computing architecture of the GPU, the fast bilateral stereo performs in real time when processing the Middlebury stereo datasets. [1807.02044v1]

 

Detection and Analysis of Content Creator Collaborations in YouTube Videos using Face- and Speaker-Recognition

Moritz Lode, Michael Örtl, Christian Koch, Amr Rizk, Ralf Steinmetz

This work discusses and implements the application of speaker recognition for the detection of collaborations in YouTube videos. CATANA, an existing framework for detection and analysis of YouTube collaborations, is utilizing face recognition for the detection of collaborators, which naturally performs poor on video-content without appearing faces. This work proposes an extension of CATANA using active speaker detection and speaker recognition to improve the detection accuracy. [1807.02020v1]

 

Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders

Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, Carsten Steger

Convolutional autoencoders have emerged as popular models for unsupervised defect segmentation on image data. Most commonly, this task is performed by thresholding a pixel-wise reconstruction error based on an $\ell^p$ distance. However, this procedure generally leads to high novelty scores whenever the reconstruction encompasses slight localization inaccuracies around edges. We show that this problem prevents these approaches from being applied to complex real-world scenarios and that it cannot be easily avoided by employing more elaborate architectures. Instead, we propose to use a perceptual loss function based on structural similarity. Our approach achieves state-of-the-art performance on a real-world dataset of nanofibrous materials, while being trained end-to-end without requiring additional priors such as pretrained networks or handcrafted features. [1807.02011v1]

 

Combining Background Subtraction Algorithms with Convolutional Neural Network

Dongdong Zeng, Ming Zhu, Arjan Kuijper

Accurate and fast extraction of foreground object is a key prerequisite for a wide range of computer vision applications such as object tracking and recognition. Thus, enormous background subtraction methods for foreground object detection have been proposed in recent decades. However, it is still regarded as a tough problem due to a variety of challenges such as illumination variations, camera jitter, dynamic backgrounds, shadows, and so on. Currently, there is no single method that can handle all the challenges in a robust way. In this letter, we try to solve this problem from a new perspective by combining different state-of-the-art background subtraction algorithms to create a more robust and more advanced foreground detection algorithm. More concretely, a encoder-decoder fully convolutional neural network architecture is trained to automatically learn how to leverage the characteristics of different algorithms to fuse the results produced by different background subtraction algorithms and output a more precise result. Comprehensive experiments evaluated on the CDnet 2014 dataset demonstrate that the proposed method outperforms all the considered single background subtraction algorithm. And we show that our solution is more efficient than other combination strategies. [1807.02080v1]

 

Road surface 3d reconstruction based on dense subpixel disparity map estimation

Rui Fan, Xiao Ai, Naim Dahnoun

Various 3D reconstruction methods have enabled civil engineers to detect damage on a road surface. To achieve the millimetre accuracy required for road condition assessment, a disparity map with subpixel resolution needs to be used. However, none of the existing stereo matching algorithms are specially suitable for the reconstruction of the road surface. Hence in this paper, we propose a novel dense subpixel disparity estimation algorithm with high computational efficiency and robustness. This is achieved by first transforming the perspective view of the target frame into the reference view, which not only increases the accuracy of the block matching for the road surface but also improves the processing speed. The disparities are then estimated iteratively using our previously published algorithm where the search range is propagated from three estimated neighbouring disparities. Since the search range is obtained from the previous iteration, errors may occur when the propagated search range is not sufficient. Therefore, a correlation maxima verification is performed to rectify this issue, and the subpixel resolution is achieved by conducting a parabola interpolation enhancement. Furthermore, a novel disparity global refinement approach developed from the Markov Random Fields and Fast Bilateral Stereo is introduced to further improve the accuracy of the estimated disparity map, where disparities are updated iteratively by minimising the energy function that is related to their interpolated correlation polynomials. The algorithm is implemented in C language with a near real-time performance. The experimental results illustrate that the absolute error of the reconstruction varies from 0.1 mm to 3 mm. [1807.01874v1]

 

PortraitGAN for Flexible Portrait Manipulation

Jiali Duan, Xiaoyuan Guo, Yuhang Song, Chao Yang, C. -C. Jay Kuo

Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise etc, out of canonical expressions and they are not scalable, operating in single modality. In this paper, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. Specifically, we adapt cycle-consistency into the conditional setting by leveraging additional facial landmarks information. This has two effects: first cycle mapping induces bidirectional manipulation and identity preserving; second pairing samples from different modalities can thus be utilized. To ensure high-quality synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. Quantitative and qualitative experiments show the effectiveness of our framework in performing flexible and multi-modality portrait manipulation with photo-realistic effects. [1807.01826v1]

 

Face Recognition Using Map Discriminant on YCbCr Color Space

I Gede Pasek Suta Wijaya

This paper presents face recognition using maximum a posteriori (MAP) discriminant on YCbCr color space. The YCbCr color space is considered in order to cover the skin information of face image on the recognition process. The proposed method is employed to improve the recognition rate and equal error rate (EER) of the gray scale based face recognition. In this case, the face features vector consisting of small part of dominant frequency elements which is extracted by non-blocking DCT is implemented as dimensional reduction of the raw face images. The matching process between the query face features and the trained face features is performed using maximum a posteriori (MAP) discriminant. From the experimental results on data from four face databases containing 2268 images with 196 classes show that the face recognition YCbCr color space provide better recognition rate and lesser EER than those of gray scale based face recognition which improve the first rank of grayscale based method result by about 4%. However, it requires three times more computation time than that of grayscale based method. [1807.02135v1]

 

Spatiotemporal KSVD Dictionary Learning for Online Multi-target Tracking

Huynh Manh, Gita Alaghband

In this paper, we present a new spatial discriminative KSVD dictionary algorithm (STKSVD) for learning target appearance in online multi-target tracking. Different from other classification/recognition tasks (e.g. face, image recognition), learning target’s appearance in online multi-target tracking is impacted by factors such as posture/articulation changes, partial occlusion by background scene or other targets, background changes (human detection bounding box covers human parts and part of the scene), etc. However, we observe that these variations occur gradually relative to spatial and temporal dynamics. We characterize the spatial and temporal information between target’s samples through a new STKSVD appearance learning algorithm to better discriminate sparse code, linear classifier parameters and minimize reconstruction error in a single optimization system. Our appearance learning algorithm and tracking framework employ two different methods of calculating appearance similarity score in each stage of a two-stage association: a linear classifier in the first stage, and minimum residual errors in the second stage. The results tested using 2DMOT2015 dataset and its public Aggregated Channel features (ACF) human detection for all comparisons show that our method outperforms the existing related learning methods. [1807.02143v1]

 

A Single Shot Text Detector with Scale-adaptive Anchors

Qi Yuan, Bingwang Zhang, Haojie Li, Zhihui Wang, Zhongxuan Luo

Currently, most top-performing text detection networks tend to employ fixed-size anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an end-to-end box-based text detector with scale-adaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scale-adaptive anchors allow us to use a few number of anchors to handle multi-scale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only $0.28$ second per image, while outperforming most state-of-the-art methods in accuracy. [1807.01884v1]

 

Learning Personalized Representation for Inverse Problems in Medical Imaging Using Deep Neural Network

Kuang Gong, Kyungsang Kim, Jianan Cui, Ning Guo, Ciprian Catana, Jinyi Qi, Quanzheng Li

Recently deep neural networks have been widely and successfully applied in computer vision tasks and attracted growing interests in medical imaging. One barrier for the application of deep neural networks to medical imaging is the need of large amounts of prior training pairs, which is not always feasible in clinical practice. In this work we propose a personalized representation learning framework where no prior training pairs are needed, but only the patient’s own prior images. The representation is expressed using a deep neural network with the patient’s prior images as network input. We then applied this novel image representation to inverse problems in medical imaging in which the original inverse problem was formulated as a constraint optimization problem and solved using the alternating direction method of multipliers (ADMM) algorithm. Anatomically guided brain positron emission tomography (PET) image reconstruction and image denoising were employed as examples to demonstrate the effectiveness of the proposed framework. Quantification results based on simulation and real datasets show that the proposed personalized representation framework outperform other widely adopted methods. [1807.01759v1]

 

Unbiased Image Style Transfer

Hyun-Chul Choi, Minseong Kim

Recent fast image style transferring methods use feed-forward neural networks to generate an output image of desired style strength from the input pair of a content and a target style image. In the existing methods, the image of intermediate style between the content and the target style is obtained by decoding a linearly interpolated feature in encoded feature space. However, there has been no work on analyzing the effectiveness of this kind of style strength interpolation so far. In this paper, we tackle the missing work on the in-depth analysis of style interpolation and propose a method that is more effective in controlling style strength. We interpret the training task of a style transfer network as a regression learning between the control parameter and output style strength. In this understanding, the existing methods are biased due to the fact that training is performed with one-sided data of full style strength (alpha = 1.0). Thus, this biased learning does not guarantee the generation of a desired intermediate style corresponding to the style control parameter between 0.0 and 1.0. To solve this problem of the biased network, we propose an unbiased learning technique which uses unbiased training data and corresponding unbiased loss for alpha = 0.0 to make the feed-forward networks to generate a zero-style image, i.e., content image when alpha = 0.0. Our experimental results verified that our unbiased learning method achieved the reconstruction of a content image with zero style strength, better regression specification between style control parameter and output style, and more stable style transfer that is insensitive to the weight of style loss without additive complexity in image generating process. [1807.01424v2]

 

Localization Recall Precision (LRP): A New Performance Metric for Object Detection

Kemal Oksuz, Baris Can Cam, Emre Akbas, Sinan Kalkan

Average precision (AP), the area under the recall-precision (RP) curve, is the standard performance measure for object detection. Despite its wide acceptance, it has a number of shortcomings, the most important of which are (i) the inability to distinguish very different RP curves, and (ii) the lack of directly measuring bounding box localization accuracy. In this paper, we propose ‘Localization Recall Precision (LRP) Error’, a new metric which we specifically designed for object detection. LRP Error is composed of three components related to localization, false negative (FN) rate and false positive (FP) rate. Based on LRP, we introduce the ‘Optimal LRP’, the minimum achievable LRP error representing the best achievable configuration of the detector in terms of recall-precision and the tightness of the boxes. In contrast to AP, which considers precisions over the entire recall domain, Optimal LRP determines the ‘best’ confidence score threshold for a class, which balances the trade-off between localization and recall-precision. In our experiments, we show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides richer and more discriminative information than AP. We also demonstrate that the best confidence score thresholds vary significantly among classes and detectors. Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes. At https://github.com/cancam/LRP we provide the source code that can compute LRP for the PASCAL VOC and MSCOCO datasets. Our source code can easily be adapted to other datasets as well. [1807.01696v2]

 

MITOS-RCNN: A Novel Approach to Mitotic Figure Detection in Breast Cancer Histopathology Images using Region Based Convolutional Neural Networks

Siddhant Rao

Studies estimate that there will be 266,120 new cases of invasive breast cancer and 40,920 breast cancer induced deaths in the year of 2018 alone. Despite the pervasiveness of this affliction, the current process to obtain an accurate breast cancer prognosis is tedious and time consuming, requiring a trained pathologist to manually examine histopathological images in order to identify the features that characterize various cancer severity levels. We propose MITOS-RCNN: a novel region based convolutional neural network (RCNN) geared for small object detection to accurately grade one of the three factors that characterize tumor belligerence described by the Nottingham Grading System: mitotic count. Other computational approaches to mitotic figure counting and detection do not demonstrate ample recall or precision to be clinically viable. Our models outperformed all previous participants in the ICPR 2012 challenge, the AMIDA 2013 challenge and the MITOS-ATYPIA-14 challenge along with recently published works. Our model achieved an F-measure score of 0.955, a 6.11% improvement in accuracy from the most accurate of the previously proposed models. [1807.01788v1]

 

Deep Cross-modality Adaptation via Semantics Preserving Adversarial Learning for Sketch-based 3D Shape Retrieval

Jiaxin Chen, Yi Fang

Due to the large cross-modality discrepancy between 2D sketches and 3D shapes, retrieving 3D shapes by sketches is a significantly challenging task. To address this problem, we propose a novel framework to learn a discriminative deep cross-modality adaptation model in this paper. Specifically, we first separately adopt two metric networks, following two deep convolutional neural networks (CNNs), to learn modality-specific discriminative features based on an importance-aware metric learning method. Subsequently, we explicitly introduce a cross-modality transformation network to compensate for the divergence between two modalities, which can transfer features of 2D sketches to the feature space of 3D shapes. We develop an adversarial learning based method to train the transformation model, by simultaneously enhancing the holistic correlations between data distributions of two modalities, and mitigating the local semantic divergences through minimizing a cross-modality mean discrepancy term. Experimental results on the SHREC 2013 and SHREC 2014 datasets clearly show the superior retrieval performance of our proposed model, compared to the state-of-the-art approaches. [1807.01806v1]

 

TextTopicNet – Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community. In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches. [1807.02110v1]

从最小的摄像机视点捕获体积性能

Andrew GilbertMarco VolinoJohn CollomosseAdrian Hilton

我们提出了一种卷积自动编码器,它能够从仅包含少量摄像机视图的多视图视频中捕获人体性能的高保真度体积重建。我们的方法产生类似于使用显着更多(双或更多)视点计算的概率视觉船体的端到端重建误差。我们使用通过自动编码器隐式学习的深度先验,该自动编码器训练在广泛的主题和动作的视图消融的多视图视频镜头的数据集上。这开启了在现场和专业消费者场景中高端体积性能捕获的可能性,其中时间或成本禁止高见证摄像机计数。[1807.01950v1]

 

基于自动深度学习的乳房动态对比增强磁共振图像的归一化

张军,Ashirbani SahaBrian J. SoherMaciej A. Mazurowski

目的:利用图像信息,开发一种自动图像归一化算法,用于不同MRI扫描仪采集的乳腺动态增强磁共振成像(DCE-MRI)图像强度校正。方法:采用不同扫描仪获得的460例乳腺癌患者的DCE-MR图像进行研究。每个受试者具有一个T1加权的对比前图像和三个T1加权的对比后图像。我们的归一化算法在假设不同患者中的相同类型的组织应由相同的体素值表示的情况下操作。我们使用四种组织/材料类型作为标准化的锚点:1)空气,2)脂肪组织,3)致密组织,和4)心脏。该算法按以下两个步骤进行:首先,采用最先进的基于深度学习的算法,准确有效地进行组织分割。然后,基于分割结果,在锚点之间应用特定于主题的分段线性映射函数,以将不同患者中的相同类型的组织标准化为相同的强度范围。我们用300个用于训练的受试者评估了该算法,其余用于测试。结果:将我们的算法应用于具有不同扫描参数的图像,导致像素值和提取的放射性组学特征的一致性大大提高。结论:基于组织分割的图像归一化策略可以完全自动地进行强度校正,而不需要了解扫描仪参数。意义:我们已经彻底测试了我们的算法,并证明它成功地将DCE-MR图像的强度归一化。我们公开我们的软件供其他人应用于他们的分析。[1807.02152v1]

 

实时的亚像素精确跟踪刚体物体

TobiasBöttgerMarkus UlrichCarsten Steger

我们提出了一种新的物体跟踪方案,可以实时跟踪刚性物体。该方法使用亚像素精确图像边缘以高精度跟踪对象。它可以确定物体位置,比例和旋转,子像素精度约为80fps。跟踪器为每个帧返回可靠的分数,并且能够自我诊断跟踪失败。此外,相似性度量的选择使得该方法对于遮挡,杂波和非线性照明变化具有固有的鲁棒性。我们评估来自OTB-2015VOT2016数据集的刚性对象的序列方法,并讨论其性能。评估表明,跟踪器比最先进的实时跟踪器更准确,同时具有同样强大的功能。[1807.01952v1]

 

Calamari – 一种基于高性能Tensorflow的光学字符识别深度学习软件包

Christoph WickChristian ReulFrank Doll

关于当代和历史数据的光学字符识别(OCR)仍然是许多研究人员关注的焦点。特别是历史版画需要书籍特定的训练OCR模型才能达到适用的结果(SpringmannL \“udeling2016Reul等,2017a)。减少人工注意地面实况(GT)的各种技巧,如投票和预训练已经证明是非常有效的(Reul等,2018aReul等,2018b.Calamari是一种新的开源OCR线识别软件,它使用了最先进的深度神经网络(DNN)。 Tensorflow并为预训练和投票等技术提供原生支持。由卷积神经网络(CNNS)和长短时间存储器(LSTM)层构成的可定制网络架构通过Graves等人的所谓的连接时间分类(CTC)算法进行训练。(2006年)。GPU的可选用法大大减少了训练和预测的计算时间。我们使用两个不同的数据集来比较CalamariOCRopyOCRopus3Tesseract 4的性能.Calamari在用现代英语写的UW3数据集上达到0.11%的字符错误率(CER),在用德语写的DTA19数据集上达到0.18 Fraktur,其性能远远优于现有软件的结果。[1807.02004v1] GPU的可选用法大大减少了训练和预测的计算时间。我们使用两个不同的数据集来比较CalamariOCRopyOCRopus3Tesseract 4的性能.Calamari在用现代英语写的UW3数据集上达到0.11%的字符错误率(CER),在用德语写的DTA19数据集上达到0.18 Fraktur,其性能远远优于现有软件的结果。[1807.02004v1] GPU的可选用法大大减少了训练和预测的计算时间。我们使用两个不同的数据集来比较CalamariOCRopyOCRopus3Tesseract 4的性能.Calamari在用现代英语写的UW3数据集上达到0.11%的字符错误率(CER),在用德语写的DTA19数据集上达到0.18 Fraktur,其性能远远优于现有软件的结果。[1807.02004v1]

 

非刚性形状的无模型共识最大化

Thomas ProbstAjad ChhatkuliDanda Pani PaudelLuc Van Gool

许多计算机视觉方法依赖于共识最大化来将包含异常值的测量与可靠的变换模型相关联。在匹配刚性形状的情况下,这通常使用随机抽样和共识(RANSAC)来估计与最大数量的测量结果一致的分析模型,该模型产生内点。然而,这些模型要么不可用,要么对于非刚性形状来说太复杂。在本文中,我们使用测量的规则在图表中将无模型共识最大化问题表示为整数程序。然后,我们提供了一种使用分支定界(BnB)范例最佳地求解这种公式的方法。在非刚性形状的背景下,我们应用该方法来过滤掉异常的3D对应并实现优于现有技术的性能。我们的方法适用于异常值比率高达80%。我们进一步推导出类似的3D模板到图像对应的公式。与最先进的技术相比,我们的方法实现了类似或更好的性能。[1807.01963v1]

 

一致的生成查询网络

Ananya KumarSM Ali EslamiDanilo J. RezendeMarta GarneloFabio ViolaEdward LockhartMurray Shanahan

随机视频预测通常被构造为外推问题,其目标是以一系列观察到的过去帧为条件对连续的未来图像帧进行采样。在大多数情况下,此任务的算法以自回归方式顺序生成未来视频帧,这种方式很慢并且需要输入和输出是连续的。我们介绍了一个克服这些缺点的模型它学会了从视频中的任意一组帧生成全局潜在表示。然后,该表示可以用于在视频中的任意时间点同时且有效地采样任意数量的时间上一致的帧。我们将模型应用于合成视频预测任务,并获得与最先进的视频预测模型相当的结果。此外,我们通过将模型应用于3D场景重建来展示我们模型的灵活性,我们在位置而不是时间上进行调整。据我们所知,我们的模型是第一个为随机视频数据集提供灵活和连贯预测的模型,以及一致的3D场景样本。请查看项目网站https://bit.ly/2jX7Vyu,查看我们模型生成的场景重建和视频。[1807.02033v1]

 

开放LOGO检测挑战

Hang Su, Xiatian Zhu, Shaogang Gong

现有的徽标检测基准通过假设每个类的具有细粒度边界框注释的大型训练数据可用于模型训练来考虑人工部署场景。在现实的徽标检测方案中,这种假设通常是无效的,其中新的徽标类逐渐出现并且需要以很少或没有预算来检测,以便为每个新类别详尽地标记细粒度的训练数据。因此,现有的基准测试无法评估徽标检测方法在实际和开放式部署中的真实性能。在这项工作中,我们引入了更现实和具有挑战性的徽标检测设置,称为开放徽标检测。特别,这个新设置假定仅对一小部分徽标类别进行细粒度标记,而其余类别没有标记的训练数据来模拟开放部署。我们进一步创建了一个名为OpenLogo的开放式徽标检测基准,以促进对这一新挑战的调查。OpenLogo包含来自309个徽标类的27,189个图像,通过聚合/优化7个现有数据集并建立开放徽标检测评估协议来构建。为了应对这一挑战,我们提出了一种上下文对抗性学习(CAL)方法,用于将训练数据与连贯的徽标实例外观相结合,以适应不同的背景环境,从而更有效地优化当代深度学习检测模型。实验表明,CAL比现有的最先进的替代方法在更现实和更具挑战性的OpenLogo基准测试中具有性能优势。[1807.01964v1]

 

面部变形攻击检测的反射分析

Clemens SeiboldAnna HilsmannPeter Eisert

面部变形是一种合成创建的面部图像,看起来类似于两个不同的个体,甚至可以欺骗生物识别面部识别系统识别两个人。这种攻击被称为面部变形攻击。创建这样的面部变形的过程已有详细记录,并且可以免费获得许多用于创建面部变形的教程和软件。因此,必须能够检测到这种欺诈,以确保面部的完整性作为可靠的生物特征。在这项工作中,我们研究了面部变形对光照物理正确性的影响。我们基于眼睛中的镜面高光来估计光源的方向,并使用它们来生成皮肤上的高光的合成图。将该地图与怀疑是欺诈的图像中的突出显示进行比较。变形面具有不同的几何形状,源图像的错误对齐或使用具有不同照明的图像,可能导致反射中的不一致,这表明存在变形攻击。[1807.02030v1]

 

使用深度变分自动编码器学习机器人导航的表示图

Kaixin Hu, Peter O’Connor

这项工作的目的是使用变分自动编码器(VAE)来学习可用于机器人导航的室内环境的表示。我们使用从视频中提取的图像,其中相机在房屋周围巡视,用于训练具有4维潜在空间的VAE模型。在训练模型之后,每个真实帧在潜在空间中的流形上具有对应的表示点,并且每个表示点具有对应的重建图像。对于导航问题,我们将起始图像和目标图像映射到潜在空间,然后优化连接两个点的学习流形上的路径,最后将路径通过解码器映射到一系列图像。理想的图像序列应该对应于空间连续的路径 路径中的邻居图像应对应于物理空间中的邻居位置。这样的路线可以用于利用计算机视觉技术进行导航,即机器人可以一步一步地遵循从环境中的起始位置到目的地的图像序列。我们实现了这个算法,但在我们的实验结果中发现得到的路线并不令人满意。该路线由沿着理想路线的若干不连续图像帧组成,因此在实践中具有计算机视觉技术的机器人不能遵循该路线。在我们的评估中,我们提出了两个原因,即我们未能自动找到连续路线:(1VAE倾向于捕捉全局结构,但丢弃细节2)用于测量房屋图像之间连续性的欧几里德相似性度量是次优的。为进一步的工作,我们建议:尝试其他生成模型,如VAE-GAN,这可能更好地重建细节以学习表示图,并调整路径选择算法中的相似性度量。[1807.02401v1]

 

基于立体视觉的语义三维物体与自主驾驶的自我运动跟踪

Peiliang Li, Tong Qin, Shaojie Shen

我们提出了一种基于立体视觉的方法,用于在动态自动驾驶场景中跟踪摄像机自我运动和3D语义对象。我们建议使用易于标记的2D检测和离散视点分类以及轻量级语义推理方法来获得粗略的3D对象测量,而不是使用端到端方法直接回归3D边界框。基于物体感知辅助相机姿态跟踪,在动态环境中具有鲁棒性,结合我们新颖的动态对象束调整(BA)融合时间稀疏特征对应的方法和语义三维测量模型,我们获得了三维物体姿态,具有实例精度和时间一致性的速度和锚定动态点云估计。我们提出的方法的性能在不同的场景中得到证明。将自我运动估计和对象定位与现有技术解决方案进行比较。[1807.02062v1]

 

基于全卷积神经网络的肉牛实例分割

Aram Ter-SarkisovRobert RossJohn KelleherBernadette EarleyMichael Keane

我们提出了一种实例分割算法,该算法经过训练并应用于冬季整理期间肉牛的CCTV记录。完全卷积网络被转换为实例分割网络,该网络学习分别标记动物的每个实例。我们引入了一个概念上简单的框架,网络用它来输出每个动物的单一预测。这些结果有助于冬季整理肉牛的行为分析,以便及早发现动物福利相关问题。[1807.01972v1]

 

透视感知CNN用于人群计数

Miaojing Shi, Zhaohui Yang, Chao Xu, Qijun Chen

人群计数是估算人群图像中行人数量的任务。现代人群计数方法采用深度神经网络通过人群密度回归来估计人群数量。这项任务的一个主要挑战在于图像尺度和视角的急剧变化。代表性方法通常利用不同(大)尺寸的滤波器并进行基于补丁的估计来解决它,但是这在计算上是昂贵的。在本文中,我们提出了一个透视感知卷积神经网络(PACNN),其具有单个小滤波器主干(例如3×3)。它直接预测网络中的透视图,并将其编码为透视感知加权层,以自适应地组合来自多尺度特征图的密度输出。在地图的每个像素处学习权重,使得最终组合对于透视变化和行人尺寸变化是稳健的。我们对ShanghaiTechWorldExpo’10UCF_CC_50数据集进行了大量实验,并证明PACNN可以实现最先进的结果,并以最快的速度运行。[1807.01989v1]

 

MAT-CNN-SOPC:在可编程片上系统上使用卷积神经网络对流量进行无间断分析

Somdip DeyGrigorios KalliatakisSangeet SahaAmit Kumar SinghShoaib EhsanKlaus McDonald-Maier

智能交通系统(ITS)已经成为现代智能城市框架的重要支柱,需要机器的智能参与。对于这样的系统,流量负载识别可以被分类为重要且具有挑战性的问题。最近,由于其准确的预测能力,卷积神经网络(CNN)模型已经在许多领域引起了相当大的兴趣,例如天气分类,通过图像的人权违规检测。这项工作解决了可编程片上系统(SOPC)平台上的实际流量负载识别问题,并将其编入MAT-CNN-SOPC,它使用CNN的智能重新训练机制和已知环境。拟议的方法能够提高方法的效力2。与现有技术相比,通过实验分析证明了44倍。我们还引入了一个数学方程,该方程能够量化使用不同CNN模型而不是基于特定应用的实现的适用性。[1807.02098v1]

 

利用Box注意检测视觉关系

Alexander KolesnikovChristoph H. LampertVittorio Ferrari

在本文中,我们提出了一种检测视觉关系的新模型。我们的主要技术新颖性是Box Attention机制,它允许使用标准对象检测管道对视觉场景中的对象之间的成对交互进行建模。由此产生的模型在概念上是干净的,富有表现力的,并且依赖于充分合理的训练和预测程序。此外,与先前提出的方法不同,我们的模型不会在基础检测模型已经要求的那些之上引入任何额外的复杂组件或超参数。我们对两个具有挑战性的数据集V-COCO和视觉关系进行了实验评估,展示了强大的定量和定性结果。[1807.02136v1]

 

一种实时单目多目标跟踪的Gauss-Newton方法

Henning TjadenUlrich SchwaneckeElmarSchömerDaniel Cremers

我们提出了一种使用单目RGB相机对刚性3D物体进行实时6DOF姿态跟踪的算法。关键思想是使用时间上一致的局部颜色直方图来导出基于区域的成本函数。虽然这些基于区域的成本函数通常使用一阶梯度下降技术进行优化,但我们系统地推导出高斯牛顿优化方案,该方案可以实现极快的收敛速度和高度准确且强大的跟踪性能。我们还提出了一种新颖的复杂数据集,专门用于单眼对象姿势跟踪任务,并向社区公开。据我们所知,它是第一个解决相机和物体在杂乱场景中同时移动的常见且重要的场景。在许多实验中包括我们自己提出的数据集我们证明了所提出的Gauss-Newton方法优于现有方法,特别是在存在杂乱背景,异质对象和部分遮挡的情况下。[1807.02087v1]

 

基于Siamese-LSTM的深度度量学习的三维人体动作识别

Seyma YucerYusuf Sinan Akgul

本文提出了一种新的三维人体动作识别系统作为两阶段系统:(1)深度度量学习模块,利用Siamese-LSTM网络学习两个三维联合序列之间的相似性度量2)多类分类模块,使用第一个模块的输出来产生最终识别输出。该模型具有以下几个优点:第一个模块使用更多的数据集进行训练,因为它使用了许多序列对的组合。我们的深度量学习模块也可以独立于数据集进行训练,这使得我们的系统具有模块化和可推广性。我们在标准和新引入的数据集上测试了所提出的系统,这些数据集向我们展示了初始结果很有希 我们将通过添加更复杂的LSTM块以及在不同数据集之间进行交叉训练来继续开发此系统。[1807.02131v1]

 

在卫星视频中检测微小的移动车辆

Wei Ao, Yanwei Fu, Feng Xu

近年来,卫星视频已被移动的卫星平台捕获。与消费者,电影和普通监控视频相比,卫星视频可以记录城市规模场景的快照。在卫星视频的广阔视野中,每个移动目标将非常小并且通常由帧中的几个像素组成。更糟糕的是,由于卫星的运动,视频帧的背景具有亚像素级并且不均匀移动,因此视频帧中也存在噪声信号。我们认为这是一种新型的计算机视觉任务,因为以前的技术无法有效地检测到这种小型车辆。本文提出了一种新的框架,可以识别卫星视频中的小型移动车辆。尤其是,我们提出了一种基于局部噪声建模的新型检测算法。我们通过指数概率分布将潜在的车辆目标与噪声模式区分开来。随后,基于多形态线索的辨别策略被设计用于进一步区分正确的车辆目标与少量现有噪声。另一个重要的贡献是引入一系列评估协议来系统地测量微小移动车辆检测的性能。我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 我们通过指数概率分布将潜在的车辆目标与噪声模式区分开来。随后,基于多形态线索的辨别策略被设计用于进一步区分正确的车辆目标与少量现有噪声。另一个重要的贡献是引入一系列评估协议来系统地测量微小移动车辆检测的性能。我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 我们通过指数概率分布将潜在的车辆目标与噪声模式区分开来。随后,基于多形态线索的辨别策略被设计用于进一步区分正确的车辆目标与少量现有噪声。另一个重要的贡献是引入一系列评估协议来系统地测量微小移动车辆检测的性能。我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 基于多形态线索的识别策略旨在进一步区分正确的车辆目标与少数现有噪声。另一个重要的贡献是引入一系列评估协议来系统地测量微小移动车辆检测的性能。我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 基于多形态线索的识别策略旨在进一步区分正确的车辆目标与少数现有噪声。另一个重要的贡献是引入一系列评估协议来系统地测量微小移动车辆检测的性能。我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1] 我们手动注释卫星视频,并使用它在不同的评估标准下测试我们的算法。该算法还与最先进的基线进行了比较,并展示了我们的框架优于基准的优势。[1807.01864v1]

 

实时亚像素快速双边立体声

Rui Fan, Yanan Liu, Mohammud Junaid Bocus, Ming Liu

立体视觉技术已广泛用于机器人系统以获取三维信息。近年来,许多研究人员已经在立体视觉中应用双边滤波来自适应地聚合匹配成本。这极大地提高了估计的视差图的准确性。然而,过滤整个成本量的过程非常耗时,因此研究人员不得不求助于一些功能强大的硬件用于实时目的。本文介绍了在最先进的GPU上实现快速双边立体声。通过高度利用GPU的并行计算架构,快速双边立体声在处理Middlebury立体声数据集时实时执行。[1807.02044v1]

 

使用面部和扬声器识别检测和分析YouTube视频中的内容创建者协作

Moritz LodeMichaelÖrtlChristian KochAmr RizkRalf Steinmetz

这项工作讨论并实施说话人识别的应用,以检测YouTube视频中的协作。CATANA是一个用于检测和分析YouTube合作的现有框架,它正在利用人脸识别来检测合作者,这些合作者在视频内容上自然表现不佳而不会出现面孔。这项工作提出了CATANA的扩展,使用有源扬声器检测和说话人识别来提高检测准确性。[1807.02020v1]

 

利用结构相似性对自动编码器改进无监督缺陷分割

Paul BergmannSindyLöweMichael FauserDavid SattleggerCarsten Steger

卷积自动编码器已经成为用于图像数据的无监督缺陷分割的流行模型。最常见的是,此任务是通过基于$ \ ell ^ p $距离对像素方式重建错误进行阈值处理来执行的。然而,每当重建包含边缘周围的轻微定位不准确时,该过程通常导致高新颖性分数。我们表明,这个问题阻止了这些方法应用于复杂的现实场景,并且通过采用更复杂的架构无法轻易避免。相反,我们建议使用基于结构相似性的感知损失函数。我们的方法在真实世界的纳米纤维材料数据集上实现了最先进的性能,同时进行端到端培训,无需额外的先验,如预训练网络或手工制作的功能。[1807.02011v1]

 

背景减算法与卷积神经网络的结合

Dongdong Zeng, Ming Zhu, Arjan Kuijper

准确快速地提取前景对象是各种计算机视觉应用(如对象跟踪和识别)的关键先决条件。因此,近几十年来已经提出了用于前景物体检测的大量背景减除方法。然而,由于诸如照明变化,相机抖动,动态背景,阴影等各种挑战,它仍然被认为是一个棘手的问题。目前,没有一种方法能够以稳健的方式处理所有挑战。在这封信中,我们试图通过结合不同的最先进的背景减法算法从一个新的角度来解决这个问题,以创建一个更健壮和更先进的前景检测算法。更具体地说,编码器解码器完全卷积神经网络架构经过训练,自动学习如何利用不同算法的特性融合不同背景减法算法产生的结果,并输出更精确的结果。在CDnet 2014数据集上评估的综合实验表明,所提出的方法优于所有考虑的单一背景减法算法。我们表明我们的解决方案比其他组合策略更有效。[1807.02080v1] CDnet 2014数据集上评估的综合实验表明,所提出的方法优于所有考虑的单一背景减法算法。我们表明我们的解决方案比其他组合策略更有效。[1807.02080v1] CDnet 2014数据集上评估的综合实验表明,所提出的方法优于所有考虑的单一背景减法算法。我们表明我们的解决方案比其他组合策略更有效。[1807.02080v1]

 

基于密集子像素视差图估计的路面三维重建

Rui Fan, Xiao Ai, Naim Dahnoun

各种3D重建方法使土木工程师能够检测到路面上的损坏。为了实现道路状况评估所需的毫米精度,需要使用具有子像素分辨率的视差图。然而,现有的立体匹配算法都不是特别适合于路面的重建。因此,在本文中,我们提出了一种具有高计算效率和鲁棒性的新型密集子像素视差估计算法。这是通过首先将目标框架的透视图变换为参考视图来实现的,这不仅提高了路面的块匹配精度,而且提高了处理速度。然后使用我们先前公布的算法迭代地估计差异,其中搜索范围从三个估计的相邻差异传播。由于搜索范围是从前一次迭代获得的,因此当传播的搜索范围不足时可能发生错误。因此,执行相关最大值验证以纠正该问题,并且通过进行抛物线内插增强来实现子像素分辨率。此外,引入了从马尔可夫随机场和快速双边立体声开发的新颖视差全局细化方法,以进一步提高估计视差图的准确度,其中通过最小化与其内插相关多项式相关的能量函数来迭代地更新视差。该算法以C语言实现,具有接近实时的性能。实验结果表明,重建的绝对误差从0.1 mm3 mm不等。[1807.01874v1]

 

用于灵活人像操作的PortraitGAN

Jiali Duan, Xiaoyuan Guo, Yuhang Song, Chao Yang, C. -C. Jay Kuo

以前的方法已经处理了面部属性的离散操作,例如微笑,悲伤,愤怒,惊讶等,超出规范表达,并且它们不可扩展,以单一模态操作。在本文中,我们提出了一个新的框架,支持使用对抗性学习的连续编辑和多模态肖像操作。具体而言,我们通过利用额外的面部地标信息,使周期一致性适应条件设置。这有两个作用:第一周期映射引起双向操作和身份保持因此可以利用来自不同模态的第二配对样本。为了确保高质量的合成,我们采用纹理损失来强制纹理一致性和多级对抗监督,以促进梯度流动。定量和定性实验表明我们的框架在执行具有照片效果效果的灵活和多模态肖像操作方面的有效性。[1807.01826v1]

 

基于YCbCr颜色空间上的最大后验判别的人脸识别

Gede Pasek Suta Wijaya

本文提出了在YCbCr颜色空间上使用最大后验(MAP)判别的人脸识别。考虑YCbCr颜色空间以便在识别过程中覆盖面部图像的皮肤信息。该方法用于提高基于灰度的人脸识别的识别率和等错误率(EER)。在这种情况下,由非阻塞DCT提取的由主要频率元素的一小部分组成的面部特征向量被实现为原始面部图像的尺寸减小。使用最大后验(MAP)判别来执行查询面部特征与训练的面部特征之间的匹配过程。从包含2268个图像和196个图像的四个面部数据库的数据的实验结果可以看出,面部识别YCbCr颜色空间提供了比基于灰度的人脸识别更好的识别率和更低的EER,从而提高了基于灰度的方法结果的第一等级。约4%。然而,它需要比基于灰度的方法多三倍的计算时间。[1807.02135v1]

 

用于在线多目标跟踪的时空KSVD词典学习

Huynh ManhGita Alaghband

在本文中,我们提出了一种新的空间判别KSVD字典算法(STKSVD),用于在线多目标跟踪中学习目标外观。与其他分类/识别任务(例如,面部,图像识别)不同,在线多目标跟踪中学习目标的出现受到诸如姿势/清晰度变化,背景场景或其他目标的部分遮挡,背景变化(人类检测边界)等因素的影响。框中包括人体部分和场景的一部分)等。然而,我们观察到这些变化相对于空间和时间动态逐渐发生。我们通过新的STKSVD外观学习算法来表征目标样本之间的空间和时间信息,以更好地区分稀疏代码,线性分类器参数并最小化单个优化系统中的重建误差。我们的外观学习算法和跟踪框架采用两种不同的方法来计算两阶段关联的每个阶段中的外观相似性得分:第一阶段中的线性分类器和第二阶段中的最小残差错误。使用2DMOT2015数据集及其公共聚合通道特征(ACF)人体检测进行的所有比较测试结果表明,我们的方法优于现有的相关学习方法。[1807.02143v1] 和第二阶段的最小残差。使用2DMOT2015数据集及其公共聚合通道特征(ACF)人体检测进行的所有比较测试结果表明,我们的方法优于现有的相关学习方法。[1807.02143v1] 和第二阶段的最小残差。使用2DMOT2015数据集及其公共聚合通道特征(ACF)人体检测进行的所有比较测试结果表明,我们的方法优于现有的相关学习方法。[1807.02143v1]

 

具有尺度自适应锚点的单步文本检测器

Qi Yuan, Bingwang Zhang, Haojie Li, Zhihui Wang, Zhongxuan Luo

目前,大多数表现最佳的文本检测网络倾向于使用固定大小的锚箱来指导搜索文本实例。它们通常依赖于具有不同尺度的大量锚来发现场景图像中的文本,因此导致高计算成本。在本文中,我们提出了一种具有尺度自适应锚点的端到端基于盒子的文本检测器,它可以通过引入额外的比例回归层,根据基础文本的大小动态调整锚点的比例。所提出的尺度自适应锚点允许我们使用少量锚点来处理多尺度文本,因此显着提高了计算效率。此外,与先前方法中使用的离散尺度相比,所学习的连续尺度更可靠,尤其对于小文本检测。此外,我们建议使用Anchor卷积来更好地利用必要的特征信息,方法是根据学习的尺度动态调整感受野的大小。大量实验表明,所提出的探测器速度很快,每张图像只需0.28秒,同时在精度方面优于大多数最先进的方法。[1807.01884v1]

 

利用深度神经网络学习医学影像反问题的个性化表示

Kuang Gong, Kyungsang Kim, Jianan Cui, Ning Guo, Ciprian Catana, Jinyi Qi, Quanzheng Li

最近,深度神经网络已经广泛且成功地应用于计算机视觉任务中,并且引起了对医学成像的日益增长 将深度神经网络应用于医学成像的一个障碍是需要大量的先前训练对,这在临床实践中并不总是可行的。在这项工作中,我们提出了一种个性化的表征学习框架,其中不需要先前的训练对,而只需要患者自己的先前图像。使用深度神经网络表示表示,其中患者的先前图像作为网络输入。然后,我们将这种新颖的图像表示应用于医学成像中的逆问题,其中将原始逆问题公式化为约束优化问题,并使用交替方向乘法器(ADMM)算法求解。采用解剖学引导的脑正电子发射断层扫描(PET)图像重建和图像去噪作为实例来证明所提出的框架的有效性。基于模拟和真实数据集的量化结果表明,所提出的个性化表示框架优于其他广泛采用的方法。[1807.01759v1] 基于模拟和真实数据集的量化结果表明,所提出的个性化表示框架优于其他广泛采用的方法。[1807.01759v1] 基于模拟和真实数据集的量化结果表明,所提出的个性化表示框架优于其他广泛采用的方法。[1807.01759v1]

 

无偏的图像样式转移

Hyun-Chul ChoiMinseong Kim

最近的快速图像样式转移方法使用前馈神经网络从内容的输入对和目标样式图像生成期望样式强度的输出图像。在现有方法中,通过对编码特征空间中的线性插值特征进行解码来获得内容与目标风格之间的中间风格的图像。然而,到目前为止,还没有分析这种样式强度插值的有效性的工作。在本文中,我们解决了对样式插值的深入分析缺失的工作,并提出了一种更有效地控制样式强度的方法。我们将样式转移网络的训练任务解释为控制参数和输出样式强度之间的回归学习。在这个理解中,由于使用全风格强度(α= 1.0)的单侧数据进行训练,现有方法存在偏差。因此,这种有偏见的学习并不能保证产生对应于0.01.0之间的样式控制参数的所需中间样式。为了解决偏置网络的这个问题,我们提出了一种无偏的学习技术,它使用无偏训练数据和相应的无偏损损失α= 0.0,使前馈网络生成零样式图像,即alpha =时的内容图像0.0。我们的实验结果验证了我们的无偏学习方法实现了样式强度为零的内容图像重建,样式控制参数与输出样式之间的回归规范更好,更稳定的风格转移,对风格损失的重量不敏感,在图像生成过程中没有附加的复杂性。[1807.01424v2]

 

本地化召回精度(LRP):用于对象检测的新性能指标

Kemal OksuzBaris Can CamEmre AkbasSinan Kalkan

平均精度(AP)是召回精度(RP)曲线下的面积,是物体检测的标准性能指标。尽管它被广泛接受,但它有许多缺点,其中最重要的是(i)无法区分非常不同的RP曲线,以及(ii)缺乏直接测量边界框定位精度。在本文中,我们提出了本地化召回精度(LRP)误差,这是我们专门为物体检测设计的一种新指标。LRP错误由与定位,假阴性(FN)率和假阳性(FP)率相关的三个组成部分组成。基于LRP,我们引入了最佳LRP”,最小可达到的LRP误差代表了探测器在召回精度和箱子紧密度方面可实现的最佳配置。与考虑整个召回域精确度的AP相比,最佳LRP确定了一个类的最佳置信度得分阈值,这平衡了本地化和召回精度之间的权衡。在我们的实验中,我们表明,对于最先进的物体(SOTA)探测器,Optimal LRP提供比AP更丰富和更具辨别力的信息。我们还证明了最佳置信度得分阈值在类和检测器之间差异很大。此外,我们提出了一个简单的在线视频对象检测器的LRP结果,该检测器使用SOTA静止图像对象检测器,并显示特定于类的优化阈值提高了使用所有类的一般阈值的常用方法的准确性。在https// githubcom / cancam / LRP我们提供了可以为PASCAL VOCMSCOCO数据集计算LRP的源代码。我们的源代码也可以轻松地适应其他数据集。[1807.01696v2]

 

MITOS-RCNN:基于区域卷积神经网络的乳腺癌组织病理图像中有丝分裂图检测的新方法

Siddhant Rao

研究估计,仅2018年就会有266,120例新发侵袭性乳腺癌病例和40,920例乳腺癌致死病例。尽管这种痛苦普遍存在,但目前获得准确乳腺癌预后的过程是繁琐且耗时的,需要训练有素的病理学家手动检查组织病理学图像,以便识别表征各种癌症严重程度水平的特征。我们提出MITOS-RCNN:一种新型的基于区域的卷积神经网络(RCNN),适用于小物体检测,以准确评定诺丁汉分级系统描述的肿瘤好战特征的三个因素之一:有丝分裂计数。有丝分裂图形计数和检测的其他计算方法没有证明在临床上可行的充分回忆或精确度。我们的模型在ICPR 2012挑战,AMIDA 2013挑战和MITOS-ATYPIA-14挑战以及最近出版的作品中表现优于所有先前参与者。我们的模型获得了0.955F-测量分数,从最精确的先前提出的模型中提高了6.11%的准确度。[1807.01788v1]

 

基于语义的深度跨模态自适应保持基于草图的三维形状检索的对抗性学习

Jiaxin Chen, Yi Fang

由于2D草图和3D形状之间存在较大的跨模态差异,因此通过草图检索3D形状是一项非常具有挑战性的任务。为了解决这个问题,我们提出了一个新的框架来学习一个有区别的深度跨模态适应模型。具体而言,我们首先分别采用两个度量网络,遵循两个深度卷积神经网络(CNN),以基于重要性感知度量学习方法学习模态特定的判别特征。随后,我们明确地引入了跨模态转换网络来补偿两种模态之间的差异,这可以将2D草图的特征转移到3D形状的特征空间。我们开发了一种基于对抗性学习的方法来训练转换模型,通过同时增强两种模态的数据分布之间的整体相关性,并通过最小化交叉模态意味着差异项来减轻局部语义差异。SHREC 2013SHREC 2014数据集的实验结果清楚地表明,与最先进的方法相比,我们提出的模型具有更高的检索性能。[1807.01806v1]

 

TextTopicNet – 通过在语义文本空间上嵌入图像来自动监督学习视觉特征

Yash PatelLluis GomezRaul GomezMarçalRusiñolDimosthenis KaratzasJawahar CV

基于深度学习的计算机视觉方法的巨大成功在很大程度上依赖于大规模的训练数据集。这些带有丰富注释的数据集有助于网络学习辨别视觉特征。收集和注释这些数据集需要大量的人力,并且注释仅限于流行的类集。作为替代方案,通过设计利用可自由使用的自我监督的辅助任务来学习视觉特征已经在计算机视觉社区中变得越来越流行。在本文中,我们提出了利用多模态上下文为计算机视觉算法的训练提供自我监督的想法。我们展示了通过训练CNN来预测语义文本上下文可以有效地学习足够的视觉特征,其中特定图像更可能以图示的形式出现。更具体地说,我们使用流行的文本嵌入技术来为深度CNN的训练提供自我监督。我们的实验证明,与最近的自我监督或自然监督方法相比,图像分类,物体检测和多模态检索具有最先进的性能。[1807.02110v1] 与最近的自我监督或自然监督方法相比,多模态检索。[1807.02110v1] 与最近的自我监督或自然监督方法相比,多模态检索。[1807.02110v1]

转载请注明:《TextTopicNet – 通过在语义文本空间上嵌入图像来自动监督学习视觉特征+基于Siamese-LSTM的深度度量学习的三维人体动作识别+具有尺度自适应锚点的单步文本检测器

发表评论