Volumetric performance capture from minimal camera viewpoints

Andrew Gilbert, Marco Volino, John Collomosse, Adrian Hilton

We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count. [1807.01950v1]

Automatic deep learning-based normalization of breast dynamic contrast-enhanced magnetic resonance images

Jun Zhang, Ashirbani Saha, Brian J. Soher, Maciej A. Mazurowski

Objective: To develop an automatic image normalization algorithm for intensity correction of images from breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) acquired by different MRI scanners with various imaging parameters, using only image information. Methods: DCE-MR images of 460 subjects with breast cancer acquired by different scanners were used in this study. Each subject had one T1-weighted pre-contrast image and three T1-weighted post-contrast images available. Our normalization algorithm operated under the assumption that the same type of tissue in different patients should be represented by the same voxel value. We used four tissue/material types as the anchors for the normalization: 1) air, 2) fat tissue, 3) dense tissue, and 4) heart. The algorithm proceeded in the following two steps: First, a state-of-the-art deep learning-based algorithm was applied to perform tissue segmentation accurately and efficiently. Then, based on the segmentation results, a subject-specific piecewise linear mapping function was applied between the anchor points to normalize the same type of tissue in different patients into the same intensity ranges. We evaluated the algorithm with 300 subjects used for training and the rest used for testing. Results: The application of our algorithm to images with different scanning parameters resulted in highly improved consistency in pixel values and extracted radiomics features. Conclusion: The proposed image normalization strategy based on tissue segmentation can perform intensity correction fully automatically, without the knowledge of the scanner parameters. Significance: We have thoroughly tested our algorithm and showed that it successfully normalizes the intensity of DCE-MR images. We made our software publicly available for others to apply in their analyses. [1807.02152v1]

Subpixel-Precise Tracking of Rigid Objects in Real-time

Tobias Böttger, Markus Ulrich, Carsten Steger

We present a novel object tracking scheme that can track rigid objects in real time. The approach uses subpixel-precise image edges to track objects with high accuracy. It can determine the object position, scale, and rotation with subpixel-precision at around 80fps. The tracker returns a reliable score for each frame and is capable of self diagnosing a tracking failure. Furthermore, the choice of the similarity measure makes the approach inherently robust against occlusion, clutter, and nonlinear illumination changes. We evaluate the method on sequences from rigid objects from the OTB-2015 and VOT2016 dataset and discuss its performance. The evaluation shows that the tracker is more accurate than state-of-the-art real-time trackers while being equally robust. [1807.01952v1]

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Christoph Wick, Christian Reul, Frank Puppe

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and L\”udeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient (Reul et al., 2018a, Reul et al., 2018b). Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-ShortTerm-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares. [1807.02004v1]

Model-free Consensus Maximization for Non-Rigid Shapes

Thomas Probst, Ajad Chhatkuli, Danda Pani Paudel, Luc Van Gool

Many computer vision methods rely on consensus maximization to relate measurements containing outliers with a reliable transformation model. In the context of matching rigid shapes, this is typically done using Random Sampling and Consensus (RANSAC) to estimate an analytical model that agrees with the largest number of measurements, which make the inliers. However, such models are either not available or too complex for non-rigid shapes. In this paper, we formulate the model-free consensus maximization problem as an Integer Program in a graph using ‘rules’ on measurements. We then provide a method to solve such a formulation optimally using the Branch and Bound (BnB) paradigm. In the context of non-rigid shapes, we apply the method to filter out outlier 3D correspondences and achieve performance superior to the state-of-the-art. Our method works with outlier ratio as high as 80%. We further derive a similar formulation for 3D template to image correspondences. Our approach achieves similar or better performance compared to the state-of-the-art. [1807.01963v1]

Consistent Generative Query Networks

Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, Murray Shanahan

Stochastic video prediction is usually framed as an extrapolation problem where the goal is to sample a sequence of consecutive future image frames conditioned on a sequence of observed past frames. For the most part, algorithms for this task generate future video frames sequentially in an autoregressive fashion, which is slow and requires the input and output to be consecutive. We introduce a model that overcomes these drawbacks — it learns to generate a global latent representation from an arbitrary set of frames within a video. This representation can then be used to simultaneously and efficiently sample any number of temporally consistent frames at arbitrary time-points in the video. We apply our model to synthetic video prediction tasks and achieve results that are comparable to state-of-the-art video prediction models. In addition, we demonstrate the flexibility of our model by applying it to 3D scene reconstruction where we condition on location instead of time. To the best of our knowledge, our model is the first to provide flexible and coherent prediction on stochastic video datasets, as well as consistent 3D scene samples. Please check the project website https://bit.ly/2jX7Vyu to view scene reconstructions and videos produced by our model. [1807.02033v1]

Open Logo Detection Challenge

Hang Su, Xiatian Zhu, Shaogang Gong

Existing logo detection benchmarks consider artificial deployment scenarios by assuming that large training data with fine-grained bounding box annotations for each class are available for model training. Such assumptions are often invalid in realistic logo detection scenarios where new logo classes come progressively and require to be detected with little or none budget for exhaustively labelling fine-grained training data for every new class. Existing benchmarks are thus unable to evaluate the true performance of a logo detection method in realistic and open deployments. In this work, we introduce a more realistic and challenging logo detection setting, called Open Logo Detection. Specifically, this new setting assumes fine-grained labelling only on a small proportion of logo classes whilst the remaining classes have no labelled training data to simulate the open deployment. We further create an open logo detection benchmark, called OpenLogo,to promote the investigation of this new challenge. OpenLogo contains 27,189 images from 309 logo classes, built by aggregating/refining 7 existing datasets and establishing an open logo detection evaluation protocol. To address this challenge, we propose a Context Adversarial Learning (CAL) approach to synthesising training data with coherent logo instance appearance against diverse background context for enabling more effective optimisation of contemporary deep learning detection models. Experiments show the performance advantage of CAL over existing state-of-the-art alternative methods on the more realistic and challenging OpenLogo benchmark. [1807.01964v1]

Reflection Analysis for Face Morphing Attack Detection

Clemens Seibold, Anna Hilsmann, Peter Eisert

A facial morph is a synthetically created image of a face that looks similar to two different individuals and can even trick biometric facial recognition systems into recognizing both individuals. This attack is known as face morphing attack. The process of creating such a facial morph is well documented and a lot of tutorials and software to create them are freely available. Therefore, it is mandatory to be able to detect this kind of fraud to ensure the integrity of the face as reliable biometric feature. In this work, we study the effects of face morphing on the physically correctness of the illumination. We estimate the direction to the light sources based on specular highlights in the eyes and use them to generate a synthetic map for highlights on the skin. This map is compared with the highlights in the image that is suspected to be a fraud. Morphing faces with different geometries, a bad alignment of the source images or using images with different illuminations, can lead to inconsistencies in reflections that indicate the existence of a morphing attack. [1807.02030v1]

Learning a Representation Map for Robot Navigation using Deep Variational Autoencoder

Kaixin Hu, Peter O’Connor

The aim of this work is to use Variational Autoencoder (VAE) to learn a representation of an indoor environment that can be used for robot navigation. We use images extracted from a video, in which a camera takes a tour around a house, for training the VAE model with a 4 dimensional latent space. After the model is trained, each real frame has a corresponding representation point on manifold in the latent space, and each representation point has corresponding reconstructed image. For the navigation problem, we map the starting image and destination image to the latent space, then optimize a path on the learned manifold connecting the two points, and finally map the path back through decoder to a sequence of images. The ideal sequence of images should correspond to a route that is spatially continuous – i.e. neighbor images in the route should correspond to neighbor locations in physical space. Such a route could be used for navigation with computer vision techniques, i.e. a robot could follow the image sequence from starting location to destination in the environment step by step. We implement this algorithm, but find in our experimental results that the resulting route is not satisfactory. The route consist of several discontinuous image frames along the ideal routes, so that the route could not be followed by a robot with computer vision techniques in practice. In our evaluation, we propose two reasons for our failure to automatically find continuous routes: (1) The VAE tends to capture global structures, but discard the details; (2) the Euclidean similarity metric used for measuring continuity between house images is sub-optimal. For further work, we propose: trying other generative models like VAE-GANs which may be better at reconstructing the details to learn the representation map, and adjusting the similarity metric in the path selecting algorithm. [1807.02401v1]

Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving

Peiliang Li, Tong Qin, Shaojie Shen

We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-the-art solutions. [1807.02062v1]

Beef Cattle Instance Segmentation Using Fully Convolutional Neural Network

Aram Ter-Sarkisov, Robert Ross, John Kelleher, Bernadette Earley, Michael Keane

We present an instance segmentation algorithm trained and applied to a CCTV recording of beef cattle during a winter finishing period. A fully convolutional network was transformed into an instance segmentation network that learns to label each instance of an animal separately. We introduce a conceptually simple framework that the network uses to output a single prediction for every animal. These results are a contribution towards behaviour analysis in winter finishing beef cattle for early detection of animal welfare-related problems. [1807.01972v1]

Perspective-Aware CNN For Crowd Counting

Miaojing Shi, Zhaohui Yang, Chao Xu, Qijun Chen

Crowd counting is the task of estimating pedestrian numbers in crowd images. Modern crowd counting methods employ deep neural networks to estimate crowd counts via crowd density regressions. A major challenge of this task lies in the drastic changes of scales and perspectives in images. Representative approaches usually utilize different (large) sized filters and conduct patch-based estimations to tackle it, which is however computationally expensive. In this paper, we propose a perspective-aware convolutional neural network (PACNN) with a single backbone of small filters (e.g. 3×3). It directly predicts a perspective map in the network and encodes it as a perspective-aware weighting layer to adaptively combine the density outputs from multi-scale feature maps. The weights are learned at every pixel of the map such that the final combination is robust to perspective changes and pedestrian size variations. We conduct extensive experiments on the ShanghaiTech, WorldExpo’10 and UCF_CC_50 datasets, and demonstrate that PACNN achieves state-of-the-art results and runs as fast as the fastest. [1807.01989v1]

MAT-CNN-SOPC: Motionless Analysis of Traffic Using Convolutional Neural Networks on System-On-a-Programmable-Chip

Somdip Dey, Grigorios Kalliatakis, Sangeet Saha, Amit Kumar Singh, Shoaib Ehsan, Klaus McDonald-Maier

Intelligent Transportation Systems (ITS) have become an important pillar in modern “smart city” framework which demands intelligent involvement of machines. Traffic load recognition can be categorized as an important and challenging issue for such systems. Recently, Convolutional Neural Network (CNN) models have drawn considerable amount of interest in many areas such as weather classification, human rights violation detection through images, due to its accurate prediction capabilities. This work tackles real-life traffic load recognition problem on System-On-a-Programmable-Chip (SOPC) platform and coin it as MAT-CNN- SOPC, which uses an intelligent re-training mechanism of the CNN with known environments. The proposed methodology is capable of enhancing the efficacy of the approach by 2.44x in comparison to the state-of-art and proven through experimental analysis. We have also introduced a mathematical equation, which is capable of quantifying the suitability of using different CNN models over the other for a particular application based implementation. [1807.02098v1]

Detecting Visual Relationships Using Box Attention

Alexander Kolesnikov, Christoph H. Lampert, Vittorio Ferrari

In this paper we propose a new model for detecting visual relationships. Our main technical novelty is a Box Attention mechanism that allows modelling pairwise interactions between objects in visual scenes using standard object detection pipelines. The resulting model is conceptually clean, expressive and relies on well-justified training and prediction procedures. Moreover, unlike previously proposed approaches, our model does not introduce any additional complex components or hyperparameters on top of those already required by the underlying detection model. We conduct an experimental evaluation on two challenging datasets, V-COCO and Visual Relationships, demonstrating strong quantitative and qualitative results. [1807.02136v1]

A Gauss-Newton Approach to Real-Time Monocular Multiple Object Tracking

Henning Tjaden, Ulrich Schwanecke, Elmar Schömer, Daniel Cremers

We propose an algorithm for real-time 6DOF pose tracking of rigid 3D objects using a monocular RGB camera. The key idea is to derive a region-based cost function using temporally consistent local color histograms. While such region-based cost functions are commonly optimized using first-order gradient descent techniques, we systematically derive a Gauss-Newton optimization scheme which gives rise to drastically faster convergence and highly accurate and robust tracking performance. We furthermore propose a novel complex dataset dedicated for the task of monocular object pose tracking and make it publicly available to the community. To our knowledge, It is the first to address the common and important scenario in which both the camera as well as the objects are moving simultaneously in cluttered scenes. In numerous experiments – including our own proposed data set – we demonstrate that the proposed Gauss-Newton approach outperforms existing approaches, in particular in the presence of cluttered backgrounds, heterogeneous objects and partial occlusions. [1807.02087v1]

3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning

Seyma Yucer, Yusuf Sinan Akgul

This paper proposes a new 3D Human Action Recognition system as a two-phase system: (1) Deep Metric Learning Module which learns a similarity metric between two 3D joint sequences using Siamese-LSTM networks; (2) A Multiclass Classification Module that uses the output of the first module to produce the final recognition output. This model has several advantages: the first module is trained with a larger set of data because it uses many combinations of sequence pairs.Our deep metric learning module can also be trained independently of the datasets, which makes our system modular and generalizable. We tested the proposed system on standard and newly introduced datasets that showed us that initial results are promising. We will continue developing this system by adding more sophisticated LSTM blocks and by cross-training between different datasets. [1807.02131v1]

Detecting Tiny Moving Vehicles in Satellite Videos

Wei Ao, Yanwei Fu, Feng Xu

In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the city-scale scene. In a broad field-of-view of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixel-level and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multi-morphological-cue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the state-of-the-art baselines, and demonstrates the advantages of our framework over the benchmarks. [1807.01864v1]

Real-Time Subpixel Fast Bilateral Stereo

Rui Fan, Yanan Liu, Mohammud Junaid Bocus, Ming Liu

Stereo vision technique has been widely used in robotic systems to acquire 3-D information. In recent years, many researchers have applied bilateral filtering in stereo vision to adaptively aggregate the matching costs. This has greatly improved the accuracy of the estimated disparity maps. However, the process of filtering the whole cost volume is very time consuming and therefore the researchers have to resort to some powerful hardware for the real-time purpose. This paper presents the implementation of fast bilateral stereo on a state-of-the-art GPU. By highly exploiting the parallel computing architecture of the GPU, the fast bilateral stereo performs in real time when processing the Middlebury stereo datasets. [1807.02044v1]

Detection and Analysis of Content Creator Collaborations in YouTube Videos using Face- and Speaker-Recognition

Moritz Lode, Michael Örtl, Christian Koch, Amr Rizk, Ralf Steinmetz

This work discusses and implements the application of speaker recognition for the detection of collaborations in YouTube videos. CATANA, an existing framework for detection and analysis of YouTube collaborations, is utilizing face recognition for the detection of collaborators, which naturally performs poor on video-content without appearing faces. This work proposes an extension of CATANA using active speaker detection and speaker recognition to improve the detection accuracy. [1807.02020v1]

Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders

Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, Carsten Steger

Convolutional autoencoders have emerged as popular models for unsupervised defect segmentation on image data. Most commonly, this task is performed by thresholding a pixel-wise reconstruction error based on an $\ell^p$ distance. However, this procedure generally leads to high novelty scores whenever the reconstruction encompasses slight localization inaccuracies around edges. We show that this problem prevents these approaches from being applied to complex real-world scenarios and that it cannot be easily avoided by employing more elaborate architectures. Instead, we propose to use a perceptual loss function based on structural similarity. Our approach achieves state-of-the-art performance on a real-world dataset of nanofibrous materials, while being trained end-to-end without requiring additional priors such as pretrained networks or handcrafted features. [1807.02011v1]

Combining Background Subtraction Algorithms with Convolutional Neural Network

Dongdong Zeng, Ming Zhu, Arjan Kuijper

Accurate and fast extraction of foreground object is a key prerequisite for a wide range of computer vision applications such as object tracking and recognition. Thus, enormous background subtraction methods for foreground object detection have been proposed in recent decades. However, it is still regarded as a tough problem due to a variety of challenges such as illumination variations, camera jitter, dynamic backgrounds, shadows, and so on. Currently, there is no single method that can handle all the challenges in a robust way. In this letter, we try to solve this problem from a new perspective by combining different state-of-the-art background subtraction algorithms to create a more robust and more advanced foreground detection algorithm. More concretely, a encoder-decoder fully convolutional neural network architecture is trained to automatically learn how to leverage the characteristics of different algorithms to fuse the results produced by different background subtraction algorithms and output a more precise result. Comprehensive experiments evaluated on the CDnet 2014 dataset demonstrate that the proposed method outperforms all the considered single background subtraction algorithm. And we show that our solution is more efficient than other combination strategies. [1807.02080v1]

Road surface 3d reconstruction based on dense subpixel disparity map estimation

Rui Fan, Xiao Ai, Naim Dahnoun

Various 3D reconstruction methods have enabled civil engineers to detect damage on a road surface. To achieve the millimetre accuracy required for road condition assessment, a disparity map with subpixel resolution needs to be used. However, none of the existing stereo matching algorithms are specially suitable for the reconstruction of the road surface. Hence in this paper, we propose a novel dense subpixel disparity estimation algorithm with high computational efficiency and robustness. This is achieved by first transforming the perspective view of the target frame into the reference view, which not only increases the accuracy of the block matching for the road surface but also improves the processing speed. The disparities are then estimated iteratively using our previously published algorithm where the search range is propagated from three estimated neighbouring disparities. Since the search range is obtained from the previous iteration, errors may occur when the propagated search range is not sufficient. Therefore, a correlation maxima verification is performed to rectify this issue, and the subpixel resolution is achieved by conducting a parabola interpolation enhancement. Furthermore, a novel disparity global refinement approach developed from the Markov Random Fields and Fast Bilateral Stereo is introduced to further improve the accuracy of the estimated disparity map, where disparities are updated iteratively by minimising the energy function that is related to their interpolated correlation polynomials. The algorithm is implemented in C language with a near real-time performance. The experimental results illustrate that the absolute error of the reconstruction varies from 0.1 mm to 3 mm. [1807.01874v1]

PortraitGAN for Flexible Portrait Manipulation

Jiali Duan, Xiaoyuan Guo, Yuhang Song, Chao Yang, C. -C. Jay Kuo

Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise etc, out of canonical expressions and they are not scalable, operating in single modality. In this paper, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. Specifically, we adapt cycle-consistency into the conditional setting by leveraging additional facial landmarks information. This has two effects: first cycle mapping induces bidirectional manipulation and identity preserving; second pairing samples from different modalities can thus be utilized. To ensure high-quality synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. Quantitative and qualitative experiments show the effectiveness of our framework in performing flexible and multi-modality portrait manipulation with photo-realistic effects. [1807.01826v1]

Face Recognition Using Map Discriminant on YCbCr Color Space

I Gede Pasek Suta Wijaya

This paper presents face recognition using maximum a posteriori (MAP) discriminant on YCbCr color space. The YCbCr color space is considered in order to cover the skin information of face image on the recognition process. The proposed method is employed to improve the recognition rate and equal error rate (EER) of the gray scale based face recognition. In this case, the face features vector consisting of small part of dominant frequency elements which is extracted by non-blocking DCT is implemented as dimensional reduction of the raw face images. The matching process between the query face features and the trained face features is performed using maximum a posteriori (MAP) discriminant. From the experimental results on data from four face databases containing 2268 images with 196 classes show that the face recognition YCbCr color space provide better recognition rate and lesser EER than those of gray scale based face recognition which improve the first rank of grayscale based method result by about 4%. However, it requires three times more computation time than that of grayscale based method. [1807.02135v1]

Spatiotemporal KSVD Dictionary Learning for Online Multi-target Tracking

Huynh Manh, Gita Alaghband

In this paper, we present a new spatial discriminative KSVD dictionary algorithm (STKSVD) for learning target appearance in online multi-target tracking. Different from other classification/recognition tasks (e.g. face, image recognition), learning target’s appearance in online multi-target tracking is impacted by factors such as posture/articulation changes, partial occlusion by background scene or other targets, background changes (human detection bounding box covers human parts and part of the scene), etc. However, we observe that these variations occur gradually relative to spatial and temporal dynamics. We characterize the spatial and temporal information between target’s samples through a new STKSVD appearance learning algorithm to better discriminate sparse code, linear classifier parameters and minimize reconstruction error in a single optimization system. Our appearance learning algorithm and tracking framework employ two different methods of calculating appearance similarity score in each stage of a two-stage association: a linear classifier in the first stage, and minimum residual errors in the second stage. The results tested using 2DMOT2015 dataset and its public Aggregated Channel features (ACF) human detection for all comparisons show that our method outperforms the existing related learning methods. [1807.02143v1]

A Single Shot Text Detector with Scale-adaptive Anchors

Qi Yuan, Bingwang Zhang, Haojie Li, Zhihui Wang, Zhongxuan Luo

Currently, most top-performing text detection networks tend to employ fixed-size anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an end-to-end box-based text detector with scale-adaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scale-adaptive anchors allow us to use a few number of anchors to handle multi-scale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only $0.28$ second per image, while outperforming most state-of-the-art methods in accuracy. [1807.01884v1]

Learning Personalized Representation for Inverse Problems in Medical Imaging Using Deep Neural Network

Kuang Gong, Kyungsang Kim, Jianan Cui, Ning Guo, Ciprian Catana, Jinyi Qi, Quanzheng Li

Recently deep neural networks have been widely and successfully applied in computer vision tasks and attracted growing interests in medical imaging. One barrier for the application of deep neural networks to medical imaging is the need of large amounts of prior training pairs, which is not always feasible in clinical practice. In this work we propose a personalized representation learning framework where no prior training pairs are needed, but only the patient’s own prior images. The representation is expressed using a deep neural network with the patient’s prior images as network input. We then applied this novel image representation to inverse problems in medical imaging in which the original inverse problem was formulated as a constraint optimization problem and solved using the alternating direction method of multipliers (ADMM) algorithm. Anatomically guided brain positron emission tomography (PET) image reconstruction and image denoising were employed as examples to demonstrate the effectiveness of the proposed framework. Quantification results based on simulation and real datasets show that the proposed personalized representation framework outperform other widely adopted methods. [1807.01759v1]

Unbiased Image Style Transfer

Hyun-Chul Choi, Minseong Kim

Recent fast image style transferring methods use feed-forward neural networks to generate an output image of desired style strength from the input pair of a content and a target style image. In the existing methods, the image of intermediate style between the content and the target style is obtained by decoding a linearly interpolated feature in encoded feature space. However, there has been no work on analyzing the effectiveness of this kind of style strength interpolation so far. In this paper, we tackle the missing work on the in-depth analysis of style interpolation and propose a method that is more effective in controlling style strength. We interpret the training task of a style transfer network as a regression learning between the control parameter and output style strength. In this understanding, the existing methods are biased due to the fact that training is performed with one-sided data of full style strength (alpha = 1.0). Thus, this biased learning does not guarantee the generation of a desired intermediate style corresponding to the style control parameter between 0.0 and 1.0. To solve this problem of the biased network, we propose an unbiased learning technique which uses unbiased training data and corresponding unbiased loss for alpha = 0.0 to make the feed-forward networks to generate a zero-style image, i.e., content image when alpha = 0.0. Our experimental results verified that our unbiased learning method achieved the reconstruction of a content image with zero style strength, better regression specification between style control parameter and output style, and more stable style transfer that is insensitive to the weight of style loss without additive complexity in image generating process. [1807.01424v2]

Localization Recall Precision (LRP): A New Performance Metric for Object Detection

Kemal Oksuz, Baris Can Cam, Emre Akbas, Sinan Kalkan

Average precision (AP), the area under the recall-precision (RP) curve, is the standard performance measure for object detection. Despite its wide acceptance, it has a number of shortcomings, the most important of which are (i) the inability to distinguish very different RP curves, and (ii) the lack of directly measuring bounding box localization accuracy. In this paper, we propose ‘Localization Recall Precision (LRP) Error’, a new metric which we specifically designed for object detection. LRP Error is composed of three components related to localization, false negative (FN) rate and false positive (FP) rate. Based on LRP, we introduce the ‘Optimal LRP’, the minimum achievable LRP error representing the best achievable configuration of the detector in terms of recall-precision and the tightness of the boxes. In contrast to AP, which considers precisions over the entire recall domain, Optimal LRP determines the ‘best’ confidence score threshold for a class, which balances the trade-off between localization and recall-precision. In our experiments, we show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides richer and more discriminative information than AP. We also demonstrate that the best confidence score thresholds vary significantly among classes and detectors. Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes. At https://github.com/cancam/LRP we provide the source code that can compute LRP for the PASCAL VOC and MSCOCO datasets. Our source code can easily be adapted to other datasets as well. [1807.01696v2]

MITOS-RCNN: A Novel Approach to Mitotic Figure Detection in Breast Cancer Histopathology Images using Region Based Convolutional Neural Networks

Siddhant Rao

Studies estimate that there will be 266,120 new cases of invasive breast cancer and 40,920 breast cancer induced deaths in the year of 2018 alone. Despite the pervasiveness of this affliction, the current process to obtain an accurate breast cancer prognosis is tedious and time consuming, requiring a trained pathologist to manually examine histopathological images in order to identify the features that characterize various cancer severity levels. We propose MITOS-RCNN: a novel region based convolutional neural network (RCNN) geared for small object detection to accurately grade one of the three factors that characterize tumor belligerence described by the Nottingham Grading System: mitotic count. Other computational approaches to mitotic figure counting and detection do not demonstrate ample recall or precision to be clinically viable. Our models outperformed all previous participants in the ICPR 2012 challenge, the AMIDA 2013 challenge and the MITOS-ATYPIA-14 challenge along with recently published works. Our model achieved an F-measure score of 0.955, a 6.11% improvement in accuracy from the most accurate of the previously proposed models. [1807.01788v1]

Deep Cross-modality Adaptation via Semantics Preserving Adversarial Learning for Sketch-based 3D Shape Retrieval

Jiaxin Chen, Yi Fang

Due to the large cross-modality discrepancy between 2D sketches and 3D shapes, retrieving 3D shapes by sketches is a significantly challenging task. To address this problem, we propose a novel framework to learn a discriminative deep cross-modality adaptation model in this paper. Specifically, we first separately adopt two metric networks, following two deep convolutional neural networks (CNNs), to learn modality-specific discriminative features based on an importance-aware metric learning method. Subsequently, we explicitly introduce a cross-modality transformation network to compensate for the divergence between two modalities, which can transfer features of 2D sketches to the feature space of 3D shapes. We develop an adversarial learning based method to train the transformation model, by simultaneously enhancing the holistic correlations between data distributions of two modalities, and mitigating the local semantic divergences through minimizing a cross-modality mean discrepancy term. Experimental results on the SHREC 2013 and SHREC 2014 datasets clearly show the superior retrieval performance of our proposed model, compared to the state-of-the-art approaches. [1807.01806v1]

TextTopicNet – Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community. In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches. [1807.02110v1]

Ananya KumarSM Ali EslamiDanilo J. RezendeMarta GarneloFabio ViolaEdward LockhartMurray Shanahan

