SceneEDNet:一种用于场景光流估计的深度学习方法+从方向和尺度不变的方法中恢复仿射特征+CIRL:基于视觉的自动驾驶的可控模拟强化学习

Convolutional neural network based automatic plaque characterization from intracoronary optical coherence tomography images

Shenghua He, Jie Zheng, Akiko Maehara, Gary Mintz, Dalin Tang, Mark Anastasio, Hua Li

Optical coherence tomography (OCT) can provide high-resolution cross-sectional images for analyzing superficial plaques in coronary arteries. Commonly, plaque characterization using intra-coronary OCT images is performed manually by expert observers. This manual analysis is time consuming and its accuracy heavily relies on the experience of human observers. Traditional machine learning based methods, such as the least squares support vector machine and random forest methods, have been recently employed to automatically characterize plaque regions in OCT images. Several processing steps, including feature extraction, informative feature selection, and final pixel classification, are commonly used in these traditional methods. Therefore, the final classification accuracy can be jeopardized by error or inaccuracy within each of these steps. In this study, we proposed a convolutional neural network (CNN) based method to automatically characterize plaques in OCT images. Unlike traditional methods, our method uses the image as a direct input and performs classification as a single-step process. The experiments on 269 OCT images showed that the average prediction accuracy of CNN-based method was 0.866, which indicated a great promise for clinical translation. [1807.03613v1]

 

Multiresolution Tree Networks for 3D Point Cloud Processing

Matheus Gadelha, Rui Wang, Subhransu Maji

We present multiresolution tree-structured networks to process point clouds for 3D shape understanding and generation tasks. Our network represents a 3D shape as a set of locality-preserving 1D ordered list of points at multiple resolutions. This allows efficient feed-forward processing through 1D convolutions, coarse-to-fine analysis through a multi-grid architecture, and it leads to faster convergence and small memory footprint during training. The proposed tree-structured encoders can be used to classify shapes and outperform existing point-based architectures on shape classification benchmarks, while tree-structured decoders can be used for generating point clouds directly and they outperform existing approaches for image-to-shape inference tasks learned using the ShapeNet dataset. Our model also allows unsupervised learning of point-cloud based shapes by using a variational autoencoder, leading to higher-quality generated shapes. [1807.03520v1]

 

Essential Tensor Learning for Multi-view Spectral Clustering

Jianlong Wu, Zhouchen Lin, Hongbin Zha

Multi-view clustering attracts much attention recently, which aims to take advantage of multi-view information to improve the performance of clustering. However, most recent work mainly focus on self-representation based subspace clustering, which is of high computation complexity. In this paper, we focus on the Markov chain based spectral clustering method and propose a novel essential tensor learning method to explore the high order correlations for multi-view representation. We first construct a tensor based on multi-view transition probability matrices of the Markov chain. By incorporating the idea from robust principle component analysis, tensor singular value decomposition (t-SVD) based tensor nuclear norm is imposed to preserve the low-rank property of the essential tensor, which can well capture the principle information from multiple views. We also employ the tensor rotation operator for this task to better investigate the relationship among views as well as reduce the computation complexity. The proposed method can be efficiently optimized by the alternating direction method of multipliers~(ADMM). Extensive experiments on six real world datasets corresponding to five different applications show that our method achieves superior performance over other state-of-the-art methods. [1807.03602v1]

 

Developing Brain Atlas through Deep Learning

Asim Iqbal, Romesa Khan, Theofanis Karayannis

To uncover the organizational principles governing the human brain, neuroscientists are in need of developing high-throughput methods that can explore the structure and function of distinct brain regions using animal models. The first step towards this goal is to accurately register the regions of interest in a mouse brain, against a standard reference atlas, with minimum human supervision. The second step is to scale this approach to different animal ages, so as to also allow insights into normal and pathological brain development and aging. We introduce here a fully automated convolutional neural network-based method (SeBRe) for registration through Segmenting Brain Regions of interest in mice at different ages. We demonstrate the validity of our method on different mouse brain post-natal (P) developmental time points, across a range of neuronal markers. Our method outperforms the existing brain registration methods, and provides the minimum mean squared error (MSE) score on a mouse brain dataset. We propose that our deep learning-based registration method can (i) accelerate brain-wide exploration of region-specific changes in brain development and (ii) replace the existing complex brain registration methodology, by simply segmenting brain regions of interest for high-throughput brain-wide analysis. [1807.03440v1]

 

Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration

De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, Juan Carlos Niebles

Our goal is for a robot to execute a previously unseen task based on a single video demonstration of the task. The success of our approach relies on the principle of transferring knowledge from seen tasks to unseen ones with similar semantics. More importantly, we hypothesize that to successfully execute a complex task from a single video demonstration, it is necessary to explicitly incorporate compositionality to the model. To test our hypothesis, we propose Neural Task Graph (NTG) Networks, which use task graph as the intermediate representation to modularize the representations of both the video demonstration and the derived policy. We show this formulation achieves strong inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. We further show that the same principle is applicable to real-world videos. We show that NTG can improve data efficiency of few-shot activity understanding in the Breakfast Dataset. [1807.03480v1]

 

Towards Head Motion Compensation Using Multi-Scale Convolutional Neural Networks

Omer Rajput, Nils Gessert, Martin Gromniak, Lars Matthäus, Alexander Schlaefer

Head pose estimation and tracking is useful in variety of medical applications. With the advent of RGBD cameras like Kinect, it has become feasible to do markerless tracking by estimating the head pose directly from the point clouds. One specific medical application is robot assisted transcranial magnetic stimulation (TMS) where any patient motion is compensated with the help of a robot. For increased patient comfort, it is important to track the head without markers. In this regard, we address the head pose estimation problem using two different approaches. In the first approach, we build upon the more traditional approach of model based head tracking, where a head model is morphed according to the particular head to be tracked and the morphed model is used to track the head in the point cloud streams. In the second approach, we propose a new multi-scale convolutional neural network architecture for more accurate pose regression. Additionally, we outline a systematic data set acquisition strategy using a head phantom mounted on the robot and ground-truth labels generated using a highly accurate tracking system. [1807.03651v1]

 

Deep Underwater Image Enhancement

Saeed Anwar, Chongyi Li, Fatih Porikli

In an underwater scene, wavelength-dependent light absorption and scattering degrade the visibility of images, causing low contrast and distorted color casts. To address this problem, we propose a convolutional neural network based image enhancement model, i.e., UWCNN, which is trained efficiently using a synthetic underwater image database. Unlike the existing works that require the parameters of underwater imaging model estimation or impose inflexible frameworks applicable only for specific scenes, our model directly reconstructs the clear latent underwater image by leveraging on an automatic end-to-end and data-driven training mechanism. Compliant with underwater imaging models and optical properties of underwater scenes, we first synthesize ten different marine image databases. Then, we separately train multiple UWCNN models for each underwater image formation type. Experimental results on real-world and synthetic underwater images demonstrate that the presented method generalizes well on different underwater scenes and outperforms the existing methods both qualitatively and quantitatively. Besides, we conduct an ablation study to demonstrate the effect of each component in our network. [1807.03528v1]

 

Efficient identification, localization and quantification of grapevine inflorescences in unprepared field images using Fully Convolutional Networks

Robert Rudolph, Katja Herzog, Reinhard Töpfer, Volker Steinhage

Yield and its prediction is one of the most important tasks in grapevine breeding purposes and vineyard management. Commonly, this trait is estimated manually right before harvest by extrapolation, which mostly is labor-intensive, destructive and inaccurate. In the present study an automated image-based workflow was developed quantifying inflorescences and single flowers in unprepared field images of grapevines, i.e. no artificial background or light was applied. It is a novel approach for non-invasive, inexpensive and objective phenotyping with high-throughput. First, image regions depicting inflorescences were identified and localized. This was done by segmenting the images into the classes “inflorescence” and “non-inflorescence” using a Fully Convolutional Network (FCN). Efficient image segmentation hereby is the most challenging step regarding the small geometry and dense distribution of flowers (several hundred flowers per inflorescence), similar color of all plant organs in the fore- and background as well as the circumstance that only approximately 5% of an image show inflorescences. The trained FCN achieved a mean Intersection Over Union (IOU) of 87.6% on the test data set. Finally, individual flowers were extracted from the “inflorescence”-areas using Circular Hough Transform. The flower extraction achieved a recall of 80.3% and a precision of 70.7% using the segmentation derived by the trained FCN model. Summarized, the presented approach is a promising strategy in order to predict yield potential automatically in the earliest stage of grapevine development which is applicable for objective monitoring and evaluations of breeding material, genetic repositories or commercial vineyards. [1807.03770v1]

 

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Chuhui Xue, Shijian Lu, Fangneng Zhan

This paper presents a scene text detection technique that exploits bootstrapping and text border semantics for accurate localization of texts in scenes. A novel bootstrapping technique is designed which samples multiple ‘subsections’ of a word or text line and accordingly relieves the constraint of limited training data effectively. At the same time, the repeated sampling of text ‘subsections’ improves the consistency of the predicted text feature maps which is critical in predicting a single complete instead of multiple broken boxes for long words or text lines. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for each scene text. With semantics-aware text borders, scene texts can be localized more accurately by regressing text pixels around the ends of words or text lines instead of all text pixels which often leads to inaccurate localization while dealing with long words or text lines. Extensive experiments demonstrate the effectiveness of the proposed techniques, and superior performance is obtained over several public datasets, e. g. 80.1 f-score for the MSRA-TD500, 67.1 f-score for the ICDAR2017-RCTW, etc. [1807.03547v1]

 

A GPU-Oriented Algorithm Design for Secant-Based Dimensionality Reduction

Henry Kvinge, Elin Farnell, Michael Kirby, Chris Peterson

Dimensionality-reduction techniques are a fundamental tool for extracting useful information from high-dimensional data sets. Because secant sets encode manifold geometry, they are a useful tool for designing meaningful data-reduction algorithms. In one such approach, the goal is to construct a projection that maximally avoids secant directions and hence ensures that distinct data points are not mapped too close together in the reduced space. This type of algorithm is based on a mathematical framework inspired by the constructive proof of Whitney’s embedding theorem from differential topology. Computing all (unit) secants for a set of points is by nature computationally expensive, thus opening the door for exploitation of GPU architecture for achieving fast versions of these algorithms. We present a polynomial-time data-reduction algorithm that produces a meaningful low-dimensional representation of a data set by iteratively constructing improved projections within the framework described above. Key to our algorithm design and implementation is the use of GPUs which, among other things, minimizes the computational time required for the calculation of all secant lines. One goal of this report is to share ideas with GPU experts and to discuss a class of mathematical algorithms that may be of interest to the broader GPU community. [1807.03425v1]

 

Two-stage iterative Procrustes match algorithm and its application for VQ-based speaker verification

Richeng Tan, Jing Li

In the past decades, Vector Quantization (VQ) model has been very popular across different pattern recognition areas, especially for feature-based tasks. However, the classification or regression performance of VQ-based systems always confronts the feature mismatch problem, which will heavily affect the performance of them. In this paper, we propose a two-stage iterative Procrustes algorithm (TIPM) to address the feature mismatch problem for VQ-based applications. At the first stage, the algorithm will remove mismatched feature vector pairs for a pair of input feature sets. Then, the second stage will collect those correct matched feature pairs that were discarded during the first stage. To evaluate the effectiveness of the proposed TIPM algorithm, speaker verification is used as the case study in this paper. The experiments were conducted on the TIMIT database and the results show that TIPM can improve VQ-based speaker verification performance clean condition and all noisy conditions. [1807.03587v1]

 

An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm

Shin Kamada, Takumi Ichimura

Deep Belief Network (DBN) has a deep architecture that represents multiple features of input patterns hierarchically with the pre-trained Restricted Boltzmann Machines (RBM). A traditional RBM or DBN model cannot change its network structure during the learning phase. Our proposed adaptive learning method can discover the optimal number of hidden neurons and weights and/or layers according to the input space. The model is an important method to take account of the computational cost and the model stability. The regularities to hold the sparse structure of network is considerable problem, since the extraction of explicit knowledge from the trained network should be required. In our previous research, we have developed the hybrid method of adaptive structural learning method of RBM and Learning Forgetting method to the trained RBM. In this paper, we propose the adaptive learning method of DBN that can determine the optimal number of layers during the learning. We evaluated our proposed model on some benchmark data sets. [1807.03486v1]

 

Learning a Single Tucker Decomposition Network for Lossy Image Compression with Multiple Bits-Per-Pixel Rates

Jianrui Cai, Zisheng Cao, Lei Zhang

Lossy image compression (LIC), which aims to utilize inexact approximations to represent an image more compactly, is a classical problem in image processing. Recently, deep convolutional neural networks (CNNs) have achieved interesting results in LIC by learning an encoder-quantizer-decoder network from a large amount of data. However, existing CNN-based LIC methods usually can only train a network for a specific bits-per-pixel (bpp). Such a “one network per bpp” problem limits the generality and flexibility of CNNs to practical LIC applications. In this paper, we propose to learn a single CNN which can perform LIC at multiple bpp rates. A simple yet effective Tucker Decomposition Network (TDNet) is developed, where there is a novel tucker decomposition layer (TDL) to decompose a latent image representation into a set of projection matrices and a core tensor. By changing the rank of the core tensor and its quantization, we can easily adjust the bpp rate of latent image representation within a single CNN. Furthermore, an iterative non-uniform quantization scheme is presented to optimize the quantizer, and a coarse-to-fine training strategy is introduced to reconstruct the decompressed images. Extensive experiments demonstrate the state-of-the-art compression performance of TDNet in terms of both PSNR and MS-SSIM indices. [1807.03470v1]

 

An Adaptive Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm

Shin Kamada, Takumi Ichimura

Restricted Boltzmann Machine (RBM) is a generative stochastic energy-based model of artificial neural network for unsupervised learning. Recently, RBM is well known to be a pre-training method of Deep Learning. In addition to visible and hidden neurons, the structure of RBM has a number of parameters such as the weights between neurons and the coefficients for them. Therefore, we may meet some difficulties to determine an optimal network structure to analyze big data. In order to evade the problem, we investigated the variance of parameters to find an optimal structure during learning. For the reason, we should check the variance of parameters to cause the fluctuation for energy function in RBM model. In this paper, we propose the adaptive learning method of RBM that can discover an optimal number of hidden neurons according to the training situation by applying the neuron generation and annihilation algorithm. In this method, a new hidden neuron is generated if the energy function is not still converged and the variance of the parameters is large. Moreover, the inactivated hidden neuron will be annihilated if the neuron does not affect the learning situation. The experimental results for some benchmark data sets were discussed in this paper. [1807.03478v1]

 

SceneEDNet: A Deep Learning Approach for Scene Flow Estimation

Ravi Kumar Thakur, Snehasis Mukherjee

Estimating scene flow in RGB-D videos is attracting much interest of the computer vision researchers, due to its potential applications in robotics. The state-of-the-art techniques for scene flow estimation, typically rely on the knowledge of scene structure of the frame and the correspondence between frames. However, with the increasing amount of RGB-D data captured from sophisticated sensors like Microsoft Kinect, and the recent advances in the area of sophisticated deep learning techniques, introduction of an efficient deep learning technique for scene flow estimation, is becoming important. This paper introduces a first effort to apply a deep learning method for direct estimation of scene flow by presenting a fully convolutional neural network with an encoder-decoder (ED) architecture. The proposed network SceneEDNet involves estimation of three dimensional motion vectors of all the scene points from sequence of stereo images. The training for direct estimation of scene flow is done using consecutive pairs of stereo images and corresponding scene flow ground truth. The proposed architecture is applied on a huge dataset and provides meaningful results. [1807.03464v1]

 

Shape analysis of framed space curves

Tom Needham

In the elastic shape analysis approach to shape matching and object classification, plane curves are represented as points in an infinite-dimensional Riemannian manifold, wherein shape dissimilarity is measured by geodesic distance. A remarkable result of Younes, Michor, Shah and Mumford says that the space of closed planar shapes, endowed with a natural metric, is isometric to an infinite-dimensional Grassmann manifold via the so-called square root transform. This result facilitates efficient shape comparison by virtue of explicit descriptions of Grassmannian geodesics. In this paper, we extend this shape analysis framework to treat shapes of framed space curves. By considering framed curves, we are able to generalize the square root transform by using quaternionic arithmetic and properties of the Hopf fibration. Under our coordinate transformation, the space of closed framed curves corresponds to an infinite-dimensional complex Grassmannian. This allows us to describe geodesics in framed curve space explicitly. We are also able to produce explicit geodesics between closed, unframed space curves by studying the action of the loop group of the circle on the Grassmann manifold. Averages of collections of plane and space curves are computed via a novel algorithm utilizing flag means. [1807.03477v1]

 

Efficient Evaluation of the Number of False Alarm Criterion

Sylvie Le Hégarat-Mascle, Emanuel Aldea, Jennifer Vandoni

This paper proposes a method for computing efficiently the significance of a parametric pattern inside a binary image. On the one hand, a-contrario strategies avoid the user involvement for tuning detection thresholds, and allow one to account fairly for different pattern sizes. On the other hand, a-contrario criteria become intractable when the pattern complexity in terms of parametrization increases. In this work, we introduce a strategy which relies on the use of a cumulative space of reduced dimensionality, derived from the coupling of a classic (Hough) cumulative space with an integral histogram trick. This space allows us to store partial computations which are required by the a-contrario criterion, and to evaluate the significance with a lower computational cost than by following a straightforward approach. The method is illustrated on synthetic examples on patterns with various parametrizations up to five dimensions. In order to demonstrate how to apply this generic concept in a real scenario, we consider a difficult crack detection task in still images, which has been addressed in the literature with various local and global detection strategies. We model cracks as bounded segments, detected by the proposed a-contrario criterion, which allow us to introduce additional spatial constraints based on their relative alignment. On this application, the proposed strategy yields state-of the-art results, and underlines its potential for handling complex pattern detection tasks. [1807.03594v1]

 

Recovering affine features from orientation- and scale-invariant ones

Daniel Barath

An approach is proposed for recovering affine correspondences (ACs) from orientation- and scale-invariant, e.g. SIFT, features. The method calculates the affine parameters consistent with a pre-estimated epipolar geometry from the point coordinates and the scales and rotations which the feature detector obtains. The closed-form solution is given as the roots of a quadratic polynomial equation, thus having two possible real candidates and fast procedure, i.e. <1 millisecond. It is shown, as a possible application, that using the proposed algorithm allows us to estimate a homography for every single correspondence independently. It is validated both in our synthetic environment and on publicly available real world datasets, that the proposed technique leads to accurate ACs. Also, the estimated homographies have similar accuracy to what the state-of-the-art methods obtain, but due to requiring only a single correspondence, the robust estimation, e.g. by locally optimized RANSAC, is an order of magnitude faster. [1807.03503v1]

 

Unsupervised Domain Adaptation for Automatic Estimation of Cardiothoracic Ratio

Nanqing Dong, Michael Kampffmeyer, Xiaodan Liang, Zeya Wang, Wei Dai, Eric P. Xing

The cardiothoracic ratio (CTR), a clinical metric of heart size in chest X-rays (CXRs), is a key indicator of cardiomegaly. Manual measurement of CTR is time-consuming and can be affected by human subjectivity, making it desirable to design computer-aided systems that assist clinicians in the diagnosis process. Automatic CTR estimation through chest organ segmentation, however, requires large amounts of pixel-level annotated data, which is often unavailable. To alleviate this problem, we propose an unsupervised domain adaptation framework based on adversarial networks. The framework learns domain invariant feature representations from openly available data sources to produce accurate chest organ segmentation for unlabeled datasets. Specifically, we propose a model that enforces our intuition that prediction masks should be domain independent. Hence, we introduce a discriminator that distinguishes segmentation predictions from ground truth masks. We evaluate our system’s prediction based on the assessment of radiologists and demonstrate the clinical practicability for the diagnosis of cardiomegaly. We finally illustrate on the JSRT dataset that the semi-supervised performance of our model is also very promising. [1807.03434v1]

 

CIRL: Controllable Imitative Reinforcement Learning for Vision-based Self-driving

Xiaodan Liang, Tairui Wang, Luona Yang, Eric Xing

Autonomous urban driving navigation with complex multi-agent dynamics is under-explored due to the difficulty of learning an optimal driving policy. The traditional modular pipeline heavily relies on hand-designed rules and the pre-processing perception system while the supervised learning-based models are limited by the accessibility of extensive human experience. We present a general and principled Controllable Imitative Reinforcement Learning (CIRL) approach which successfully makes the driving agent achieve higher success rates based on only vision inputs in a high-fidelity car simulator. To alleviate the low exploration efficiency for large continuous action space that often prohibits the use of classical RL on challenging real tasks, our CIRL explores over a reasonably constrained action space guided by encoded experiences that imitate human demonstrations, building upon Deep Deterministic Policy Gradient (DDPG). Moreover, we propose to specialize adaptive policies and steering-angle reward designs for different control signals (i.e. follow, straight, turn right, turn left) based on the shared representations to improve the model capability in tackling with diverse cases. Extensive experiments on CARLA driving benchmark demonstrate that CIRL substantially outperforms all previous methods in terms of the percentage of successfully completed episodes on a variety of goal-directed driving tasks. We also show its superior generalization capability in unseen environments. To our knowledge, this is the first successful case of the learned driving policy through reinforcement learning in the high-fidelity simulator, which performs better-than supervised imitation learning. [1807.03776v1]

 

Topic-Guided Attention for Image Captioning

Zhihao Zhu, Zhan Xue, Zejian Yuan

Attention mechanisms have attracted considerable interest in image captioning because of its powerful performance. Existing attention-based models use feedback information from the caption generator as guidance to determine which of the image features should be attended to. A common defect of these attention generation methods is that they lack a higher-level guiding information from the image itself, which sets a limit on selecting the most informative image features. Therefore, in this paper, we propose a novel attention mechanism, called topic-guided attention, which integrates image topics in the attention model as a guiding information to help select the most important image features. Moreover, we extract image features and image topics with separate networks, which can be fine-tuned jointly in an end-to-end manner during training. The experimental results on the benchmark Microsoft COCO dataset show that our method yields state-of-art performance on various quantitative metrics. [1807.03514v1]

 

Video Summarisation by Classification with Deep Reinforcement Learning

Kaiyang Zhou, Tao Xiang, Andrea Cavallaro

Most existing video summarisation methods are based on either supervised or unsupervised learning. In this paper, we propose a reinforcement learning-based weakly supervised method that exploits easy-to-obtain, video-level category labels and encourages summaries to contain category-related information and maintain category recognisability. Specifically, We formulate video summarisation as a sequential decision-making process and train a summarisation network with deep Q-learning (DQSN). A companion classification network is also trained to provide rewards for training the DQSN. With the classification network, we develop a global recognisability reward based on the classification result. Critically, a novel dense ranking-based reward is also proposed in order to cope with the temporally delayed and sparse reward problems for long sequence reinforcement learning. Extensive experiments on two benchmark datasets show that the proposed approach achieves state-of-the-art performance. [1807.03089v2]

 

High-Resolution Mammogram Synthesis using Progressive Generative Adversarial Networks

Dimitrios Korkinof, Tobias Rijken, Michael O’Neill, Joseph Yearsley, Hugh Harvey, Ben Glocker

The ability to generate synthetic medical images is useful for data augmentation, domain transfer, and out-of-distribution detection. However, generating realistic, high-resolution medical images is challenging, particularly for Full Field Digital Mammograms (FFDM), due to the textural heterogeneity, fine structural details and specific tissue properties. In this paper, we explore the use of progressively trained generative adversarial networks (GANs) to synthesize mammograms, overcoming the underlying instabilities when training such adversarial models. This work is the first to show that generation of realistic synthetic medical images is feasible at up to 1280×1024 pixels, the highest resolution achieved for medical image synthesis, enabling visualizations within standard mammographic hanging protocols. We hope this work can serve as a useful guide and facilitate further research on GANs in the medical imaging domain. [1807.03401v1]

 

High Fidelity Semantic Shape Completion for Point Clouds using Latent Optimization

Swaminathan Gurumurthy, Shubham Agrawal

Semantic shape completion is a challenging problem in 3D computer vision where the task is to generate a complete 3D shape using a partial 3D shape as input. We propose a learning-based approach to complete incomplete 3D shapes through generative modeling and latent manifold optimization. Our algorithm works directly on point clouds. We use an autoencoder and a GAN to learn a distribution of embeddings for point clouds of object classes. An input point cloud with missing regions is first encoded to a feature vector. The representations learnt by the GAN are then used to find the best latent vector on the manifold using a combined optimization that finds a vector in the manifold of plausible vectors that is close to the original input (both in the feature space and the output space of the decoder). Experiments show that our algorithm is capable of successfully reconstructing point clouds with large missing regions with very high fidelity without having to rely on exemplar based database retrieval. [1807.03407v1]

 

An Attention Model for group-level emotion recognition

Aarush Gupta, Dakshit Agrawal, Hardik Chauhan, Jose Dolz, Marco Pedersoli

In this paper we propose a new approach for classifying the global emotion of images containing groups of people. To achieve this task, we consider two different and complementary sources of information: i) a global representation of the entire image (ii) a local representation where only faces are considered. While the global representation of the image is learned with a convolutional neural network (CNN), the local representation is obtained by merging face features through an attention mechanism. The two representations are first learned independently with two separate CNN branches and then fused through concatenation in order to obtain the final group-emotion classifier. For our submission to the EmotiW 2018 group-level emotion recognition challenge, we combine several variations of the proposed model into an ensemble, obtaining a final accuracy of 64.83% on the test set and ranking 4th among all challenge participants. [1807.03380v1]

 

Deep Co-Clustering for Unsupervised Audiovisual Learning

Di Hu, Feiping Nie, Xuelong Li

The seen birds twitter, the running cars accompany with noise, people talks by face-to-face, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as Deep Co-Clustering (DCC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DCC can learn effective unimodal representation, with which the classifier can even outperform human. Further, DCC shows noticeable performance in the task of sound localization, multisource detection, and audiovisual understanding. [1807.03094v2]

 

Talk the Walk: Navigating New York City through Grounded Dialogue

Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela

We introduce “Talk The Walk”, the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a “guide” and a “tourist”) that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide’s map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task. [1807.03367v1]

 

Beyond Pixels: Image Provenance Analysis Leveraging Metadata

Aparna Bharati, Daniel Moreira, Joel Brogan, Patricia Hale, Kevin W. Bowyer, Patrick J. Flynn, Anderson Rocha, Walter J. Scheirer

Creative works, whether paintings or memes, follow unique journeys that result in their final form. Understanding these journeys, a process known as “provenance analysis”, provides rich insights into the use, motivation, and authenticity underlying any given work. The application of this type of study to the expanse of unregulated content on the Internet is what we consider in this paper. Provenance analysis provides a snapshot of the chronology and validity of content as it is uploaded, re-uploaded, and modified over time. Although still in its infancy, automated provenance analysis for online multimedia is already being applied to different types of content. Most current works seek to build provenance graphs based on the shared content between images or videos. This can be a computationally expensive task, especially when considering the vast influx of content that the Internet sees every day. Utilizing non-content-based information, such as timestamps, geotags, and camera IDs can help provide important insights into the path a particular image or video has traveled during its time on the Internet without large computational overhead. This paper tests the scope and applicability of metadata-based inferences for provenance graph construction in two different scenarios: digital image forensics and cultural analytics. [1807.03376v1]

基于卷积神经网络的冠状动脉内光学相干断层扫描图像自动斑块表征

Shenghua He, Jie Zheng, Akiko Maehara, Gary Mintz, Dalin Tang, Mark Anastasio, Hua Li

光学相干断层扫描(OCT)可以提供高分辨率的横截面图像,用于分析冠状动脉中的浅表斑块。通常,使用冠状动脉内OCT图像的斑块表征由专家观察者手动执行。这种手动分析非常耗时,其准确性在很大程度上依赖于人类观察者的经验。最近已采用传统的基于机器学习的方法,例如最小二乘支持向量机和随机森林方法来自动表征OCT图像中的斑块区域。在这些传统方法中通常使用若干处理步骤,包括特征提取,信息特征选择和最终像素分类。因此,最终的分类准确度可能会因这些步骤中的每个步骤中的错误或不准确而受到损害。在这项研究中,我们提出了一种基于卷积神经网络(CNN)的方法来自动表征OCT图像中的斑块。与传统方法不同,我们的方法使用图像作为直接输入,并将分类作为单步过程执行。在269OCT图像上的实验表明,基于CNN的方法的平均预测准确度为0.866,这表明了临床翻译的巨大希望。[1807.03613v1] 这表明了临床翻译的巨大希望。[1807.03613v1] 这表明了临床翻译的巨大希望。[1807.03613v1]

 

用于3D点云处理的多分辨率树网络

Matheus GadelhaRui WangSubhransu Maji

我们提出了多分辨率树状结构网络来处理点云,以便进行3D形状理解和生成任务。我们的网络将3D形状表示为一组保持位置的1D有序多点分辨率列表。这允许通过一维卷积进行高效的前馈处理,通过多网格架构进行粗到精分析,并且在训练期间可以实现更快的收敛和更小的内存占用。所提出的树形结构编码器可用于对形状进行分类,并在形状分类基准测试中优于现有的基于点的架构,而树形结构解码器可用于直接生成点云,并且它们优于现有的图像到形状推理任务方法使用ShapeNet数据集学习。我们的模型还允许通过使用变分自动编码器对基于点云的形状进行无监督学习,从而产生更高质量的生成形状。[1807.03520v1]

 

多视点光谱聚类的基本张量学习

Jianlong Wu, Zhouchen Lin, Hongbin Zha

多视图聚类最近引起了很多关注,其目的是利用多视图信息来提高聚类的性能。然而,最近的工作主要集中在基于自代表的子空间聚类上,其具有高计算复杂度。在本文中,我们重点研究基于马尔可夫链的谱聚类方法,并提出一种新的基本张量学习方法,以探索多视图表示的高阶相关性。我们首先构造基于马尔可夫链的多视图转移概率矩阵的张量。通过结合鲁棒主成分分析的思想,采用基于张量奇异值分解(t-SVD)的张量核范数来保持基本张量的低秩性质,它可以很好地从多个视图中捕获原理信息。我们还使用张量旋转运算符来完成此任务,以更好地研究视图之间的关系以及降低计算复杂性。通过乘法器的交替方向~(ADMM)可以有效地优化所提出的方法。对与五种不同应用相对应的六个真实世界数据集进行的大量实验表明,我们的方法比其他最先进的方法具有更好的性能。[1807.03602v1] 对与五种不同应用相对应的六个真实世界数据集进行的大量实验表明,我们的方法比其他最先进的方法具有更好的性能。[1807.03602v1] 对与五种不同应用相对应的六个真实世界数据集进行的大量实验表明,我们的方法比其他最先进的方法具有更好的性能。[1807.03602v1]

 

通过深度学习开发脑图谱

Asim IqbalRomesa KhanTheofanis Karayannis

为了揭示管理人类大脑的组织原则,神经科学家需要开发高通量方法,这些方法可以使用动物模型探索不同大脑区域的结构和功能。实现这一目标的第一步是在最小的人工监督下,针对标准参考地图集准确地记录小鼠大脑中的感兴趣区域。第二步是将这种方法扩展到不同的动物年龄,以便也可以深入了解正常和病理性大脑发育和衰老。我们在这里介绍一种全自动卷积神经网络方法(SeBRe),用于通过对不同年龄的小鼠感兴趣的脑区域进行注册。我们证明了我们的方法对不同小鼠脑产后(P)发育时间点的有效性,跨越一系列神经元标记物。我们的方法优于现有的脑注册方法,并提供小鼠大脑数据集的最小均方误差(MSE)分数。我们提出,我们基于深度学习的注册方法可以(i)加速大脑发展中大脑区域特定变化的大脑探索,以及(ii)通过简单地分割感兴趣的大脑区域以获得高通量,取代现有的复杂脑注册方法全脑分析。[1807.03440v1] 我们提出,我们基于深度学习的注册方法可以(i)加速大脑发展中大脑区域特定变化的大脑探索,以及(ii)通过简单地分割感兴趣的大脑区域以获得高通量,取代现有的复杂脑注册方法全脑分析。[1807.03440v1] 我们提出,我们基于深度学习的注册方法可以(i)加速大脑发展中大脑区域特定变化的大脑探索,以及(ii)通过简单地分割感兴趣的大脑区域以获得高通量,取代现有的复杂脑注册方法全脑分析。[1807.03440v1]

 

神经任务图:从单个视频演示中推广到看不见的任务

De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, Juan Carlos Niebles

我们的目标是让机器人根据任务的单个视频演示执行以前看不见的任务。我们的方法的成功依赖于将知识从看到的任务转移到具有类似语义的看不见的任务的原则。更重要的是,我们假设要从单个视频演示成功执行复杂任务,有必要明确地将组合性合并到模型中。为了验证我们的假设,我们提出了神经任务图(NTG)网络,它使用任务图作为中间表示来模块化视频演示和派生策略的表示。我们证明了这个公式在两个复杂的任务上实现了强大的任务间泛化:BulletPhysics中的块堆叠和AI2-THOR中的对象收集。我们进一步表明,相同的原则适用于真实世界的视频。我们证明NTG可以提高早餐数据集中少数活动理解的数据效率。[1807.03480v1]

 

利用多尺度卷积神经网络实现头部运动补偿

Omer RajputNils GessertMartin GromniakLars MatthausAlexander Schlaefer

头部姿势估计和跟踪在各种医疗应用中是有用的。随着像Kinect这样的RGBD相机的出现,通过直接从点云估计头部姿势来进行无标记跟踪变得可行。一种特定的医疗应用是机器人辅助经颅磁刺激(TMS),其中任何患者运动在机器人的帮助下得到补偿。为了提高患者的舒适度,在没有标记的情况下跟踪头部非常重要。在这方面,我们使用两种不同的方法解决头部姿势估计问题。在第一种方法中,我们建立在更传统的基于模型的头部跟踪方法的基础上,其中头部模型根据要跟踪的特定头部变形,并且变形模型用于跟踪点云流中的头部。在第二种方法中,我们提出了一种新的多尺度卷积神经网络架构,用于更准确的姿态回归。此外,我们概述了使用安装在机器人上的头部模型和使用高度精确的跟踪系统生成的地面实况标签的系统数据集采集策略。[1807.03651v1]

 

深水下图像增强

Saeed AnwarChongyi LiFatih Porikli

在水下场景中,依赖于波长的光吸收和散射会降低图像的可见度,从而导致低对比度和失真的色偏。为了解决这个问题,我们提出了一种基于卷积神经网络的图像增强模型,即UWCNN,它使用合成的水下图像数据库进行有效训练。与需要水下成像模型估计参数或仅适用于特定场景的不灵活框架的现有工作不同,我们的模型通过利用自动端到端和数据驱动的训练机制直接重建清晰的潜在水下图像。我们首先合成了十种不同的海洋图像数据库,符合水下成像模型和水下场景的光学特性。然后,我们分别为每种水下成像类型训练多个UWCNN模型。实际和合成水下图像的实验结果表明,所提出的方法在不同的水下场景中得到了很好的推广,并且在质量和数量上均优于现有方法。此外,我们进行消融研究,以证明我们网络中每个组成部分的影响。[1807.03528v1]

 

使用完全卷积网络在未准备的田间图像中有效识别,定位和定量葡萄花序

Robert RudolphKatja HerzogReinhardTöpferVolker Steinhage

产量及其预测是葡萄育种目的和葡萄园管理中最重要的任务之一。通常,这种特性在收获前通过外推手动估算,这主要是劳动密集型,破坏性和不准确性。在本研究中,开发了一种基于图像的自动化工作流程,用于量化葡萄藤无准备的野外图像中的花序和单花,即没有施加人工背景或光照。这是一种用于非侵入性,廉价且客观的表型分析的新方法,具有高通量。首先,识别和定位描绘花序的图像区域。这是通过使用完全卷积网络(FCN)将图像分割成花序非花序类来完成的。因此,有效的图像分割是关于花的小几何和密集分布(每个花序数百花)的最具挑战性的步骤,前景和背景中所有植物器官的相似颜色以及仅约5%的花的情况。图像显示花序。训练有素的FCN在测试数据集上实现了87.6%的平均交叉联盟(IOU)。最后,使用圆形霍夫变换从花序区域提取单个花。使用由训练的FCN模型导出的分割,花提取实现80.3%的召回率和70.7%的精确度。总结,所提出的方法是一种有前途的策略,以便在葡萄发育的最早阶段自动预测产量潜力,其适用于对育种材料,遗传资源库或商业葡萄园的客观监测和评估。[1807.03770v1]

 

通过边界语义意识和引导进行精确的场景文本检测

Chuhui Xue, Shijian Lu, Fangneng Zhan

本文提出了一种场景文本检测技术,该技术利用自举和文本边界语义来精确定位场景中的文本。设计了一种新颖的自举技术,该技术对单词或文本行的多个子部分进行采样,从而有效地减轻了有限训练数据的约束。同时,对文本子部分的重复采样提高了预测文本特征映射的一致性,这对于预测单个完整而不是长字或文本行的多个破碎框是至关重要的。此外,设计了一种语义识别文本边界检测技术,该技术为每个场景文本产生四种类型的文本边界段。使用语义识别文本边框,通过回归单词或文本行的末端周围的文本像素而不是所有文本像素,可以更准确地定位场景文本,这通常导致在处理长单词或文本行时不准确的定位。大量实验证明了所提出技术的有效性,并且在几个公共数据集上获得了优异的性能,例如MSRA-TD50080.1 f分数,ICDAR2017-RCTW67.1 f分数等[1807.03547v1]

 

基于GPU的降维方法降维算法设计

Henry KvingeElin FarnellMichael KirbyChris Peterson

降维技术是从高维数据集中提取有用信息的基本工具。因为割线集编码流形几何,它们是设计有意义的数据缩减算法的有用工具。在一种这样的方法中,目标是构建最大程度地避免正割方向的投影,并因此确保不同的数据点在缩小的空间中不会太靠近地映射。这种类型的算法基于一个数学框架,该框架的灵感来自于来自差分拓扑的Whitney嵌入定理的建设性证明。计算一组点的所有(单位)割线本质上是计算上昂贵的,因此为利用GPU架构打开了大门,以实现这些算法的快速版本。我们提出了一种多项式时间数据减少算法,该算法通过在上述框架内迭代地构造改进的投影来产生数据集的有意义的低维表示。我们的算法设计和实现的关键是使用GPU,其中,最小化计算所有割线所需的计算时间。本报告的一个目标是与GPU专家分享想法,并讨论一类可能对更广泛的GPU社区感兴趣的数学算法。[1807.03425v1] 最小化计算所有割线所需的计算时间。本报告的一个目标是与GPU专家分享想法,并讨论一类可能对更广泛的GPU社区感兴趣的数学算法。[1807.03425v1] 最小化计算所有割线所需的计算时间。本报告的一个目标是与GPU专家分享想法,并讨论一类可能对更广泛的GPU社区感兴趣的数学算法。[1807.03425v1]

 

两阶段迭代Procrustes匹配算法及其在基于VQ的说话人验证中的应用

Richeng Tan, Jing Li

在过去的几十年中,矢量量化(VQ)模型在不同的模式识别领域中非常流行,特别是对于基于特征的任务。然而,基于VQ的系统的分类或回归性能总是面临特征不匹配问题,这将严重影响它们的性能。在本文中,我们提出了一个两阶段迭代Procrustes算法(TIPM)来解决基于VQ的应用程序的特征不匹配问题。在第一阶段,算法将移除一对输入特征集的不匹配特征向量对。然后,第二阶段将收集在第一阶段丢弃的那些正确匹配的特征对。为了评估所提出的TIPM算法的有效性,本文使用说话人验证作为案例研究。实验在TIMIT数据库上进行,结果表明TIPM可以改善基于VQ的扬声器验证性能清洁条件和所有噪声条件。[1807.03587v1]

 

基于层生成算法的深度信念网络自适应学习方法

Shin KamadaTakumi Ichimura

Deep Belief NetworkDBN)具有深层体系结构,使用预先训练的Restricted Boltzmann MachinesRBM)分层次地表示输入模式的多个特征。传统的RBMDBN模型在学习阶段不能改变其网络结构。我们提出的自适应学习方法可以根据输入空间发现隐藏神经元和权重和/或层的最佳数量。该模型是考虑计算成本和模型稳定性的重要方法。保持稀疏网络结构的规律性是相当大的问题,因为应该需要从训练的网络中提取显式知识。在我们之前的研究中,我们开发了RBM自适应结构学习方法和学习遗忘方法的混合方法到训练的RBM。在本文中,我们提出了DBN的自适应学习方法,可以确定学习过程中的最佳层数。我们在一些基准数据集上评估了我们提出的模型。[1807.03486v1]

 

学习单Tucker分解网络,以多像素每像素速率进行有损图像压缩

Jianrui Cai, Zisheng Cao, Lei Zhang

有损图像压缩(LIC)旨在利用不精确的近似来更紧凑地表示图像,这是图像处理中的经典问题。最近,深度卷积神经网络(CNN)通过从大量数据中学习编码器量化器解码器网络在LIC中获得了有趣的结果。然而,现有的基于CNNLIC方法通常只能针对特定的每像素比特(bpp)训练网络。这种每个bpp一个网络问题限制了CNN对实际LIC应用的普遍性和灵活性。在本文中,我们建议学习一个能够以多个bpp速率执行LICCNN。开发了一种简单而有效的Tucker分解网络(TDNet),其中有一个新的tucker分解层(TDL),用于将潜像表示分解为一组投影矩阵和一个核心张量。通过改变核心张量的等级及其量化,我们可以容易地调整单个CNN内的潜像表示的bpp率。此外,呈现迭代非均匀量化方案以优化量化器,并且引入粗略到精细训练策略以重建解压缩图像。大量实验证明了TDNetPSNRMS-SSIM指数方面的最先进压缩性能。[1807.03470v1] 提出了一种迭代的非均匀量化方案来优化量化器,并引入粗到细的训练策略来重构解压缩的图像。大量实验证明了TDNetPSNRMS-SSIM指数方面的最先进压缩性能。[1807.03470v1] 提出了一种迭代的非均匀量化方案来优化量化器,并引入粗到细的训练策略来重构解压缩的图像。大量实验证明了TDNetPSNRMS-SSIM指数方面的最先进压缩性能。[1807.03470v1]

 

基于神经元生成和湮灭算法的受限玻尔兹曼机器自适应学习方法

Shin KamadaTakumi Ichimura

受限玻尔兹曼机(RBM)是一种基于随机能量的人工神经网络模型,用于无监督学习。最近,众所周知RBM是深度学习的预训练方法。除了可见和隐藏的神经元之外,RBM的结构还具有许多参数,例如神经元之间的权重和它们的系数。因此,我们可能会遇到一些困难来确定最佳网络结构来分析大数据。为了避免这个问题,我们研究了参数的方差,以便在学习过程中找到最佳结构。因此,我们应该检查参数的方差,以引起RBM模型中能量函数的波动。在本文中,我们提出了RBM的自适应学习方法,通过应用神经元生成和湮灭算法,可以根据训练情况发现最佳数量的隐藏神经元。在该方法中,如果能量函数仍未收敛并且参数的方差很大,则生成新的隐藏神经元。此外,如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v1] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v1] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v1]

 

SceneEDNet:一种用于场景光流估计的深度学习方法

Ravi Kumar ThakurSnehasis Mukherjee

由于其在机器人技术中的潜在应用,估计RGB-D视频中的场景流程引起了计算机视觉研究人员的极大兴趣。用于场景流估计的现有技术通常依赖于帧的场景结构和帧之间的对应关系的知识。然而,随着从诸如Microsoft Kinect等复杂传感器捕获的RGB-D数据量的增加,以及复杂深度学习技术领域的最新进展,引入用于场景流估计的有效深度学习技术变得越来越重要。本文首先介绍了通过呈现具有编码器解码器(ED)架构的完全卷积神经网络来应用深度学习方法来直接估计场景流的方法。所提出的网络SceneEDNet涉及从立体图像序列估计所有场景点的三维运动矢量。使用连续的立体图像对和相应的场景流基础事实来完成对场景流的直接估计的训练。建议的体系结构应用于庞大的数据集并提供有意义的结果。[1807.03464v1]

 

框架空间曲线的形状分析

汤姆李约瑟

在形状匹配和对象分类的弹性形状分析方法中,平面曲线表示为无限维黎曼流形中的点,其中形状不相似性通过测地距离来测量。YounesMichorShahMumford的一个显着结果表明,具有自然度量的封闭平面形状的空间通过所谓的平方根变换与无限维Grassmann流形等距。该结果通过Grassmannian测地线的明确描述促进了有效的形状比较。在本文中,我们扩展了这个形状分析框架,以处理框架空间曲线的形状。通过考虑框架曲线,我们能够通过使用四元数算术和Hopf纤维化的特性来推广平方根变换。在我们的坐标变换下,闭合框架曲线的空间对应于无限维复杂的格拉斯曼。这允许我们明确地描述框架曲线空间中的测地线。通过研究Grassmann流形上圆的环群的作用,我们还能够在闭合的无框架空间曲线之间产生显式的测地线。通过利用标志装置的新算法计算平面和空间曲线的集合的平均值。[1807.03477v1] 通过利用标志装置的新算法计算平面和空间曲线的集合的平均值。[1807.03477v1] 通过利用标志装置的新算法计算平面和空间曲线的集合的平均值。[1807.03477v1]

 

有效评估虚警标准的数量

SylvieLeHégarat-MascleEmanuel AldeaJennifer Vandoni

本文提出了一种有效计算二进制图像内参数模式的重要性的方法。一方面,相反的策略避免了用户参与调整检测阈值,并允许人们公平地考虑不同的模式大小。另一方面,当参数化方面的模式复杂性增加时,相反的标准变得难以处理。在这项工作中,我们引入了一种策略,该策略依赖于使用经典(Hough)累积空间与积分直方图技巧的耦合而得到的降维的累积空间。该空间允许我们存储a-相反标准所需的部分计算,并且以比通过简单方法更低的计算成本来评估重要性。该方法在关于具有多达五维的各种参数化的图案的合成示例上示出。为了演示如何在真实场景中应用这种通用概念,我们考虑静止图像中的一个困难的裂缝检测任务,这已经在文献中用各种局部和全局检测策略来解决。我们将裂缝建模为有界区段,通过提出的相反标准检测,这允许我们基于它们的相对对齐引入额外的空间约束。在此应用程序中,所提出的策略产生了最先进的结果,并强调了其处理复杂模式检测任务的潜力。[1807.03594v1] 为了演示如何在真实场景中应用这种通用概念,我们考虑静止图像中的一个困难的裂缝检测任务,这已经在文献中用各种局部和全局检测策略来解决。我们将裂缝建模为有界区段,通过提出的相反标准检测,这允许我们基于它们的相对对齐引入额外的空间约束。在此应用程序中,所提出的策略产生了最先进的结果,并强调了其处理复杂模式检测任务的潜力。[1807.03594v1] 为了演示如何在真实场景中应用这种通用概念,我们考虑静止图像中的一个困难的裂缝检测任务,这已经在文献中用各种局部和全局检测策略来解决。我们将裂缝建模为有界区段,通过提出的相反标准检测,这允许我们基于它们的相对对齐引入额外的空间约束。在此应用程序中,所提出的策略产生了最先进的结果,并强调了其处理复杂模式检测任务的潜力。[1807.03594v1] 这允许我们基于它们的相对对齐引入额外的空间约束。在此应用程序中,所提出的策略产生了最先进的结果,并强调了其处理复杂模式检测任务的潜力。[1807.03594v1] 这允许我们基于它们的相对对齐引入额外的空间约束。在此应用程序中,所提出的策略产生了最先进的结果,并强调了其处理复杂模式检测任务的潜力。[1807.03594v1]

 

从方向和尺度不变的方法中恢复仿射特征

丹尼尔巴拉斯

提出了一种从方向和尺度不变(例如SIFT)特征中恢复仿射对应(AC)的方法。该方法根据点坐标以及特征检测器获得的标度和旋转来计算与预估的极线几何一致的仿射参数。封闭形式的解作为二次多项式方程的根,因此具有两个可能的实际候选和快速过程,即<1毫秒。作为可能的应用,示出了使用所提出的算法允许我们独立地估计每个单独对应的单应性。它在我们的合成环境和公开可用的现实世界数据集中得到验证,所提出的技术可以产生准确的AC。也,估计的单应性具有与现有技术获得的相似的精确度,但由于仅需要单个对应,因此例如通过局部优化的RANSAC的稳健估计更快一个数量级。[1807.03503v1]

 

无监督域自适应自动估算心胸比

Nanqing Dong, Michael Kampffmeyer, Xiaodan Liang, Zeya Wang, Wei Dai, Eric P. Xing

心胸比(CTR)是胸部X射线(CXR)中心脏大小的临床指标,是心脏扩大的关键指标。CTR的手动测量是耗时的并且可能受到人类主观性的影响,因此需要设计帮助临床医生诊断过程的计算机辅助系统。然而,通过胸部器官分割的自动CTR估计需要大量的像素级注释数据,这通常是不可用的。为了缓解这个问题,我们提出了一种基于对抗网络的无监督域自适应框架。该框架从公开可用的数据源学习域不变特征表示,以便为未标记的数据集产生准确的胸部器官分割。特别,我们提出了一个模型来强化我们的直觉,即预测掩码应该是域独立的。因此,我们引入了一个鉴别器,用于区分分割预测和地面实况掩模。我们根据放射科医师的评估评估我们的系统预测,并证明心脏扩大诊断的临床实用性。我们最终在JSRT数据集上说明了我们模型的半监督性能也非常有前途。[1807.03434v1] 我们最终在JSRT数据集上说明了我们模型的半监督性能也非常有前途。[1807.03434v1] 我们最终在JSRT数据集上说明了我们模型的半监督性能也非常有前途。[1807.03434v1]

 

CIRL:基于视觉的自动驾驶的可控模仿强化学习

Xiaodan Liang, Tairui Wang, Luona Yang, Eric Xing

由于难以学习最优驾驶政策,因此对具有复杂多智能体动力学的自主城市驾驶导航进行了探索。传统的模块化管道严重依赖于手工设计的规则和预处理感知系统,而受监督的基于学习的模型受到广泛的人类经验的可访问性的限制。我们提出了一种通用且有原则的可控模仿强化学习(CIRL)方法,该方法成功地使驾驶员仅基于高保真汽车模拟器中的视觉输入实现更高的成功率。为了减轻大型连续作战空间的低探索效率,这通常禁止在挑战性的实际任务中使用经典RL,我们的CIRL探讨了一个合理约束的行动空间,该行动空间由模仿人类示范的编码体验引导,建立在深层确定性政策梯度(DDPG)之上。此外,我们建议基于共享表示专门针对不同控制信号(即跟随,直线,右转,左转)的自适应策略和转向角奖励设计,以提高处理不同情况的模型能力。关于CARLA驾驶基准的广泛实验表明,就各种目标导向驾驶任务成功完成的事件的百分比而言,CIRL大大优于以前的所有方法。我们还在看不见的环境中展示了其卓越的泛化能力。据我们所知,这是通过高保真模拟器中的强化学习获得学习驾驶政策的第一个成功案例,该模拟器的表现优于受监督的模仿学习。[1807.03776v1]

 

主题引导注意图像标题

Zhihao Zhu, Zhan Xue, Zejian Yuan

由于其强大的性能,注意机制引起了人们对图像字幕的极大兴趣。现有的基于注意力的模型使用来自字幕生成器的反馈信息作为指导来确定应该关注哪个图像特征。这些注意力产生方法的一个常见缺陷是它们缺少来自图像本身的更高级别的引导信息,这限制了选择信息最丰富的图像特征。因此,在本文中,我们提出了一种新的注意机制,称为主题引导注意,它将注意模型中的图像主题作为指导信息集成,以帮助选择最重要的图像特征。此外,我们使用单独的网络提取图像特征和图像主题,这些网络可以在训练期间以端到端的方式联合微调。基准Microsoft COCO数据集的实验结果表明,我们的方法在各种定量指标上产生了最先进的性能。[1807.03514v1]

 

分类与深度强化学习的视频总结

Kaiyang Zhou, Tao Xiang, Andrea Cavallaro

大多数现有的视频摘要方法基于有监督或无监督学习。在本文中,我们提出了一种基于强化学习的弱监督方法,该方法利用易于获取的视频级别类别标签,并鼓励摘要包含与类别相关的信息并保持类别可识别性。具体而言,我们将视频摘要制定为顺序决策过程,并培训具有深度Q学习(DQSN)的摘要网络。还训练伴侣分类网络以提供训练DQSN的奖励。通过分类网络,我们基于分类结果开发全局可识别性奖励。重要的是,为了应对长序列强化学习的时间延迟和稀疏奖励问题,还提出了一种新颖的基于密集排名的奖励。对两个基准数据集的大量实验表明,所提出的方法实现了最先进的性能。[1807.03089v2]

 

使用渐进生成对抗网络的高分辨率乳房X线图像合成

Dimitrios KorkinofTobias RichMichael O’NeillJoseph YearsleyHugh HarveyBen Glocker

生成合成医学图像的能力对于数据增强,域转移和分布外检测是有用的。然而,由于纹理异质性,精细结构细节和特定组织特性,产生逼真的高分辨率医学图像是具有挑战性的,特别是对于全视野数字乳房X线照片(FFDM)。在本文中,我们探索使用逐步训练的生成对抗网络(GAN)来合成乳房X线照片,克服训练这种对抗模型时潜在的不稳定性。这项工作首次表明,生成逼真的合成医学图像在高达1280×1024像素下是可行的,这是医学图像合成所达到的最高分辨率,可在标准乳腺摄影悬挂协议中实现可视化。我们希望这项工作可以作为一个有用的指南,并促进进一步研究医学成像领域的GAN[1807.03401v1]

 

使用潜在优化的点云的高保真语义形状完成

Swaminathan GurumurthyShubham Agrawal

语义形状完成是3D计算机视觉中的挑战性问题,其中任务是使用部分3D形状作为输入来生成完整的3D形状。我们提出了一种基于学习的方法,通过生成建模和潜在流形优化来完成不完整的3D形状。我们的算法直接在点云上工作。我们使用自动编码器和GAN来学习对象类点云的嵌入分布。首先将具有缺失区域的输入点云编码为特征向量。然后使用GAN学习的表示来使用组合优化在流形上找到最佳潜在向量,该组合优化在接近原始输入的合理向量的流形中找到向量(在特征空间和输出空间中)解码器)。实验表明,我们的算法能够成功地重建具有高保真度的大缺失区域的点云,而不必依赖基于样本的数据库检索。[1807.03407v1]

 

群体情绪识别的注意模型

Aarush GuptaDakshit AgrawalHardik ChauhanJose DolzMarco Pedersoli

在本文中,我们提出了一种新方法,用于对包含人群的图像的全局情感进行分类。为了完成这项任务,我们考虑两种不同的和互补的信息来源:i)整个图像的全局表示(ii)仅考虑面部的局部表示。虽然利用卷积神经网络(CNN)学习图像的全局表示,但是通过通过注意机制合并面部特征来获得局部表示。这两个表示首先用两个独立的CNN分支独立学习,然后通过连接融合,以获得最终的组情感分类器。对于我们提交给EmotiW 2018组级情感识别挑战,我们将所提出的模型的几种变体组合成一个整体,在测试集上获得64.83%的最终准确度,在所有挑战参与者中排名第4[1807.03380v1]

 

无监督视听学习的深度共聚类

Di Hu, Feiping Nie, Xuelong Li

看到的鸟儿叽叽喳喳,跑步车伴随着噪音,人们面对面交谈等等。这些自然的视听通信提供了探索和理解外部世界的可能性。然而,混合的多个对象和声音使得在无约束环境中执行有效匹配变得棘手。为了解决这个问题,我们建议充分挖掘音频和视觉组件,并在它们之间进行精细的通信学习。具体地,提出了一种新的无监督视听学习模型,称为深度协同聚类(DCC),其与不同共享空间中的卷积图的多模态向量同步地执行聚类集合,以捕获多个视听对应。并且这种集成的多模式集群网络可以以端到端的方式进行有效的最大边际损失训练。执行特征评估和视听任务中的实验量。结果表明,DCC可以学习有效的单峰表示,分类器甚至可以胜过人类。此外,DCC在声音定位,多源检测和视听理解的任务中表现出显着的性能。[1807.03094v2] 和视听理解。[1807.03094v2] 和视听理解。[1807.03094v2]

 

畅谈:通过扎根对话来驾驭纽约市

Harm de VriesKurt ShusterDhruv BatraDevi ParikhJason WestonDouwe Kiela

我们介绍了“Talk The Walk”,这是第一个以行动和感知为基础的大型对话数据集。该任务涉及两个代理(指南游客),它们通过自然语言进行交流以实现共同的目标:让游客导航到给定的目标位置。详细描述的任务和数据集具有挑战性,它们的完整解决方案是我们向社区提出的一个开放性问题。我们(i)专注于旅游本地化的任务,并开发新颖的空间卷积掩盖注意(MASC)机制,允许将旅游话语引入指南的地图,(ii)显示它为紧急和自然语言交流带来显着改善,以及(iii)使用这种方法,我们在完整的任务上建立了非平凡的基线。[1807

 

超越像素:利用元数据的图像来源分析

Aparna BharatiDaniel MoreiraJoel BroganPatricia HaleKevin W. BowyerPatrick J. FlynnAnderson RochaWalter J. Scheirer

无论是绘画还是模因,创意作品都遵循独特的旅程,从而形成最终形式。了解这些旅程,称为出处分析的过程,可以深入了解任何特定工作的使用,动机和真实性。我们在本文中考虑将这种类型的研究应用于互联网上广泛的不受管制的内容。源内容分析提供内容的时间顺序和有效性的快照,因为它随着时间的推移上传,重新上传和修改。尽管仍处于起步阶段,但在线多媒体的自动化起源分析已经应用于不同类型的内容。目前大多数作品都试图根据图像或视频之间的共享内容来构建起源图。这可能是一项计算成本高昂的任务,特别是考虑到互联网每天都看到的大量内容。利用非基于内容的信息(例如时间戳,地理标记和摄像机ID)可以帮助提供对特定图像或视频在互联网上传播的路径的重要见解,而无需大量计算开销。本文测试了基于元数据的推理在两种不同场景中的源头图构建的范围和适用性:数字图像取证和文化分析。[1807.03376v1] 本文测试了基于元数据的推理在两种不同场景中的源头图构建的范围和适用性:数字图像取证和文化分析。[1807.03376v1] 本文测试了基于元数据的推理在两种不同场景中的源头图构建的范围和适用性:数字图像取证和文化分析。[1807.03376v1]

转载请注明:《SceneEDNet:一种用于场景光流估计的深度学习方法+从方向和尺度不变的方法中恢复仿射特征+CIRL:基于视觉的自动驾驶的可控模拟强化学习

发表评论