稀缺数据的语义分割+ SymmNet:用于遮挡检测的对称卷积神经网络+ReCoNet:实时的连贯的视频风格迁移

Stochastic Channel Decorrelation Network and Its Application to Visual Tracking

Jie Guo, Tingfa Xu, Shenwang Jiang, Ziyi Shen

Deep convolutional neural networks (CNNs) have dominated many computer vision domains because of their great power to extract good features automatically. However, many deep CNNs-based computer vison tasks suffer from lack of train-ing data while there are millions of parameters in the deep models. Obviously, these two biphase violation facts will re-sult in parameter redundancy of many poorly designed deep CNNs. Therefore, we look deep into the existing CNNs and find that the redundancy of network parameters comes from the correlation between features in different channels within a convolutional layer. To solve this problem, we propose the stochastic channel decorrelation (SCD) block which, in every iteration, randomly selects multiple pairs of channels within a convolutional layer and calculates their normalized cross cor-relation (NCC). Then a squared max-margin loss is proposed as the objective of SCD to suppress correlation and keep di-versity between channels explicitly. The proposed SCD is very flexible and can be applied to any current existing CNN models simply. Based on the SCD and the Fully-Convolutional Siamese Networks, we proposed a visual tracking algorithm to verify the effectiveness of SCD. [1807.01103v1]

 

Modular Vehicle Control for Transferring Semantic Information to Unseen Weather Conditions using GANs

Patrick Wenzel, Qadeer Khan, Daniel Cremers, Laura Leal-Taixé

End-to-end supervised learning has shown promising results for self-driving cars, particularly under conditions for which it was trained. However, it may not necessarily perform well under unseen conditions. In this paper, we demonstrate how knowledge can be transferred from one weather condition for which semantic labels and steering commands are available to a completely new set of conditions for which we have no access to labeled data. The problem is addressed by dividing the task of vehicle control into independent perception and control modules, such that changing one does not affect the other. We train the control module only on the data for the available condition and keep it fixed even under new conditions. The perception module is then used as an interface between the new weather conditions and this control model. The perception module in turn is trained using semantic labels, which we assume are already available for the same weather condition on which the control model was trained. However, obtaining them for other conditions is a tedious and error-prone process. Therefore, we propose to use a generative adversarial network (GAN)-based model to retrieve the semantic information for the new conditions in an unsupervised manner. We introduce a master-servant architecture, where the master model (semantic labels available) trains the servant model (semantic labels not available). The servant model can then be used for steering the vehicle without retraining the control module. [1807.01001v1]

 

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, Yaser Sheikh

In this paper, we present supervision-by-registration, an unsupervised approach to improve the precision of facial landmark detectors on both images and video. Our key observation is that the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. Interestingly, the coherency of optical flow is a source of supervision that does not require manual labeling, and can be leveraged during detector training. For example, we can enforce in the training loss function that a detected landmark at frame$_{t-1}$ followed by optical flow tracking from frame$_{t-1}$ to frame$_t$ should coincide with the location of the detection at frame$_t$. Essentially, supervision-by-registration augments the training loss function with a registration loss, thus training the detector to have output that is not only close to the annotations in labeled images, but also consistent with registration on large amounts of unlabeled videos. End-to-end training with the registration loss is made possible by a differentiable Lucas-Kanade operation, which computes optical flow registration in the forward pass, and back-propagates gradients that encourage temporal coherency in the detector. The output of our method is a more precise image-based facial landmark detector, which can be applied to single images or video. With supervision-by-registration, we demonstrate (1) improvements in facial landmark detection on both images (300W, ALFW) and video (300VW, Youtube-Celebrities), and (2) significant reduction of jittering in video detections. [1807.00966v1]

 

A Weakly Supervised Adaptive DenseNet for Classifying Thoracic Diseases and Identifying Abnormalities

Bo Zhou, Yuemeng Li, Jiangcong Wang

We present a weakly supervised deep learning model for classifying diseases and identifying abnormalities based on medical imaging data. In this work, instead of learning from medical imaging data with region-level annotations, our model was trained on imaging data with image-level labels to classify diseases, and is able to identify abnormal image regions simultaneously. Our model consists of a customized pooling structure and an adaptive DenseNet front-end, which can effectively recognize possible disease features for classification and localization tasks. Our method has been validated on the publicly available ChestX-ray14 dataset. Experimental results have demonstrated that our classification and localization prediction performance achieved significant improvement over the previous models on the ChestX-ray14 dataset. In summary, our network can produce accurate disease classification and localization, which can potentially support clinical decisions. [1807.01257v1]

 

MetaAnchor: Learning to Detect Objects with Customized Anchors

Tong Yang, Xiangyu Zhang, Wenqiang Zhang, Jian Sun

We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks. Unlike many previous detectors model anchors via a predefined manner, in MetaAnchor anchor functions could be dynamically generated from the arbitrary customized prior boxes. Taking advantage of weight prediction, MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, we empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it also shows the potential on transfer tasks. Our experiment on COCO detection task shows that MetaAnchor consistently outperforms the counterparts in various scenarios. [1807.00980v1]

 

Semi-supervised Anomaly Detection Using GANs for Visual Inspection in Noisy Training Data

Masanari Kimura, Takashi Yanagihara

The detection and the quantification of anomalies in image data are critical tasks in industrial scenes such as detecting micro scratches on product. In recent years, due to the difficulty of defining anomalies and the limit of correcting their labels, research on unsupervised anomaly detection using generative models has attracted attention. Generally, in those studies, only normal images are used for training to model the distribution of normal images. The model measures the anomalies in the target images by reproducing the most similar images and scoring image patches indicating their fit to the learned distribution. This approach is based on a strong presumption; the trained model should not be able to generate abnormal images. However, in reality, the model can generate abnormal images mainly due to noisy normal data which include small abnormal pixels, and such noise severely affects the accuracy of the model. Therefore, we propose a novel semi-supervised method to distort the distribution of the model with existing abnormal images. The proposed method detects pixel-level micro anomalies with a high accuracy from 1024×1024 high resolution images which are actually used in an industrial scene. In this paper, we share experimental results on open datasets, due to the confidentiality of the data. [1807.01136v1]

 

Long Activity Video Understanding using Functional Object-Oriented Network

Ahmad Babaeian Jelodar, David Paulius, Yu Sun

Video understanding is one of the most challenging topics in computer vision. In this paper, a four-stage video understanding pipeline is presented to simultaneously recognize all atomic actions and the single on-going activity in a video. This pipeline uses objects and motions from the video and a graph-based knowledge representation network as prior reference. Two deep networks are trained to identify objects and motions in each video sequence associated with an action. Low Level image features are then used to identify objects of interest in that video sequence. Confidence scores are assigned to objects of interest based on their involvement in the action and to motion classes based on results from a deep neural network that classifies the on-going action in video into motion classes. Confidence scores are computed for each candidate functional unit associated with an action using a knowledge representation network, object confidences, and motion confidences. Each action is therefore associated with a functional unit and the sequence of actions is further evaluated to identify the single on-going activity in the video. The knowledge representation used in the pipeline is called the functional object-oriented network which is a graph-based network useful for encoding knowledge about manipulation tasks. Experiments are performed on a dataset of cooking videos to test the proposed algorithm with action inference and activity classification. Experiments show that using functional object oriented network improves video understanding significantly. [1807.00983v1]

 

HAMLET: Hierarchical Harmonic Filters for Learning Tracts from Diffusion MRI

Marco Reisert, Volker A. Coenen, Christoph Kaller, Karl Egger, Henrik Skibbe

In this work we propose HAMLET, a novel tract learning algorithm, which, after training, maps raw diffusion weighted MRI directly onto an image which simultaneously indicates tract direction and tract presence. The automatic learning of fiber tracts based on diffusion MRI data is a rather new idea, which tries to overcome limitations of atlas-based techniques. HAMLET takes a such an approach. Unlike the current trend in machine learning, HAMLET has only a small number of free parameters HAMLET is based on spherical tensor algebra which allows a translation and rotation covariant treatment of the problem. HAMLET is based on a repeated application of convolutions and non-linearities, which all respect the rotation covariance. The intrinsic treatment of such basic image transformations in HAMLET allows the training and generalization of the algorithm without any additional data augmentation. We demonstrate the performance of our approach for twelve prominent bundles, and show that the obtained tract estimates are robust and reliable. It is also shown that the learned models are portable from one sequence to another. [1807.01068v1]

 

Deep Architectures and Ensembles for Semantic Video Classification

Eng-Jon Ong, Sameed Husain, Mikel Bober, Miroslaw Bober

This work addresses the problem of accurate semantic labelling of short videos. We advance the state of the art by proposing a new residual architecture, with state-of-the art classification performance at significantly reduced complexity. Further, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by “clever” ensembling of diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we perform a detailed evaluation of a broad range of deep architectures, including designs based on recurrent networks (RNN), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, mid-stage AV fusion and others, presenting for the first time an in-depth evaluation and analysis of their behaviour. [1807.01026v1]

 

A Spatial and Temporal Features Mixture Model with Body Parts for Video-based Person Re-Identification

Jie Liu, Cheng Sun, Xiang Xu, Baomin Xu, Shuangyuan Yu

The video-based person re-identification is to recognize a person under different cameras, which is a crucial task applied in visual surveillance system. Most previous methods mainly focused on the feature of full body in the frame. In this paper we propose a novel Spatial and Temporal Features Mixture Model (STFMM) based on convolutional neural network (CNN) and recurrent neural network (RNN), in which the human body is split into $N$ parts in horizontal direction so that we can obtain more specific features. The proposed method skillfully integrates features of each part to achieve more expressive representation of each person. We first split the video sequence into $N$ part sequences which include the information of head, waist, legs and so on. Then the features are extracted by STFMM whose $2N$ inputs are obtained from the developed Siamese network, and these features are combined into a discriminative representation for one person. Experiments are conducted on the iLIDS-VID and PRID-2011 datasets. The results demonstrate that our approach outperforms existing methods for video-based person re-identification. It achieves a rank-1 CMC accuracy of 74\% on the iLIDS-VID dataset, exceeding the the most recently developed method ASTPN by 12\%. For the cross-data testing, our method achieves a rank-1 CMC accuracy of 48\% exceeding the ASTPN method by 18\%, which shows that our model has significant stability. [1807.00975v1]

 

MediaEval 2018: Predicting Media Memorability Task

Romain Cohendet, Claire-Hélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, France Rennes

In this paper, we present the Predicting Media Memorability task, which is proposed as part of the MediaEval 2018 Benchmarking Initiative for Multimedia Evaluation. Participants are expected to design systems that automatically predict memorability scores for videos, which reflect the probability of a video being remembered. In contrast to previous work in image memorability prediction, where memorability was measured a few minutes after memorization, the proposed dataset comes with short-term and long-term memorability annotations. All task characteristics are described, namely: the task’s challenges and breakthrough, the released data set and ground truth, the required participant runs and the evaluation metrics. [1807.01052v1]

 

Resembled Generative Adversarial Networks: Two Domains with Similar Attributes

Duhyeon Bang, Hyunjung Shim

We propose a novel algorithm, namely Resembled Generative Adversarial Networks (GAN), that generates two different domain data simultaneously where they resemble each other. Although recent GAN algorithms achieve the great success in learning the cross-domain relationship, their application is limited to domain transfers, which requires the input image. The first attempt to tackle the data generation of two domains was proposed by CoGAN. However, their solution is inherently vulnerable for various levels of domain similarities. Unlike CoGAN, our Resembled GAN implicitly induces two generators to match feature covariance from both domains, thus leading to share semantic attributes. Hence, we effectively handle a wide range of structural and semantic similarities between various two domains. Based on experimental analysis on various datasets, we verify that the proposed algorithm is effective for generating two domains with similar attributes. [1807.00947v1]

 

SpaceNet: A Remote Sensing Dataset and Challenge Series

Adam Van Etten, Dave Lindenbaum, Todd M. Bacastow

Foundational mapping remains a challenge in many parts of the world, particularly during dynamic scenarios such as natural disasters when timely updates are critical. Modifying maps is currently a highly manual process requiring a large number of human labelers to either create features or rigorously validate automated outputs. We propose that the frequent revisits of earth imaging satellite constellations may accelerate existing efforts to quickly revise foundational maps when combined with advanced machine learning techniques. Accordingly, the SpaceNet partners (CosmiQ Works, Radiant Solutions, and NVIDIA), released a large corpus of labeled satellite imagery on Amazon Web Services (AWS) called SpaceNet. The SpaceNet partners also launched a series of public prize competitions to encourage improvement of remote sensing machine learning algorithms. The first two of these competitions focused on automated building footprint extraction, and the most recent challenge focused on road network extraction. In this paper we discuss the SpaceNet imagery, labels, evaluation metrics, and prize challenge results to date. [1807.01232v1]

 

Local Gradients Smoothing: Defense against localized adversarial attacks

Muzammal Naseer, Salman Khan, Fatih Porikli

Deep neural networks (DNNs) have shown vulnerability to adversarial attacks, i.e., carefully perturbed inputs designed to mislead the network at inference time. Recently introduced localized attacks, LaVAN and Adversarial patch, posed a new challenge to deep learning security by adding adversarial noise only within a specific region without affecting the salient objects in an image. Driven by the observation that such attacks introduce concentrated high-frequency changes at a particular image location, we have developed an effective method to estimate noise location in gradient domain and transform those high activation regions caused by adversarial noise in image domain while having minimal effect on the salient object that is important for correct classification. Our proposed Local Gradients Smoothing (LGS) scheme achieves this by regularizing gradients in the estimated noisy region before feeding the image to DNN for inference. We have shown the effectiveness of our method in comparison to other defense methods including JPEG compression, Total Variance Minimization (TVM) and Feature squeezing on ImageNet dataset. In addition, we systematically study the robustness of the proposed defense mechanism against Back Pass Differentiable Approximation (BPDA), a state of the art attack recently developed to break defenses that transform an input sample to minimize the adversarial effect. Compared to other defense mechanisms, LGS is by far the most resistant to BPDA in localized adversarial attack setting. [1807.01216v1]

 

ReCoNet: Real-time Coherent Video Style Transfer Network

Chang Gao, Derun Gu, Fangjun Zhang, Yizhou Yu

Image style transfer models based on convolutional neural networks usually suffer from high temporal inconsistency when applied to videos. Some video style transfer models have been proposed to improve temporal consistency, yet they fail to guarantee fast processing speed, nice perceptual style quality and high temporal consistency at the same time. In this paper, we propose a novel real-time video style transfer model, ReCoNet, which can generate temporally coherent style transfer videos while maintaining favorable perceptual styles. A novel luminance warping constraint is added to the temporal loss at the output level to capture luminance changes between consecutive frames and increase stylization stability under illumination effects. We also purpose a novel feature-map-level temporal loss to further enhance temporal consistency on traceable objects. Experimental results indicate that our model exhibits outstanding performance both qualitatively and quantitatively. [1807.01197v1]

 

Viewpoint Estimation-Insights & Model

Gilad Divon, Ayellet Tal

This paper addresses the problem of viewpoint estimation of an object in a given image. It presents five key insights that should be taken into consideration when designing a CNN that solves the problem. Based on these insights, the paper proposes a network in which (i) The architecture jointly solves detection, classification, and viewpoint estimation. (ii) New types of data are added and trained on. (iii) A novel loss function, which takes into account both the geometry of the problem and the new types of data, is propose. Our network improves the state-of-the-art results for this problem by 9.8%. [1807.01312v1]

 

Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities

Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, Walter J. Scheirer

In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data. We select high-level features for our model that have been successful in nonaffect domains in order to test their generalizability in the sentiment detection domain. We train and test our model on the newly released CMU Multimodal Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset, obtaining an F1 score of 0.8049 on the validation set and an F1 score of 0.6325 on the held-out challenge test set. [1807.01122v1]

 

Iterative Attention Mining for Weakly Supervised Thoracic Disease Pattern Localization in Chest X-Rays

Jinzheng Cai, Le Lu, Adam P. Harrison, Xiaoshuang Shi, Pingjun Chen, Lin Yang

Given image labels as the only supervisory signal, we focus on harvesting, or mining, thoracic disease localizations from chest X-ray images. Harvesting such localizations from existing datasets allows for the creation of improved data sources for computer-aided diagnosis and retrospective analyses. We train a convolutional neural network (CNN) for image classification and propose an attention mining (AM) strategy to improve the model’s sensitivity or saliency to disease patterns. The intuition of AM is that once the most salient disease area is blocked or hidden from the CNN model, it will pay attention to alternative image regions, while still attempting to make correct predictions. However, the model requires to be properly constrained during AM, otherwise, it may overfit to uncorrelated image parts and forget the valuable knowledge that it has learned from the original image classification task. To alleviate such side effects, we then design a knowledge preservation (KP) loss, which minimizes the discrepancy between responses for X-ray images from the original and the updated networks. Furthermore, we modify the CNN model to include multi-scale aggregation (MSA), improving its localization ability on small-scale disease findings, e.g., lung nodules. We experimentally validate our method on the publicly-available ChestX-ray14 dataset, outperforming a class activation map (CAM)-based approach, and demonstrating the value of our novel framework for mining disease locations. [1807.00958v1]

 

Who did What at Where and When: Simultaneous Multi-Person Tracking and Activity Recognition

Wenbo Li, Ming-Ching Chang, Siwei Lyu

We present a bootstrapping framework to simultaneously improve multi-person tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via multi-stages: (1) activity-aware tracking, (2) joint interaction recognition and occlusion recovery, and (3) collective activity recognition. We solve the where and when problem with visual tracking, as well as the who and what problem with recognition. High-order correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using a hypergraph formulation within the Bayesian framework. Experiments on several benchmarks show the advantages of our approach over state-of-art methods. [1807.01253v1]

 

Ballistocardiogram Signal Processing: A Literature Review

Ibrahim Sadek

Time-domain algorithms are focused on detecting local maxima or local minima using a moving window, and therefore finding the interval between the dominant J-peaks of ballistocardiogram (BCG) signal. However, this approach has many limitations due to the nonlinear and nonstationary behavior of the BCG signal. This is because the BCG signal does not display consistent J-peaks, which can usually be the case for overnight, in-home monitoring, particularly with frail elderly. Additionally, its accuracy will be undoubtedly affected by motion artifacts. Second, frequency-domain algorithms do not provide information about interbeat intervals. Nevertheless, they can provide information about heart rate variability. This is usually done by taking the fast Fourier transform or the inverse Fourier transform of the logarithm of the estimated spectrum, i.e., cepstrum of the signal using a sliding window. Thereafter, the dominant frequency is obtained in a particular frequency range. The limit of these algorithms is that the peak in the spectrum may get wider and multiple peaks may appear, which might cause a problem in measuring the vital signs. At last, the objective of wavelet-domain algorithms is to decompose the signal into different components, hence the component which shows an agreement with the vital signs can be selected i.e., the selected component contains only information about the heart cycles or respiratory cycles, respectively. An empirical mode decomposition is an alternative approach to wavelet decomposition, and it is also a very suitable approach to cope with nonlinear and nonstationary signals such as cardiorespiratory signals. Apart from the above-mentioned algorithms, machine learning approaches have been implemented for measuring heartbeats. However, manual labeling of training data is a restricting property. [1807.00951v1]

 

Kitting in the Wild through Online Domain Adaptation

Massimiliano Mancini, Hakan Karaoguz, Elisa Ricci, Patric Jensfelt, Barbara Caputo

Technological developments call for increasing perception and action capabilities of robots. Among other skills, vision systems that can adapt to any possible change in the working conditions are needed. Since these conditions are unpredictable, we need benchmarks which allow to assess the generalization and robustness capabilities of our visual recognition algorithms. In this work we focus on robotic kitting in unconstrained scenarios. As a first contribution, we present a new visual dataset for the kitting task. Differently from standard object recognition datasets, we provide images of the same objects acquired under various conditions where camera, illumination and background are changed. This novel dataset allows for testing the robustness of robot visual recognition algorithms to a series of different domain shifts both in isolation and unified. Our second contribution is a novel online adaptation algorithm for deep models, based on batch-normalization layers, which allows to continuously adapt a model to the current working conditions. Differently from standard domain adaptation algorithms, it does not require any image from the target domain at training time. We benchmark the performance of the algorithm on the proposed dataset, showing its capability to fill the gap between the performances of a standard architecture and its counterpart adapted offline to the given target domain. [1807.01028v1]

 

SymmNet: A Symmetric Convolutional Neural Network for Occlusion Detection

Ang Li, Zejian Yuan

Detecting the occlusion from stereo images or video frames is important to many computer vision applications. Previous efforts focus on bundling it with the computation of disparity or optical flow, leading to a chicken-and-egg problem. In this paper, we leverage convolutional neural network to liberate the occlusion detection task from the interleaved, traditional calculation framework. We propose a Symmetric Network (SymmNet) to directly exploit information from an image pair, without estimating disparity or motion in advance. The proposed network is structurally left-right symmetric to learn the binocular occlusion simultaneously, aimed at jointly improving both results. The comprehensive experiments show that our model achieves state-of-the-art results on detecting the stereo and motion occlusion. [1807.00959v1]

 

Semantic Segmentation with Scarce Data

Isay Katsman, Rohun Tripathi, Andreas Veit, Serge Belongie

Semantic segmentation is a challenging vision problem that usually necessitates the collection of large amounts of finely annotated data, which is often quite expensive to obtain. Coarsely annotated data provides an interesting alternative as it is usually substantially more cheap. In this work, we present a method to leverage coarsely annotated data along with fine supervision to produce better segmentation results than would be obtained when training using only the fine data. We validate our approach by simulating a scarce data setting with less than 200 low resolution images from the Cityscapes dataset and show that our method substantially outperforms solely training on the fine annotation data by an average of 15.52% mIoU and outperforms the coarse mask by an average of 5.28% mIoU. [1807.00911v1]

 

Recurrent-OctoMap: Learning State-based Map Refinement for Long-Term Semantic Mapping with 3D-Lidar Data

Li Sun, Zhi Yan, Anestis Zaganidis, Cheng Zhao, Tom Duckett

This paper presents a novel semantic mapping approach, Recurrent-OctoMap, learned from long-term 3D Lidar data. Most existing semantic mapping approaches focus on improving semantic understanding of single frames, rather than 3D refinement of semantic maps (i.e. fusing semantic observations). The most widely-used approach for 3D semantic map refinement is a Bayes update, which fuses the consecutive predictive probabilities following a Markov-Chain model. Instead, we propose a learning approach to fuse the semantic features, rather than simply fusing predictions from a classifier. In our approach, we represent and maintain our 3D map as an OctoMap, and model each cell as a recurrent neural network (RNN), to obtain a Recurrent-OctoMap. In this case, the semantic mapping process can be formulated as a sequence-to-sequence encoding-decoding problem. Moreover, in order to extend the duration of observations in our Recurrent-OctoMap, we developed a robust 3D localization and mapping system for successively mapping a dynamic environment using more than two weeks of data, and the system can be trained and deployed with arbitrary memory length. We validate our approach on the ETH long-term 3D Lidar dataset [1]. The experimental results show that our proposed approach outperforms the conventional “Bayes update” approach. [1807.00925v1]

 

Model-based Hand Pose Estimation for Generalized Hand Shape with Appearance Normalization

Jan Wöhlke, Shile Li, Dongheui Lee

Since the emergence of large annotated datasets, state-of-the-art hand pose estimation methods have been mostly based on discriminative learning. Recently, a hybrid approach has embedded a kinematic layer into the deep learning structure in such a way that the pose estimates obey the physical constraints of human hand kinematics. However, the existing approach relies on a single person’s hand shape parameters, which are fixed constants. Therefore, the existing hybrid method has problems to generalize to new, unseen hands. In this work, we extend the kinematic layer to make the hand shape parameters learnable. In this way, the learnt network can generalize towards arbitrary hand shapes. Furthermore, inspired by the idea of Spatial Transformer Networks, we apply a cascade of appearance normalization networks to decrease the variance in the input data. The input images are shifted, rotated, and globally scaled to a similar appearance. The effectiveness and limitations of our proposed approach are extensively evaluated on the Hands 2017 challenge dataset and the NYU dataset. [1807.00898v1]

 

Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera

Fangchang Ma, Guilherme Venturelli Cavalheiro, Sertac Karaman

Depth completion, the technique of estimating a dense depth image from sparse depth measurements, has a variety of applications in robotics and autonomous driving. However, depth completion faces 3 main challenges: the irregularly spaced pattern in the sparse depth input, the difficulty in handling multiple sensor modalities (when color images are available), as well as the lack of dense, pixel-level ground truth depth labels. In this work, we address all these challenges. Specifically, we develop a deep regression model to learn a direct mapping from sparse depth (and color images) to dense depth. We also propose a self-supervised training framework that requires only sequences of color and sparse depth images, without the need for dense depth labels. Our experiments demonstrate that our network, when trained with semi-dense annotations, attains state-of-the- art accuracy and is the winning approach on the KITTI depth completion benchmark at the time of submission. Furthermore, the self-supervised framework outperforms a number of existing solutions trained with semi- dense annotations. [1807.00275v2]

 

Differentiable Learning-to-Normalize via Switchable Normalization

Ping Luo, Jiamin Ren, Zhanglin Peng

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different operations for different normalization layers of a deep neural network (DNN). SN switches among three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch, by learning their importance weights in an end-to-end manner. SN has several good properties. First, it adapts to various network architectures and tasks (see Fig.1). Second, it is robust to a wide range of batch sizes, maintaining high performance when small minibatch is presented (e.g. 2 images/GPU). Third, SN treats all channels as a group, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging problems, such as image classification in ImageNet, object detection and segmentation in COCO, artistic image stylization, and neural architecture search. We hope SN will help ease the usages and understand the effects of normalization techniques in deep learning. The code of SN will be made available in https://github.com/switchablenorms/. [1806.10779v3]

 

Compact Deep Neural Networks for Computationally Efficient Gesture Classification From Electromyography Signals

Adam Hartwell, Visakan Kadirkamanathan, Sean R Anderson

Machine learning classifiers using surface electromyography are important for human-machine interfacing and device control. Conventional classifiers such as support vector machines (SVMs) use manually extracted features based on e.g. wavelets. These features tend to be fixed and non-person specific, which is a key limitation due to high person-to-person variability of myography signals. Deep neural networks, by contrast, can automatically extract person specific features – an important advantage. However, deep neural networks typically have the drawback of large numbers of parameters, requiring large training data sets and powerful hardware not suited to embedded systems. This paper solves these problems by introducing a compact deep neural network architecture that is much smaller than existing counterparts. The performance of the compact deep net is benchmarked against an SVM and compared to other contemporary architectures across 10 human subjects, comparing Myo and Delsys Trigno electrode sets. The accuracy of the compact deep net was found to be 84.2 +/- 0.06% versus 70.5 +/- 0.07% for the SVM on the Myo, and 80.3+/- 0.07% versus 67.8 +/- 0.09% for the Delsys system, demonstrating the superior effectiveness of the proposed compact network, which had just 5,889 parameters – orders of magnitude less than some contemporary alternatives in this domain while maintaining better performance. [1806.08641v2]

 

Fully Convolutional Networks and Generative Neural Networks Applied to Sclera Segmentation

Diego R. Lucio, Rayson Laroca, Evair Severo, Alceu S. Britto Jr., David Menotti

Due to the world’s demand for security systems, biometrics can be seen as an important topic of research in computer vision. One of the biometric forms that has been gaining attention is the recognition based on sclera. The initial and paramount step for performing this type of recognition is the segmentation of the region of interest, i.e. the sclera. In this context, two approaches for such task based on the Fully Convolutional Network (FCN) and on Generative Adversarial Network (GAN) are introduced in this work. FCN is similar to a common convolution neural network, however the fully connected layers (i.e., the classification layers) are removed from the end of the network and the output is generated by combining the output of pooling layers from different convolutional ones. The GAN is based on the game theory, where we have two networks competing with each other to generate the best segmentation. In order to perform fair comparison with baselines and quantitative and objective evaluations of the proposed approaches, we provide to the scientific community new 1,300 manually segmented images from two databases. The experiments are performed on the UBIRIS.v2 and MICHE databases and the best performing configurations of our propositions achieved F-score’s measures of 87.48% and 88.32%, respectively. [1806.08722v2]

 

Real-time Monocular Visual Odometry for Turbid and Dynamic Underwater Environments

Maxime Ferrera, Julien Moras, Pauline Trouvé-Peloux, Vincent Creuze

In the context of robotic underwater operations, the visual degradations induced by the medium properties make difficult the exclusive use of cameras for localization purpose. Hence, most localization methods are based on expensive navigational sensors associated with acoustic positioning. On the other hand, visual odometry and visual SLAM have been exhaustively studied for aerial or terrestrial applications, but state-of-the-art algorithms fail underwater. In this paper we tackle the problem of using a simple low-cost camera for underwater localization and propose a new monocular visual odometry method dedicated to the underwater environment. We evaluate different tracking methods and show that optical flow based tracking is more suited to underwater images than classical approaches based on descriptors. We also propose a keyframe-based visual odometry approach highly relying on nonlinear optimization. The proposed algorithm has been assessed on both simulated and real underwater datasets and outperforms state-of-the-art visual SLAM methods under many of the most challenging conditions. The main application of this work is the localization of Remotely Operated Vehicles (ROVs) used for underwater archaeological missions but the developed system can be used in any other applications as long as visual information is available. [1806.05842v2]

 

GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations

Zhilin Yang, Jake Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun

Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels. [1806.05662v3]

 

Local Learning with Deep and Handcrafted Features for Facial Expression Recognition

Mariana-Iuliana Georgescu, Radu Tudor Ionescu, Marius Popescu

We present an approach that combines automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model in order to achieve state-of-the-art results in facial expression recognition. To obtain automatic features, we experiment with multiple CNN architectures, pre-trained models and training procedures, e.g. Dense-Sparse-Dense. After fusing the two types of features, we employ a local learning framework to predict the class label for each test image. The local learning framework is based on three steps. First, a k-nearest neighbors model is applied for selecting the nearest training samples for an input test image. Second, a one-versus-all Support Vector Machines (SVM) classifier is trained on the selected training samples. Finally, the SVM classifier is used for predicting the class label only for the test image it was trained for. Although local learning has been used before in combination with handcrafted features, to the best of our knowledge, it has never been employed in combination with deep features. The experiments on the 2013 Facial Expression Recognition (FER) Challenge data set and the FER+ data set demonstrate that our approach achieves state-of-the-art results. With a top accuracy of 75.42% on the FER 2013 data set and 87.76% on the FER+ data set, we surpass all competition by more than 2% on both data sets. [1804.10892v3]

 

Target Driven Instance Detection

Phil Ammirato, Cheng-Yang Fu, Mykhailo Shvets, Jana Kosecka, Alexander C. Berg

While state-of-the-art general object detectors are getting better and better, there are not many systems specifically designed to take advantage of the instance detection problem. For many applications, such as household robotics, a system may need to recognize a few very specific instances at a time. Speed can be critical in these applications, as can the need to recognize previously unseen instances. We introduce a Target Driven Instance Detector(TDID), which modifies existing general object detectors for the instance recognition setting. TDID not only improves performance on instances seen during training, with a fast runtime, but is also able to generalize to detect novel instances. [1803.04610v2]

 

Improved Training of Generative Adversarial Networks Using Representative Features

Duhyeon Bang, Hyunjung Shim

Despite the success of generative adversarial networks (GANs) for image generation, the trade-off between visual quality and image diversity remains a significant issue. This paper achieves both aims simultaneously by improving the stability of training GANs. The key idea of the proposed approach is to implicitly regularize the discriminator using representative features. Focusing on the fact that standard GAN minimizes reverse Kullback-Leibler (KL) divergence, we transfer the representative feature, which is extracted from the data distribution using a pre-trained autoencoder (AE), to the discriminator of standard GANs. Because the AE learns to minimize forward KL divergence, our GAN training with representative features is influenced by both reverse and forward KL divergence. Consequently, the proposed approach is verified to improve visual quality and diversity of state of the art GANs using extensive evaluations. [1801.09195v3]

 

Enlarging Context with Low Cost: Efficient Arithmetic Coding with Trimmed Convolution

Mu Li, Shuhang Gu, David Zhang, Wangmeng Zuo

Arithmetic coding is an essential class of coding techniques. One key issue of arithmetic encoding method is to predict the probability of the current coding symbol from its context, i.e., the preceding encoded symbols, which usually can be executed by building a look-up table (LUT). However, the complexity of LUT increases exponentially with the length of context. Thus, such solutions are limited to modeling large context, which inevitably restricts the compression performance. Several recent deep neural network-based solutions have been developed to account for large context, but are still costly in computation. The inefficiency of the existing methods are mainly attributed to that probability prediction is performed independently for the neighboring symbols, which actually can be efficiently conducted by shared computation. To this end, we propose a trimmed convolutional network for arithmetic encoding (TCAE) to model large context while maintaining computational efficiency. As for trimmed convolution, the convolutional kernels are specially trimmed to respect the compression order and context dependency of the input symbols. Benefited from trimmed convolution, the probability prediction of all symbols can be efficiently performed in one single forward pass via a fully convolutional network. Furthermore, to speed up the decoding process, a slope TCAE model is presented to divide the codes from a 3D code map into several blocks and remove the dependency between the codes inner one block for parallel decoding, which can 60x speed up the decoding process. Experiments show that our TCAE and slope TCAE attain better compression ratio in lossless gray image compression, and can be adopted in CNN-based lossy image compression to achieve state-of-the-art rate-distortion performance with real-time encoding speed. [1801.04662v2]

 

InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity

Hee Jung Ryu, Hartwig Adam, Margaret Mitchell

We demonstrate an approach to face attribute detection that retains or improves attribute detection accuracy across gender and race subgroups by learning demographic information prior to learning the attribute detection task. The system, which we call InclusiveFaceNet, detects face attributes by transferring race and gender representations learned from a held-out dataset of public race and gender identities. Leveraging learned demographic representations while withholding demographic inference from the downstream face attribute detection task preserves potential users’ demographic privacy while resulting in some of the best reported numbers to date on attribute detection in the Faces of the World and CelebA datasets. [1712.00193v2]

 

Pixel-wise object tracking

Yilin Song, Chenge Li, Yao Wang

In this paper, we propose a novel pixel-wise visual object tracking framework that can track any anonymous object in a noisy background. The framework consists of two submodels, a global attention model and a local segmentation model. The global model generates a region of interests (ROI) that the object may lie in the new frame based on the past object segmentation maps, while the local model segments the new image in the ROI. Each model uses a LSTM structure to model the temporal dynamics of the motion and appearance, respectively. To circumvent the dependency of the training data between the two models, we use an iterative update strategy. Once the models are trained, there is no need to refine them to track specific objects, making our method efficient compared to online learning approaches. We demonstrate our real time pixel-wise object tracking framework on a challenging VOT dataset [1711.07377v2]

 

Tensor-Based Classifiers for Hyperspectral Data Analysis

Konstantinos Makantasis, Anastasios Doulamis, Nikolaos Doulamis, Antonis Nikitakis

In this work, we present tensor-based linear and nonlinear models for hyperspectral data classification and analysis. By exploiting principles of tensor algebra, we introduce new classification architectures, the weight parameters of which satisfies the {\it rank}-1 canonical decomposition property. Then, we introduce learning algorithms to train both the linear and the non-linear classifier in a way to i) to minimize the error over the training samples and ii) the weight coefficients satisfies the {\it rank}-1 canonical decomposition property. The advantages of the proposed classification model is that i) it reduces the number of parameters required and thus reduces the respective number of training samples required to properly train the model, ii) it provides a physical interpretation regarding the model coefficients on the classification output and iii) it retains the spatial and spectral coherency of the input samples. To address issues related with linear classification, characterizing by low capacity, since it can produce rules that are linear in the input space, we introduce non-linear classification models based on a modification of a feedforward neural network. We call the proposed architecture {\it rank}-1 Feedfoward Neural Network (FNN), since their weights satisfy the {\it rank}-1 caconical decomposition property. Appropriate learning algorithms are also proposed to train the network. Experimental results and comparisons with state of the art classification methods, either linear (e.g., SVM) and non-linear (e.g., deep learning) indicates the outperformance of the proposed scheme, especially in cases where a small number of training samples are available. Furthermore, the proposed tensor-based classfiers are evaluated against their capabilities in dimensionality reduction. [1709.08164v2]

 

H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation from CT Volumes

Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, Pheng Ann Heng

Liver cancer is one of the leading causes of cancer death. To assist doctors in hepatocellular carcinoma diagnosis and treatment planning, an accurate and automatic liver and tumor segmentation method is highly demanded in the clinical practice. Recently, fully convolutional neural networks (FCNs), including 2D and 3D FCNs, serve as the back-bone in many volumetric image segmentation. However, 2D convolutions can not fully leverage the spatial information along the third dimension while 3D convolutions suffer from high computational cost and GPU memory consumption. To address these issues, we propose a novel hybrid densely connected UNet (H-DenseUNet), which consists of a 2D DenseUNet for efficiently extracting intra-slice features and a 3D counterpart for hierarchically aggregating volumetric contexts under the spirit of the auto-context algorithm for liver and tumor segmentation. We formulate the learning process of H-DenseUNet in an end-to-end manner, where the intra-slice representations and inter-slice features can be jointly optimized through a hybrid feature fusion (HFF) layer. We extensively evaluated our method on the dataset of MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge and 3DIRCADb Dataset. Our method outperformed other state-of-the-arts on the segmentation results of tumors and achieved very competitive performance for liver segmentation even with a single model. [1709.07330v3]

 

Multi Resolution LSTM For Long Term Prediction In Neural Activity Video

Yilin Song, Jonathan Viventi, Yao Wang

Epileptic seizures are caused by abnormal, overly syn- chronized, electrical activity in the brain. The abnor- mal electrical activity manifests as waves, propagating across the brain. Accurate prediction of the propagation velocity and direction of these waves could enable real- time responsive brain stimulation to suppress or prevent the seizures entirely. However, this problem is very chal- lenging because the algorithm must be able to predict the neural signals in a sufficiently long time horizon to allow enough time for medical intervention. We consider how to accomplish long term prediction using a LSTM network. To alleviate the vanishing gradient problem, we propose two encoder-decoder-predictor structures, both using multi-resolution representation. The novel LSTM structure with multi-resolution layers could significantly outperform the single-resolution benchmark with similar number of parameters. To overcome the blurring effect associated with video prediction in the pixel domain using standard mean square error (MSE) loss, we use energy- based adversarial training to improve the long-term pre- diction. We demonstrate and analyze how a discriminative model with an encoder-decoder structure using 3D CNN model improves long term prediction. [1705.02893v2]

 

Diversity encouraged learning of unsupervised LSTM ensemble for neural activity video prediction

Yilin Song, Jonathan Viventi, Yao Wang

Being able to predict the neural signal in the near future from the current and previous observations has the potential to enable real-time responsive brain stimulation to suppress seizures. We have investigated how to use an auto-encoder model consisting of LSTM cells for such prediction. Recog- nizing that there exist multiple activity pattern clusters, we have further explored to train an ensemble of LSTM mod- els so that each model can specialize in modeling certain neural activities, without explicitly clustering the training data. We train the ensemble using an ensemble-awareness loss, which jointly solves the model assignment problem and the error minimization problem. During training, for each training sequence, only the model that has the lowest recon- struction and prediction error is updated. Intrinsically such a loss function enables each LTSM model to be adapted to a subset of the training sequences that share similar dynamic behavior. We demonstrate this can be trained in an end- to-end manner and achieve significant accuracy in neural activity prediction. [1611.04899v2]

随机信道解相关网络及其在视觉跟踪中的应用

Jie Guo, Tingfa Xu, Shenwang Jiang, Ziyi Shen

深度卷积神经网络(CNN)已经占据了许多计算机视觉领域,因为它们能够自动提取好的特征。然而,许多基于CNN的深度计算机视觉任务都缺乏训练数据,而深度模型中有数百万个参数。显然,这两个双相违规事实将导致许多设计不佳的深CNN的参数冗余。因此,我们深入研究现有的CNN,发现网络参数的冗余来自卷积层内不同信道的特征之间的相关性。为了解决这个问题,我们提出了随机信道去相关(SCD)块,其在每次迭代中随机选择卷积层内的多对信道并计算它们的归一化交叉相关(NCC)。然后提出平方最大边际损失作为SCD的目标,以抑制相关性并明确地保持通道之间的多样性。所提出的SCD非常灵活,可以简单地应用于任何现有的CNN模型。基于SCD和全卷积连体网络,我们提出了一种可视化跟踪算法来验证SCD的有效性。[1807.01103v1]

 

用于使用GAN将语义信息传递到看不见的天气条件的模块化车辆控制

Patrick WenzelQadeer KhanDaniel CremersLaura Leal-Taixé

端到端监督学习已经为自动驾驶汽车带来了可喜的结果,特别是在训练条件下。但是,在不可见的条件下,它可能不一定表现良好。在本文中,我们演示了如何将知识从一个可用语义标签和转向命令的天气条件转移到一组我们无法访问标记数据的全新条件。通过将车辆控制的任务划分为独立的感知和控制模块来解决该问题,使得改变一个不影响另一个。我们仅根据可用条件的数据训练控制模块,即使在新条件下也能保持固定。然后将感知模块用作新天气条件和该控制模型之间的接口。反过来使用语义标签训练感知模块,我们假设这些语义标签已经可用于训练控制模型的相同天气条件。但是,为其他条件获取它们是一个单调乏味且容易出错的过程。因此,我们建议使用基于生成对抗网络(GAN)的模型以无监督的方式检索新条件的语义信息。我们引入了一个主服务器架构,其中主模型(可用的语义标签)训练服务方模型(语义标签不可用)。然后可以使用仆人模型来转向车辆而无需重新训练控制模块。[1807.01001v1] 为其他条件获取它们是一个单调乏味且容易出错的过程。因此,我们建议使用基于生成对抗网络(GAN)的模型以无监督的方式检索新条件的语义信息。我们引入了一个主服务器架构,其中主模型(可用的语义标签)训练服务方模型(语义标签不可用)。然后可以使用仆人模型来转向车辆而无需重新训练控制模块。[1807.01001v1] 为其他条件获取它们是一个单调乏味且容易出错的过程。因此,我们建议使用基于生成对抗网络(GAN)的模型以无监督的方式检索新条件的语义信息。我们引入了一个主服务器架构,其中主模型(可用的语义标签)训练服务方模型(语义标签不可用)。然后可以使用仆人模型来转向车辆而无需重新训练控制模块。[1807.01001v1] 其中主模型(可用的语义标签)训练仆人模型(语义标签不可用)。然后可以使用仆人模型来转向车辆而无需重新训练控制模块。[1807.01001v1] 其中主模型(可用的语义标签)训练仆人模型(语义标签不可用)。然后可以使用仆人模型来转向车辆而无需重新训练控制模块。[1807.01001v1]

 

监督注册:一种无监督的方法来提高面部地标检测器的精度

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, Yaser Sheikh

在本文中,我们提出了一种注册监督,一种无监督的方法来提高图像和视频上面部标志检测器的精度。我们的关键观察是相邻帧中相同地标的检测应该与配准(即光流)一致。有趣的是,光流的一致性是监督的来源,不需要手动标记,并且可以在探测器训练期间使用。例如,我们可以在训练损失函数中强制执行在帧$ _ {t-1} $处检测到的地标,然后从帧$ _ {t-1} $到帧$ _t $的光流跟踪应该与位置重合帧$ _t $的检测。从本质上讲,登记监督增加了登记损失的培训损失功能,因此,训练探测器的输出不仅接近标记图像中的注释,而且与大量未标记视频上的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 但也与大量未标记视频的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 但也与大量未标记视频的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1] 我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v1]

 

一种用于分类胸部疾病和识别异常的弱监督自适应密集网

Bo Zhou, Yuemeng Li, Jiangcong Wang

我们提出了一种弱监督的深度学习模型,用于根据医学成像数据对疾病进行分类和识别异常。在这项工作中,我们的模型不是通过区域级注释来学习医学成像数据,而是使用图像级标签对成像数据进行训练以对疾病进行分类,并且能够同时识别异常图像区域。我们的模型包括一个定制的池结构和一个自适应DenseNet前端,可以有效识别分类和定位任务可能的疾病特征。我们的方法已在公开的ChestX-ray14数据集上得到验证。实验结果表明,我们的分类和定位预测性能比之前的ChestX-ray14数据集模型有了显着的改进。综上所述,我们的网络可以产生准确的疾病分类和定位,这可能有助于临床决策。[1807.01257v1]

 

MetaAnchor:学习使用自定义锚点检测对象

Tong Yang, Xiangyu Zhang, Wenqiang Zhang, Jian Sun

我们为对象检测框架提出了一种名为MetaAnchor的新颖灵活的锚机制。与许多先前的检测器模型锚点不同,通过预定义的方式,在MetaAnchor中,锚点函数可以从任意定制的先前框中动态生成。利用权重预测,MetaAnchor能够与大多数基于锚的对象检测系统(如RetinaNet)协同工作。与预定义的锚点方案相比,我们凭经验发现MetaAnchor对锚点设置和边界框分布更具鲁棒性此外,它还显示了转移任务的潜力。我们对COCO检测任务的实验表明,MetaAnchor在各种情况下始终优于同行。[1807.00980v1]

 

在噪声训练数据中使用GAN进行视觉检测的半监督异常检测

Masanari KimuraTakashi Yanagihara

图像数据中异常的检测和量化是工业场景中的关键任务,例如检测产品上的微划痕。近年来,由于难以定义异常和校正标 签的限制,使用生成模型进行无监督异常检测的研究引起了人们的关注。通常,在那些研究中,仅将正常图像用于训练以模拟正常图像的分布。该模型通过再现最相似的图像并对图像块进行评分来测量目标图像中的异常,从而指示它们与学习的分布的拟合。这种方法基于强有力的推定受过训练的模型不应该能够生成异常图像。但实际上,该模型主要由于包括小异常像素的噪声正常数据而产生异常图像,并且这种噪声严重影响模型的准确性。因此,我们提出了一种新的半监督方法来扭曲模型的分布与现有的异常图像。所提出的方法从1024×1024高分辨率图像中高精度地检测像素级微异常,这些图像实际上在工业场景中使用。在本文中,由于数据的机密性,我们在开放数据集上分享实验结果。[1807.01136v1] 所提出的方法从1024×1024高分辨率图像中高精度地检测像素级微异常,这些图像实际上在工业场景中使用。在本文中,由于数据的机密性,我们在开放数据集上分享实验结果。[1807.01136v1] 所提出的方法从1024×1024高分辨率图像中高精度地检测像素级微异常,这些图像实际上在工业场景中使用。在本文中,由于数据的机密性,我们在开放数据集上分享实验结果。[1807.01136v1]

 

使用功能面向对象网络的长活动视频理解

Ahmad Babaeian JelodarDavid PaulYu Sun

视频理解是计算机视觉中最具挑战性的主题之一。在本文中,提出了一个四阶段视频理解管道,以同时识别视频中的所有原子动作和单个正在进行的活动。该管道使用来自视频的对象和运动以及基于图的知识表示网络作为先前参考。训练两个深度网络以识别与动作相关联的每个视频序列中的对象和动作。然后使用低级图像特征来识别该视频序列中感兴趣的对象。基于参与动作的动作类和基于深度神经网络的结果的动作类将置信度分配分配给感兴趣的对象,该神经网络将视频中的正在进行的动作分类为动作类。使用知识表示网络,对象置信度和运动置信度为与动作相关联的每个候选功能单元计算置信度分数。因此,每个动作与功能单元相关联,并且进一步评估动作序列以识别视频中的单个正在进行的活动。管道中使用的知识表示称为功能面向对象网络,它是一种基于图形的网络,可用于编码有关操作任务的知识。在烹饪视频的数据集上执行实验,以通过动作推断和活动分类来测试所提出的算法。实验表明,使用面向功能对象的网络可以显着提高视频理解。[1807.00983v1] 因此,每个动作与功能单元相关联,并且进一步评估动作序列以识别视频中的单个正在进行的活动。管道中使用的知识表示称为功能面向对象网络,它是一种基于图形的网络,可用于编码有关操作任务的知识。在烹饪视频的数据集上执行实验,以通过动作推断和活动分类来测试所提出的算法。实验表明,使用面向功能对象的网络可以显着提高视频理解。[1807.00983v1] 因此,每个动作与功能单元相关联,并且进一步评估动作序列以识别视频中的单个正在进行的活动。管道中使用的知识表示称为功能面向对象网络,它是一种基于图形的网络,可用于编码有关操作任务的知识。在烹饪视频的数据集上执行实验,以通过动作推断和活动分类来测试所提出的算法。实验表明,使用面向功能对象的网络可以显着提高视频理解。[1807.00983v1] 管道中使用的知识表示称为功能面向对象网络,它是一种基于图形的网络,可用于编码有关操作任务的知识。在烹饪视频的数据集上执行实验,以通过动作推断和活动分类来测试所提出的算法。实验表明,使用面向功能对象的网络可以显着提高视频理解。[1807.00983v1] 管道中使用的知识表示称为功能面向对象网络,它是一种基于图形的网络,可用于编码有关操作任务的知识。在烹饪视频的数据集上执行实验,以通过动作推断和活动分类来测试所提出的算法。实验表明,使用面向功能对象的网络可以显着提高视频理解。[1807.00983v1]

 

HAMLET:用于扩散MRI学习束的分层谐波滤波器

Marco ReisertVolker A. CoenenChristoph KallerKarl EggerHenrik Skibbe

在这项工作中,我们提出了HAMLET,一种新颖的管道学习算法,在训练后,将原始扩散加权MRI直接映射到同时指示管道方向和管道存在的图像上。基于扩散MRI数据的纤维束的自动学习是一个相当新的想法,其试图克服基于图谱的技术的局限性。HAMLET采取了这样的方法。与机器学习的当前趋势不同,HAMLET只有少量的自由参数HAMLET基于球面张量代数,它允许对问题进行平移和旋转协变处理。HAMLET基于重复应用卷积和非线性,它们都遵循旋转协方差。HAMLET中这种基本图像变换的内在处理允许算法的训练和概括,而无需任何额外的数据增加。我们展示了我们的方法对12个突出的捆绑的性能,并表明获得的流域估计是稳健可靠的。还表明,学习的模型可以从一个序列移植到另一个序列。[1807.01068v1]

 

用于语义视频分类的深层体系结构和集合

Eng-Jon OngSameed HusainMikel BoberMiroslaw Bober

这项工作解决了短视频的准确语义标签问题。我们通过提出一种新的剩余架构来推进最先进的技术,其最先进的分类性能显着降低了复杂性。此外,我们提出了四种新的多样性驱动的多网络集成方法,一种基于快速相关性测量,另一种采用基于DNN的组合器。我们表明,通过对各种网络的巧妙整合可以实现显着的性能提升,并且我们研究了导致高度多样性的因素。基于广泛的YouTube8M数据集,我们对广泛的深层架构进行了详细评估,包括基于循环网络(RNN),特征空间聚合(FVVLADBoW)的设计,简单的统计聚合,中期AV融合和别的,首次提出对其行为进行深入评估和分析。[1807.01026v1]

 

基于视频的人物重新识别的时空特征混合模型与身体部位

Jie Liu, Cheng Sun, Xiang Xu, Baomin Xu, Shuangyuan Yu

基于视频的人物重新识别是识别不同摄像机下的人,这是视觉监控系统中应用的关键任务。以前的大多数方法主要关注框架中全身的特征。在本文中,我们提出了一种基于卷积神经网络(CNN)和递归神经网络(RNN)的新型时空特征混合模型(STFMM),其中人体在水平方向上被分成$ N $个部分,这样我们就可以了可以获得更具体的功能。所提出的方法巧妙地集成了每个部分的特征,以实现每个人的更具表现力的表现。我们首先将视频序列分成$ N $部分序列,其中包括头部,腰部,腿部等信息。然后由STFMM提取特征,其中$ 2N $输入是从开发的Siamese网络获得的,并且这些特征被组合成一个人的辨别表示。实验在iLIDS-VIDPRID-2011数据集上进行。结果表明,我们的方法优于现有的基于视频的人员重新识别方法。它在iLIDS-VID数据集上达到了74%的1CMC准确度,超过了最近开发的方法ASTPN 12%。对于跨数据测试,我们的方法实现了超过ASTPN方法的18%的1CMC精度18%,这表明我们的模型具有显着的稳定性。[1807.00975v1] 结果表明,我们的方法优于现有的基于视频的人员重新识别方法。它在iLIDS-VID数据集上达到了74%的1CMC准确度,超过了最近开发的方法ASTPN 12%。对于跨数据测试,我们的方法实现了超过ASTPN方法的18%的1CMC精度18%,这表明我们的模型具有显着的稳定性。[1807.00975v1] 结果表明,我们的方法优于现有的基于视频的人员重新识别方法。它在iLIDS-VID数据集上达到了74%的1CMC准确度,超过了最近开发的方法ASTPN 12%。对于跨数据测试,我们的方法实现了超过ASTPN方法的18%的1CMC精度18%,这表明我们的模型具有显着的稳定性。[1807.00975v1]

 

MediaEval 2018:预测媒体可记忆性任务

Romain CohendetClaire-HélèneDemartyNgoc DuongMatsSjöbergBogdan IonescuQing-Toan Do,法国雷恩

在本文中,我们提出了Predicting Media Memorability任务,该任务是作为MediaEval 2018多媒体评估基准计划的一部分提出的。参与者需要设计能够自动预测视频的可记忆性分数的系统,这些系统反映了视频被记住的可能性。与先前在图像记忆性预测中的工作相比,在记忆后几分钟测量记忆性,建议的数据集带有短期和长期记忆性注释。描述了所有任务特征,即:任务的挑战和突破,发布的数据集和基本事实,所需的参与者运行和评估指标。[1807.01052v1]

 

类似生成对抗网络:具有相似属性的两个域

Duhyeon BangHyunjung Shim

我们提出了一种新颖的算法,即类似的生成对抗网络(GAN),它们在它们彼此相似的情况下同时生成两个不同的域数据。尽管最近的GAN算法在学习跨域关系方面取得了巨大成功,但它们的应用仅限于域传输,这需要输入图像。CoGAN提出了首次尝试解决两个域的数据生成问题。但是,他们的解决方案本质上容易受到各种级别的域相似性的影响。与CoGAN不同,我们的Resembled GAN隐含地引入两个生成器来匹配来自两个域的特征协方差,从而导致共享语义属性。因此,我们有效地处理了两个领域之间广泛的结构和语义相似性。基于对各种数据集的实验分析,我们验证了所提出的算法对于生成具有相似属性的两个域是有效的。[1807.00947v1]

 

SpaceNet:遥感数据集和挑战系列

Adam Van EttenDave LindenbaumTodd M. Bacastow

基础地图绘制在世界许多地方仍然是一项挑战,特别是在自然灾害等动态情景中,及时更新至关重要。修改地图目前是一个高度手动的过程,需要大量的人类贴标人员创建功能或严格验证自动输出。我们建议频繁重新考虑地球成像卫星星座可以加速现有的努力,结合先进的机器学习技术快速修改基础地图。因此,SpaceNet合作伙伴(CosmiQ WorksRadiant SolutionsNVIDIA)在Amazon Web ServicesAWS)上发布了一个名为SpaceNet的大型标记卫星图像。SpaceNet合作伙伴还推出了一系列公共奖项竞赛,以鼓励改进遥感机器学习算法。前两场比赛的重点是自动化建筑足迹提取,最近的挑战集中在道路网络提取上。在本文中,我们将讨论迄今为止的SpaceNet图像,标签,评估指标和奖品挑战结果。[1807.01232v1]

 

局部渐变平滑:防御局部对抗性攻击

Muzammal NaseerSalman KhanFatih Porikli

深度神经网络(DNN)已显示出对抗性攻击的脆弱性,即精心扰动的输入,旨在在推理时误导网络。最近引入的局部攻击,LaVANAdversarial补丁,通过仅在特定区域内添加对抗性噪声而不影响图像中的显着对象,对深度学习安全性提出了新的挑战。在观察到这种攻击在特定图像位置引入集中的高频变化的驱动下,我们开发了一种有效的方法来估计梯度域中的噪声位置,并转换由图像域中的对抗性噪声引起的那些高激活区域,同时对对于正确分类很重要的显着对象。我们提出的局部梯度平滑(LGS)方案通过在将图像馈送到DNN用于推断之前规范估计的噪声区域中的梯度来实现这一点。我们已经证明了我们的方法与其他防御方法相比的有效性,包括JPEG压缩,总方差最小化(TVM)和ImageNet数据集上的特征压缩。此外,我们系统地研究了提出的反向通过可微分近似(BPDA)防御机制的鲁棒性,BPDA是最近发展起来的一种先进技术,用于打破改变输入样本以最小化对抗效应的防御。与其他防御机制相比,LGS在本地化对抗性攻击设置中是迄今为止对BPDA最具抵抗力的。[1807.01216v1] ImageNet数据集上的总差异最小化(TVM)和特征压缩。此外,我们系统地研究了提出的反向通过可微分近似(BPDA)防御机制的鲁棒性,BPDA是最近发展起来的一种先进技术,用于打破改变输入样本以最小化对抗效应的防御。与其他防御机制相比,LGS在本地化对抗性攻击设置中是迄今为止对BPDA最具抵抗力的。[1807.01216v1] ImageNet数据集上的总差异最小化(TVM)和特征压缩。此外,我们系统地研究了提出的反向通过可微分近似(BPDA)防御机制的鲁棒性,BPDA是最近发展起来的一种先进技术,用于打破改变输入样本以最小化对抗效应的防御。与其他防御机制相比,LGS在本地化对抗性攻击设置中是迄今为止对BPDA最具抵抗力的。[1807.01216v1] 在本地化的对抗性攻击设置中,LGS是迄今为止对BPDA最具抵抗力的。[1807.01216v1] 在本地化的对抗性攻击设置中,LGS是迄今为止对BPDA最具抵抗力的。[1807.01216v1]

 

ReCoNet:实时相干视频传输网络

Chang Gao, Derun Gu, Fangjun Zhang, Yizhou Yu

基于卷积神经网络的图像样式传递模型在应用于视频时通常遭受高时间不一致性。已经提出了一些视频样式传递模型来改善时间一致性,但是它们不能保证快速处理速度,良好的感知风格质量和高时间一致性。在本文中,我们提出了一种新颖的实时视频风格传输模型ReCoNet,它可以生成时间上连贯的风格转移视频,同时保持良好的感知风格。将新颖的亮度变形约束添加到输出电平的时间损耗,以捕获连续帧之间的亮度变化,并增加照明效果下的样式化稳定性。我们还针对一种新颖的特征地图级时间损失,以进一步增强可追踪对象的时间一致性。实验结果表明,我们的模型在质量和数量上都表现出优异的性能。[1807.01197v1]

 

观点估计见解和模型

Divon吉拉德,ayellets

本文解决了给定图像中对象的视点估计问题。它提出了在设计解决问题的CNN时应该考虑的五个关键见解。基于这些见解,本文提出了一种网络,其中(i)该体系结构共同解决了检测,分类和视点估计。(ii)增加和培训新类型的数据。(iii)提出了一种新的损失函数,它考虑了问题的几何形状和新的数据类型。我们的网络将此问题的最新结果提高了9.8%。[1807.01312v1]

 

获得没有文本的潜台词:可视化和声学模态的可扩展多模态情感分类

Nathaniel BlanchardDaniel MoreiraAparna BharatiWalter J. Scheirer

在过去的十年中,视频博客(vlogs)已经成为人们表达情感的一种非常流行的方法。这些视频无处不在增加了多模式融合模型的重要性,多模式融合模型将视频和音频功能与传统文本功能相结合,用于自动情绪检测。多模式融合提供了一个独特的机会来构建模型,从人类观众可用的全面表达中学习。在检测这些视频中的情绪时,声学和视频特征为其他模糊的成绩单提供了清晰度。在本文中,我们提出了一种多模式融合模型,该模型专门使用高级视频和音频功能来分析口语句子以获得情感。我们放弃传统的转录功能,以最大限度地减少人为干预,并最大限度地提高我们的模型在大规模现实世界数据上的可部署性。我们为我们的模型选择了在非影响域中成功的高级特征,以测试它们在情感检测领域的普遍性。我们在新发布的CMU Multimodal Opinion SeninmentEmotion IntensityCMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049F1分数,在保持的挑战测试集上获得0.6325F1分数。[1807.01122v1] 我们在新发布的CMU Multimodal Opinion SeninmentEmotion IntensityCMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049F1分数,在保持的挑战测试集上获得0.6325F1分数。[1807.01122v1] 我们在新发布的CMU Multimodal Opinion SeninmentEmotion IntensityCMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049F1分数,在保持的挑战测试集上获得0.6325F1分数。[1807.01122v1]

 

胸部X射线弱监督胸部疾病模式定位的迭代注意挖掘

Jinzheng Cai, Le Lu, Adam P. Harrison, Xiaoshuang Shi, Pingjun Chen, Lin Yang

鉴于图像标签是唯一的监控信号,我们专注于从胸部X射线图像中采集或挖掘胸部疾病定位。从现有数据集中收集此类本地化允许创建用于计算机辅助诊断和回顾性分析的改进数据源。我们训练卷积神经网络(CNN)进行图像分类,并提出一种注意力挖掘(AM)策略,以提高模型对疾病模式的敏感性或显着性。AM的直觉是,一旦最显着的疾病区域被CNN模型阻挡或隐藏,它将关注替代图像区域,同时仍试图做出正确的预测。但是,该模型需要在AM期间进行适当约束,否则,它可能会过度拟合不相关的图像部分,并忘记它从原始图像分类任务中学到的宝贵知识。为了减轻这种副作用,我们设计了知识保存(KP)损失,这使得来自原始网络和更新网络的X射线图像的响应之间的差异最小化。此外,我们修改CNN模型以包括多尺度聚集(MSA),改善其对小规模疾病发现(例如肺结节)的定位能力。我们在公开可用的ChestX-ray14数据集上实验验证了我们的方法,优于基于类激活图(CAM)的方法,并证明了我们用于挖掘疾病位置的新框架的价值。[1807.00958v1] 然后,我们设计了知识保存(KP)损失,这最小化了来自原始网络和更新网络的X射线图像的响应之间的差异。此外,我们修改CNN模型以包括多尺度聚集(MSA),改善其对小规模疾病发现(例如肺结节)的定位能力。我们在公开可用的ChestX-ray14数据集上实验验证了我们的方法,优于基于类激活图(CAM)的方法,并证明了我们用于挖掘疾病位置的新框架的价值。[1807.00958v1] 然后,我们设计了知识保存(KP)损失,这最小化了来自原始网络和更新网络的X射线图像的响应之间的差异。此外,我们修改CNN模型以包括多尺度聚集(MSA),改善其对小规模疾病发现(例如肺结节)的定位能力。我们在公开可用的ChestX-ray14数据集上实验验证了我们的方法,优于基于类激活图(CAM)的方法,并证明了我们用于挖掘疾病位置的新框架的价值。[1807.00958v1] 提高其对小规模疾病发现的定位能力,例如肺结节。我们在公开可用的ChestX-ray14数据集上实验验证了我们的方法,优于基于类激活图(CAM)的方法,并证明了我们用于挖掘疾病位置的新框架的价值。[1807.00958v1] 提高其对小规模疾病发现的定位能力,例如肺结节。我们在公开可用的ChestX-ray14数据集上实验验证了我们的方法,优于基于类激活图(CAM)的方法,并证明了我们用于挖掘疾病位置的新框架的价值。[1807.00958v1]

 

谁在何时何地做了什么:同时进行多人跟踪和活动识别

Wenbo Li, Ming-Ching Chang, Siwei Lyu

我们提出了一个引导框架,以同时改善个人,互动和社交群体活动水平的多人跟踪和活动识别。推断包括识别所有行人演员的轨迹,个人活动,成对交互和集体活动,给定观察到的行人检测。我们的方法使用图形模型通过多阶段来表示和解决联合跟踪和识别问题:(1)活动感知跟踪,(2)联合交互识别和遮挡恢复,以及(3)集体活动识别。我们解决了视觉跟踪的地点和时间问题,以及识别问题的人和问题。可见和闭塞个体之间的高阶相关性,成对相互作用,群体,然后使用贝叶斯框架内的超图公式解决活动。几个基准测试的实验表明我们的方法优于最先进的方法。[1807.01253v1]

 

心冲击描记图信号处理:文献综述

易卜拉欣·萨德克

时域算法专注于使用移动窗口检测局部最大值或局部最小值,并因此找到心冲击描记图(BCG)信号的主要J峰值之间的间隔。然而,由于BCG信号的非线性和非平稳行为,该方法具有许多限制。这是因为BCG信号不显示一致的J峰,这通常可以是夜间家庭监测的情况,特别是对于体弱的老年人。此外,它的准确性无疑会受到运动伪影的影响。其次,频域算法不提供有关间隔间隔的信息。然而,他们可以提供有关心率变异性的信息。这通常通过采用快速傅立叶变换或估计频谱的对数的逆傅里叶变换来完成,即,使用滑动窗口的信号的倒谱。此后,在特定频率范围内获得主频率。这些算法的极限是光谱中的峰值可能变宽并且可能出现多个峰值,这可能导致测量生命体征的问题。最后,小波域算法的目的是将信号分解成不同的分量,因此可以选择与生命体征一致的分量,即,所选择的分量仅包含关于心动周期或呼吸周期的信息。 。经验模式分解是小波分解的替代方法,并且它也是处理诸如心肺呼吸信号的非线性和非平稳信号的非常合适的方法。除了上述算法,已经实施了用于测量心跳的机器学习方法。但是,手动标记训练数据是一种限制性属性。[1807.00951v1]

 

通过在线域适应进入野外

Massimiliano ManciniHakan KaraoguzElisa RicciPatric JensfeltBarbara Caputo

技术发展需要增加机器人的感知和行动能力。在其他技能中,需要能够适应工作条件的任何可能变化的视觉系统。由于这些条件不可预测,我们需要基准测试,以评估我们的视觉识别算法的泛化和稳健性能力。在这项工作中,我们专注于无约束场景中的机器人配套。作为第一个贡献,我们为kitting任务提供了一个新的可视化数据集。与标准物体识别数据集不同,我们提供在相机,照明和背景改变的各种条件下获得的相同物体的图像。这种新颖的数据集允许将机器人视觉识别算法的鲁棒性测试到一系列不同的域移位,无论是孤立还是统一。我们的第二个贡献是基于批量标准化层的深度模型的新型在线自适应算法,其允许连续地使模型适应当前的工作条件。与标准域自适应算法不同,它在训练时不需要来自目标域的任何图像。我们基于所提议的数据集对算法的性能进行基准测试,显示其能够填补标准体系结构的性能与离线适应给定目标域的对应物之间的差距。[1807.01028v1] 这允许不断地使模型适应当前的工作条件。与标准域自适应算法不同,它在训练时不需要来自目标域的任何图像。我们基于所提议的数据集对算法的性能进行基准测试,显示其能够填补标准体系结构的性能与离线适应给定目标域的对应物之间的差距。[1807.01028v1] 这允许不断地使模型适应当前的工作条件。与标准域自适应算法不同,它在训练时不需要来自目标域的任何图像。我们基于所提议的数据集对算法的性能进行基准测试,显示其能够填补标准体系结构的性能与离线适应给定目标域的对应物之间的差距。[1807.01028v1] 显示其能够填补标准体系结构的性能与其离线适应给定目标域的对应物之间的差距。[1807.01028v1] 显示其能够填补标准体系结构的性能与其离线适应给定目标域的对应物之间的差距。[1807.01028v1]

 

SymmNet:用于遮挡检测的对称卷积神经网络

Ang Li, Zejian Yuan

从立体图像或视频帧中检测遮挡对于许多计算机视觉应用来说是重要的。以前的努力集中在将其与差异或光流的计算捆绑在一起,导致鸡与蛋的问题。在本文中,我们利用卷积神经网络从交错的传统计算框架中解放遮挡检测任务。我们提出了一种对称网络(SymmNet)来直接利用来自图像对的信息,而无需提前估计差异或运动。所提出的网络在结构上是左右对称的,以同时学习双眼闭塞,旨在共同改善两种结果。综合实验表明,我们的模型在检测立体声和运动遮挡方面取得了最先进的成果。[1807.00959v1]

 

稀缺数据的语义分割

Isay KatsmanRohun TripathiAndreas VeitSerge Belongie

语义分割是一个具有挑战性的视觉问题,通常需要收集大量精细注释的数据,这通常非常昂贵。粗略注释的数据提供了一个有趣的替代方案,因为它通常更便宜。在这项工作中,我们提出了一种利用粗略注释数据和精细监督的方法,以产生比仅使用精细数据进行训练时获得的更好的分割结果。我们通过使用来自Cityscapes数据集的少于200个低分辨率图像模拟稀缺数据设置来验证我们的方法,并且表明我们的方法基本上优于仅对精细注释数据的训练平均为15.52mIoU并且优于粗糙掩模的平均值5.28mIoU[1807.00911v1]

 

Recurrent-OctoMap:利用3D激光雷达数据学习基于状态的长期语义映射的地图细化

Li Sun, Zhi Yan, Anestis Zaganidis, Cheng Zhao, Tom Duckett

本文提出了一种新的语义映射方法,Recurrent-OctoMap,从长期的3D激光雷达数据中学习。大多数现有的语义映射方法侧重于改进单帧的语义理解,而不是语义映射的3D细化(即融合语义观察)。用于3D语义地图细化的最广泛使用的方法是贝叶斯更新,其融合了马尔可夫链模型之后的连续预测概率。相反,我们提出了一种融合语义特征的学习方法,而不是简单地融合分类器的预测。在我们的方法中,我们将我们的3D地图表示和维护为OctoMap,并将每个细胞建模为递归神经网络(RNN),以获得Recurrent-OctoMap。在这种情况下,语义映射过程可以表示为序列到序列的编码解码问题。此外,为了延长Recurrent-OctoMap中观测的持续时间,我们开发了一个强大的3D定位和绘图系统,用于使用两周以上的数据连续映射动态环境,并且系统可以使用任意内存进行训练和部署长度。我们验证了我们对ETH长期3D激光雷达数据集的方法[1]。实验结果表明,我们提出的方法优于传统的贝叶斯更新方法。[1807.00925v1] 并且可以使用任意存储长度训练和部署系统。我们验证了我们对ETH长期3D激光雷达数据集的方法[1]。实验结果表明,我们提出的方法优于传统的贝叶斯更新方法。[1807.00925v1] 并且可以使用任意存储长度训练和部署系统。我们验证了我们对ETH长期3D激光雷达数据集的方法[1]。实验结果表明,我们提出的方法优于传统的贝叶斯更新方法。[1807.00925v1]

 

基于模型的外形归一化广义手形姿态估计

Jan Wöhlke, Shile Li, Dongheui Lee

自大量注释数据集出现以来,最先进的手部姿势估计方法主要基于判别性学习。最近,混合方法已经将运动层嵌入到深度学习结构中,使得姿势估计遵循人手运动学的物理约束。但是,现有方法依赖于单个人的手形参数,这些参数是固定常数。因此,现有的混合方法存在问题,无法推广到新的,看不见的手。在这项工作中,我们扩展了运动层,使手形参数可以学习。以这种方式,学习的网络可以概括为任意手形。此外,受空间变压器网络的启发,我们应用级联的外观归一化网络来减少输入数据的方差。输入图像被移位,旋转和全局缩放到类似的外观。我们提出的方法的有效性和局限性在Hands 2017挑战数据集和NYU数据集上得到了广泛的评估。[1807.00898v1]

 

自监督稀疏到密集:LiDAR和单目相机的自监督深度完成

马方昌,Guilherme Venturelli CavalheiroSertac Karaman

深度完成是从稀疏深度测量估计密集深度图像的技术,在机器人和自动驾驶中具有多种应用。然而,深度完成面临三个主要挑战:稀疏深度输入中的不规则间隔图案,处理多个传感器模态的困难(当彩色图像可用时),以及缺少密集的像素级地面实况深度标签。在这项工作中,我们解决了所有这些挑战。具体来说,我们开发了一个深度回归模型来学习从稀疏深度(和彩色图像)到密集深度的直接映射。我们还提出了一种自我监督的训练框架,它只需要一系列颜色和稀疏深度图像,而不需要密集的深度标签。我们的实验证明了我们的网络,当用半密集注释训练时,获得最先进的准确性,是提交时KITTI深度完成基准的获胜方法。此外,自我监督的框架优于使用半密集注释训练的许多现有解决方案。[1807.00275v2]

 

通过可切换归一化可区分学习到规范化

Ping Luo, Jiamin Ren, Zhanglin Peng

我们通过提出可切换归一化(SN)来解决学习到规范化问题,该可学习规范化(SN)学习为深度神经网络(DNN)的不同归一化层选择不同的操作。SN通过以端到端的方式学习它们的重要性权重,在三个不同的范围之间切换以计算包括信道,层和小批量的统计(均值和方差)。SN有几个很好的特性。首先,它适应各种网络架构和任务(见图1)。其次,它适用于各种批量大小,在呈现小型小批量时保持高性能(例如2个图像/ GPU)。第三,SN将所有通道视为一个组,与将组数量作为超参数搜索的组标准化不同。在没有花里胡哨的情况下,SN在各种具有挑战性的问题上表现优于同行,例如ImageNet中的图像分类,COCO中的对象检测和分割,艺术图像样式化和神经结构搜索。我们希望SN能够帮助缓解用法并理解规范化技术在深度学习中的作用。SN的代码将在https://github.com/switchablenorms/中提供。[1806.10779v3]

 

紧凑型深度神经网络,用于从肌电信号中进行计算高效的手势分类

Adam HartwellVisakan KadirkamanathanSean R Anderson

使用表面肌电图的机器学习分类器对于人机接口和设备控制是重要的。诸如支持向量机(SVM)的常规分类器使用基于例如小波的手动提取的特征。这些特征往往是固定的和非人特定的,这是由于肌电图信号的高人与人之间的可变性而成为关键限制。相比之下,深度神经网络可以自动提取人的特定功能一个重要的优势。然而,深度神经网络通常具有大量参数的缺点,需要大型训练数据集和不适合嵌入式系统的强大硬件。本文通过引入比现有对应物小得多的紧凑深度神经网络架构解决了这些问题。紧凑型深网的性能基于SVM进行基准测试,并与10个人类受试者的其他当代建筑进行比较,比较了MyoDelsys Trigno电极组。紧凑型深网的精度为84.2 +/- 0.06%,而Myo上的SVM70.5 +/- 0.07%,而Delsys系统为80.3 +/- 0.07%,相对于67.8 +/- 0.09%,展示了所提出的紧凑型网络的卓越效果,该网络仅有5,889个参数比该领域的一些现代替代品低几个数量级,同时保持更好的性能。[1806.08641v2] MyoSVM07%,Delsys系统为80.3 +/- 0.07%对67.8 +/- 0.09%,证明了所提出的紧凑型网络的卓越效果,该网络仅有5,889个参数比一些参数低几个数量级该领域的现代替代品,同时保持更好的性能。[1806.08641v2] MyoSVM07%,Delsys系统为80.3 +/- 0.07%对67.8 +/- 0.09%,证明了所提出的紧凑型网络的卓越效果,该网络仅有5,889个参数比一些参数低几个数量级该领域的现代替代品,同时保持更好的性能。[1806.08641v2]

 

完全卷积网络和生成神经网络应用于巩膜切割

Diego R. LucioRayson LarocaEvair SeveroAlceu S. Britto Jr.David Menotti

由于世界对安全系统的需求,生物识别技术可被视为计算机视觉研究的重要课题。其中一种引起关注的生物识别形式是基于巩膜的识别。进行这种识别的最初和最重要的步骤是感兴趣区域即巩膜的分割。在这种情况下,在这项工作中引入了基于完全卷积网络(FCN)和生成对抗网络(GAN)的这种任务的两种方法。FCN类似于常见的卷积神经网络,但是从网络的末端移除完全连接的层(即,分类层),并且通过组合来自不同卷积层的池化层的输出来生成输出。GAN基于博弈论,我们有两个网络相互竞争以产生最佳分割。为了与基线进行公平比较以及对拟议方法进行定量和客观评估,我们向科学界提供了来自两个数据库的新1,300个手动分段图像。实验在UBIRIS.v2MICHE数据库上进行,我们命题的最佳表现配置分别达到了87.48%和88.32%的F-score测量值。[1806.08722v2] v2MICHE数据库以及我们命题的最佳表现配置分别实现了F-score87.48%和88.32%的测量。[1806.08722v2] v2MICHE数据库以及我们命题的最佳表现配置分别实现了F-score87.48%和88.32%的测量。[1806.08722v2]

 

浑浊和动态水下环境的实时单目视觉测距

Maxime FerreraJulien MorasPauline Found-PelouxVincent Creuze

在机器人水下操作的背景下,由介质特性引起的视觉退化使得摄像机专用于本地化目的变得困难。因此,大多数定位方法基于与声学定位相关联的昂贵的导航传感器。另一方面,视觉测距和视觉SLAM已经针对航空或地面应用进行了详尽的研究,但最先进的算法在水下失败。在本文中,我们解决了使用简单的低成本相机进行水下定位的问题,并提出了一种专用于水下环境的新型单目视觉测距方法。我们评估不同的跟踪方法,并表明基于光学流的跟踪比基于描述符的经典方法更适合于水下图像。我们还提出了一种基于关键帧的视觉测距方法,高度依赖于非线性优化。所提出的算法已经在模拟和真实水下数据集上进行了评估,并且在许多最具挑战性的条件下优于最先进的视觉SLAM方法。这项工作的主要应用是用于水下考古任务的远程操作车辆(ROV)的本地化,但只要有可用的视觉信息,开发的系统就可以用于任何其他应用。[1806.05842v2] 这项工作的主要应用是用于水下考古任务的远程操作车辆(ROV)的本地化,但只要有可用的视觉信息,开发的系统就可以用于任何其他应用。[1806.05842v2] 这项工作的主要应用是用于水下考古任务的远程操作车辆(ROV)的本地化,但只要有可用的视觉信息,开发的系统就可以用于任何其他应用。[1806.05842v2]

 

GLoMo:无监督学习的关系图作为可转移的表示

Zhilin Yang, Jake Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun

现代深度转移学习方法主要侧重于从一个任务中学习通用特征向量,这些向量可转移到其他任务,例如语言中的单词嵌入和视觉中的预训练卷积特征。但是,这些方法通常会传输一元特征,并且在很大程 这项工作探索了学习通用潜在关系图的可能性,这些关系图捕获来自大规模未标记数据的数据单元对(例如,单词或像素)之间的依赖关系,并将图形转移到下游任务。我们提出的转移学习框架可以提高各种任务的性能,包括问答,自然语言推理,情感分析和图像分类。我们还表明,学习的图形足够通用,可以转移到图形尚未训练的不同嵌入(包括GloVe嵌入,ELMo嵌入和任务特定的RNN隐藏单元),或无嵌入单元,如图像像素。[1806.05662v3]

 

具有深度和手工制作功能的本地学习,用于面部表情识别

Mariana-Iuliana GeorgescuRadu Tudor IonescuMarius Popescu

我们提出了一种方法,它结合了卷积神经网络(CNN)学习的自动特征和由视觉词袋(BOVW)模型计算的手工制作的特征,以实现面部表情识别中最先进的结果。为了获得自动功能,我们尝试了多种CNN架构,预训练模型和训练程序,例如Dense-Sparse-Dense。融合了两种类型的功能后,我们使用本地学习框架来预测每个测试图像的类标签。本地学习框架基于三个步骤。首先,应用k-最近邻模型来为输入测试图像选择最近的训练样本。其次,对所选训练样本训练一对一支持向量机(SVM)分类器。最后,SVM分类器仅用于为其训练的测试图像预测类标签。虽然以前使用过本地学习与手工制作的功能相结合,但据我们所知,它从未与深层功能结合使用。2013年面部表情识别(FER)挑战数据集和FER +数据集的实验表明,我们的方法实现了最先进的结果。在FER 2013数据集上的最高准确度为7542%,在FER +数据集上的准确率为87.76%,我们在这两个数据集上的所有竞争对手均超过2%。[1804.10892v3] 它从未与深层功能结合使用。2013年面部表情识别(FER)挑战数据集和FER +数据集的实验表明,我们的方法实现了最先进的结果。在FER 2013数据集上的最高准确度为7542%,在FER +数据集上的准确率为87.76%,我们在这两个数据集上的所有竞争对手均超过2%。[1804.10892v3] 它从未与深层功能结合使用。2013年面部表情识别(FER)挑战数据集和FER +数据集的实验表明,我们的方法实现了最先进的结果。在FER 2013数据集上的最高准确度为7542%,在FER +数据集上的准确率为87.76%,我们在这两个数据集上的所有竞争对手均超过2%。[1804.10892v3]

 

目标驱动的实例检测

Phil AmmiratoCheng-Yang FuMykhailo ShvetsJana KoseckaAlexander C. Berg

虽然最先进的通用物体探测器越来越好,但没有多少系统专门设计用于利用实例检测问题。对于许多应用,例如家用机器人,系统可能需要一次识别一些非常特定的实例。速度在这些应用程序中至关重要,因为需要识别以前看不见的实例。我们引入了目标驱动实例检测器(TDID),它修改了用于实例识别设置的现有通用对象检测器。TDID不仅提高了训练期间看到的实例的性能,而且运行时间更快,但也可以概括为检测新实例。[1803.04610v2]

 

利用代表性特征改进生成对抗网络的训练

Duhyeon BangHyunjung Shim

尽管用于图像生成的生成对抗网络(GAN)成功,但视觉质量和图像多样性之间的权衡仍然是一个重要问题。本文通过提高训练GAN的稳定性,同时实现了这两个目标。所提出方法的关键思想是使用代表性特征隐式地使鉴别器正规化。着眼于标准GAN最小化反向Kullback-LeiblerKL)发散的事实,我们将使用预训练自动编码器(AE)从数据分布中提取的代表性特征传送到标准GAN的鉴别器。由于AE学会最小化前向KL分歧,我们具有代表性特征的GAN训练受到反向和前向KL发散的影响。所以,通过广泛的评估,验证了所提出的方法以改善现有技术GAN的视觉质量和多样性。[1801.09195v3]

 

以低成本扩展上下文:使用修剪卷积的高效算术编码

Mu Li, Shuhang Gu, David Zhang, Wangmeng Zuo

算术编码是一类必不可少的编码技术。算术编码方法的一个关键问题是预测当前编码符号从其上下文的概率,即先前编码的符号,其通常可以通过构建查找表(LUT)来执行。然而,LUT的复杂性随着上下文的长度呈指数增长。因此,这种解决方案仅限于建模大的上下文,这不可避免地限制了压缩性能。最近开发了几种基于深度神经网络的解决方案来解决大的背景,但计算成本仍然很高。现有方法的低效率主要归因于对于相邻符号独立地执行概率预测,其实际上可以通过共享计算有效地进行。为此,我们提出了一种用于算术编码的修剪卷积网络(TCAE),以在保持计算效率的同时对大环境进行建模。对于修剪卷积,卷积内核被特别修剪以尊重输入符号的压缩顺序和上下文依赖性。受益于修剪的卷积,所有符号的概率预测可以通过完全卷积网络在单个前向通道中有效地执行。此外,为了加速解码过程,提出斜率TCAE模型以将代码从3D代码映射分成几个块,并消除用于并行解码的内部一个代码之间的代码之间的依赖性,这可以使解码过程加速60倍。实验表明,我们的TCAE和斜率TCAE在无损灰度图像压缩中获得了更好的压缩比,并且可以在基于CNN的有损图像压缩中采用,以实现具有实时编码速度的最先进的速率失真性能。[1801.04662v2]

 

InclusiveFaceNet:通过种族和性别多样性改进面部属性检测

Hee Jung RyuHartwig AdamMargaret Mitchell

我们演示了一种面部属性检测方法,通过在学习属性检测任务之前学习人口统计信息,保留或提高性别和种族子群的属性检测准确性。该系统,我们称之为InclusiveFaceNet,通过转移从公共种族和性别身份的数据集中学习的种族和性别表征来检测面部属性。利用学习的人口统计表示,同时从下游人脸属性检测任务中扣留人口统计推断,可以保留潜在用户的人口隐私,同时在世界的面孔和CelebA数据集中产生一些迄今为止最佳的属性检测数据。[1712.00193v2]

 

像素方式对象跟踪

Yilin Song, Chenge Li, Yao Wang

在本文中,我们提出了一种新颖的像素方式视觉对象跟踪框架,可以在嘈杂的背景中跟踪任何匿名对象。该框架由两个子模型组成,一个全局注意模型和一个局部分割模型。全局模型基于过去的对象分割图生成对象可能位于新帧中的感兴趣区域(ROI),而局部模型在ROI中对新图像进行分段。每个模型分别使用LSTM结构来模拟运动和外观的时间动态。为了规避两个模型之间训练数据的依赖性,我们使用迭代更新策略。一旦模型被训练,就不需要对它们进行细化以跟踪特定对象,从而使我们的方法与在线学习方法相比更有效。

 

基于张量的高光谱数据分析分类器

Konstantinos MakantasisAnastasios DoulamisNikolaos DoulamisAntonis Nikitakis

在这项工作中,我们提出了基于张量的线性和非线性模型,用于高光谱数据分类和分析。通过利用张量代数原理,我们引入了新的分类体系结构,其权重参数满足{\ it rank} -1规范分解特性。然后,我们引入学习算法来训练线性和非线性分类器,以便i)最小化训练样本上的误差和ii)权重系数满足{\ it rank} -1规范分解属性。所提出的分类模型的优点在于:i)它减少了所需参数的数量,从而减少了正确训练模型所需的训练样本的数量,ii)它提供关于分类输出上的模型系数的物理解释,以及iii)它保留输入样本的空间和频谱一致性。为了解决与线性分类相关的问题,通过低容量表征,因为它可以产生输入空间中线性的规则,我们引入基于前馈神经网络的修改的非线性分类模型。我们称之为建议架构{\ it rank} -1 Feedfoward Neural NetworkFNN),因为它们的权重满足{\ it rank} -1 caconical分解属性。还提出了适当的学习算法来训练网络。实验结果和与现有技术分类方法的比较,线性(例如,SVM)和非线性(例如,深度学习)表明拟议方案的表现优异,特别是在有少量培训样本可用的情况下。此外,所提出的基于张量的分类器根据它们降维的能力进行评估。[1709.08164v2]

 

H-DenseUNet:混合密集连接的UNet用于CT体积的肝脏和肿瘤分割

Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, Pheng Ann Heng

肝癌是导致癌症死亡的主要原因之一。为了帮助医生进行肝细胞癌的诊断和治疗计划,在临床实践中高度需要准确和自动的肝脏和肿瘤分割方法。最近,完全卷积神经网络(FCN),包括2D3D FCN,在许多体积图像分割中充当了骨干。然而,2D卷积不能充分利用沿第三维的空间信息,而3D卷绕遭受高计算成本和GPU内存消耗。为了解决这些问题,我们提出了一种新颖的混合密集连接的UNetH-DenseUNet),它包括用于有效提取切片内特征的2D DenseUNet和用于在肝脏和肿瘤分割的自动上下文算法的精神下分层聚合体积上下文的3D对应物。我们以端到端的方式制定H-DenseUNet的学习过程,其中可以通过混合特征融合(HFF)层联合优化片内表示和片间特征。我们在MICCAI 2017肝肿瘤分割(LiTS)挑战和3DIRCADb数据集的数据集上广泛评估了我们的方法。我们的方法在肿瘤分割结果方面优于其他现有技术,并且即使使用单一模型也能实现肝脏分割的非常有竞争力的表现。[1709.07330v3] 我们以端到端的方式制定H-DenseUNet的学习过程,其中可以通过混合特征融合(HFF)层联合优化片内表示和片间特征。我们在MICCAI 2017肝肿瘤分割(LiTS)挑战和3DIRCADb数据集的数据集上广泛评估了我们的方法。我们的方法在肿瘤分割结果方面优于其他现有技术,并且即使使用单一模型也能实现肝脏分割的非常有竞争力的表现。[1709.07330v3] 我们以端到端的方式制定H-DenseUNet的学习过程,其中可以通过混合特征融合(HFF)层联合优化片内表示和片间特征。我们在MICCAI 2017肝肿瘤分割(LiTS)挑战和3DIRCADb数据集的数据集上广泛评估了我们的方法。我们的方法在肿瘤分割结果方面优于其他现有技术,并且即使使用单一模型也能实现肝脏分割的非常有竞争力的表现。[1709.07330v3] 我们的方法在肿瘤分割结果方面优于其他现有技术,并且即使使用单一模型也能实现肝脏分割的非常有竞争力的表现。[1709.07330v3] 我们的方法在肿瘤分割结果方面优于其他现有技术,并且即使使用单一模型也能实现肝脏分割的非常有竞争力的表现。[1709.07330v3]

 

多分辨率LSTM用于神经活动视频中的长期预测

Yilin Song, Jonathan Viventi, Yao Wang

癫痫发作是由大脑中异常,过度同步的电活动引起的。异常的电活动表现为波浪,在大脑中传播。准确预测这些波的传播速度和方向可以实现实时响应性脑刺激,从而完全抑制或预防癫痫发作。然而,这个问题非常具有挑战性,因为该算法必须能够在足够长的时间范围内预测神经信号,以便有足够的时间进行医疗干预。我们考虑如何使用LSTM网络完成长期预测。为了减轻消失梯度问题,我们提出了两种编码器解码器预测器结构,两者都使用多分辨率表示。具有多分辨率层的新型LSTM结构可以明显优于具有相似数量参数的单分辨率基准。为了克服与使用标准均方误差(MSE)损失的像素域中的视频预测相关联的模糊效应,我们使用基于能量的对抗训练来改善长期预测。我们演示并分析了使用3D CNN模型的具有编码器解码器结构的判别模型如何改进长期预测。[1705.02893v2] 我们演示并分析了使用3D CNN模型的具有编码器解码器结构的判别模型如何改进长期预测。[1705.02893v2] 我们演示并分析了使用3D CNN模型的具有编码器解码器结构的判别模型如何改进长期预测。[1705.02893v2]

 

多样性鼓励学习无监督的LSTM集合用于神经活动视频预测

Yilin Song, Jonathan Viventi, Yao Wang

能够从当前和先前的观察预测近期的神经信号有可能实现实时响应性脑刺激以抑制癫痫发作。我们已经研究了如何使用由LSTM单元组成的自动编码器模型来进行这种预测。通过认识到存在多个活动模式聚类,我们进一步探索了训练LSTM模型的集合,以便每个模型都可以专门建模某些神经活动,而无需明确地聚类训练数据。我们使用集合意识损失来训练集合,它共同解决了模型分配问题和误差最小化问题。在训练期间,对于每个训练序列,仅更新具有最低重构和预测误差的模型。本质上,这种损失函数使得每个LTSM模型能够适应共享相似动态行为的训练序列的子集。我们证明这可以以端到端的方式进行训练,并在神经活动预测中实现显着的准确性。[1611.04899v2]

转载请注明:《稀缺数据的语义分割+ SymmNet:用于遮挡检测的对称卷积神经网络+ReCoNet:实时的连贯的视频风格迁移

发表评论