用于组合人体姿态估计的深度自动编码器和人体建模提升+用于无监督跨数据集行人重识别的多任务中级特征对齐网络+TextSnake:一种用于检测任意形状文本的灵活表示

Diversity in Machine Learning

Zhiqiang Gong, Ping Zhong, Weidong Hu

Machine learning methods have achieved good performance and been widely applied in various real-world applications. It can learn the model adaptively and be better fit for special requirements of different tasks. Many factors can affect the performance of the machine learning process, among which diversity of the machine learning is an important one. Generally, a good machine learning system is composed of plentiful training data, a good model training process, and an accurate inference. The diversity could help each procedure to guarantee a total good machine learning: diversity of the training data ensures the data contain enough discriminative information, diversity of the learned model (diversity in parameters of each model or diversity in models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a plausible result. However, there is no systematical analysis of the diversification in machine learning system. In this paper, we systematically summarize the methods to make data diversification, model diversification, and inference diversification in machine learning process, respectively. In addition, the typical applications where the diversity technology improved the machine learning performances have been surveyed, including the remote sensing imaging tasks, machine translation, camera relocalization, image segmentation, object detection, topic modeling, and others. Finally, we discuss some challenges of diversity technology in machine learning and point out some directions in future work. Our analysis provides a deeper understanding of the diversity technology in machine learning tasks, and hence can help design and learn more effective models for specific tasks. [1807.01477v1]

 

Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations

Dan Hendrycks, Thomas G. Dietterich

In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Unlike recent robustness research, this benchmark evaluates performance on commonplace corruptions not worst-case adversarial corruptions. We find that there are negligible changes in relative corruption robustness from AlexNet to ResNet classifiers, and we discover ways to enhance corruption robustness. Then we propose a new dataset called Icons-50 which opens research on a new kind of robustness, surface variation robustness. With this dataset we evaluate the frailty of classifiers on new styles of known objects and unexpected instances of known classes. We also demonstrate two methods that improve surface variation robustness. Together our benchmarks may aid future work toward networks that learn fundamental class structure and also robustly generalize. [1807.01697v1]

 

SGAD: Soft-Guided Adaptively-Dropped Neural Network

Zhisheng Wang, Fangxuan Sun, Jun Lin, Zhongfeng Wang, Bo Yuan

Deep neural networks (DNNs) have been proven to have many redundancies. Hence, many efforts have been made to compress DNNs. However, the existing model compression methods treat all the input samples equally while ignoring the fact that the difficulties of various input samples being correctly classified are different. To address this problem, DNNs with adaptive dropping mechanism are well explored in this work. To inform the DNNs how difficult the input samples can be classified, a guideline that contains the information of input samples is introduced to improve the performance. Based on the developed guideline and adaptive dropping mechanism, an innovative soft-guided adaptively-dropped (SGAD) neural network is proposed in this paper. Compared with the 32 layers residual neural networks, the presented SGAD can reduce the FLOPs by 77% with less than 1% drop in accuracy on CIFAR-10. [1807.01430v1]

 

Uncorrelated Feature Encoding for Faster Image Style Transfer

Minseong Kim, Jongju Shin, Myung-Cheol Roh, Hyun-Chul Choi

Recent fast style transfer methods use a pre-trained convolutional neural network as a feature encoder and a perceptual loss network. Although the pre-trained network is used to generate responses of receptive fields effective for representing style and content of image, it is not optimized for image style transfer but rather for image classification. Furthermore, it also requires a time-consuming and correlation-considering feature alignment process for image style transfer because of its inter-channel correlation. In this paper, we propose an end-to-end learning method which optimizes an encoder/decoder network for the purpose of style transfer as well as relieves the feature alignment complexity from considering inter-channel correlation. We used uncorrelation loss, i.e., the total correlation coefficient between the responses of different encoder channels, with style and content losses for training style transfer network. This makes the encoder network to be trained to generate inter-channel uncorrelated features and to be optimized for the task of image style transfer which maintained the quality of image style only with a light-weighted and correlation-unaware feature alignment process. Moreover, our method drastically reduced redundant channels of the encoded feature and this resulted in the efficient size of structure of network and faster forward processing speed. Our method can also be applied to cascade network scheme for multiple scaled style transferring and allows user-control of style strength by using a content-style trade-off parameter. [1807.01493v1]

 

Sensors, SLAM and Long-term Autonomy: A Review

Mubariz Zaffar, Shoaib Ehsan, Rustam Stolkin, Klaus McDonald Maier

Simultaneous Localization and Mapping, commonly known as SLAM, has been an active research area in the field of Robotics over the past three decades. For solving the SLAM problem, every robot is equipped with either a single sensor or a combination of similar/different sensors. This paper attempts to review, discuss, evaluate and compare these sensors. Keeping an eye on future, this paper also assesses the characteristics of these sensors against factors critical to the long-term autonomy challenge. [1807.01605v1]

 

Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling

Matthew Trumble, Andrew Gilbert, Adrian Hilton, John Collomosse

We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of $4 \times$, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose. [1807.01511v1]

 

Video Frame Interpolation by Plug-and-Play Deep Locally Linear Embedding

Anh-Duc Nguyen, Woojae Kim, Jongyoo Kim, Sanghoon Lee

We propose a generative framework which takes on the video frame interpolation problem. Our framework, which we call Deep Locally Linear Embedding (DeepLLE), is powered by a deep convolutional neural network (CNN) while it can be used instantly like conventional models. DeepLLE fits an auto-encoding CNN to a set of several consecutive frames and embeds a linearity constraint on the latent codes so that new frames can be generated by interpolating new latent codes. Different from the current deep learning paradigm which requires training on large datasets, DeepLLE works in a plug-and-play and unsupervised manner, and is able to generate an arbitrary number of frames. Thorough experiments demonstrate that without bells and whistles, our method is highly competitive among current state-of-the-art models. [1807.01462v1]

 

Localization Recall Precision (LRP): A New Performance Metric for Object Detection

Kemal Oksuz, Baris Can Cam, Emre Akbas, Sinan Kalkan

Average precision (AP), the area under the recall-precision (RP) curve, is the standard performance measure for object detection. Despite its wide acceptance, it has a number of shortcomings, the most important of which are (i) the inability to distinguish very different RP curves, and (ii) the lack of directly measuring bounding box localization accuracy. In this paper, we propose ‘Localization Recall Precision (LRP) Error’, a new metric which we specifically designed for object detection. LRP Error is composed of three components related to localization, false negative (FN) rate and false positive (FP) rate. Based on LRP, we introduce the ‘Optimal LRP’, the minimum achievable LRP error representing the best achievable configuration of the detector in terms of recall-precision and the tightness of the boxes. In contrast to AP, which considers precisions over the entire recall domain, Optimal LRP determines the ‘best’ confidence score threshold for a class, which balances the trade-off between localization and recall-precision. In our experiments, we show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides richer and more discriminative information than AP. We also demonstrate that the best confidence score thresholds vary significantly among classes and detectors. Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes. We provide the source code that can compute LRP for the PASCAL VOC and MSCOCO datasets in https://github.com/cancam/LRP. Our source code can easily be adapted to other datasets as well. [1807.01696v1]

 

Selective Deep Convolutional Neural Network for Low Cost Distorted Image Classification

Minho Ha, Younghoon Byeon, Youngjoo Lee, Sunggu Lee

Deep convolutional neural networks have proven to be well suited for image classification applications. However, if there is distortion in the image, the classification accuracy can be significantly degraded, even with state-of-the-art neural networks. The accuracy cannot be significantly improved by simply training with distorted images. Instead, this paper proposes a multiple neural network topology referred to as a selective deep convolutional neural network. By modifying existing state-of-the-art neural networks in the proposed manner, it is shown that a similar level of classification accuracy can be achieved, but at a significantly lower cost. The cost reduction is obtained primarily through the use of fewer weight parameters. Using fewer weights reduces the number of multiply-accumulate operations and also reduces the energy required for data accesses. Finally, it is shown that the effectiveness of the proposed selective deep convolutional neural network can be further improved by combining it with previously proposed network cost reduction methods. [1807.01418v1]

 

Small-scale Pedestrian Detection Based on Somatic Topology Localization and Temporal Feature Aggregation

Tao Song, Leiyu Sun, Di Xie, Haiming Sun, Shiliang Pu

A critical issue in pedestrian detection is to detect small-scale objects that will introduce feeble contrast and motion blur in images and videos, which in our opinion should partially resort to deep-rooted annotation bias. Motivated by this, we propose a novel method integrated with somatic topological line localization (TLL) and temporal feature aggregation for detecting multi-scale pedestrians, which works particularly well with small-scale pedestrians that are relatively far from the camera. Moreover, a post-processing scheme based on Markov Random Field (MRF) is introduced to eliminate ambiguities in occlusion cases. Applying with these methodologies comprehensively, we achieve best detection performance on Caltech benchmark and improve performance of small-scale objects significantly (miss rate decreases from 74.53% to 60.79%). Beyond this, we also achieve competitive performance on CityPersons dataset and show the existence of annotation bias in KITTI dataset. [1807.01438v1]

 

An Integration of Bottom-up and Top-Down Salient Cues on RGB-D Data: Saliency from Objectness vs. Non-Objectness

Nevrez Imamoglu, Wataru Shimoda, Chi Zhang, Yuming Fang, Asako Kanezaki, Keiji Yanai, Yoshifumi Nishida

Bottom-up and top-down visual cues are two types of information that helps the visual saliency models. These salient cues can be from spatial distributions of the features (space-based saliency) or contextual / task-dependent features (object based saliency). Saliency models generally incorporate salient cues either in bottom-up or top-down norm separately. In this work, we combine bottom-up and top-down cues from both space and object based salient features on RGB-D data. In addition, we also investigated the ability of various pre-trained convolutional neural networks for extracting top-down saliency on color images based on the object dependent feature activation. We demonstrate that combining salient features from color and dept through bottom-up and top-down methods gives significant improvement on the salient object detection with space based and object based salient cues. RGB-D saliency integration framework yields promising results compared with the several state-of-the-art-models. [1807.01532v1]

 

Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention

Karuna Ahuja, Karan Sikka, Anirban Roy, Ajay Divakaran

We tackle the problem of understanding visual ads where given an ad image, our goal is to rank appropriate human generated statements describing the purpose of the ad. This problem is generally addressed by jointly embedding images and candidate statements to establish correspondence. Decoding a visual ad requires inference of both semantic and symbolic nuances referenced in an image and prior methods may fail to capture such associations especially with weakly annotated symbols. In order to create better embeddings, we leverage an attention mechanism to associate image proposals with symbols and thus effectively aggregate information from aligned multimodal representations. We propose a multihop co-attention mechanism that iteratively refines the attention map to ensure accurate attention estimation. Our attention based embedding model is learned end-to-end guided by a max-margin loss function. We show that our model outperforms other baselines on the benchmark Ad dataset and also show qualitative results to highlight the advantages of using multihop co-attention. [1807.01448v1]

 

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, Cong Yao

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure. [1807.01544v1]

 

Discriminative Feature Learning with Foreground Attention for Person Re-Identification

Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang, Yihong Gong, Nanning Zheng

The performance of person re-identification (Re-ID) seriously depends on the camera environment, where the large cross-view appearance variations caused by mutual occlusions and background clutters have severely decreased the identification accuracy. Therefore, it is very essential to learn a feature representation that can adaptively emphasize the foreground of each input image. In this paper, we propose a simple yet effective deep neural network to solve the person Re-ID problem, which attempts to learn a discriminative feature representation by addressing the foreground of input images. Specifically, a novel foreground attentive neural network (FANN) is first built to strengthen the positive influence of foreground while weaken the side effect of background, in which an encoder and decoder subnetwork is carefully designed to drive the whole neural network to put its attention on the foreground of input images. Then, a novel symmetric triplet loss function is designed to enhance the feature learning capability by jointly minimizing the intra-class distance and maximizing the inter-class distance in each triplet unit. Training the deep neural network in an end-to-end fashion, a discriminative feature representation can be finally learned to find out the matched reference to the probe among various candidates in the gallery. Comprehensive experiments on several public benchmark datasets are conducted to evaluate the performance, which have shown clear improvements of our method as compared with the state-of-the-art approaches. [1807.01455v1]

 

The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion

Michael Schmitt, Lloyd Haydn Hughes, Xiao Xiang Zhu

While deep learning techniques have an increasing impact on many technical fields, gathering sufficient amounts of training data is a challenging problem in remote sensing. In particular, this holds for applications involving data from multiple sensors with heterogeneous characteristics. One example for that is the fusion of synthetic aperture radar (SAR) data and optical imagery. With this paper, we publish the SEN1-2 dataset to foster deep learning research in SAR-optical data fusion. SEN1-2 comprises 282,384 pairs of corresponding image patches, collected from across the globe and throughout all meteorological seasons. Besides a detailed description of the dataset, we show exemplary results for several possible applications, such as SAR image colorization, SAR-optical image matching, and creation of artificial optical images from SAR input data. Since SEN1-2 is the first large open dataset of this kind, we believe it will support further developments in the field of deep learning for remote sensing as well as multi-sensor data fusion. [1807.01569v1]

 

Encoding Spatial Relations from Natural Language

Tiago Ramalho, Tomáš Kociský, Frederic Besse, S. M. Ali Eslami, Gábor Melis, Fabio Viola, Phil Blunsom, Karl Moritz Hermann

Natural language processing has made significant inroads into learning the semantics of words through distributional approaches, however representations learnt via these methods fail to capture certain kinds of information implicit in the real world. In particular, spatial relations are encoded in a way that is inconsistent with human spatial reasoning and lacking invariance to viewpoint changes. We present a system capable of capturing the semantics of spatial relations such as behind, left of, etc from natural language. Our key contributions are a novel multi-modal objective based on generating images of scenes from their textual descriptions, and a new dataset on which to train it. We demonstrate that internal representations are robust to meaning preserving transformations of descriptions (paraphrase invariance), while viewpoint invariance is an emergent property of the system. [1807.01670v1]

 

Deep Learning Based Damage Detection on Post-Hurricane Satellite Imagery

Quoc Dung Cao, Youngjun Choe

After a hurricane, damage assessment is critical to emergency managers and first responders. To improve the efficiency and accuracy of damage assessment, instead of using windshield survey, we propose to automatically detect damaged buildings using image classification algorithms. The method is applied to the case study of 2017 Hurricane Harvey. [1807.01688v1]

 

AND: Autoregressive Novelty Detectors

Davide Abati, Angelo Porrello, Simone Calderara, Rita Cucchiara

We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes’ likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse one-class novelty detection and video anomaly detection benchmarks. [1807.01653v1]

 

Neonatal Pain Expression Recognition Using Transfer Learning

Ghada Zamzmi, Dmitry Goldgof, Rangachar Kasturi, Yu Sun

Transfer learning using pre-trained Convolutional Neural Networks (CNNs) has been successfully applied to images for different classification tasks. In this paper, we propose a new pipeline for pain expression recognition in neonates using transfer learning. Specifically, we propose to exploit a pre-trained CNN that was originally trained on a relatively similar dataset for face recognition (VGG Face) as well as CNNs that were pre-trained on a relatively different dataset for image classification (iVGG F,M, and S) to extract deep features from neonates’ faces. In the final stage, several supervised machine learning classifiers are trained to classify neonates’ facial expression into pain or no pain expression. The proposed pipeline achieved, on a testing dataset, 0.841 AUC and 90.34 accuracy, which is approx. 7 higher than the accuracy of handcrafted traditional features. We also propose to combine deep features with traditional features and hypothesize that the mixed features would improve pain classification performance. Combining deep features with traditional features achieved 92.71 accuracy and 0.948 AUC. These results show that transfer learning, which is a faster and more practical option than training CNN from the scratch, can be used to extract useful features for pain expression recognition in neonates. It also shows that combining deep features with traditional handcrafted features is a good practice to improve the performance of pain expression recognition and possibly the performance of similar applications. [1807.01631v1]

 

VideoKifu, or the automatic transcription of a Go game

Mario Corsolini, Andrea Carta

In two previous papers [arXiv:1508.03269, arXiv:1701.05419] we described the techniques we employed for reconstructing the whole move sequence of a Go game. That task was at first accomplished by means of a series of photographs, manually shot, as explained during the scientific conference held within the LIX European Go Congress (Liberec, CZ). The photographs were subsequently replaced by a possibly unattended video live stream (provided by webcams, videocameras, smartphones and so on) or, were the live stream not available, by means of a pre-recorded video of the game itself, on condition that the goban and the stones were clearly visible more often than not. As we hinted in the latter paper, in the last two years we have improved both the algorithms employed for reconstructing the grid and detecting the stones, making extensive usage of the multicore capabilities offered by modern CPUs. Those capabilities prompted us to develop some asynchronous routines, capable of double-checking the position of the grid and the number and colour of any stone previously detected, in order to get rid of minor errors possibly occurred during the main analysis, and that may pass undetected especially in the course of an unattended live streaming. Those routines will be described in details, as they address some problems that are of general interest when reconstructing the move sequence, for example what to do when large movements of the whole goban occur (deliberate or not) and how to deal with captures of dead stones $-$ that could be wrongly detected and recorded as “fresh” moves if not promptly removed. [1807.01577v1]

 

Unbiased Decoder Learning for Fast Image Style Transfer

Hyun-Chul Choi, Minseong Kim

Image style transfer is one of the computer vision applications related to deep machine learning. Since the proposal of the first online learning approach of single layered neural network called neural style, image style transferring method has been continuously improved in processing speed and style capacity. However, controlling the style strength of image has not been investigated deeply. As an early stage of research for style strength control, we propose a method of style manifold learning in image decoder which can generate unbiased style image for image style transfer. [1807.01424v1]

 

Deep Saliency Hashing

Sheng Jin

In recent years, hashing methods have been proved efficient for large-scale Web media search. However, existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce attention mechanism to the learning of hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semantic-preserving hashing codes simultaneously. DSaH is a two-step end-to-end model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. The saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both fine-grained and general retrieval datasets for performance evaluation. Experimental results on Oxford Flowers-17 and Stanford Dogs-120 demonstrate that our DSaH performs the best for fine-grained retrieval task and beats the existing best retrieval performance (DPSH) by approximately 12%. DSaH also outperforms several state-of-the-art hashing methods on general datasets, including CIFAR-10 and NUS-WIDE. [1807.01459v1]

 

Video Semantic Salient Instance Segmentation: Benchmark Dataset and Baseline

Trung-Nghia Le, Akihiro Sugimoto

This paper pushes the envelope on salient regions in a video to decompose them into semantically meaningful components, semantic salient instances. To address this video semantic salient instance segmentation, we construct a new dataset, Semantic Salient Instance Video (SESIV) dataset. Our SESIV dataset consists of 84 high-quality video sequences with pixel-wisely per-frame ground-truth labels annotated for different segmentation tasks. We also provide a baseline for this problem, called Fork-Join Strategy (FJS). FJS is a two-stream network leveraging advantages of two different segmentation tasks, i.e., semantic instance segmentation and salient object segmentation. In FJS, we introduce a sequential fusion that combines the outputs of the two streams to have non-overlapping instances one by one. We also introduce a recurrent instance propagation to refine the shapes and semantic meanings of instances, and an identity tracking to maintain both the identity and the semantic meaning of an instance over the entire video. Experimental results demonstrated the effectiveness of our proposed FJS. [1807.01452v1]

 

Multi-task Mid-level Feature Alignment Network for Unsupervised Cross-Dataset Person Re-Identification

Shan Lin, Haoliang Li, Chang-Tsun Li, Alex Chichung Kot

Most existing person re-identification (Re-ID) approaches follow a supervised learning framework, in which a large number of labelled matching pairs are required for training. Such a setting severely limits their scalability in real-world applications where no labelled samples are available during the training phase. To overcome this limitation, we develop a novel unsupervised Multi-task Mid-level Feature Alignment (MMFA) network for the unsupervised cross-dataset person re-identification task. Under the assumption that the source and target datasets share the same set of mid-level semantic attributes, our proposed model can be jointly optimised under the person’s identity classification and the attribute learning task with a cross-dataset mid-level feature alignment regularisation term. In this way, the learned feature representation can be better generalised from one dataset to another which further improve the person re-identification accuracy. Experimental results on four benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art baselines. [1807.01440v1]

 

Endmember Extraction on the Grassmannian

Elin Farnell, Henry Kvinge, Michael Kirby, Chris Peterson

Endmember extraction plays a prominent role in a variety of data analysis problems as endmembers often correspond to data representing the purest or best representative of some feature. Identifying endmembers then can be useful for further identification and classification tasks. In settings with high-dimensional data, such as hyperspectral imagery, it can be useful to consider endmembers that are subspaces as they are capable of capturing a wider range of variations of a signature. The endmember extraction problem in this setting thus translates to finding the vertices of the convex hull of a set of points on a Grassmannian. In the presence of noise, it can be less clear whether a point should be considered a vertex. In this paper, we propose an algorithm to extract endmembers on a Grassmannian, identify subspaces of interest that lie near the boundary of a convex hull, and demonstrate the use of the algorithm on a synthetic example and on the 220 spectral band AVIRIS Indian Pines hyperspectral image. [1807.01401v1]

 

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, Yaser Sheikh

In this paper, we present supervision-by-registration, an unsupervised approach to improve the precision of facial landmark detectors on both images and video. Our key observation is that the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. Interestingly, the coherency of optical flow is a source of supervision that does not require manual labeling, and can be leveraged during detector training. For example, we can enforce in the training loss function that a detected landmark at frame$_{t-1}$ followed by optical flow tracking from frame$_{t-1}$ to frame$_t$ should coincide with the location of the detection at frame$_t$. Essentially, supervision-by-registration augments the training loss function with a registration loss, thus training the detector to have output that is not only close to the annotations in labeled images, but also consistent with registration on large amounts of unlabeled videos. End-to-end training with the registration loss is made possible by a differentiable Lucas-Kanade operation, which computes optical flow registration in the forward pass, and back-propagates gradients that encourage temporal coherency in the detector. The output of our method is a more precise image-based facial landmark detector, which can be applied to single images or video. With supervision-by-registration, we demonstrate (1) improvements in facial landmark detection on both images (300W, ALFW) and video (300VW, Youtube-Celebrities), and (2) significant reduction of jittering in video detections. [1807.00966v2]

 

ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations

Shuai Zheng, Fan Yang, M. Hadi Kiapour, Robinson Piramuthu

Understanding clothes from a single image has strong commercial and cultural impacts on modern societies. However, this task remains a challenging computer vision problem due to wide variations in the appearance, style, brand and layering of clothing items. We present a new database called “ModaNet“, a large-scale collection of images based on Paperdoll dataset. Our dataset provides $55,176$ street images, fully annotated with polygons on top of the 1 million weakly annotated street images in Paperdoll. ModaNet aims to provide a technical benchmark to fairly evaluate the progress of applying the latest computer vision techniques that rely on large data for fashion understanding. The rich annotation of the dataset allows to measure the performance of state-of-the-art algorithms for object detection, semantic segmentation and polygon prediction on street fashion images in detail. The dataset will be releasing soon. [1807.01394v1]

机器学习的多样性

Zhiqiang Gong, Ping Zhong, Weidong Hu

机器学习方法已经取得了良好的性能,并被广泛应用于各种实际应用中。它可以自适应地学习模型,更好地适应不同任务的特殊要求。许多因素都会影响机器学习过程的性能,其中机器学习的多样性是一个重要的因素。通常,良好的机器学习系统由丰富的训练数据,良好的模型训练过程和准确的推理组成。多样性可以帮助每个程序保证良好的机器学习:训练数据的多样性确保数据包含足够的判别信息,学习模型的多样性(每个模型的参数的多样性或模型中的多样性)使得每个参数/模型捕获唯一或补充信息,并且推理的多样性可以提供多个选择,每个选择对应于合理的结果。然而,对机器学习系统的多样化没有系统的分析。在本文中,我们分别系统地总结了机器学习过程中数据多样化,模型多样化和推理多样化的方法。此外,还调查了多样性技术改善机器学习性能的典型应用,包括遥感成像任务,机器翻译,相机重定位,图像分割,物体检测,主题建模等。最后,我们讨论了机器学习中多样性技术的一些挑战,并指出了未来工作中的一些方向。我们的分析可以更深入地了解机器学习任务中的多样性技术,因此可以帮助设计和学习更有效的特定任务模型。[1807.01477v1]

 

基准神经网络对常见腐败和表面变化的鲁棒性

还有HendrycksThomas G. Dietterich

在本文中,我们为图像分类器的稳健性建立了严格的基准。我们的第一个基准ImageNet-C标准化和扩展了损坏稳健性主题,同时展示了哪些分类器在安全关键应用中更受欢迎。与最近的稳健性研究不同,该基准评估了普通腐败的表现,而非最坏情况的对抗性腐败。我们发现从AlexNetResNet分类器的相对损坏稳健性的变化可以忽略不计,我们发现了增强破坏稳健性的方法。然后我们提出了一个名为Icons-50的新数据集,它开启了对一种新的鲁棒性,表面变异鲁棒性的研究。使用此数据集,我们可以评估已知对象的新样式和已知类的意外实例的分类器的脆弱性。我们还展示了两种改善表面变异稳健性的方法。我们的基准测试可以共同帮助未来的工作向学习基本类结构的网络进行强有力的推广。[1807.01697v1]

 

SGAD:软引导自适应丢弃神经网络

Zhisheng Wang, Fangxuan Sun, Jun Lin, Zhongfeng Wang, Bo Yuan

已经证明深度神经网络(DNN)具有许多冗余。因此,已经进行了许多努力来压缩DNN。然而,现有的模型压缩方法同等地处理所有输入样本,而忽略了各种输入样本被正确分类的困难不同的事实。为了解决这个问题,在这项工作中充分探索了具有自适应丢弃机制的DNN。为了通知DNN输入样本的分类难度,引入了包含输入样本信息的指南以提高性能。基于开发的指南和自适应丢弃机制,本文提出了一种创新的软导向自适应丢弃(SGAD)神经网络。与32层残余神经网络相比,提交的SGAD可以将FLOP降低77%,而CIFAR-10的准确度下降不到1%。[1807.01430v1]

 

用于加快图像样式传递的不相关特征编码

Minseong KimJongju ShinMyung-Cheol RohHyun-Chul Choi

最近的快速风格转移方法使用预先训练的卷积神经网络作为特征编码器和感知损失网络。尽管预训练网络用于生成有效表示图像的样式和内容的感受域的响应,但是它不是针对图像样式转移而是针对图像分类进行优化。此外,由于其信道间相关性,它还需要耗时且考虑相关性的特征对准处理以用于图像样式传送。在本文中,我们提出了一种端到端学习方法,该方法优化了编码器/解码器网络以用于样式传送,并且减轻了特征对齐复杂度以考虑信道间相关性。我们使用了不相关的损失,即 不同编码器通道响应之间的总相关系数,以及训练风格转移网络的风格和内容损失。这使得编码器网络被训练以生成通道间不相关的特征并且针对图像样式转移的任务进行优化,其仅通过轻量且相关不知道的特征对准过程来维持图像样式的质量。此外,我们的方法大大减少了编码特征的冗余信道,这导致了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 风格转移网络的风格和内容损失。这使得编码器网络被训练以生成通道间不相关的特征并且针对图像样式转移的任务进行优化,其仅通过轻量且相关不知道的特征对准过程来维持图像样式的质量。此外,我们的方法大大减少了编码特征的冗余信道,这导致了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 风格转移网络的风格和内容损失。这使得编码器网络被训练以生成通道间不相关的特征并且针对图像样式转移的任务进行优化,其仅通过轻量且相关不知道的特征对准过程来维持图像样式的质量。此外,我们的方法大大减少了编码特征的冗余信道,这导致了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 这使得编码器网络被训练以生成通道间不相关的特征并且针对图像样式转移的任务进行优化,其仅通过轻量且相关不知道的特征对准过程来维持图像样式的质量。此外,我们的方法大大减少了编码特征的冗余信道,这导致了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 这使得编码器网络被训练以生成通道间不相关的特征并且针对图像样式转移的任务进行优化,其仅通过轻量且相关不知道的特征对准过程来维持图像样式的质量。此外,我们的方法大大减少了编码特征的冗余信道,这导致了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 我们的方法大大减少了编码特征的冗余信道,从而实现了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1] 我们的方法大大减少了编码特征的冗余信道,从而实现了网络结构的有效尺寸和更快的前向处理速度。我们的方法也可以应用于多级缩放样式传输的级联网络方案,并允许用户通过使用内容式权衡参数来控制样式强度。[1807.01493v1]

 

传感器,SLAM和长期自治:回顾

Mubariz ZaffarShoaib EhsanRustam StolkinKlaus McDonald Maier

同步定位和映射,通常称为SLAM,在过去的三十年中一直是机器人领域的一个活跃的研究领域。为了解决SLAM问题,每个机器人都配备了单个传感器或类似/不同传感器的组合。本文试图回顾,讨论,评估和比较这些传感器。关注未来,本文还评估了这些传感器的特性,以及对长期自治挑战至关重要的因素。[1807.01605v1]

 

用于组合人体姿态估计的深度自动编码器和人体建模提升

Matthew TrumbleAndrew GilbertAdrian HiltonJohn Collomosse

我们提出了一种从稀疏的一组宽基线摄像机视图同时估计3D人体姿势和身体形状的方法。我们训练具有双重损失的对称卷积自动编码器,其强制学习编码骨骼关节位置的潜在表示,并且同时学习体积体形的深度表示。我们利用后者将输入体积数据上调4美元乘以$,同时恢复关节位置的3D估计,其精度等于或高于现有技术水平。推理以实时(25 fps)运行,并且具有被动人类行为监控的潜力,其中需要对人体形状和姿势进行高保真度估计。[1807.01511v1]

 

通过即插即用深度局部线性嵌入进行视频帧插值

Anh-Duc NguyenWoojae KimJongyoo KimSanghoon Lee

我们提出了一个生成框架,它承担了视频帧插值问题。我们的框架,我们称之为深度局部线性嵌入(DeepLLE),由深度卷积神经网络(CNN)提供动力,同时它可以像传统模型一样立即使用。DeepLLE将自动编码CNN拟合到一组连续的帧中,并在潜码上嵌入线性约束,以便通过插入新的潜码生成新帧。与当前需要对大型数据集进行训练的深度学习范例不同,DeepLLE以即插即用和无监督的方式工作,并且能够生成任意数量的帧。彻底的实验表明,在没有花里胡哨的情况下,我们的方法在当前最先进的模型中具有很强的竞争力。[1807.01462v1]

 

本地化召回精度(LRP):用于对象检测的新性能指标

Kemal OksuzBaris Can CamEmre AkbasSinan Kalkan

平均精度(AP)是召回精度(RP)曲线下的面积,是物体检测的标准性能指标。尽管它被广泛接受,但它有许多缺点,其中最重要的是(i)无法区分非常不同的RP曲线,以及(ii)缺乏直接测量边界框定位精度。在本文中,我们提出了本地化召回精度(LRP)误差,这是我们专门为对象检测设计的一个新指标。LRP错误由与定位,假阴性(FN)率和假阳性(FP)率相关的三个组成部分组成。基于LRP,我们引入了最佳LRP”,最小可达到的LRP误差代表了探测器在召回精度和箱子紧密度方面可实现的最佳配置。与考虑整个召回域精确度的AP相比,最佳LRP确定了一个类的最佳置信度得分阈值,这平衡了本地化和召回精度之间的权衡。在我们的实验中,我们表明,对于最先进的物体(SOTA)探测器,Optimal LRP提供比AP更丰富和更具辨别力的信息。我们还证明了最佳置信度得分阈值在类和检测器之间差异很大。此外,我们提出了一个简单的在线视频对象检测器的LRP结果,该检测器使用SOTA静止图像对象检测器,并显示特定于类的优化阈值提高了使用所有类的一般阈值的常用方法的准确性。我们提供了可以在https://github.com/cancam/LRP中为PASCAL VOCMSCOCO数据集计算LRP的源代码。我们的源代码也可以轻松地适应其他数据集。[1807.01696v1]

 

用于低成本失真图像分类的选择性深卷积神经网络

米尼奥哈,Younghoon Byeon,李英珠,李Sunggu

深度卷积神经网络已被证明非常适合图像分类应用。然而,如果图像中存在失真,即使使用最先进的神经网络,分类精度也会显着降低。简单地训练失真图像不能显着提高准确度。相反,本文提出了一种多神经网络拓扑,称为选择性深度卷积神经网络。通过以所提出的方式修改现有的现有神经网络,示出了可以实现类似水平的分类准确度,但成本显着降低。降低成本主要是通过使用较少的重量参数来获得的。使用较少的权重减少了乘法累加运算的数量,并且还减少了数据访问所需的能量。最后,表明通过将其与先前提出的网络成本降低方法相结合,可以进一步改善所提出的选择性深度卷积神经网络的有效性。[1807.01418v1]

 

基于体拓扑定位和时间特征聚合的小规模行人检测

Tao Song, Leiyu Sun, Di Xie, Haiming Sun, Shiliang Pu

行人检测中的一个关键问题是检测小尺度物体,这些物体会在图像和视频中引入微弱的对比度和运动模糊,我们认为这应该部分地依赖于根深蒂固的注释偏差。受此启发,我们提出了一种与体细胞拓扑线定位(TLL)和时间特征聚合相结合的新方法,用于检测多尺度行人,这对于距离相机较远的小规模行人尤其有效。此外,引入了基于马尔可夫随机场(MRF)的后处理方案以消除遮挡情况中的模糊。综合运用这些方法,我们在Caltech基准测试中获得了最佳检测性能,并显着提高了小型物体的性能(漏失率从74.53%降至60.79%)。除此之外,我们还在CityPersons数据集上实现了竞争性能,并在KITTI数据集中显示了注释偏差的存在。[1807.01438v1]

 

RGB-D数据的自下而上和自上而下的显着提示的集成:来自对象性与非对象性的显着性

Nevrez ImamogluWataru ShimodaChi ZhangYuming FangAsako KanezakiKeiji YanaiYoshifumi Nishida

自下而上和自上而下的视觉提示是有助于视觉显着性模型的两种类型的信息。这些显着线索可以来自特征的空间分布(基于空间的显着性)或依赖于上下文/任务的特征(基于对象的显着性)。显着性模型通常分别以自下而上或自上而下的标准结合显着线索。在这项工作中,我们将RGB-D数据中基于空间和对象的显着特征的自下而上和自上而下的提示结合起来。此外,我们还研究了各种预先训练的卷积神经网络基于对象依赖特征激活来提取彩色图像上的自上而下显着性的能力。我们证明,通过自下而上和自上而下的方法结合颜色和部门的显着特征,对基于空间和基于对象的显着线索的显着对象检测进行了显着改进。与几种最先进的模型相比,RGB-D显着性集成框架产生了有希望的结果。[1807.01532v1]

 

通过使用共同注意对齐符号和对象来了解可视广告

Karuna AhujaKaran SikkaAnirban RoyAjay Divakaran

我们解决了在给定广告图片的情况下理解视觉广告的问题,我们的目标是对描述广告目的的适当的人工生成的陈述进行排名。通常通过联合嵌入图像和候选语句来建立对应来解决该问题。解码视觉广告需要推断图像中引用的语义和符号细微差别,并且先前的方法可能无法捕获这种关联,尤其是弱注释符号。为了创建更好的嵌入,我们利用注意机制将图像提议与符号相关联,从而有效地聚合来自对齐的多模式表示的信息。我们提出了一种多跳共同关注机制,它迭代地改进注意力图以确保准确的注意力估计。我们基于注意力的嵌入模型是通过最大边际损失函数以端到端的方式学习的。我们表明,我们的模型在基准广告数据集上优于其他基线,并且还显示定性结果,以突出使用多跳共同关注的优势。[1807.01448v1]

 

TextSnake:一种用于检测任意形状文本的灵活表示

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, Cong Yao

在深度神经网络和大规模数据集的推动下,场景文本检测方法在过去几年中取得了长足进步,不断刷新各种标准基准的性能记录。但是,受到用于描述文本的表示(轴对齐矩形,旋转矩形或四边形)的限制,现有方法在处理更多自由格式文本实例时可能会失败,例如弯曲文本,这在实际中非常常见。世界场景。为了解决这个问题,我们提出了一种更灵活的场景文本表示,称为TextSnake,它能够有效地表示水平,定向和弯曲形式的文本实例。在TextSnake中,文本实例被描述为以对称轴为中心的有序重叠磁盘序列,每个都与潜在的可变半径和方向相关联。通过完全卷积网络(FCN)模型估计这种几何属性。在实验中,基于TextSnake的文本检测器在Total-TextSCUT-CTW1500上实现了最先进或相当的性能,这是两个新发布的基准,特别强调自然图像中的弯曲文本,以及广泛的使用的数据集ICDAR 2015MSRA-TD500。具体来说,TextSnakeF-measure中的总文本基线优于40%。[1807.01544v1] 两个新发布的基准测试,特别强调自然图像中的弯曲文本,以及广泛使用的数据集ICDAR 2015MSRA-TD500。具体来说,TextSnakeF-measure中的总文本基线优于40%。[1807.01544v1] 两个新发布的基准测试,特别强调自然图像中的弯曲文本,以及广泛使用的数据集ICDAR 2015MSRA-TD500。具体来说,TextSnakeF-measure中的总文本基线优于40%。[1807.01544v1]

 

具有前景重视的行人重识别特征学习

Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang, Yihong Gong, Nanning Zheng

人员重新识别(Re-ID)的性能严重依赖于相机环境,其中由相互遮挡和背景杂乱引起的大的交叉视图外观变化严重降低了识别准确性。因此,学习能够自适应地强调每个输入图像的前景的特征表示是非常必要的。在本文中,我们提出了一个简单而有效的深度神经网络来解决人Re-ID问题,该问题试图通过解决输入图像的前景来学习判别特征表示。具体来说,首先建立一个新颖的前景细心神经网络(FANN),以加强前景的积极影响,同时削弱背景的副作用,其中编码器和解码器子网经过精心设计,驱动整个神经网络将其注意力放在输入图像的前景上。然后,设计一种新颖的对称三重态损失函数,通过联合最小化类内距离并最大化每个三元组中的类间距离来增强特征学习能力。以端到端的方式训练深度神经网络,可以最终学习辨别特征表示,以找出画廊中各种候选者之间对探针的匹配参考。对几个公共基准数据集进行了全面的实验,以评估性能,与最先进的方法相比,我们的方法有了明显的改进。[1807.01455v1] 一种新颖的对称三重态损失函数旨在通过联合最小化类内距离并最大化每个三元组中的类间距离来增强特征学习能力。以端到端的方式训练深度神经网络,可以最终学习辨别特征表示,以找出画廊中各种候选者之间对探针的匹配参考。对几个公共基准数据集进行了全面的实验,以评估性能,与最先进的方法相比,我们的方法有了明显的改进。[1807.01455v1] 一种新颖的对称三重态损失函数旨在通过联合最小化类内距离并最大化每个三元组中的类间距离来增强特征学习能力。以端到端的方式训练深度神经网络,可以最终学习辨别特征表示,以找出画廊中各种候选者之间对探针的匹配参考。对几个公共基准数据集进行了全面的实验,以评估性能,与最先进的方法相比,我们的方法有了明显的改进。[1807.01455v1] 最终可以学习辨别特征表示以找出画廊中各种候选者之间对探测的匹配参考。对几个公共基准数据集进行了全面的实验,以评估性能,与最先进的方法相比,我们的方法有了明显的改进。[1807.01455v1] 最终可以学习辨别特征表示以找出画廊中各种候选者之间对探测的匹配参考。对几个公共基准数据集进行了全面的实验,以评估性能,与最先进的方法相比,我们的方法有了明显的改进。[1807.01455v1]

 

用于SAR光学数据融合的深度学习的SEN1-2数据集

Michael Schmitt, Lloyd Haydn Hughes, Xiao Xiang Zhu

虽然深度学习技术对许多技术领域的影响越来越大,但收集足够数量的训练数据是遥感中的一个挑战性问题。特别是,这适用于涉及来自具有异构特征的多个传感器的数据的应用。其中一个例子是合成孔径雷达(SAR)数据和光学图像的融合。通过本文,我们发布了SEN1-2数据集,以促进SAR光学数据融合的深度学习研究。SEN1-2包括282,384对相应的图像块,从全球各地和整个气象季节收集。除了对数据集的详细描述外,我们还展示了几种可能应用的示例性结果,例如SAR图像着色,SAR光学图像匹配,从SAR输入数据创建人工光学图像。由于SEN1-2是第一个这类大型开放数据集,我们相信它将支持远程传感深度学习领域的进一步发展以及多传感器数据融合。[1807.01569v1]

 

从自然语言编码空间关系

Tiago RamalhoTomášKociskýFrederic BesseSM Ali EslamiGáborMelisFabio ViolaPhil BlunsomKarl Moritz Hermann

自然语言处理通过分布式方法在学习单词语义方面取得了重大进展,但通过这些方法学习的表征无法捕捉到现实世界中隐含的某些信息。特别地,空间关系以与人类空间推理不一致并且缺乏视点变化不变性的方式编码。我们提出了一种能够从自然语言中捕获诸如后面,左边等空间关系语义的系统。我们的主要贡献是一个新颖的多模式目标,它基于从文本描述生成场景图像,以及一个新的数据集来训练它。我们证明内部表示对于保留描述的转换(释义不变性)的意义是健壮的,而视点不变性是系统的新兴属性。[1807.01670v1]

 

基于深度学习的飓风后卫星图像损伤识别

Cao CaoYoungjun Choe

飓风过后,损害评估对应急管理人员和急救人员至关重要。为了提高损伤评估的效率和准确性,我们建议使用图像分类算法自动检测受损建筑物,而不是使用挡风玻璃调查。该方法适用于2017Hurricane Harvey的案例研究。[1807.01688v1]

 

AND:自回归新奇探测器

Davide AbatiAngelo PorrelloSimone CalderaraRita Cucchiara

我们提出了一种用于新颖性检测的无监督模型。将受试者视为密度估计问题,其中采用深度神经网络来学习最大化训练样本概率的参数函数。这是通过为自动编码器配备一个新模块来实现的,该模块负责通过自回归最大化压缩码的可能性。我们说明了设计选择和适当的层,以便在处理图像和视频输入时执行自回归密度估计。尽管有一个非常通用的表述,我们的模型显示了各种一类新奇检测和视频异常检测基准的有希望的结果。[1807.01653v1]

 

基于转移学习的新生儿疼痛表达识别

Ghada ZamzmiDmitry GoldgofRangachar KasturiYu Sun

使用预先训练的卷积神经网络(CNN)的转移学习已成功应用于不同分类任务的图像。在本文中,我们提出了一种新的管道,用于使用转移学习在新生儿中进行疼痛表达识别。具体来说,我们建议利用一个预先训练的CNN,该CNN最初是在相对类似的人脸识别数据集(VGG Face)上训练的,以及在相对不同的数据集上进行图像分类预训练的CNNiVGG FM,和S)从新生儿的脸上提取深层特征。在最后阶段,训练几个受监督的机器学习分类器,以将新生儿的面部表情分类为疼痛或无疼痛表达。拟议的管道在测试数据集上实现了0.841 AUC90.34精度,大约是。7高于手工制作传统功能的准确性。我们还建议将深层特征与传统特征相结合,并假设混合特征将改善疼痛分类性能。将深度特征与传统特征相结合,可实现92.71精度和0.948 AUC。这些结果表明,转移学习是一种比从头开始训练CNN更快更实用的选择,可用于提取新生儿疼痛表达识别的有用特征。它还表明,将深层特征与传统手工特征相结合是一种很好的实践,可以提高疼痛表达识别的性能,并可能提高类似应用的性能。[1807.01631v1] 我们还建议将深层特征与传统特征相结合,并假设混合特征将改善疼痛分类性能。将深度特征与传统特征相结合,可实现92.71精度和0.948 AUC。这些结果表明,转移学习是一种比从头开始训练CNN更快更实用的选择,可用于提取新生儿疼痛表达识别的有用特征。它还表明,将深层特征与传统手工特征相结合是一种很好的实践,可以提高疼痛表达识别的性能,并可能提高类似应用的性能。[1807.01631v1] 我们还建议将深层特征与传统特征相结合,并假设混合特征将改善疼痛分类性能。将深度特征与传统特征相结合,可实现92.71精度和0.948 AUC。这些结果表明,转移学习是一种比从头开始训练CNN更快更实用的选择,可用于提取新生儿疼痛表达识别的有用特征。它还表明,将深层特征与传统手工特征相结合是一种很好的实践,可以提高疼痛表达识别的性能,并可能提高类似应用的性能。[1807.01631v1] 这些结果表明,转移学习是一种比从头开始训练CNN更快更实用的选择,可用于提取新生儿疼痛表达识别的有用特征。它还表明,将深层特征与传统手工特征相结合是一种很好的实践,可以提高疼痛表达识别的性能,并可能提高类似应用的性能。[1807.01631v1] 这些结果表明,转移学习是一种比从头开始训练CNN更快更实用的选择,可用于提取新生儿疼痛表达识别的有用特征。它还表明,将深层特征与传统手工特征相结合是一种很好的实践,可以提高疼痛表达识别的性能,并可能提高类似应用的性能。[1807.01631v1]

 

VideoKifu,或Go游戏的自动转录

Mario CorsoliniAndrea Carta

在前两篇论文[arXiv1508.03269arXiv1701.05419]中,我们描述了我们用于重建Go游戏的整个移动序列的技术。这项任务最初是通过手动拍摄的一系列照片完成的,正如在LIX欧洲围棋大会(利贝雷茨,CZ)举行的科学会议期间所解释的那样。这些照片后来被一个可能无人看管的视频直播流(由网络摄像头,摄像机,智能手机等提供)取代,或者是通过预先录制的游戏本身视频而无法获得直播,条件是: goban和石头经常清晰可见。正如我们在后一篇论文中暗示的那样,在过去两年中,我们改进了用于重建网格和检测石头的算法,广泛使用现代CPU提供的多核功能。这些功能促使我们开发了一些异步程序,能够仔细检查网格的位置以及先前检测到的任何石头的数量和颜色,以便摆脱主要分析期间可能发生的微小错误,并且可能通过特别是在无人值守的直播流媒体中未被发现。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 这些功能促使我们开发了一些异步程序,能够仔细检查网格的位置以及先前检测到的任何石头的数量和颜色,以便摆脱主要分析期间可能发生的微小错误,并且可能通过特别是在无人值守的直播流媒体中未被发现。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 这些功能促使我们开发了一些异步程序,能够仔细检查网格的位置以及先前检测到的任何石头的数量和颜色,以便摆脱主要分析期间可能发生的微小错误,并且可能通过特别是在无人值守的直播流媒体中未被发现。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 能够仔细检查网格的位置以及之前检测到的任何石头的数量和颜色,以消除主要分析过程中可能发生的微小错误,并且可能无法检测到,特别是在无人看管的直播过程中。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 能够仔细检查网格的位置以及之前检测到的任何石头的数量和颜色,以消除主要分析过程中可能发生的微小错误,并且可能无法检测到,特别是在无人看管的直播过程中。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 而这可能会在未经检测的直播过程中未被发现。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 而这可能会在未经检测的直播过程中未被发现。这些例程将被详细描述,因为它们解决了在重建移动序列时一般感兴趣的一些问题,例如当整个goban的大幅度移动发生时(故意与否)以及如何处理死亡的捕获时该怎么做如果没有及时删除,可能会错误地检测到并记录为新鲜动作的$ – $[1807.01577v1] 如果不及时删除则移动。[1807.01577v1] 如果不及时删除则移动。[1807.01577v1]

 

无偏差解码器学习,实现快速图像传输

Hyun-Chul ChoiMinseong Kim

图像样式转移是与深度机器学习相关的计算机视觉应用之一。由于第一种单层神经网络在线学习方法被称为神经风格,因此图像样式转换方法在处理速度和样式容量方面不断提高。然而,控制图像的风格强度尚未得到深入研究。作为风格强度控制研究的早期阶段,我们提出了一种图像解码器中的风格流形学习方法,可以生成无偏见的图像风格转换图像。[1807.01424v1]

 

深度显着性哈希

Sheng Jin

近年来,已经证明散列方法对于大规模Web媒体搜索是有效的。然而,现有的一般散列方法对于描述具有相似整体外观但具有细微差别的细粒度对象具有有限的辨别力。为了解决这个问题,我们首次将注意机制引入到哈希码的学习中。具体来说,我们提出了一种新的深度哈希模型,称为深度显着性哈希(DSaH),它自动挖掘显着区域并同时学习保留语义的哈希码。DSaH是一个由注意网络和散列网络组成的两步 端到端模型。我们的损失函数包含三个基本组件,包括语义丢失,显着性丢失和量化损失。显着性损失引导注意网络从成对图像中挖掘判别区域。我们对细粒度和一般检索数据集进行了大量实验,以进行性能评估。Oxford Flowers-17Stanford Dogs-120的实验结果表明,我们的DSaH在细粒度检索任务中表现最佳,并且使现有的最佳检索性能(DPSH)大约超过12%。DSaH在一般数据集上也优于几种最先进的散列方法,包括CIFAR-10NUS-WIDE[1807.01459v1] Oxford Flowers-17Stanford Dogs-120的实验结果表明,我们的DSaH在细粒度检索任务中表现最佳,并且使现有的最佳检索性能(DPSH)大约超过12%。DSaH在一般数据集上也优于几种最先进的散列方法,包括CIFAR-10NUS-WIDE[1807.01459v1] Oxford Flowers-17Stanford Dogs-120的实验结果表明,我们的DSaH在细粒度检索任务中表现最佳,并且使现有的最佳检索性能(DPSH)大约超过12%。DSaH在一般数据集上也优于几种最先进的散列方法,包括CIFAR-10NUS-WIDE[1807.01459v1]

 

视频语义突出实例分割:基准数据集和基线

Trung-Nghia LeAkihiro Sugimoto

本文将信封推到视频中的显着区域,将它们分解为语义上有意义的组件,即语义显着实例。为了解决这个视频语义显着实例分割,我们构建了一个新的数据集,即语义显着实例视频(SESIV)数据集。我们的SESIV数据集由84个高质量的视频序列组成,每个帧的地面实况标签按照不同的分段任务进行注释。我们还为此问题提供了一个基线,称为Fork-Join StrategyFJS)。FJS是一个双流网络,利用两个不同的分割任务的优点,即语义实例分割和显着对象分割。在FJS中,我们引入了一种顺序融合,它将两个流的输出结合起来,逐个具有非重叠的实例。我们还引入了一个循环实例传播来优化实例的形状和语义含义,并引入一个身份跟踪来维护整个视频中实例的身份和语义。实验结果证明了我们提出的FJS的有效性。[1807.01452v1]

 

用于无监督跨数据集人员重新识别的多任务中级特征对齐网络

Shan Lin, Haoliang Li, Chang-Tsun Li, Alex Chichung Kot

大多数现有的人重新识别(Re-ID)方法遵循监督学习框架,其中训练需要大量标记的匹配对。这样的设置严重限制了它们在实际应用中的可扩展性,其中在训练阶段期间没有可用的标记样本。为克服此限制,我们为无监督的跨数据集人员重新识别任务开发了一种新颖的无监督多任务中级特征对齐(MMFA)网络。在源数据集和目标数据集共享同一组中级语义属性的假设下,我们提出的模型可以在人的身份分类和属性学习任务下联合优化,具有跨数据集中级特征对齐正则化项。通过这种方式,学习的特征表示可以更好地从一个数据集推广到另一个数据集,这进一步提高了人员重新识别的准确性。四个基准数据集的实验结果表明,我们提出的方法优于最先进的基线。[1807.01440v1]

 

Grassmannian上的端元提取

Elin FarnellHenry KvingeMichael KirbyChris Peterson

端元提取在各种数据分析问题中起着突出的作用,因为端元通常对应于代表某些特征的最纯粹或最佳代表的数据。然后识别终端成员可用于进一步识别和分类任务。在具有高维数据的设置中,例如高光谱图像,考虑作为子空间的端元是有用的,因为它们能够捕获更多范围的签名变体。因此,该设置中的端元提取问题转化为在格拉斯曼上找到一组点的凸包顶点。在存在噪声的情况下,是否应将点视为顶点可能不太清楚。在本文中,我们提出了一种在Grassmannian上提取endmembers的算法,识别位于凸包边界附近的感兴趣子空间,并证明该算法在合成实例和220光谱带AVIRIS Indian Pines高光谱图像上的使用。[1807.01401v1]

 

监督注册:一种无监督的方法来提高人脸特征点检测器的精度

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, Yaser Sheikh

在本文中,我们提出了一种注册监督,一种无监督的方法来提高图像和视频上面部标志检测器的精度。我们的关键观察是相邻帧中相同地标的检测应该与配准(即光流)一致。有趣的是,光流的一致性是监督的来源,不需要手动标记,并且可以在探测器训练期间使用。例如,我们可以在训练损失函数中强制执行在帧$ _ {t-1} $处检测到的地标,然后从帧$ _ {t-1} $到帧$ _t $的光流跟踪应该与位置重合帧$ _t $的检测。从本质上讲,登记监督增加了登记损失的培训损失功能,因此,训练探测器的输出不仅接近标记图像中的注释,而且与大量未标记视频上的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 但也与大量未标记视频的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 但也与大量未标记视频的注册一致。通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 通过可微分的Lucas-Kanade操作实现具有配准损失的端到端训练,该操作计算前向通道中的光流配准,并且反向传播促进检测器中的时间一致性的梯度。我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2] 我们的方法的输出是更精确的基于图像的面部标志检测器,其可以应用于单个图像或视频。通过注册监督,我们证明了(1)在图像(300WALFW)和视频(300VWYoutube-Celebrities)上的面部地标检测的改进,以及(2)视频检测中的抖动的显着减少。[1807.00966v2]

 

ModaNet:具有多边形注释的大型街头时尚数据集

Shuai Zheng, Fan Yang, M. Hadi Kiapour, Robinson Piramuthu

从单一形象中理解衣服对现代社会具有强烈的商业和文化影响。然而,由于衣物的外观,款式,品牌和层次的广泛变化,该任务仍然是具有挑战性的计算机视觉问题。我们提出了一个名为“ModaNet“的新数据库,这是一个基于Paperdoll数据集的大规模图像集合。我们的数据集提供了55,176美元的街道图像,在Paperdoll100万个弱注释的街道图像之上完全注释了多边形。ModaNet旨在提供技术基准,以公平地评估应用最新计算机视觉技术的进展,这些技术依赖于大数据来理解时尚。数据集的丰富注释允许测量用于对象检测的最先进算法的性能,街头时尚图像的语义分割和多边形预测。该数据集即将发布。[1807.01394v1]

转载请注明:《用于组合人体姿态估计的深度自动编码器和人体建模提升+用于无监督跨数据集行人重识别的多任务中级特征对齐网络+TextSnake:一种用于检测任意形状文本的灵活表示

发表评论