Presentation Attack Detection for Cadaver Irises
Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz
This paper presents a deep-learning-based method for iris presentation attack detection (PAD) when iris images are obtained from deceased people. Our approach is based on the VGG-16 architecture fine-tuned with a database of 574 post-mortem, near-infrared iris images from the Warsaw-BioBase-PostMortem-Iris-v1 database, complemented by a dataset of 256 images of live irises, collected within the scope of this study. Experiments described in this paper show that our approach is able to correctly classify iris images as either representing a live or a dead eye in almost 99% of the trials, averaged over 20 subject-disjoint, train/test splits. We also show that the post-mortem iris detection accuracy increases as time since death elapses, and that we are able to construct a classification system with APCER=0%@BPCER=1% (Attack Presentation and Bona Fide Presentation Classification Error Rates, respectively) when only post-mortem samples collected at least 16 hours post-mortem are considered. Since acquisitions of ante- and post-mortem samples differ significantly, we applied countermeasures to minimize bias in our classification methodology caused by image properties that are not related to the PAD. This included using the same iris sensor in collection of ante- and post-mortem samples, and analysis of class activation maps to ensure that discriminant iris regions utilized by our classifier are related to properties of the eye, and not to those of the acquisition protocol. This paper offers the first known to us PAD method in a post-mortem setting, together with an explanation of the decisions made by the convolutional neural network. Along with the paper we offer source codes, weights of the trained network, and a dataset of live iris images to facilitate reproducibility and further research. [1807.04058v1]
Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound
Nicolo’ Savioli, Silvia Visentin, Erich Cosmi, Enrico Grisan, Pablo Lamata, Giovanni Montana
The automatic analysis of ultrasound sequences can substantially improve the efficiency of clinical diagnosis. In this work we present our attempt to automate the challenging task of measuring the vascular diameter of the fetal abdominal aorta from ultrasound images. We propose a neural network architecture consisting of three blocks: a convolutional layer for the extraction of imaging features, a Convolution Gated Recurrent Unit (C-GRU) for enforcing the temporal coherence across video frames and exploiting the temporal redundancy of a signal, and a regularized loss function, called \textit{CyclicLoss}, to impose our prior knowledge about the periodicity of the observed signal. We present experimental evidence suggesting that the proposed architecture can reach an accuracy substantially superior to previously proposed methods, providing an average reduction of the mean squared error from $0.31 mm^2$ (state-of-art) to $0.09 mm^2$, and a relative error reduction from $8.1\%$ to $5.3\%$. The mean execution speed of the proposed approach of 289 frames per second makes it suitable for real time clinical use. [1807.04056v1]
DeSTNet: Densely Fused Spatial Transformer Networks
Roberto Annunziata, Christos Sagonas, Jacques Calì
Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to the best of our knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness. [1807.04050v1]
Cross-spectral Iris Recognition for Mobile Applications using High-quality Color Images
Mateusz Trokielewicz, Ewelina Bartuzi
With the recent shift towards mobile computing, new challenges for biometric authentication appear on the horizon. This paper provides a comprehensive study of cross-spectral iris recognition in a scenario, in which high quality color images obtained with a mobile phone are used against enrollment images collected in typical, near-infrared setups. Grayscale conversion of the color images that employs selective RGB channel choice depending on the iris coloration is shown to improve the recognition accuracy for some combinations of eye colors and matching software, when compared to using the red channel only, with equal error rates driven down to as low as 2%. The authors are not aware of any other paper focusing on cross-spectral iris recognition is a scenario with near-infrared enrollment using a professional iris recognition setup and then a mobile-based verification employing color images. [1807.04061v1]
DCNN-based Human-Interpretable Post-mortem Iris Recognition
Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz
With post-mortem iris recognition getting increasing attention throughout the biometric and forensic communities, no specific, cadaver-aware recognition methodologies have been proposed to date. This paper makes the first step in assessing the discriminatory capabilities of post-mortem iris images collected in multiple time points after a person’s demise, by proposing a deep convolutional neural network (DCNN) classifier fine-tuned with cadaver iris images. The proposed method is able to learn these features and provide classification of post-mortem irises in a closed-set scenario, proving that even with post-mortem biological processes’ onset after a person’s death, features in their irises remain, and can be utilized as a biometric trait. This is also the first work (known to us) to analyze the class-activation maps produced by the DCNN-based iris classifier, and to compare them with attention maps acquired by a gaze-tracking device observing human subjects performing post-mortem iris recognition task. We show how humans perceive post-mortem irises when challenged with the task of classification, and hypothesize that the proposed DCNN-based method can offer human-intelligible decisions backed by visual explanations which may be valuable for iris examiners in a forensic/courthouse scenario. [1807.04049v1]
Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, Michaela Blott
It is well known that many types of artificial neural networks, including recurrent networks, can achieve a high classification accuracy even with low-precision weights and activations. The reduction in precision generally yields much more efficient hardware implementations in regards to hardware cost, memory requirements, energy, and achievable throughput. In this paper, we present the first systematic exploration of this design space as a function of precision for Bidirectional Long Short-Term Memory (BiLSTM) neural network. Specifically, we include an in-depth investigation of precision vs. accuracy using a fully hardware-aware training flow, where during training quantization of all aspects of the network including weights, input, output and in-memory cell activations are taken into consideration. In addition, hardware resource cost, power consumption and throughput scalability are explored as a function of precision for FPGA-based implementations of BiLSTM, and multiple approaches of parallelizing the hardware. We provide the first open source HLS library extension of FINN for parameterizable hardware architectures of LSTM layers on FPGAs which offers full precision flexibility and allows for parameterizable performance scaling offering different levels of parallelism within the architecture. Based on this library, we present an FPGA-based accelerator for BiLSTM neural network designed for optical character recognition, along with numerous other experimental proof points for a Zynq UltraScale+ XCZU7EV MPSoC within the given design space. [1807.04093v1]
Yandong Li, liqiang Wang, Tianbao Yang, Boqing Gong
The large volume of video content and high viewing frequency demand automatic video summarization algorithms, of which a key property is the capability of modeling diversity. If videos are lengthy like hours-long egocentric videos, it is necessary to track the temporal structures of the videos and enforce local diversity. The local diversity refers to that the shots selected from a short time duration are diverse but visually similar shots are allowed to co-exist in the summary if they appear far apart in the video. In this paper, we propose a novel probabilistic model, built upon SeqDPP, to dynamically control the time span of a video segment upon which the local diversity is imposed. In particular, we enable SeqDPP to learn to automatically infer how local the local diversity is supposed to be from the input video. The resulting model is extremely involved to train by the hallmark maximum likelihood estimation (MLE), which further suffers from the exposure bias and non-differentiable evaluation metrics. To tackle these problems, we instead devise a reinforcement learning algorithm for training the proposed model. Extensive experiments verify the advantages of our model and the new learning algorithm over MLE-based methods. [1807.04219v1]
Learning Neural Models for End-to-End Clustering
Benjamin Bruno Meier, Ismail Elezi, Mohammadreza Amirian, Oliver Durr, Thilo Stadelmann
We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters $k$, and for each $1 \leq k \leq k_\mathrm{max}$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end. [1807.04001v1]
Deep attention-based classification network for robust depth prediction
Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang
In this paper, we present our deep attention-based classification (DABC) network for robust single image depth prediction, in the context of the Robust Vision Challenge 2018 (ROB 2018). Unlike conventional depth prediction, our goal is to design a model that can perform well in both indoor and outdoor scenes with a single parameter set. However, robust depth prediction suffers from two challenging problems: a) How to extract more discriminative features for different scenes (compared to a single scene)? b) How to handle the large differences of depth ranges between indoor and outdoor datasets? To address these two problems, we first formulate depth prediction as a multi-class classification task and apply a softmax classifier to classify the depth label of each pixel. We then introduce a global pooling layer and a channel-wise attention mechanism to adaptively select the discriminative channels of features and to update the original features by assigning important channels with higher weights. Further, to reduce the influence of quantization errors, we employ a soft-weighted sum inference strategy for the final prediction. Experimental results on both indoor and outdoor datasets demonstrate the effectiveness of our method. It is worth mentioning that we won the 2-nd place in single image depth prediction entry of ROB 2018, in conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018. [1807.03959v1]
Decision method choice in a human posture recognition context
Stéphane Perrin, Eric Benoit, Didier Coquin
Human posture recognition provides a dynamic field that has produced many methods. Using fuzzy subsets based data fusion methods to aggregate the results given by different types of recognition processes is a convenient way to improve recognition methods. Nevertheless, choosing a defuzzification method to imple-ment the decision is a crucial point of this approach. The goal of this paper is to present an approach where the choice of the defuzzification method is driven by the constraints of the final data user, which are expressed as limitations on indica-tors like confidence or accuracy. A practical experimentation illustrating this ap-proach is presented: from a depth camera sensor, human posture is interpreted and the defuzzification method is selected in accordance with the constraints of the final information consumer. The paper illustrates the interest of the approach in a context of postures based human robot communication. [1807.04170v1]
A Computational Method for Evaluating UI Patterns
Bardia Doosti, Tao Dong, Biplab Deka, Jeffrey Nichols
UI design languages, such as Google’s Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navigation Drawer. To address these challenges, this short paper introduces a method for measuring the impact of design languages and informing design debates through analyzing a dataset consisting of view hierarchies, screenshots, and app metadata for more than 9,000 mobile apps. Our data analysis shows that use of Material Design is positively correlated to app ratings, and to some extent, also the number of installs. Furthermore, we show that use of UI components vary by app category, suggesting a more nuanced view needed in design debates. [1807.04191v1]
MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network
Muhammed Kocabas, Salih Karagoz, Emre Akbas
In this paper, we present MultiPoseNet, a novel bottom-up multi-person pose estimation architecture that combines a multi-task model with a novel assignment method. MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances. On the COCO keypoints dataset, our pose estimation method outperforms all previous bottom-up methods both in accuracy (+4-point mAP over previous best result) and speed; it also performs on par with the best top-down methods while being at least 4x faster. Our method is the fastest real time system with 23 frames/sec. Source code is available at: https://github.com/mkocabas/pose-residual-network [1807.04067v1]
Variational Capsules for Image Analysis and Synthesis
Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan
A capsule is a group of neurons whose activity vector models different properties of the same entity. This paper extends the capsule to a generative version, named variational capsules (VCs). Each VC produces a latent variable for a specific entity, making it possible to integrate image analysis and image synthesis into a unified framework. Variational capsules model an image as a composition of entities in a probabilistic model. Different capsules’ divergence with a specific prior distribution represents the presence of different entities, which can be applied in image analysis tasks such as classification. In addition, variational capsules encode multiple entities in a semantically-disentangling way. Diverse instantiations of capsules are related to various properties of the same entity, making it easy to generate diverse samples with fine-grained semantic attributes. Extensive experiments demonstrate that deep networks designed with variational capsules can not only achieve promising performance on image analysis tasks (including image classification and attribute prediction) but can also improve the diversity and controllability of image synthesis. [1807.04099v1]
CG-DIQA: No-reference Document Image Quality Assessment Based on Character Gradient
Hongyu Li, Fan Zhu, Junhua Qiu
Document image quality assessment (DIQA) is an important and challenging problem in real applications. In order to predict the quality scores of document images, this paper proposes a novel no-reference DIQA method based on character gradient, where the OCR accuracy is used as a ground-truth quality metric. Character gradient is computed on character patches detected with the maximally stable extremal regions (MSER) based method. Character patches are essentially significant to character recognition and therefore suitable for use in estimating document image quality. Experiments on a benchmark dataset show that the proposed method outperforms the state-of-the-art methods in estimating the quality score of document images. [1807.04047v1]
With Friends Like These, Who Needs Adversaries?
Saumya Jetley*, Nicholas A. Lord*, Philip H. S. Torr
The vulnerability of deep image classification networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks (DCNs) that shed new light on their behaviour and its connection to the problem of adversaries, with two key results. The first is a straightforward explanation of the existence of universal adversarial perturbations and their association with specific class identities, obtained by analysing the properties of nets’ logit responses as functions of 1D movements along specific image-space directions. The second is the clear demonstration of the tight coupling between classification performance and vulnerability to adversarial attack within the spaces spanned by these directions. Prior work has noted the importance of low-dimensional subspaces in adversarial vulnerability: we illustrate that this likewise represents the nets’ notion of saliency. In all, we provide a digestible perspective from which to understand previously reported results which have appeared disjoint or contradictory, with implications for efforts to construct neural nets that are both accurate and robust to adversarial attack. [1807.04200v1]
Generative Adversarial Networks with Decoder-Encoder Output Noise
Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang
In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve the generative ability of the generator. However, the generated pictures by GANs are generally blurry. The deep convolutional generative adversarial networks (DCGANs) were then proposed to leverage the quality of generated images. Since the input noise vectors are randomly sampled from a Gaussian distribution, the generator has to map from a whole normal distribution to the images. This makes DCGANs unable to reflect the inherent structure of the training data. In this paper, we propose a novel deep model, called generative adversarial networks with decoder-encoder output noise (DE-GANs), which takes advantage of both the adversarial training and the variational Bayesain inference to improve the performance of image generation. DE-GANs use a pre-trained decoder-encoder architecture to map the random Gaussian noise vectors to informative ones and pass them to the generator of the adversarial networks. Since the decoder-encoder architecture is trained by the same images as the generators, the output vectors could carry the intrinsic distribution information of the original images. Moreover, the loss function of DE-GANs is different from GANs and DCGANs. A hidden-space loss function is added to the adversarial loss function to enhance the robustness of the model. Extensive empirical results show that DE-GANs can accelerate the convergence of the adversarial training process and improve the quality of the generated images. [1807.03923v1]
Data-Driven Segmentation of Post-mortem Iris Images
Mateusz Trokielewicz, Adam Czajka
This paper presents a method for segmenting iris images obtained from the deceased subjects, by training a deep convolutional neural network (DCNN) designed for the purpose of semantic segmentation. Post-mortem iris recognition has recently emerged as an alternative, or additional, method useful in forensic analysis. At the same time it poses many new challenges from the technological standpoint, one of them being the image segmentation stage, which has proven difficult to be reliably executed by conventional iris recognition methods. Our approach is based on the SegNet architecture, fine-tuned with 1,300 manually segmented post-mortem iris images taken from the Warsaw-BioBase-Post-Mortem-Iris v1.0 database. The experiments presented in this paper show that this data-driven solution is able to learn specific deformations present in post-mortem samples, which are missing from alive irises, and offers a considerable improvement over the state-of-the-art, conventional segmentation algorithm (OSIRIS): the Intersection over Union (IoU) metric was improved from 73.6% (for OSIRIS) to 83% (for DCNN-based presented in this paper) averaged over subject-disjoint, multiple splits of the data into train and test subsets. This paper offers the first known to us method of automatic processing of post-mortem iris images. We offer source codes with the trained DCNN that perform end-to-end segmentation of post-mortem iris images, as described in this paper. Also, we offer binary masks corresponding to manual segmentation of samples from Warsaw-BioBase-Post-Mortem-Iris v1.0 database to facilitate development of alternative methods for post-mortem iris segmentation. [1807.04154v1]
Underwater Image Haze Removal and Color Correction with an Underwater-ready Dark Channel Prior
Tomasz Łuczyński, Andreas Birk
Underwater images suffer from extremely unfavourable conditions. Light is heavily attenuated and scattered. Attenuation creates change in hue, scattering causes so called veiling light. General state of the art methods for enhancing image quality are either unreliable or cannot be easily used in underwater operations. On the other hand there is a well known method for haze removal in air, called Dark Channel Prior. Even though there are known adaptations of this method to underwater applications, they do not always work correctly. This work elaborates and improves upon the initial concept presented in [1]. A modification to the Dark Channel Prior is proposed that allows for an easy application to underwater images. It is also shown that our method outperforms competing solutions based on the Dark Channel Prior. Experiments on real-life data collected within the DexROV project are also presented, showing the robustness and high performance of the proposed algorithm. [1807.04169v1]
Model-based free-breathing cardiac MRI reconstruction using deep learned \& STORM priors: MoDL-STORM
Sampurna Biswas, Hemant K. Aggarwal, Sunrita Poddar, Mathews Jacob
We introduce a model-based reconstruction framework with deep learned (DL) and smoothness regularization on manifolds (STORM) priors to recover free breathing and ungated (FBU) cardiac MRI from highly undersampled measurements. The DL priors enable us to exploit the local correlations, while the STORM prior enables us to make use of the extensive non-local similarities that are subject dependent. We introduce a novel model-based formulation that allows the seamless integration of deep learning methods with available prior information, which current deep learning algorithms are not capable of. The experimental results demonstrate the preliminary potential of this work in accelerating FBU cardiac MRI. [1807.03845v1]
An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm
Shin Kamada, Takumi Ichimura
Deep Belief Network (DBN) has a deep architecture that represents multiple features of input patterns hierarchically with the pre-trained Restricted Boltzmann Machines (RBM). A traditional RBM or DBN model cannot change its network structure during the learning phase. Our proposed adaptive learning method can discover the optimal number of hidden neurons and weights and/or layers according to the input space. The model is an important method to take account of the computational cost and the model stability. The regularities to hold the sparse structure of network is considerable problem, since the extraction of explicit knowledge from the trained network should be required. In our previous research, we have developed the hybrid method of adaptive structural learning method of RBM and Learning Forgetting method to the trained RBM. In this paper, we propose the adaptive learning method of DBN that can determine the optimal number of layers during the learning. We evaluated our proposed model on some benchmark data sets. [1807.03486v2]
Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping
Chuhui Xue, Shijian Lu, Fangneng Zhan
This paper presents a scene text detection technique that exploits bootstrapping and text border semantics for accurate localization of texts in scenes. A novel bootstrapping technique is designed which samples multiple ‘subsections’ of a word or text line and accordingly relieves the constraint of limited training data effectively. At the same time, the repeated sampling of text ‘subsections’ improves the consistency of the predicted text feature maps which is critical in predicting a single complete instead of multiple broken boxes for long words or text lines. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for each scene text. With semantics-aware text borders, scene texts can be localized more accurately by regressing text pixels around the ends of words or text lines instead of all text pixels which often leads to inaccurate localization while dealing with long words or text lines. Extensive experiments demonstrate the effectiveness of the proposed techniques, and superior performance is obtained over several public datasets, e. g. 80.1 f-score for the MSRA-TD500, 67.1 f-score for the ICDAR2017-RCTW, etc. [1807.03547v2]
“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention
Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo
Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision. [1807.03871v1]
Vision System for AGI: Problems and Directions
Alexey Potapov, Sergey Rodionov, Maxim Peterson, Oleg Shcherbakov, Innokentii Zhdanov, Nikolai Skorobogatko
What frameworks and architectures are necessary to create a vision system for AGI? In this paper, we propose a formal model that states the task of perception within AGI. We show the role of discriminative and generative models in achieving efficient and general solution of this task, thus specifying the task in more detail. We discuss some existing generative and discriminative models and demonstrate their insufficiency for our purposes. Finally, we discuss some architectural dilemmas and open questions. [1807.03887v1]
Deep Imbalanced Attribute Classification using Visual Attention Aggregation
Nikolaos Sarafianos, Ioannis A. Kakadiaris
For many computer vision applications such as image description and human identification, recognizing the visual attributes of humans is an essential yet challenging problem. Its challenges originate from its multi-label nature, the large underlying class imbalance and the lack of spatial annotations. Existing methods follow either a computer vision approach while failing to account for class imbalance, or explore machine learning solutions, which disregard the spatial and semantic relations that exist in the images. With that in mind, we propose an effective method that extracts and aggregates visual attention masks at different scales. We introduce a loss function to handle class imbalance both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak supervision of the attention mechanism. By identifying and addressing these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without additional context or side information. [1807.03903v1]
Shin Kamada, Takumi Ichimura
Restricted Boltzmann Machine (RBM) is a generative stochastic energy-based model of artificial neural network for unsupervised learning. Recently, RBM is well known to be a pre-training method of Deep Learning. In addition to visible and hidden neurons, the structure of RBM has a number of parameters such as the weights between neurons and the coefficients for them. Therefore, we may meet some difficulties to determine an optimal network structure to analyze big data. In order to evade the problem, we investigated the variance of parameters to find an optimal structure during learning. For the reason, we should check the variance of parameters to cause the fluctuation for energy function in RBM model. In this paper, we propose the adaptive learning method of RBM that can discover an optimal number of hidden neurons according to the training situation by applying the neuron generation and annihilation algorithm. In this method, a new hidden neuron is generated if the energy function is not still converged and the variance of the parameters is large. Moreover, the inactivated hidden neuron will be annihilated if the neuron does not affect the learning situation. The experimental results for some benchmark data sets were discussed in this paper. [1807.03478v2]
Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition
Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris
In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains. [1807.03848v1]
Mateusz Trokielewicz,Adam Czajka和Piotr Maciejewicz
本文提出了一种基于深度学习的虹膜表现攻击检测(PAD)方法,当从死者身上获取虹膜图像时。我们的方法基于VGG-16架构,该架构使用来自Warsaw-BioBase-PostMortem-Iris-v1数据库的574个验尸后近红外虹膜图像数据库进行微调,并辅以256张实时虹膜图像的数据集。 ,收集在本研究范围内。本文中描述的实验表明,我们的方法能够正确地将虹膜图像分类为在近99%的试验中代表活眼或死眼,平均超过20个主题不相交,训练/测试分裂。我们还表明,死后虹膜检测准确度随着死亡时间的推移而增加,并且我们能够构建一个分类系统,当只考虑至少16小时后收集的验尸样本时,APCER = 0%@ BPCER = 1%(攻击呈现和善意呈现分类错误率)。由于前期和后期样本的采集差异很大,我们采用对策来最大限度地减少由于与PAD无关的图像属性导致的分类方法的偏差。这包括使用相同的虹膜传感器收集前期和死后样本,并分析类激活图,以确保我们的分类器使用的判别虹膜区域与眼睛的属性有关,而不是与采集协议的属性有关。 。本文提供了我们在死后环境中首次了解PAD方法,以及对卷积神经网络做出的决定的解释。与论文一起,我们提供源代码,训练网络的权重和活虹膜图像的数据集,以促进再现性和进一步研究。[1807.04058v1]
Nicolo’Savioli,Silvia Visentin,Erich Cosmi,Enrico Grisan,Pablo Lamata,Giovanni Montana
超声序列的自动分析可以显着提高临床诊断的效率。在这项工作中,我们提出了尝试自动化从超声图像测量胎儿腹主动脉血管直径的挑战性任务。我们提出了一个由三个块组成的神经网络架构:用于提取成像特征的卷积层,用于强制跨视频帧的时间相干性和利用信号的时间冗余的卷积门控递归单元(C-GRU),以及正则化损失函数,称为\ textit {CyclicLoss},用于强加我们关于观测信号周期性的先验知识。我们提出实验证据表明所提出的架构可以达到比以前提出的方法更好的准确度,平均误差平均从0.31 mm ^ 2 $(现有技术)降至$ 0.09 mm ^ 2 $,相对误差从$ 8.1 \%$降至$ 5.3 \%$。所提出的方法的平均执行速度为每秒289帧,使其适合于实时临床使用。[1807.04056v1]
Roberto Annunziata,Christos Sagonas,JacquesCalì
现代卷积神经网络(CNN)在一系列计算机视觉任务中非常强大。然而,当数据的特征在于由空间变换引起的大的类内可变性时,它们的性能可能降低。空间变换器网络(STN)目前是提供CNN能够在端到端学习框架中移除这些变换并提高性能的首选方法。在本文中,我们提出了密集融合空间变换网络(DeSTNet),据我们所知,它是第一个用于组合多个STN的密集融合模式。具体来说,我们展示了如何将多个STN的连接模式从顺序更改为密集导致更强大的对齐模块。在三个基准上进行了广泛的实验,即MNIST,GTSRB,和IDocDB表明,所提出的技术在准确性和鲁棒性方面都优于相关的最新方法(即STN和CSTN)。[1807.04050v1]
Mateusz Trokielewicz,Ewelina Bartuzi
随着最近向移动计算的转变,生物识别身份验证面临着新的挑战。本文提供了一个场景中的交叉光谱虹膜识别的综合研究,其中使用移动电话获得的高质量彩色图像用于在典型的近红外设置中收集的登记图像。采用选择性RGB通道选择的彩色图像的灰度转换取决于虹膜着色,与仅使用红色通道相比,可以提高眼睛颜色和匹配软件的某些组合的识别准确度,同时降低错误率。低至2%。作者不知道任何其他关于跨光谱虹膜识别的论文是使用专业虹膜识别设置进行近红外登记的场景,然后使用彩色图像进行基于移动的验证。[1807.04061v1]
Mateusz Trokielewicz,Adam Czajka和Piotr Maciejewicz
随着验尸虹膜识别越来越受到生物识别和法医社区的关注,迄今为止尚未提出具体的尸体识别识别方法。本文通过提出使用尸体虹膜图像微调的深度卷积神经网络(DCNN)分类器,评估在人死亡后在多个时间点收集的验尸虹膜图像的辨别能力的第一步。所提出的方法能够学习这些特征并在封闭场景中提供验尸后虹膜的分类,证明即使死后生物过程在人死亡后发作,其虹膜中的特征仍然存在,并且可以被利用作为一种生物特征。这也是我们分析由基于DCNN的虹膜分类器产生的类激活图的第一项工作(我们已知),并将它们与通过凝视跟踪设备获取的注意力图进行比较,所述注视跟踪设备观察执行验尸虹膜识别的人类对象任务。我们展示了人类在分类任务受到挑战时如何感知验尸后的虹膜,并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1] 并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1] 并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1]
FINN-L:FPGA上可变精度LSTM网络的库扩展和设计权衡分析
Vladimir Rybalkin,Alessandro Pappalardo,Muhammad Mohsin Ghaffar,Giulio Gambardella,Norbert Wehn,Michaela Blott
众所周知,即使使用低精度权重和激活,许多类型的人工神经网络(包括循环网络)也可以实现高分类精度。精度的降低通常在硬件成本,存储器要求,能量和可实现的吞吐量方面产生更有效的硬件实现。在本文中,我们首次系统地探索了这个设计空间作为双向长短期记忆(BiLSTM)神经网络精度的函数。具体而言,我们使用完全硬件感知的培训流程对精确度与准确度进行深入研究,其中在训练期间考虑网络的所有方面的量化,包括权重,输入,输出和存储器内单元激活。另外,硬件资源成本,功耗和吞吐量可扩展性作为基于FPGA的BiLSTM实现的精确度和多种并行化硬件的方法进行了探索。我们提供FINN的第一个开源HLS库扩展,用于FPGA上LSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGA的BiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提供FINN的第一个开源HLS库扩展,用于FPGA上LSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGA的BiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提供FINN的第一个开源HLS库扩展,用于FPGA上LSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGA的BiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提出了一个基于FPGA的BiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提出了一个基于FPGA的BiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1]
本地多样性如何?用动态地集修正序贯行列式点过程用于监督视频摘要
Yandong Li, liqiang Wang, Tianbao Yang, Boqing Gong
大量的视频内容和高观看频率需要自动视频摘要算法,其关键属性是建模多样性的能力。如果视频像长达数小时的自我中心视频一样冗长,则有必要跟踪视频的时间结构并强制执行本地多样性。本地多样性指的是从短时间段选择的镜头是多种多样的,但是如果它们在视频中看起来相距很远,则允许视觉上类似的镜头在摘要中共存。在本文中,我们提出了一种基于SeqDPP的新概率模型,用于动态控制视频片段的时间跨度,在该视频片段上施加局部多样性。特别是,我们使SeqDPP能够学习如何从输入视频中自动推断本地多样性应该是多么本地化。由此产生的模型极其涉及通过标记最大似然估计(MLE)进行训练,其进一步受到暴露偏差和不可微分的评估指标的影响。为了解决这些问题,我们设计了一种强化学习算法来训练所提出的模型。大量实验验证了我们的模型和新学习算法优于基于MLE的方法的优势。[1807.04219v1]
Benjamin Bruno Meier,Ismail Elezi,Mohammadreza Amiri,Oliver Durr,Thilo Stadelmann
我们提出了一种新颖的端到端神经网络架构,一旦经过训练,就可以一次性直接输出一批输入示例的概率聚类。它估计了聚类数量$ k $以及每个$ 1 \ leq k \ leq k_ \ mathrm {max} $的分布,这是每个数据点的单个聚类分配的分布。预先以监督的方式对网络进行单独数据训练,以通过基于成对标签(相同/不同组)的任何感知相似性标准来学习分组。然后,它可以应用于包含不同组的不同数据。我们在高维数据(如图像(COIL-100)和语音(TIMIT))上展示了出色的性能。我们称之为“学习聚类”,并将其概念差异展示给深度量度学习,半监督集群和其他相关方法,同时具有完全端到端地执行可学习集群的优势。[1807.04001v1]
Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang
在本文中,我们在鲁棒视觉挑战2018(ROB 2018)的背景下,提出了基于注意力的深度分类(DABC)网络,用于稳健的单图像深度预测。与传统的深度预测不同,我们的目标是设计一个能够在单个参数集的室内和室外场景中表现良好的模型。然而,强大的深度预测存在两个具有挑战性的问题:a)如何为不同的场景提取更多的判别特征(与单个场景相比)?b)如何处理室内和室外数据集之间的深度差异?为了解决这两个问题,我们首先将深度预测表示为多类分类任务,并应用softmax分类器对每个像素的深度标签进行分类。然后,我们引入全局池层和通道注意机制,以自适应地选择特征的辨别通道,并通过分配具有更高权重的重要通道来更新原始特征。此外,为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR)2018一起赢得了第二名。[1807.03959v1] 为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR)2018一起赢得了第二名。[1807.03959v1] 为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR)2018一起赢得了第二名。[1807.03959v1]
StéphanePerrin,Eric Benoit,Didier Coquin
人体姿势识别提供了产生许多方法的动态场。使用基于模糊子集的数据融合方法来聚合由不同类型的识别过程给出的结果是改进识别方法的便利方式。然而,选择一种去模糊化方法来实施决策是这种方法的关键点。本文的目的是提出一种方法,即去模糊化方法的选择是由最终数据用户的约束驱动的,这些限制表示为对信任度或准确性等指标的限制。提供了说明该方法的实际实验:从深度相机传感器,解释人体姿势并根据最终信息消费者的约束选择去模糊化方法。本文阐述了该方法在基于姿势的人体机器人通信环境中的兴趣。[1807.04170v1]
Bardia Doosti,Tao Dong,Biplab Deka,Jeffrey Nichols
UI设计语言(例如Google的Material Design)通过提供一组标准UI组件,使应用程序更易于开发和更易于学习。尽管如此,很难评估设计语言在野外的影响。此外,设计师经常因为某些UI组件(例如浮动操作按钮和导航抽屉)的优点而受到强烈争论的争论。为了应对这些挑战,本篇简短的论文介绍了一种测量设计语言影响的方法,并通过分析由超过9,000个移动应用程序的视图层次结构,屏幕截图和应用程序元数据组成的数据集来通知设计辩论。我们的数据分析表明,Material Design的使用与应用程序评级正相关,在某种程度上也与安装数量正相关。此外,我们展示了UI组件的使用因应用类别而异,这表明设计辩论需要更细致的视图。[1807.04191v1]
MultiPoseNet:使用姿势残差网络进行快速多人姿态估计
Muhammed Kocabas,Salih Karagoz,Emre Akbas
在本文中,我们介绍了MultiPoseNet,一种新颖的自下而上的多人姿势估计架构,它将多任务模型与新颖的分配方法相结合。MultiPoseNet可以联合处理人员检测,关键点检测,人员分割和姿势估计问题。新颖的分配方法由姿势残留网络(PRN)实现,该网络接收关键点和人物检测,并通过将关键点分配给人物实例来产生准确的姿势。在COCO关键点数据集上,我们的姿态估计方法在准确性(+ 4点mAP与先前最佳结果)和速度方面均优于所有先前的自下而上方法; 它也可以与最好的自上而下方法相媲美,同时至少快4倍。我们的方法是最快的实时系统,具有23帧/秒。源代码可从以下网址获得:https:// github。
Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan
胶囊是一组神经元,其活动向量模拟同一实体的不同属性。本文将胶囊扩展为生殖型,命名为变异胶囊(VC)。每个VC为特定实体生成潜在变量,从而可以将图像分析和图像合成集成到统一的框架中。变分胶囊将图像建模为概率模型中的实体的组合。具有特定先验分布的不同胶囊的发散表示存在不同的实体,其可以应用于诸如分类的图像分析任务中。此外,变分胶囊以语义解开的方式编码多个实体。胶囊的不同实例与同一实体的各种属性有关,使用细粒度的语义属性轻松生成各种样本。大量实验表明,采用变分胶囊设计的深度网络不仅可以在图像分析任务(包括图像分类和属性预测)上实现有前途的性能,还可以提高图像合成的多样性和可控性。[1807.04099v1]
Hongyu Li, Fan Zhu, Junhua Qiu
文档图像质量评估(DIQA)是实际应用中的一个重要且具有挑战性的问题。为了预测文档图像的质量得分,本文提出了一种基于字符梯度的新型无参考DIQA方法,其中OCR精度被用作地面实况质量度量。使用基于最大稳定极值区域(MSER)的方法检测的字符块上计算字符梯度。字符补丁对字符识别具有重要意义,因此适用于估计文档图像质量。基准数据集上的实验表明,所提出的方法在估计文档图像的质量分数方面优于最先进的方法。[1807.04047v1]
Saumya Jetley *,Nicholas A. Lord *,Philip HS Torr
深度图像分类网络对敌对攻击的脆弱性现在众所周知,但不太了解。通过一个新的实验分析,我们说明了一些关于深度卷积网络(DCN)的事实,这些事实揭示了它们的行为及其与对手问题的联系,有两个关键结果。第一个是对通用对抗性扰动的存在及其与特定类别身份的关联的直接解释,通过分析网络对数响应的特性作为沿特定图像空间方向的一维运动的函数而获得。第二个是清楚地证明了分类性能与这些方向所跨越的空间内的对抗性攻击之间的紧密联系。先前的工作已经注意到低维子空间在对抗性脆弱性中的重要性:我们说明这同样代表了网络的显着性概念。总而言之,我们提供了一个可消化的视角,从中可以了解先前报告的结果,这些结果似乎是不相交的或矛盾的,对于构建对抗攻击的准确和强大的神经网的努力具有意义。[1807.04200v1]
Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang
近年来,对图像生成方法的研究正在迅速发展。自动编码变分贝叶斯方法(VAE)于2013年提出,它使用变分推理从图像数据库中学习潜在空间,然后使用解码器生成图像。生成对抗网络(GANs)作为一个有前途的框架出现,它使用对抗训练来提高发电机的生成能力。然而,GAN生成的图片通常是模糊的。然后提出深度卷积生成对抗网络(DCGAN)来利用生成图像的质量。由于输入噪声矢量是从高斯分布中随机采样的,因此发生器必须从整个正态分布映射到图像。这使得DCGAN无法反映训练数据的固有结构。在本文中,我们提出了一种新的深度模型,称为具有解码器 – 编码器输出噪声(DE-GAN)的生成对抗网络,它利用对抗训练和变分贝叶斯推断来提高图像生成的性能。DE-GAN使用预先训练的解码器 – 编码器架构将随机高斯噪声向量映射到信息性高斯噪声向量并将它们传递到对抗性网络的生成器。由于解码器 – 编码器架构由与生成器相同的图像训练,因此输出矢量可以携带原始图像的固有分布信息。此外,DE-GAN的损失功能与GAN和DCGAN不同。隐藏空间损失函数被添加到对抗性损失函数中以增强模型的稳健性。大量的实证结果表明,DE-GAN可以加速对抗训练过程的收敛,提高生成图像的质量。[1807.03923v1]
Mateusz Trokielewicz,Adam Czajka
本文提出了一种通过训练设计用于语义分割的深度卷积神经网络(DCNN)来分割从已故受试者获得的虹膜图像的方法。验尸虹膜识别最近已成为法医分析中可用的替代方法或其他方法。同时,从技术角度来看,它带来了许多新的挑战,其中之一是图像分割阶段,已经证明难以通过传统的虹膜识别方法可靠地执行。我们的方法基于SegNet架构,使用华沙 – BioBase-Post-Mortem-Iris v1.0数据库中的1,300个手动分段的验尸虹膜图像进行微调。本文提出的实验表明,这种数据驱动的解决方案能够学习死活样品中存在的特定变形,这些变形在活虹膜中缺失,并且相对于现有技术的传统分割提供了相当大的改进。算法(OSIRIS):联合交叉(IoU)度量标准从73.6%(对于OSIRIS)提高到83%(对于本文中提出的基于DCNN的平均值)对主题不相交,数据多次拆分为训练和测试的平均值子集。本文提供了我们第一个自动处理验尸虹膜图像的方法。如本文所述,我们提供具有经过训练的DCNN的源代码,该DCNN执行死后虹膜图像的端到端分割。也,我们提供二元掩模,对应于Warsaw-BioBase-Post-Mortem-Iris v1.0数据库中样本的手动分割,以便于开发用于验尸后虹膜分割的替代方法。[1807.04154v1]
TomaszŁuczyński,Andreas Birk
水下图像受到极端不利条件的影响。光被严重衰减和散射。衰减会产生色调变化,散射会导致所谓的遮蔽光。用于增强图像质量的一般现有技术方法要么不可靠,要么不能容易地用于水下操作。另一方面,有一种众所周知的在空气中去除雾度的方法,称为Dark Channel Prior。即使已知这种方法适用于水下应用,但它们并不总能正常工作。这项工作阐述并改进了[1]中提出的初始概念。提出了对Dark Channel Prior的修改,其允许容易地应用于水下图像。还表明我们的方法优于基于Dark Channel Prior的竞争解决方案。还介绍了在DexROV项目中收集的实际数据的实验,显示了所提算法的鲁棒性和高性能。[1807.04169v1]
基于模型的自由呼吸心脏MRI重建使用深度学习和STORM先验:MoDL-STORM
Sampurna Biswas,Hemant K. Aggarwal,Sunrita Poddar,Mathews Jacob
我们引入了基于模型的重建框架,其具有深度学习(DL)和流形(STORM)先验的平滑正则化,以从高度欠采样的测量中恢复自由呼吸和非门控(FBU)心脏MRI。DL先验使我们能够利用局部相关性,而STORM先验使我们能够利用受主体依赖的广泛的非局部相似性。我们引入了一种新颖的基于模型的公式,允许深度学习方法与现有深度学习算法无法获得的现有信息无缝集成。实验结果证明了这项工作在加速FBU心脏MRI方面的初步潜力。[1807.03845v1]
Shin Kamada,Takumi Ichimura
Deep Belief Network(DBN)具有深层体系结构,使用预先训练的Restricted Boltzmann Machines(RBM)分层次地表示输入模式的多个特征。传统的RBM或DBN模型在学习阶段不能改变其网络结构。我们提出的自适应学习方法可以根据输入空间发现隐藏神经元和权重和/或层的最佳数量。该模型是考虑计算成本和模型稳定性的重要方法。保持稀疏网络结构的规律性是相当大的问题,因为应该需要从训练的网络中提取显式知识。在我们之前的研究中,我们开发了RBM自适应结构学习方法和学习遗忘方法的混合方法到训练的RBM。在本文中,我们提出了DBN的自适应学习方法,可以确定学习过程中的最佳层数。我们在一些基准数据集上评估了我们提出的模型。[1807.03486v2]
Chuhui Xue, Shijian Lu, Fangneng Zhan
本文提出了一种场景文本检测技术,该技术利用自举和文本边界语义来精确定位场景中的文本。设计了一种新颖的自举技术,该技术对单词或文本行的多个“子部分”进行采样,从而有效地减轻了有限训练数据的约束。同时,对文本“子部分”的重复采样提高了预测文本特征映射的一致性,这对于预测单个完整而不是长字或文本行的多个破碎框是至关重要的。此外,设计了一种语义识别文本边界检测技术,该技术为每个场景文本产生四种类型的文本边界段。使用语义识别文本边框,通过回归单词或文本行的末端周围的文本像素而不是所有文本像素,可以更准确地定位场景文本,这通常导致在处理长单词或文本行时不准确的定位。大量实验证明了所提出技术的有效性,并且在几个公共数据集上获得了优异的性能,例如MSRA-TD500的80.1 f分数,ICDAR2017-RCTW的67.1 f分数等[1807.03547v2]
Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo
为图像生成风格化标题是图像标题中的新兴主题。给定图像作为输入,它要求系统生成具有特定样式(例如,幽默,浪漫,积极和消极)的标题,同时在语义上准确地描述图像内容。在本文中,我们提出了一种新颖的程式化图像字幕模型,它有效地考虑了这两个要求。为此,我们首先设计了一种新的LSTM变体,命名为style-factual LSTM,作为我们模型的构建块。它使用两组矩阵分别捕获事实和程式化的知识,并根据先前的上下文自动学习两组的单词级权重。另外,当我们训练模型来捕捉风格元素时,我们提出了一种基于参考事实模型的自适应学习方法,当模型从风格化的标题标签中学习时,它为模型提供事实知识,并且可以自适应地计算在每个时间步骤提供多少信息。我们在两个风格化的图像字幕数据集上评估我们的模型,这些数据集分别包含幽默/浪漫字幕和正/负字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1] 其中分别包含幽默/浪漫字幕和正面/负面字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1] 其中分别包含幽默/浪漫字幕和正面/负面字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1]
Alexey Potapov,Sergey Rodionov,Maxim Peterson,Oleg Shcherbakov,Innokentii Zhdanov,Nikolai Skorobogatko
为AGI创建视觉系统需要哪些框架和架构?在本文中,我们提出了一个正式模型,阐述了AGI中的感知任务。我们展示了判别和生成模型在实现该任务的有效和一般解决方案中的作用,从而更详细地指定任务。我们讨论一些现有的生成和判别模型,并证明它们不足以达到我们的目的。最后,我们讨论一些架构困境和开放式问题。[1807.03887v1]
Nikolaos Sarafianos,Ioannis A. Kakadiaris
对于许多计算机视觉应用,例如图像描述和人类识别,识别人类的视觉属性是一个必不可少的挑战性问题。它的挑战源于其多标签性质,大的潜在类别不平衡和缺乏空间注释。现有方法遵循计算机视觉方法而未能解决类不平衡,或探索机器学习解决方案,其忽略图像中存在的空间和语义关系。考虑到这一点,我们提出了一种有效的方法,可以提取和聚合不同尺度的视觉注意面具。我们引入了一个损失函数来处理类和实例级别的类不平衡,并进一步证明惩罚具有高预测方差的注意掩模会导致对注意机制的弱监督。通过识别和解决这些挑战,我们在PETA和WIDER属性数据集中通过简单的关注机制实现了最先进的结果,而无需额外的上下文或辅助信息。[1807.03903v1]
Shin Kamada,Takumi Ichimura
受限玻尔兹曼机(RBM)是一种基于随机能量的人工神经网络模型,用于无监督学习。最近,众所周知RBM是深度学习的预训练方法。除了可见和隐藏的神经元之外,RBM的结构还具有许多参数,例如神经元之间的权重和它们的系数。因此,我们可能会遇到一些困难来确定最佳网络结构来分析大数据。为了避免这个问题,我们研究了参数的方差,以便在学习过程中找到最佳结构。因此,我们应该检查参数的方差,以引起RBM模型中能量函数的波动。在本文中,我们提出了RBM的自适应学习方法,通过应用神经元生成和湮灭算法,可以根据训练情况发现最佳数量的隐藏神经元。在该方法中,如果能量函数仍未收敛并且参数的方差很大,则生成新的隐藏神经元。此外,如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2]
Big-Little Net:一种用于视觉和语音识别的高效多尺度特征表示
Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris
在本文中,我们提出了一种新颖的卷积神经网络(CNN)架构,用于学习多尺度特征表示,并在速度和精度之间进行良好的权衡。这是通过使用多分支网络实现的,该分支网络在不同分支处具有不同的计算复杂度。通过频繁合并不同尺度分支的特征,我们的模型获得了多尺度特征,同时使用较少的计算。所提出的方法使用包括ResNet和ResNeXt在内的流行架构,证明了对象识别和语音识别任务的模型效率和性能的提高。对于物体识别,我们的方法在物体识别上减少了33%的计算,同时将精度提高了0.9%。此外,我们的模型在精度和FLOP降低方面超过了最先进的CNN加速方法。在语音识别的任务上,我们提出的多尺度CNN节省了30%的FLOP,具有稍好的字错误率,显示出跨域的良好泛化。[1807.03848v1]
转载请注明:《Big-Little Net:一种用于视觉和语音识别的高效多尺度特征表示+DeSTNet:密集融合的空间变换网络+神经网络学习端到端聚类》