Big-Little Net:一种用于视觉和语音识别的高效多尺度特征表示+DeSTNet:密集融合的空间变换网络+神经网络学习端到端聚类

Presentation Attack Detection for Cadaver Irises

Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz

This paper presents a deep-learning-based method for iris presentation attack detection (PAD) when iris images are obtained from deceased people. Our approach is based on the VGG-16 architecture fine-tuned with a database of 574 post-mortem, near-infrared iris images from the Warsaw-BioBase-PostMortem-Iris-v1 database, complemented by a dataset of 256 images of live irises, collected within the scope of this study. Experiments described in this paper show that our approach is able to correctly classify iris images as either representing a live or a dead eye in almost 99% of the trials, averaged over 20 subject-disjoint, train/test splits. We also show that the post-mortem iris detection accuracy increases as time since death elapses, and that we are able to construct a classification system with APCER=0%@BPCER=1% (Attack Presentation and Bona Fide Presentation Classification Error Rates, respectively) when only post-mortem samples collected at least 16 hours post-mortem are considered. Since acquisitions of ante- and post-mortem samples differ significantly, we applied countermeasures to minimize bias in our classification methodology caused by image properties that are not related to the PAD. This included using the same iris sensor in collection of ante- and post-mortem samples, and analysis of class activation maps to ensure that discriminant iris regions utilized by our classifier are related to properties of the eye, and not to those of the acquisition protocol. This paper offers the first known to us PAD method in a post-mortem setting, together with an explanation of the decisions made by the convolutional neural network. Along with the paper we offer source codes, weights of the trained network, and a dataset of live iris images to facilitate reproducibility and further research. [1807.04058v1]


Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound

Nicolo’ Savioli, Silvia Visentin, Erich Cosmi, Enrico Grisan, Pablo Lamata, Giovanni Montana

The automatic analysis of ultrasound sequences can substantially improve the efficiency of clinical diagnosis. In this work we present our attempt to automate the challenging task of measuring the vascular diameter of the fetal abdominal aorta from ultrasound images. We propose a neural network architecture consisting of three blocks: a convolutional layer for the extraction of imaging features, a Convolution Gated Recurrent Unit (C-GRU) for enforcing the temporal coherence across video frames and exploiting the temporal redundancy of a signal, and a regularized loss function, called \textit{CyclicLoss}, to impose our prior knowledge about the periodicity of the observed signal. We present experimental evidence suggesting that the proposed architecture can reach an accuracy substantially superior to previously proposed methods, providing an average reduction of the mean squared error from $0.31 mm^2$ (state-of-art) to $0.09 mm^2$, and a relative error reduction from $8.1\%$ to $5.3\%$. The mean execution speed of the proposed approach of 289 frames per second makes it suitable for real time clinical use. [1807.04056v1]


DeSTNet: Densely Fused Spatial Transformer Networks

Roberto Annunziata, Christos Sagonas, Jacques Calì

Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to the best of our knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness. [1807.04050v1]


Cross-spectral Iris Recognition for Mobile Applications using High-quality Color Images

Mateusz Trokielewicz, Ewelina Bartuzi

With the recent shift towards mobile computing, new challenges for biometric authentication appear on the horizon. This paper provides a comprehensive study of cross-spectral iris recognition in a scenario, in which high quality color images obtained with a mobile phone are used against enrollment images collected in typical, near-infrared setups. Grayscale conversion of the color images that employs selective RGB channel choice depending on the iris coloration is shown to improve the recognition accuracy for some combinations of eye colors and matching software, when compared to using the red channel only, with equal error rates driven down to as low as 2%. The authors are not aware of any other paper focusing on cross-spectral iris recognition is a scenario with near-infrared enrollment using a professional iris recognition setup and then a mobile-based verification employing color images. [1807.04061v1]


DCNN-based Human-Interpretable Post-mortem Iris Recognition

Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz

With post-mortem iris recognition getting increasing attention throughout the biometric and forensic communities, no specific, cadaver-aware recognition methodologies have been proposed to date. This paper makes the first step in assessing the discriminatory capabilities of post-mortem iris images collected in multiple time points after a person’s demise, by proposing a deep convolutional neural network (DCNN) classifier fine-tuned with cadaver iris images. The proposed method is able to learn these features and provide classification of post-mortem irises in a closed-set scenario, proving that even with post-mortem biological processes’ onset after a person’s death, features in their irises remain, and can be utilized as a biometric trait. This is also the first work (known to us) to analyze the class-activation maps produced by the DCNN-based iris classifier, and to compare them with attention maps acquired by a gaze-tracking device observing human subjects performing post-mortem iris recognition task. We show how humans perceive post-mortem irises when challenged with the task of classification, and hypothesize that the proposed DCNN-based method can offer human-intelligible decisions backed by visual explanations which may be valuable for iris examiners in a forensic/courthouse scenario. [1807.04049v1]


FINN-L: Library Extensions and Design Trade-off Analysis for Variable Precision LSTM Networks on FPGAs

Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, Michaela Blott

It is well known that many types of artificial neural networks, including recurrent networks, can achieve a high classification accuracy even with low-precision weights and activations. The reduction in precision generally yields much more efficient hardware implementations in regards to hardware cost, memory requirements, energy, and achievable throughput. In this paper, we present the first systematic exploration of this design space as a function of precision for Bidirectional Long Short-Term Memory (BiLSTM) neural network. Specifically, we include an in-depth investigation of precision vs. accuracy using a fully hardware-aware training flow, where during training quantization of all aspects of the network including weights, input, output and in-memory cell activations are taken into consideration. In addition, hardware resource cost, power consumption and throughput scalability are explored as a function of precision for FPGA-based implementations of BiLSTM, and multiple approaches of parallelizing the hardware. We provide the first open source HLS library extension of FINN for parameterizable hardware architectures of LSTM layers on FPGAs which offers full precision flexibility and allows for parameterizable performance scaling offering different levels of parallelism within the architecture. Based on this library, we present an FPGA-based accelerator for BiLSTM neural network designed for optical character recognition, along with numerous other experimental proof points for a Zynq UltraScale+ XCZU7EV MPSoC within the given design space. [1807.04093v1]


How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization

Yandong Li, liqiang Wang, Tianbao Yang, Boqing Gong

The large volume of video content and high viewing frequency demand automatic video summarization algorithms, of which a key property is the capability of modeling diversity. If videos are lengthy like hours-long egocentric videos, it is necessary to track the temporal structures of the videos and enforce local diversity. The local diversity refers to that the shots selected from a short time duration are diverse but visually similar shots are allowed to co-exist in the summary if they appear far apart in the video. In this paper, we propose a novel probabilistic model, built upon SeqDPP, to dynamically control the time span of a video segment upon which the local diversity is imposed. In particular, we enable SeqDPP to learn to automatically infer how local the local diversity is supposed to be from the input video. The resulting model is extremely involved to train by the hallmark maximum likelihood estimation (MLE), which further suffers from the exposure bias and non-differentiable evaluation metrics. To tackle these problems, we instead devise a reinforcement learning algorithm for training the proposed model. Extensive experiments verify the advantages of our model and the new learning algorithm over MLE-based methods. [1807.04219v1]


Learning Neural Models for End-to-End Clustering

Benjamin Bruno Meier, Ismail Elezi, Mohammadreza Amirian, Oliver Durr, Thilo Stadelmann

We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters $k$, and for each $1 \leq k \leq k_\mathrm{max}$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end. [1807.04001v1]


Deep attention-based classification network for robust depth prediction

Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang

In this paper, we present our deep attention-based classification (DABC) network for robust single image depth prediction, in the context of the Robust Vision Challenge 2018 (ROB 2018). Unlike conventional depth prediction, our goal is to design a model that can perform well in both indoor and outdoor scenes with a single parameter set. However, robust depth prediction suffers from two challenging problems: a) How to extract more discriminative features for different scenes (compared to a single scene)? b) How to handle the large differences of depth ranges between indoor and outdoor datasets? To address these two problems, we first formulate depth prediction as a multi-class classification task and apply a softmax classifier to classify the depth label of each pixel. We then introduce a global pooling layer and a channel-wise attention mechanism to adaptively select the discriminative channels of features and to update the original features by assigning important channels with higher weights. Further, to reduce the influence of quantization errors, we employ a soft-weighted sum inference strategy for the final prediction. Experimental results on both indoor and outdoor datasets demonstrate the effectiveness of our method. It is worth mentioning that we won the 2-nd place in single image depth prediction entry of ROB 2018, in conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018. [1807.03959v1]


Decision method choice in a human posture recognition context

Stéphane Perrin, Eric Benoit, Didier Coquin

Human posture recognition provides a dynamic field that has produced many methods. Using fuzzy subsets based data fusion methods to aggregate the results given by different types of recognition processes is a convenient way to improve recognition methods. Nevertheless, choosing a defuzzification method to imple-ment the decision is a crucial point of this approach. The goal of this paper is to present an approach where the choice of the defuzzification method is driven by the constraints of the final data user, which are expressed as limitations on indica-tors like confidence or accuracy. A practical experimentation illustrating this ap-proach is presented: from a depth camera sensor, human posture is interpreted and the defuzzification method is selected in accordance with the constraints of the final information consumer. The paper illustrates the interest of the approach in a context of postures based human robot communication. [1807.04170v1]


A Computational Method for Evaluating UI Patterns

Bardia Doosti, Tao Dong, Biplab Deka, Jeffrey Nichols

UI design languages, such as Google’s Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navigation Drawer. To address these challenges, this short paper introduces a method for measuring the impact of design languages and informing design debates through analyzing a dataset consisting of view hierarchies, screenshots, and app metadata for more than 9,000 mobile apps. Our data analysis shows that use of Material Design is positively correlated to app ratings, and to some extent, also the number of installs. Furthermore, we show that use of UI components vary by app category, suggesting a more nuanced view needed in design debates. [1807.04191v1]


MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network

Muhammed Kocabas, Salih Karagoz, Emre Akbas

In this paper, we present MultiPoseNet, a novel bottom-up multi-person pose estimation architecture that combines a multi-task model with a novel assignment method. MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances. On the COCO keypoints dataset, our pose estimation method outperforms all previous bottom-up methods both in accuracy (+4-point mAP over previous best result) and speed; it also performs on par with the best top-down methods while being at least 4x faster. Our method is the fastest real time system with 23 frames/sec. Source code is available at: [1807.04067v1]


Variational Capsules for Image Analysis and Synthesis

Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan

A capsule is a group of neurons whose activity vector models different properties of the same entity. This paper extends the capsule to a generative version, named variational capsules (VCs). Each VC produces a latent variable for a specific entity, making it possible to integrate image analysis and image synthesis into a unified framework. Variational capsules model an image as a composition of entities in a probabilistic model. Different capsules’ divergence with a specific prior distribution represents the presence of different entities, which can be applied in image analysis tasks such as classification. In addition, variational capsules encode multiple entities in a semantically-disentangling way. Diverse instantiations of capsules are related to various properties of the same entity, making it easy to generate diverse samples with fine-grained semantic attributes. Extensive experiments demonstrate that deep networks designed with variational capsules can not only achieve promising performance on image analysis tasks (including image classification and attribute prediction) but can also improve the diversity and controllability of image synthesis. [1807.04099v1]


CG-DIQA: No-reference Document Image Quality Assessment Based on Character Gradient

Hongyu Li, Fan Zhu, Junhua Qiu

Document image quality assessment (DIQA) is an important and challenging problem in real applications. In order to predict the quality scores of document images, this paper proposes a novel no-reference DIQA method based on character gradient, where the OCR accuracy is used as a ground-truth quality metric. Character gradient is computed on character patches detected with the maximally stable extremal regions (MSER) based method. Character patches are essentially significant to character recognition and therefore suitable for use in estimating document image quality. Experiments on a benchmark dataset show that the proposed method outperforms the state-of-the-art methods in estimating the quality score of document images. [1807.04047v1]


With Friends Like These, Who Needs Adversaries?

Saumya Jetley*, Nicholas A. Lord*, Philip H. S. Torr

The vulnerability of deep image classification networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks (DCNs) that shed new light on their behaviour and its connection to the problem of adversaries, with two key results. The first is a straightforward explanation of the existence of universal adversarial perturbations and their association with specific class identities, obtained by analysing the properties of nets’ logit responses as functions of 1D movements along specific image-space directions. The second is the clear demonstration of the tight coupling between classification performance and vulnerability to adversarial attack within the spaces spanned by these directions. Prior work has noted the importance of low-dimensional subspaces in adversarial vulnerability: we illustrate that this likewise represents the nets’ notion of saliency. In all, we provide a digestible perspective from which to understand previously reported results which have appeared disjoint or contradictory, with implications for efforts to construct neural nets that are both accurate and robust to adversarial attack. [1807.04200v1]


Generative Adversarial Networks with Decoder-Encoder Output Noise

Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang

In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve the generative ability of the generator. However, the generated pictures by GANs are generally blurry. The deep convolutional generative adversarial networks (DCGANs) were then proposed to leverage the quality of generated images. Since the input noise vectors are randomly sampled from a Gaussian distribution, the generator has to map from a whole normal distribution to the images. This makes DCGANs unable to reflect the inherent structure of the training data. In this paper, we propose a novel deep model, called generative adversarial networks with decoder-encoder output noise (DE-GANs), which takes advantage of both the adversarial training and the variational Bayesain inference to improve the performance of image generation. DE-GANs use a pre-trained decoder-encoder architecture to map the random Gaussian noise vectors to informative ones and pass them to the generator of the adversarial networks. Since the decoder-encoder architecture is trained by the same images as the generators, the output vectors could carry the intrinsic distribution information of the original images. Moreover, the loss function of DE-GANs is different from GANs and DCGANs. A hidden-space loss function is added to the adversarial loss function to enhance the robustness of the model. Extensive empirical results show that DE-GANs can accelerate the convergence of the adversarial training process and improve the quality of the generated images. [1807.03923v1]


Data-Driven Segmentation of Post-mortem Iris Images

Mateusz Trokielewicz, Adam Czajka

This paper presents a method for segmenting iris images obtained from the deceased subjects, by training a deep convolutional neural network (DCNN) designed for the purpose of semantic segmentation. Post-mortem iris recognition has recently emerged as an alternative, or additional, method useful in forensic analysis. At the same time it poses many new challenges from the technological standpoint, one of them being the image segmentation stage, which has proven difficult to be reliably executed by conventional iris recognition methods. Our approach is based on the SegNet architecture, fine-tuned with 1,300 manually segmented post-mortem iris images taken from the Warsaw-BioBase-Post-Mortem-Iris v1.0 database. The experiments presented in this paper show that this data-driven solution is able to learn specific deformations present in post-mortem samples, which are missing from alive irises, and offers a considerable improvement over the state-of-the-art, conventional segmentation algorithm (OSIRIS): the Intersection over Union (IoU) metric was improved from 73.6% (for OSIRIS) to 83% (for DCNN-based presented in this paper) averaged over subject-disjoint, multiple splits of the data into train and test subsets. This paper offers the first known to us method of automatic processing of post-mortem iris images. We offer source codes with the trained DCNN that perform end-to-end segmentation of post-mortem iris images, as described in this paper. Also, we offer binary masks corresponding to manual segmentation of samples from Warsaw-BioBase-Post-Mortem-Iris v1.0 database to facilitate development of alternative methods for post-mortem iris segmentation. [1807.04154v1]


Underwater Image Haze Removal and Color Correction with an Underwater-ready Dark Channel Prior

Tomasz Łuczyński, Andreas Birk

Underwater images suffer from extremely unfavourable conditions. Light is heavily attenuated and scattered. Attenuation creates change in hue, scattering causes so called veiling light. General state of the art methods for enhancing image quality are either unreliable or cannot be easily used in underwater operations. On the other hand there is a well known method for haze removal in air, called Dark Channel Prior. Even though there are known adaptations of this method to underwater applications, they do not always work correctly. This work elaborates and improves upon the initial concept presented in [1]. A modification to the Dark Channel Prior is proposed that allows for an easy application to underwater images. It is also shown that our method outperforms competing solutions based on the Dark Channel Prior. Experiments on real-life data collected within the DexROV project are also presented, showing the robustness and high performance of the proposed algorithm. [1807.04169v1]


Model-based free-breathing cardiac MRI reconstruction using deep learned \& STORM priors: MoDL-STORM

Sampurna Biswas, Hemant K. Aggarwal, Sunrita Poddar, Mathews Jacob

We introduce a model-based reconstruction framework with deep learned (DL) and smoothness regularization on manifolds (STORM) priors to recover free breathing and ungated (FBU) cardiac MRI from highly undersampled measurements. The DL priors enable us to exploit the local correlations, while the STORM prior enables us to make use of the extensive non-local similarities that are subject dependent. We introduce a novel model-based formulation that allows the seamless integration of deep learning methods with available prior information, which current deep learning algorithms are not capable of. The experimental results demonstrate the preliminary potential of this work in accelerating FBU cardiac MRI. [1807.03845v1]


An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm

Shin Kamada, Takumi Ichimura

Deep Belief Network (DBN) has a deep architecture that represents multiple features of input patterns hierarchically with the pre-trained Restricted Boltzmann Machines (RBM). A traditional RBM or DBN model cannot change its network structure during the learning phase. Our proposed adaptive learning method can discover the optimal number of hidden neurons and weights and/or layers according to the input space. The model is an important method to take account of the computational cost and the model stability. The regularities to hold the sparse structure of network is considerable problem, since the extraction of explicit knowledge from the trained network should be required. In our previous research, we have developed the hybrid method of adaptive structural learning method of RBM and Learning Forgetting method to the trained RBM. In this paper, we propose the adaptive learning method of DBN that can determine the optimal number of layers during the learning. We evaluated our proposed model on some benchmark data sets. [1807.03486v2]


Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Chuhui Xue, Shijian Lu, Fangneng Zhan

This paper presents a scene text detection technique that exploits bootstrapping and text border semantics for accurate localization of texts in scenes. A novel bootstrapping technique is designed which samples multiple ‘subsections’ of a word or text line and accordingly relieves the constraint of limited training data effectively. At the same time, the repeated sampling of text ‘subsections’ improves the consistency of the predicted text feature maps which is critical in predicting a single complete instead of multiple broken boxes for long words or text lines. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for each scene text. With semantics-aware text borders, scene texts can be localized more accurately by regressing text pixels around the ends of words or text lines instead of all text pixels which often leads to inaccurate localization while dealing with long words or text lines. Extensive experiments demonstrate the effectiveness of the proposed techniques, and superior performance is obtained over several public datasets, e. g. 80.1 f-score for the MSRA-TD500, 67.1 f-score for the ICDAR2017-RCTW, etc. [1807.03547v2]


“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo

Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision. [1807.03871v1]


Vision System for AGI: Problems and Directions

Alexey Potapov, Sergey Rodionov, Maxim Peterson, Oleg Shcherbakov, Innokentii Zhdanov, Nikolai Skorobogatko

What frameworks and architectures are necessary to create a vision system for AGI? In this paper, we propose a formal model that states the task of perception within AGI. We show the role of discriminative and generative models in achieving efficient and general solution of this task, thus specifying the task in more detail. We discuss some existing generative and discriminative models and demonstrate their insufficiency for our purposes. Finally, we discuss some architectural dilemmas and open questions. [1807.03887v1]


Deep Imbalanced Attribute Classification using Visual Attention Aggregation

Nikolaos Sarafianos, Ioannis A. Kakadiaris

For many computer vision applications such as image description and human identification, recognizing the visual attributes of humans is an essential yet challenging problem. Its challenges originate from its multi-label nature, the large underlying class imbalance and the lack of spatial annotations. Existing methods follow either a computer vision approach while failing to account for class imbalance, or explore machine learning solutions, which disregard the spatial and semantic relations that exist in the images. With that in mind, we propose an effective method that extracts and aggregates visual attention masks at different scales. We introduce a loss function to handle class imbalance both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak supervision of the attention mechanism. By identifying and addressing these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without additional context or side information. [1807.03903v1]


An Adaptive Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm

Shin Kamada, Takumi Ichimura

Restricted Boltzmann Machine (RBM) is a generative stochastic energy-based model of artificial neural network for unsupervised learning. Recently, RBM is well known to be a pre-training method of Deep Learning. In addition to visible and hidden neurons, the structure of RBM has a number of parameters such as the weights between neurons and the coefficients for them. Therefore, we may meet some difficulties to determine an optimal network structure to analyze big data. In order to evade the problem, we investigated the variance of parameters to find an optimal structure during learning. For the reason, we should check the variance of parameters to cause the fluctuation for energy function in RBM model. In this paper, we propose the adaptive learning method of RBM that can discover an optimal number of hidden neurons according to the training situation by applying the neuron generation and annihilation algorithm. In this method, a new hidden neuron is generated if the energy function is not still converged and the variance of the parameters is large. Moreover, the inactivated hidden neuron will be annihilated if the neuron does not affect the learning situation. The experimental results for some benchmark data sets were discussed in this paper. [1807.03478v2]


Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition

Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris

In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains. [1807.03848v1]


Mateusz TrokielewiczAdam CzajkaPiotr Maciejewicz

本文提出了一种基于深度学习的虹膜表现攻击检测(PAD)方法,当从死者身上获取虹膜图像时。我们的方法基于VGG-16架构,该架构使用来自Warsaw-BioBase-PostMortem-Iris-v1数据库的574个验尸后近红外虹膜图像数据库进行微调,并辅以256张实时虹膜图像的数据集。 ,收集在本研究范围内。本文中描述的实验表明,我们的方法能够正确地将虹膜图像分类为在近99%的试验中代表活眼或死眼,平均超过20个主题不相交,训练/测试分裂。我们还表明,死后虹膜检测准确度随着死亡时间的推移而增加,并且我们能够构建一个分类系统,当只考虑至少16小时后收集的验尸样本时,APCER = 0@ BPCER = 1%(攻击呈现和善意呈现分类错误率)。由于前期和后期样本的采集差异很大,我们采用对策来最大限度地减少由于与PAD无关的图像属性导致的分类方法的偏差。这包括使用相同的虹膜传感器收集前期和死后样本,并分析类激活图,以确保我们的分类器使用的判别虹膜区域与眼睛的属性有关,而不是与采集协议的属性有关。 。本文提供了我们在死后环境中首次了解PAD方法,以及对卷积神经网络做出的决定的解释。与论文一起,我们提供源代码,训练网络的权重和活虹膜图像的数据集,以促进再现性和进一步研究。[1807.04058v1]



Nicolo’SavioliSilvia VisentinErich CosmiEnrico GrisanPablo LamataGiovanni Montana

超声序列的自动分析可以显着提高临床诊断的效率。在这项工作中,我们提出了尝试自动化从超声图像测量胎儿腹主动脉血管直径的挑战性任务。我们提出了一个由三个块组成的神经网络架构:用于提取成像特征的卷积层,用于强制跨视频帧的时间相干性和利用信号的时间冗余的卷积门控递归单元(C-GRU),以及正则化损失函数,称为\ textit {CyclicLoss},用于强加我们关于观测信号周期性的先验知识。我们提出实验证据表明所提出的架构可以达到比以前提出的方法更好的准确度,平均误差平均从0.31 mm ^ 2 $(现有技术)降至$ 0.09 mm ^ 2 $,相对误差从$ 8.1 \$降至$ 5.3 \$。所提出的方法的平均执行速度为每秒289帧,使其适合于实时临床使用。[1807.04056v1]



Roberto AnnunziataChristos SagonasJacquesCalì




Mateusz TrokielewiczEwelina Bartuzi




Mateusz TrokielewiczAdam CzajkaPiotr Maciejewicz

随着验尸虹膜识别越来越受到生物识别和法医社区的关注,迄今为止尚未提出具体的尸体识别识别方法。本文通过提出使用尸体虹膜图像微调的深度卷积神经网络(DCNN)分类器,评估在人死亡后在多个时间点收集的验尸虹膜图像的辨别能力的第一步。所提出的方法能够学习这些特征并在封闭场景中提供验尸后虹膜的分类,证明即使死后生物过程在人死亡后发作,其虹膜中的特征仍然存在,并且可以被利用作为一种生物特征。这也是我们分析由基于DCNN的虹膜分类器产生的类激活图的第一项工作(我们已知),并将它们与通过凝视跟踪设备获取的注意力图进行比较,所述注视跟踪设备观察执行验尸虹膜识别的人类对象任务。我们展示了人类在分类任务受到挑战时如何感知验尸后的虹膜,并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1] 并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1] 并假设所提出的基于DCNN的方法可以提供由视觉解释支持的人类可理解的决策,这对于法医/法院场景中的虹膜检查员可能是有价值的。[1807.04049v1]



Vladimir RybalkinAlessandro PappalardoMuhammad Mohsin GhaffarGiulio GambardellaNorbert WehnMichaela Blott

众所周知,即使使用低精度权重和激活,许多类型的人工神经网络(包括循环网络)也可以实现高分类精度。精度的降低通常在硬件成本,存储器要求,能量和可实现的吞吐量方面产生更有效的硬件实现。在本文中,我们首次系统地探索了这个设计空间作为双向长短期记忆(BiLSTM)神经网络精度的函数。具体而言,我们使用完全硬件感知的培训流程对精确度与准确度进行深入研究,其中在训练期间考虑网络的所有方面的量化,包括权重,输入,输出和存储器内单元激活。另外,硬件资源成本,功耗和吞吐量可扩展性作为基于FPGABiLSTM实现的精确度和多种并行化硬件的方法进行了探索。我们提供FINN的第一个开源HLS库扩展,用于FPGALSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGABiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提供FINN的第一个开源HLS库扩展,用于FPGALSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGABiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提供FINN的第一个开源HLS库扩展,用于FPGALSTM层的可参数化硬件架构,提供全精度灵活性,并允许参数化性能扩展,在架构内提供不同级别的并行性。基于该库,我们提供了一个基于FPGABiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提出了一个基于FPGABiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1] 我们提出了一个基于FPGABiLSTM神经网络加速器,用于光学字符识别,以及在给定设计空间内Zynq UltraScale + XCZU7EV MPSoC的许多其他实验验证点。[1807.04093v1]



Yandong Li, liqiang Wang, Tianbao Yang, Boqing Gong




Benjamin Bruno MeierIsmail EleziMohammadreza AmiriOliver DurrThilo Stadelmann

我们提出了一种新颖的端到端神经网络架构,一旦经过训练,就可以一次性直接输出一批输入示例的概率聚类。它估计了聚类数量$ k $以及每个$ 1 \ leq k \ leq k_ \ mathrm {max} $的分布,这是每个数据点的单个聚类分配的分布。预先以监督的方式对网络进行单独数据训练,以通过基于成对标签(相同/不同组)的任何感知相似性标准来学习分组。然后,它可以应用于包含不同组的不同数据。我们在高维数据(如图像(COIL-100)和语音(TIMIT))上展示了出色的性能。我们称之为学习聚类,并将其概念差异展示给深度量度学习,半监督集群和其他相关方法,同时具有完全端到端地执行可学习集群的优势。[1807.04001v1]



Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang

在本文中,我们在鲁棒视觉挑战2018ROB 2018)的背景下,提出了基于注意力的深度分类(DABC)网络,用于稳健的单图像深度预测。与传统的深度预测不同,我们的目标是设计一个能够在单个参数集的室内和室外场景中表现良好的模型。然而,强大的深度预测存在两个具有挑战性的问题:a)如何为不同的场景提取更多的判别特征(与单个场景相比)?b)如何处理室内和室外数据集之间的深度差异?为了解决这两个问题,我们首先将深度预测表示为多类分类任务,并应用softmax分类器对每个像素的深度标签进行分类。然后,我们引入全局池层和通道注意机制,以自适应地选择特征的辨别通道,并通过分配具有更高权重的重要通道来更新原始特征。此外,为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR2018一起赢得了第二名。[1807.03959v1] 为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR2018一起赢得了第二名。[1807.03959v1] 为了减少量化误差的影响,我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是,我们在ROB 2018的单幅图像深度预测条目中与IEEE计算机视觉和模式识别会议(CVPR2018一起赢得了第二名。[1807.03959v1]



StéphanePerrinEric BenoitDidier Coquin




Bardia DoostiTao DongBiplab DekaJeffrey Nichols

UI设计语言(例如GoogleMaterial Design)通过提供一组标准UI组件,使应用程序更易于开发和更易于学习。尽管如此,很难评估设计语言在野外的影响。此外,设计师经常因为某些UI组件(例如浮动操作按钮和导航抽屉)的优点而受到强烈争论的争论。为了应对这些挑战,本篇简短的论文介绍了一种测量设计语言影响的方法,并通过分析由超过9,000个移动应用程序的视图层次结构,屏幕截图和应用程序元数据组成的数据集来通知设计辩论。我们的数据分析表明,Material Design的使用与应用程序评级正相关,在某种程度上也与安装数量正相关。此外,我们展示了UI组件的使用因应用类别而异,这表明设计辩论需要更细致的视图。[1807.04191v1]



Muhammed KocabasSalih KaragozEmre Akbas

在本文中,我们介绍了MultiPoseNet,一种新颖的自下而上的多人姿势估计架构,它将多任务模型与新颖的分配方法相结合。MultiPoseNet可以联合处理人员检测,关键点检测,人员分割和姿势估计问题。新颖的分配方法由姿势残留网络(PRN)实现,该网络接收关键点和人物检测,并通过将关键点分配给人物实例来产生准确的姿势。在COCO关键点数据集上,我们的姿态估计方法在准确性(+ 4mAP与先前最佳结果)和速度方面均优于所有先前的自下而上方法它也可以与最好的自上而下方法相媲美,同时至少快4倍。我们的方法是最快的实时系统,具有23/秒。源代码可从以下网址获得:https// github



Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan




Hongyu Li, Fan Zhu, Junhua Qiu




Saumya Jetley *Nicholas A. Lord *Philip HS Torr




Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang




Mateusz TrokielewiczAdam Czajka

本文提出了一种通过训练设计用于语义分割的深度卷积神经网络(DCNN)来分割从已故受试者获得的虹膜图像的方法。验尸虹膜识别最近已成为法医分析中可用的替代方法或其他方法。同时,从技术角度来看,它带来了许多新的挑战,其中之一是图像分割阶段,已经证明难以通过传统的虹膜识别方法可靠地执行。我们的方法基于SegNet架构,使用华沙 – BioBase-Post-Mortem-Iris v1.0数据库中的1,300个手动分段的验尸虹膜图像进行微调。本文提出的实验表明,这种数据驱动的解决方案能够学习死活样品中存在的特定变形,这些变形在活虹膜中缺失,并且相对于现有技术的传统分割提供了相当大的改进。算法(OSIRIS):联合交叉(IoU)度量标准从73.6%(对于OSIRIS)提高到83%(对于本文中提出的基于DCNN的平均值)对主题不相交,数据多次拆分为训练和测试的平均值子集。本文提供了我们第一个自动处理验尸虹膜图像的方法。如本文所述,我们提供具有经过训练的DCNN的源代码,该DCNN执行死后虹膜图像的端到端分割。也,我们提供二元掩模,对应于Warsaw-BioBase-Post-Mortem-Iris v1.0数据库中样本的手动分割,以便于开发用于验尸后虹膜分割的替代方法。[1807.04154v1]



TomaszŁuczyńskiAndreas Birk

水下图像受到极端不利条件的影响。光被严重衰减和散射。衰减会产生色调变化,散射会导致所谓的遮蔽光。用于增强图像质量的一般现有技术方法要么不可靠,要么不能容易地用于水下操作。另一方面,有一种众所周知的在空气中去除雾度的方法,称为Dark Channel Prior。即使已知这种方法适用于水下应用,但它们并不总能正常工作。这项工作阐述并改进了[1]中提出的初始概念。提出了对Dark Channel Prior的修改,其允许容易地应用于水下图像。还表明我们的方法优于基于Dark Channel Prior的竞争解决方案。还介绍了在DexROV项目中收集的实际数据的实验,显示了所提算法的鲁棒性和高性能。[1807.04169v1]



Sampurna BiswasHemant K. AggarwalSunrita PoddarMathews Jacob




Shin KamadaTakumi Ichimura

Deep Belief NetworkDBN)具有深层体系结构,使用预先训练的Restricted Boltzmann MachinesRBM)分层次地表示输入模式的多个特征。传统的RBMDBN模型在学习阶段不能改变其网络结构。我们提出的自适应学习方法可以根据输入空间发现隐藏神经元和权重和/或层的最佳数量。该模型是考虑计算成本和模型稳定性的重要方法。保持稀疏网络结构的规律性是相当大的问题,因为应该需要从训练的网络中提取显式知识。在我们之前的研究中,我们开发了RBM自适应结构学习方法和学习遗忘方法的混合方法到训练的RBM。在本文中,我们提出了DBN的自适应学习方法,可以确定学习过程中的最佳层数。我们在一些基准数据集上评估了我们提出的模型。[1807.03486v2]



Chuhui Xue, Shijian Lu, Fangneng Zhan

本文提出了一种场景文本检测技术,该技术利用自举和文本边界语义来精确定位场景中的文本。设计了一种新颖的自举技术,该技术对单词或文本行的多个子部分进行采样,从而有效地减轻了有限训练数据的约束。同时,对文本子部分的重复采样提高了预测文本特征映射的一致性,这对于预测单个完整而不是长字或文本行的多个破碎框是至关重要的。此外,设计了一种语义识别文本边界检测技术,该技术为每个场景文本产生四种类型的文本边界段。使用语义识别文本边框,通过回归单词或文本行的末端周围的文本像素而不是所有文本像素,可以更准确地定位场景文本,这通常导致在处理长单词或文本行时不准确的定位。大量实验证明了所提出技术的有效性,并且在几个公共数据集上获得了优异的性能,例如MSRA-TD50080.1 f分数,ICDAR2017-RCTW67.1 f分数等[1807.03547v2]



Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo

为图像生成风格化标题是图像标题中的新兴主题。给定图像作为输入,它要求系统生成具有特定样式(例如,幽默,浪漫,积极和消极)的标题,同时在语义上准确地描述图像内容。在本文中,我们提出了一种新颖的程式化图像字幕模型,它有效地考虑了这两个要求。为此,我们首先设计了一种新的LSTM变体,命名为style-factual LSTM,作为我们模型的构建块。它使用两组矩阵分别捕获事实和程式化的知识,并根据先前的上下文自动学习两组的单词级权重。另外,当我们训练模型来捕捉风格元素时,我们提出了一种基于参考事实模型的自适应学习方法,当模型从风格化的标题标签中学习时,它为模型提供事实知识,并且可以自适应地计算在每个时间步骤提供多少信息。我们在两个风格化的图像字幕数据集上评估我们的模型,这些数据集分别包含幽默/浪漫字幕和正/负字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1] 其中分别包含幽默/浪漫字幕和正面/负面字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1] 其中分别包含幽默/浪漫字幕和正面/负面字幕。实验表明,我们提出的模型优于最先进的方法,而不使用额外的地面实况监督。[1807.03871v1]



Alexey PotapovSergey RodionovMaxim PetersonOleg ShcherbakovInnokentii ZhdanovNikolai Skorobogatko




Nikolaos SarafianosIoannis A. Kakadiaris




Shin KamadaTakumi Ichimura

受限玻尔兹曼机(RBM)是一种基于随机能量的人工神经网络模型,用于无监督学习。最近,众所周知RBM是深度学习的预训练方法。除了可见和隐藏的神经元之外,RBM的结构还具有许多参数,例如神经元之间的权重和它们的系数。因此,我们可能会遇到一些困难来确定最佳网络结构来分析大数据。为了避免这个问题,我们研究了参数的方差,以便在学习过程中找到最佳结构。因此,我们应该检查参数的方差,以引起RBM模型中能量函数的波动。在本文中,我们提出了RBM的自适应学习方法,通过应用神经元生成和湮灭算法,可以根据训练情况发现最佳数量的隐藏神经元。在该方法中,如果能量函数仍未收敛并且参数的方差很大,则生成新的隐藏神经元。此外,如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2] 如果神经元不影响学习情况,则灭活的隐藏神经元将被消灭。本文讨论了一些基准数据集的实验结果。[1807.03478v2]


Big-Little Net:一种用于视觉和语音识别的高效多尺度特征表示

Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris


转载请注明:《Big-Little Net:一种用于视觉和语音识别的高效多尺度特征表示+DeSTNet:密集融合的空间变换网络+神经网络学习端到端聚类