# Big-Little Net：一种用于视觉和语音识别的高效多尺度特征表示+DeSTNet：密集融合的空间变换网络+神经网络学习端到端聚类

Presentation Attack Detection for Cadaver Irises

Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz

This paper presents a deep-learning-based method for iris presentation attack detection (PAD) when iris images are obtained from deceased people. Our approach is based on the VGG-16 architecture fine-tuned with a database of 574 post-mortem, near-infrared iris images from the Warsaw-BioBase-PostMortem-Iris-v1 database, complemented by a dataset of 256 images of live irises, collected within the scope of this study. Experiments described in this paper show that our approach is able to correctly classify iris images as either representing a live or a dead eye in almost 99% of the trials, averaged over 20 subject-disjoint, train/test splits. We also show that the post-mortem iris detection accuracy increases as time since death elapses, and that we are able to construct a classification system with APCER=0%@BPCER=1% (Attack Presentation and Bona Fide Presentation Classification Error Rates, respectively) when only post-mortem samples collected at least 16 hours post-mortem are considered. Since acquisitions of ante- and post-mortem samples differ significantly, we applied countermeasures to minimize bias in our classification methodology caused by image properties that are not related to the PAD. This included using the same iris sensor in collection of ante- and post-mortem samples, and analysis of class activation maps to ensure that discriminant iris regions utilized by our classifier are related to properties of the eye, and not to those of the acquisition protocol. This paper offers the first known to us PAD method in a post-mortem setting, together with an explanation of the decisions made by the convolutional neural network. Along with the paper we offer source codes, weights of the trained network, and a dataset of live iris images to facilitate reproducibility and further research. [1807.04058v1]

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound

Nicolo’ Savioli, Silvia Visentin, Erich Cosmi, Enrico Grisan, Pablo Lamata, Giovanni Montana

The automatic analysis of ultrasound sequences can substantially improve the efficiency of clinical diagnosis. In this work we present our attempt to automate the challenging task of measuring the vascular diameter of the fetal abdominal aorta from ultrasound images. We propose a neural network architecture consisting of three blocks: a convolutional layer for the extraction of imaging features, a Convolution Gated Recurrent Unit (C-GRU) for enforcing the temporal coherence across video frames and exploiting the temporal redundancy of a signal, and a regularized loss function, called \textit{CyclicLoss}, to impose our prior knowledge about the periodicity of the observed signal. We present experimental evidence suggesting that the proposed architecture can reach an accuracy substantially superior to previously proposed methods, providing an average reduction of the mean squared error from $0.31 mm^2$ (state-of-art) to $0.09 mm^2$, and a relative error reduction from $8.1\%$ to $5.3\%$. The mean execution speed of the proposed approach of 289 frames per second makes it suitable for real time clinical use. [1807.04056v1]

DeSTNet: Densely Fused Spatial Transformer Networks

Roberto Annunziata, Christos Sagonas, Jacques Calì

Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to the best of our knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness. [1807.04050v1]

Cross-spectral Iris Recognition for Mobile Applications using High-quality Color Images

Mateusz Trokielewicz, Ewelina Bartuzi

With the recent shift towards mobile computing, new challenges for biometric authentication appear on the horizon. This paper provides a comprehensive study of cross-spectral iris recognition in a scenario, in which high quality color images obtained with a mobile phone are used against enrollment images collected in typical, near-infrared setups. Grayscale conversion of the color images that employs selective RGB channel choice depending on the iris coloration is shown to improve the recognition accuracy for some combinations of eye colors and matching software, when compared to using the red channel only, with equal error rates driven down to as low as 2%. The authors are not aware of any other paper focusing on cross-spectral iris recognition is a scenario with near-infrared enrollment using a professional iris recognition setup and then a mobile-based verification employing color images. [1807.04061v1]

DCNN-based Human-Interpretable Post-mortem Iris Recognition

Mateusz Trokielewicz, Adam Czajka, Piotr Maciejewicz

With post-mortem iris recognition getting increasing attention throughout the biometric and forensic communities, no specific, cadaver-aware recognition methodologies have been proposed to date. This paper makes the first step in assessing the discriminatory capabilities of post-mortem iris images collected in multiple time points after a person’s demise, by proposing a deep convolutional neural network (DCNN) classifier fine-tuned with cadaver iris images. The proposed method is able to learn these features and provide classification of post-mortem irises in a closed-set scenario, proving that even with post-mortem biological processes’ onset after a person’s death, features in their irises remain, and can be utilized as a biometric trait. This is also the first work (known to us) to analyze the class-activation maps produced by the DCNN-based iris classifier, and to compare them with attention maps acquired by a gaze-tracking device observing human subjects performing post-mortem iris recognition task. We show how humans perceive post-mortem irises when challenged with the task of classification, and hypothesize that the proposed DCNN-based method can offer human-intelligible decisions backed by visual explanations which may be valuable for iris examiners in a forensic/courthouse scenario. [1807.04049v1]

FINN-L: Library Extensions and Design Trade-off Analysis for Variable Precision LSTM Networks on FPGAs

Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, Michaela Blott

It is well known that many types of artificial neural networks, including recurrent networks, can achieve a high classification accuracy even with low-precision weights and activations. The reduction in precision generally yields much more efficient hardware implementations in regards to hardware cost, memory requirements, energy, and achievable throughput. In this paper, we present the first systematic exploration of this design space as a function of precision for Bidirectional Long Short-Term Memory (BiLSTM) neural network. Specifically, we include an in-depth investigation of precision vs. accuracy using a fully hardware-aware training flow, where during training quantization of all aspects of the network including weights, input, output and in-memory cell activations are taken into consideration. In addition, hardware resource cost, power consumption and throughput scalability are explored as a function of precision for FPGA-based implementations of BiLSTM, and multiple approaches of parallelizing the hardware. We provide the first open source HLS library extension of FINN for parameterizable hardware architectures of LSTM layers on FPGAs which offers full precision flexibility and allows for parameterizable performance scaling offering different levels of parallelism within the architecture. Based on this library, we present an FPGA-based accelerator for BiLSTM neural network designed for optical character recognition, along with numerous other experimental proof points for a Zynq UltraScale+ XCZU7EV MPSoC within the given design space. [1807.04093v1]

How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization

Yandong Li, liqiang Wang, Tianbao Yang, Boqing Gong

The large volume of video content and high viewing frequency demand automatic video summarization algorithms, of which a key property is the capability of modeling diversity. If videos are lengthy like hours-long egocentric videos, it is necessary to track the temporal structures of the videos and enforce local diversity. The local diversity refers to that the shots selected from a short time duration are diverse but visually similar shots are allowed to co-exist in the summary if they appear far apart in the video. In this paper, we propose a novel probabilistic model, built upon SeqDPP, to dynamically control the time span of a video segment upon which the local diversity is imposed. In particular, we enable SeqDPP to learn to automatically infer how local the local diversity is supposed to be from the input video. The resulting model is extremely involved to train by the hallmark maximum likelihood estimation (MLE), which further suffers from the exposure bias and non-differentiable evaluation metrics. To tackle these problems, we instead devise a reinforcement learning algorithm for training the proposed model. Extensive experiments verify the advantages of our model and the new learning algorithm over MLE-based methods. [1807.04219v1]

Learning Neural Models for End-to-End Clustering

We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters $k$, and for each $1 \leq k \leq k_\mathrm{max}$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end. [1807.04001v1]

Deep attention-based classification network for robust depth prediction

Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang

In this paper, we present our deep attention-based classification (DABC) network for robust single image depth prediction, in the context of the Robust Vision Challenge 2018 (ROB 2018). Unlike conventional depth prediction, our goal is to design a model that can perform well in both indoor and outdoor scenes with a single parameter set. However, robust depth prediction suffers from two challenging problems: a) How to extract more discriminative features for different scenes (compared to a single scene)? b) How to handle the large differences of depth ranges between indoor and outdoor datasets? To address these two problems, we first formulate depth prediction as a multi-class classification task and apply a softmax classifier to classify the depth label of each pixel. We then introduce a global pooling layer and a channel-wise attention mechanism to adaptively select the discriminative channels of features and to update the original features by assigning important channels with higher weights. Further, to reduce the influence of quantization errors, we employ a soft-weighted sum inference strategy for the final prediction. Experimental results on both indoor and outdoor datasets demonstrate the effectiveness of our method. It is worth mentioning that we won the 2-nd place in single image depth prediction entry of ROB 2018, in conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018. [1807.03959v1]

Decision method choice in a human posture recognition context

Stéphane Perrin, Eric Benoit, Didier Coquin

Human posture recognition provides a dynamic field that has produced many methods. Using fuzzy subsets based data fusion methods to aggregate the results given by different types of recognition processes is a convenient way to improve recognition methods. Nevertheless, choosing a defuzzification method to imple-ment the decision is a crucial point of this approach. The goal of this paper is to present an approach where the choice of the defuzzification method is driven by the constraints of the final data user, which are expressed as limitations on indica-tors like confidence or accuracy. A practical experimentation illustrating this ap-proach is presented: from a depth camera sensor, human posture is interpreted and the defuzzification method is selected in accordance with the constraints of the final information consumer. The paper illustrates the interest of the approach in a context of postures based human robot communication. [1807.04170v1]

A Computational Method for Evaluating UI Patterns

Bardia Doosti, Tao Dong, Biplab Deka, Jeffrey Nichols

UI design languages, such as Google’s Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navigation Drawer. To address these challenges, this short paper introduces a method for measuring the impact of design languages and informing design debates through analyzing a dataset consisting of view hierarchies, screenshots, and app metadata for more than 9,000 mobile apps. Our data analysis shows that use of Material Design is positively correlated to app ratings, and to some extent, also the number of installs. Furthermore, we show that use of UI components vary by app category, suggesting a more nuanced view needed in design debates. [1807.04191v1]

MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network

Muhammed Kocabas, Salih Karagoz, Emre Akbas

In this paper, we present MultiPoseNet, a novel bottom-up multi-person pose estimation architecture that combines a multi-task model with a novel assignment method. MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances. On the COCO keypoints dataset, our pose estimation method outperforms all previous bottom-up methods both in accuracy (+4-point mAP over previous best result) and speed; it also performs on par with the best top-down methods while being at least 4x faster. Our method is the fastest real time system with 23 frames/sec. Source code is available at: https://github.com/mkocabas/pose-residual-network [1807.04067v1]

Variational Capsules for Image Analysis and Synthesis

Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan

A capsule is a group of neurons whose activity vector models different properties of the same entity. This paper extends the capsule to a generative version, named variational capsules (VCs). Each VC produces a latent variable for a specific entity, making it possible to integrate image analysis and image synthesis into a unified framework. Variational capsules model an image as a composition of entities in a probabilistic model. Different capsules’ divergence with a specific prior distribution represents the presence of different entities, which can be applied in image analysis tasks such as classification. In addition, variational capsules encode multiple entities in a semantically-disentangling way. Diverse instantiations of capsules are related to various properties of the same entity, making it easy to generate diverse samples with fine-grained semantic attributes. Extensive experiments demonstrate that deep networks designed with variational capsules can not only achieve promising performance on image analysis tasks (including image classification and attribute prediction) but can also improve the diversity and controllability of image synthesis. [1807.04099v1]

CG-DIQA: No-reference Document Image Quality Assessment Based on Character Gradient

Hongyu Li, Fan Zhu, Junhua Qiu

Document image quality assessment (DIQA) is an important and challenging problem in real applications. In order to predict the quality scores of document images, this paper proposes a novel no-reference DIQA method based on character gradient, where the OCR accuracy is used as a ground-truth quality metric. Character gradient is computed on character patches detected with the maximally stable extremal regions (MSER) based method. Character patches are essentially significant to character recognition and therefore suitable for use in estimating document image quality. Experiments on a benchmark dataset show that the proposed method outperforms the state-of-the-art methods in estimating the quality score of document images. [1807.04047v1]

With Friends Like These, Who Needs Adversaries?

Saumya Jetley*, Nicholas A. Lord*, Philip H. S. Torr

The vulnerability of deep image classification networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks (DCNs) that shed new light on their behaviour and its connection to the problem of adversaries, with two key results. The first is a straightforward explanation of the existence of universal adversarial perturbations and their association with specific class identities, obtained by analysing the properties of nets’ logit responses as functions of 1D movements along specific image-space directions. The second is the clear demonstration of the tight coupling between classification performance and vulnerability to adversarial attack within the spaces spanned by these directions. Prior work has noted the importance of low-dimensional subspaces in adversarial vulnerability: we illustrate that this likewise represents the nets’ notion of saliency. In all, we provide a digestible perspective from which to understand previously reported results which have appeared disjoint or contradictory, with implications for efforts to construct neural nets that are both accurate and robust to adversarial attack. [1807.04200v1]

Generative Adversarial Networks with Decoder-Encoder Output Noise

Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang

In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve the generative ability of the generator. However, the generated pictures by GANs are generally blurry. The deep convolutional generative adversarial networks (DCGANs) were then proposed to leverage the quality of generated images. Since the input noise vectors are randomly sampled from a Gaussian distribution, the generator has to map from a whole normal distribution to the images. This makes DCGANs unable to reflect the inherent structure of the training data. In this paper, we propose a novel deep model, called generative adversarial networks with decoder-encoder output noise (DE-GANs), which takes advantage of both the adversarial training and the variational Bayesain inference to improve the performance of image generation. DE-GANs use a pre-trained decoder-encoder architecture to map the random Gaussian noise vectors to informative ones and pass them to the generator of the adversarial networks. Since the decoder-encoder architecture is trained by the same images as the generators, the output vectors could carry the intrinsic distribution information of the original images. Moreover, the loss function of DE-GANs is different from GANs and DCGANs. A hidden-space loss function is added to the adversarial loss function to enhance the robustness of the model. Extensive empirical results show that DE-GANs can accelerate the convergence of the adversarial training process and improve the quality of the generated images. [1807.03923v1]

Data-Driven Segmentation of Post-mortem Iris Images

This paper presents a method for segmenting iris images obtained from the deceased subjects, by training a deep convolutional neural network (DCNN) designed for the purpose of semantic segmentation. Post-mortem iris recognition has recently emerged as an alternative, or additional, method useful in forensic analysis. At the same time it poses many new challenges from the technological standpoint, one of them being the image segmentation stage, which has proven difficult to be reliably executed by conventional iris recognition methods. Our approach is based on the SegNet architecture, fine-tuned with 1,300 manually segmented post-mortem iris images taken from the Warsaw-BioBase-Post-Mortem-Iris v1.0 database. The experiments presented in this paper show that this data-driven solution is able to learn specific deformations present in post-mortem samples, which are missing from alive irises, and offers a considerable improvement over the state-of-the-art, conventional segmentation algorithm (OSIRIS): the Intersection over Union (IoU) metric was improved from 73.6% (for OSIRIS) to 83% (for DCNN-based presented in this paper) averaged over subject-disjoint, multiple splits of the data into train and test subsets. This paper offers the first known to us method of automatic processing of post-mortem iris images. We offer source codes with the trained DCNN that perform end-to-end segmentation of post-mortem iris images, as described in this paper. Also, we offer binary masks corresponding to manual segmentation of samples from Warsaw-BioBase-Post-Mortem-Iris v1.0 database to facilitate development of alternative methods for post-mortem iris segmentation. [1807.04154v1]

Underwater Image Haze Removal and Color Correction with an Underwater-ready Dark Channel Prior

Tomasz Łuczyński, Andreas Birk

Underwater images suffer from extremely unfavourable conditions. Light is heavily attenuated and scattered. Attenuation creates change in hue, scattering causes so called veiling light. General state of the art methods for enhancing image quality are either unreliable or cannot be easily used in underwater operations. On the other hand there is a well known method for haze removal in air, called Dark Channel Prior. Even though there are known adaptations of this method to underwater applications, they do not always work correctly. This work elaborates and improves upon the initial concept presented in [1]. A modification to the Dark Channel Prior is proposed that allows for an easy application to underwater images. It is also shown that our method outperforms competing solutions based on the Dark Channel Prior. Experiments on real-life data collected within the DexROV project are also presented, showing the robustness and high performance of the proposed algorithm. [1807.04169v1]

Model-based free-breathing cardiac MRI reconstruction using deep learned \& STORM priors: MoDL-STORM

Sampurna Biswas, Hemant K. Aggarwal, Sunrita Poddar, Mathews Jacob

We introduce a model-based reconstruction framework with deep learned (DL) and smoothness regularization on manifolds (STORM) priors to recover free breathing and ungated (FBU) cardiac MRI from highly undersampled measurements. The DL priors enable us to exploit the local correlations, while the STORM prior enables us to make use of the extensive non-local similarities that are subject dependent. We introduce a novel model-based formulation that allows the seamless integration of deep learning methods with available prior information, which current deep learning algorithms are not capable of. The experimental results demonstrate the preliminary potential of this work in accelerating FBU cardiac MRI. [1807.03845v1]

An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm

Deep Belief Network (DBN) has a deep architecture that represents multiple features of input patterns hierarchically with the pre-trained Restricted Boltzmann Machines (RBM). A traditional RBM or DBN model cannot change its network structure during the learning phase. Our proposed adaptive learning method can discover the optimal number of hidden neurons and weights and/or layers according to the input space. The model is an important method to take account of the computational cost and the model stability. The regularities to hold the sparse structure of network is considerable problem, since the extraction of explicit knowledge from the trained network should be required. In our previous research, we have developed the hybrid method of adaptive structural learning method of RBM and Learning Forgetting method to the trained RBM. In this paper, we propose the adaptive learning method of DBN that can determine the optimal number of layers during the learning. We evaluated our proposed model on some benchmark data sets. [1807.03486v2]

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Chuhui Xue, Shijian Lu, Fangneng Zhan

This paper presents a scene text detection technique that exploits bootstrapping and text border semantics for accurate localization of texts in scenes. A novel bootstrapping technique is designed which samples multiple ‘subsections’ of a word or text line and accordingly relieves the constraint of limited training data effectively. At the same time, the repeated sampling of text ‘subsections’ improves the consistency of the predicted text feature maps which is critical in predicting a single complete instead of multiple broken boxes for long words or text lines. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for each scene text. With semantics-aware text borders, scene texts can be localized more accurately by regressing text pixels around the ends of words or text lines instead of all text pixels which often leads to inaccurate localization while dealing with long words or text lines. Extensive experiments demonstrate the effectiveness of the proposed techniques, and superior performance is obtained over several public datasets, e. g. 80.1 f-score for the MSRA-TD500, 67.1 f-score for the ICDAR2017-RCTW, etc. [1807.03547v2]

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo

Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision. [1807.03871v1]

Vision System for AGI: Problems and Directions

Alexey Potapov, Sergey Rodionov, Maxim Peterson, Oleg Shcherbakov, Innokentii Zhdanov, Nikolai Skorobogatko

What frameworks and architectures are necessary to create a vision system for AGI? In this paper, we propose a formal model that states the task of perception within AGI. We show the role of discriminative and generative models in achieving efficient and general solution of this task, thus specifying the task in more detail. We discuss some existing generative and discriminative models and demonstrate their insufficiency for our purposes. Finally, we discuss some architectural dilemmas and open questions. [1807.03887v1]

Deep Imbalanced Attribute Classification using Visual Attention Aggregation

For many computer vision applications such as image description and human identification, recognizing the visual attributes of humans is an essential yet challenging problem. Its challenges originate from its multi-label nature, the large underlying class imbalance and the lack of spatial annotations. Existing methods follow either a computer vision approach while failing to account for class imbalance, or explore machine learning solutions, which disregard the spatial and semantic relations that exist in the images. With that in mind, we propose an effective method that extracts and aggregates visual attention masks at different scales. We introduce a loss function to handle class imbalance both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak supervision of the attention mechanism. By identifying and addressing these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without additional context or side information. [1807.03903v1]

An Adaptive Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm

Restricted Boltzmann Machine (RBM) is a generative stochastic energy-based model of artificial neural network for unsupervised learning. Recently, RBM is well known to be a pre-training method of Deep Learning. In addition to visible and hidden neurons, the structure of RBM has a number of parameters such as the weights between neurons and the coefficients for them. Therefore, we may meet some difficulties to determine an optimal network structure to analyze big data. In order to evade the problem, we investigated the variance of parameters to find an optimal structure during learning. For the reason, we should check the variance of parameters to cause the fluctuation for energy function in RBM model. In this paper, we propose the adaptive learning method of RBM that can discover an optimal number of hidden neurons according to the training situation by applying the neuron generation and annihilation algorithm. In this method, a new hidden neuron is generated if the energy function is not still converged and the variance of the parameters is large. Moreover, the inactivated hidden neuron will be annihilated if the neuron does not affect the learning situation. The experimental results for some benchmark data sets were discussed in this paper. [1807.03478v2]

Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition

Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris

In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains. [1807.03848v1]

Nicolo’SavioliSilvia VisentinErich CosmiEnrico GrisanPablo LamataGiovanni Montana

Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Lingxiao Hang

StéphanePerrinEric BenoitDidier Coquin

Bardia DoostiTao DongBiplab DekaJeffrey Nichols

MultiPoseNet：使用姿势残差网络进行快速多人姿态估计

Muhammed KocabasSalih KaragozEmre Akbas

Huaibo Huang, Lingxiao Song, Ran He, Zhenan Sun, Tieniu Tan

CG-DIQA：基于字符梯度的无参考文档图像质量评估

Hongyu Li, Fan Zhu, Junhua Qiu

Saumya Jetley *Nicholas A. Lord *Philip HS Torr

Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang

TomaszŁuczyńskiAndreas Birk

Sampurna BiswasHemant K. AggarwalSunrita PoddarMathews Jacob

Deep Belief NetworkDBN）具有深层体系结构，使用预先训练的Restricted Boltzmann MachinesRBM）分层次地表示输入模式的多个特征。传统的RBMDBN模型在学习阶段不能改变其网络结构。我们提出的自适应学习方法可以根据输入空间发现隐藏神经元和权重和/或层的最佳数量。该模型是考虑计算成本和模型稳定性的重要方法。保持稀疏网络结构的规律性是相当大的问题，因为应该需要从训练的网络中提取显式知识。在我们之前的研究中，我们开发了RBM自适应结构学习方法和学习遗忘方法的混合方法到训练的RBM。在本文中，我们提出了DBN的自适应学习方法，可以确定学习过程中的最佳层数。我们在一些基准数据集上评估了我们提出的模型。[1807.03486v2]

Chuhui Xue, Shijian Lu, Fangneng Zhan

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo

AGI视觉系统：问题和方向

Alexey PotapovSergey RodionovMaxim PetersonOleg ShcherbakovInnokentii ZhdanovNikolai Skorobogatko

AGI创建视觉系统需要哪些框架和架构？在本文中，我们提出了一个正式模型，阐述了AGI中的感知任务。我们展示了判别和生成模型在实现该任务的有效和一般解决方案中的作用，从而更详细地指定任务。我们讨论一些现有的生成和判别模型，并证明它们不足以达到我们的目的。最后，我们讨论一些架构困境和开放式问题。[1807.03887v1]

Big-Little Net：一种用于视觉和语音识别的高效多尺度特征表示

Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris