# W-net：用于2D医学图像分割的桥接U-net + 对真实照片的卷积盲去噪 + Sem-GAN：语义一致的图像到图像转换

W-net: Bridged U-net for 2D Medical Image Segmentation

Wanli Chen, Yue Zhang, Junjun He, Yu Qiao, Yifan Chen, Hongjian Shi, Xiaoying Tang

In this paper, we focus on three problems in deep learning based medical image segmentation. Firstly, U-net, as a popular model for medical image segmentation, is difficult to train when convolutional layers increase even though a deeper network usually has a better generalization ability because of more learnable parameters. Secondly, the exponential ReLU (ELU), as an alternative of ReLU, is not much different from ReLU when the network of interest gets deep. Thirdly, the Dice loss, as one of the pervasive loss functions for medical image segmentation, is not effective when the prediction is close to ground truth and will cause oscillation during training. To address the aforementioned three problems, we propose and validate a deeper network that can fit medical image datasets that are usually small in the sample size. Meanwhile, we propose a new loss function to accelerate the learning process and a combination of different activation functions to improve the network performance. Our experimental results suggest that our network is comparable or superior to state-of-the-art methods. [1807.04459v1]

Visual Reinforcement Learning with Imagined Goals

Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine

For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised “practice” phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. [1807.04742v1]

Learning Product Codebooks using Vector Quantized Autoencoders for Image Retrieval

Hanwei Wu, Markus Flierl

The Vector Quantized-Variational Autoencoder (VQ-VAE) provides an unsupervised model for learning discrete representations by combining vector quantization and autoencoders. The VQ-VAE can avoid the issue of “posterior collapse” so that its learned discrete representation is meaningful. In this paper, we incorporate the product quantization into the bottleneck stage of VQ-VAE and propose an end-to-end unsupervised learning model for the image retrieval tasks. Compared to the classic vector quantization, product quantization has the advantage of generating large size codebook and fast retrieval can be achieved by using the lookup tables that store the distance between every two sub-codewords. In our proposed model, the product codebook is jointly learned with the encoders and decoders of the autoencoders. The encodings of query and database images can be generated by feeding the images into the trained encoder and learned product codebook. The experiment shows that our proposed model outperforms other state-of-the-art hashing and quantization methods for image retrieval. [1807.04629v1]

Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation

Miguel Campo, Cheng-Kang Hsieh, Matt Nickens, JJ Espinoza, Abhinav Taliyan, Julie Rieger, Jean Ho, Bettina Sherick

Audience discovery is an important activity at major movie studios. Deep models that use convolutional networks to extract frame-by-frame features of a movie trailer and represent it in a form that is suitable for prediction are now possible thanks to the availability of pre-built feature extractors trained on large image datasets. Using these pre-built feature extractors, we are able to process hundreds of publicly available movie trailers, extract frame-by-frame low level features (e.g., a face, an object, etc) and create video-level representations. We use the video-level representations to train a hybrid Collaborative Filtering model that combines video features with historical movie attendance records. The trained model not only makes accurate attendance and audience prediction for existing movies, but also successfully profiles new movies six to eight months prior to their release. [1807.04465v1]

Adding Attentiveness to the Neurons in Recurrent Neural Networks

Pengfei Zhang, Jianru Xue, Cuiling Lan, Wenjun Zeng, Zhanning Gao, Nanning Zheng

Recurrent neural networks (RNNs) are capable of modeling the temporal dynamics of complex sequential information. However, the structures of existing RNN neurons mainly focus on controlling the contributions of current and historical information but do not explore the different importance levels of different elements in an input vector of a time slot. We propose adding a simple yet effective Element-wiseAttention Gate (EleAttG) to an RNN block (e.g., all RNN neurons in a network layer) that empowers the RNN neurons to have the attentiveness capability. For an RNN block, an EleAttG is added to adaptively modulate the input by assigning different levels of importance, i.e., attention, to each element/dimension of the input. We refer to an RNN block equipped with an EleAttG as an EleAtt-RNN block. Specifically, the modulation of the input is content adaptive and is performed at fine granularity, being element-wise rather than input-wise. The proposed EleAttG, as an additional fundamental unit, is general and can be applied to any RNN structures, e.g., standard RNN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). We demonstrate the effectiveness of the proposed EleAtt-RNN by applying it to the action recognition tasks on both 3D human skeleton data and RGB videos. Experiments show that adding attentiveness through EleAttGs to RNN blocks significantly boosts the power of RNNs. [1807.04445v1]

Toward Convolutional Blind Denoising of Real Photographs

Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, Lei Zhang

Despite their success in Gaussian denoising, deep convolutional neural networks (CNNs) are still very limited on real noisy photographs, and may even perform worse than the representative traditional methods such as BM3D and K-SVD. In order to improve the robustness and practicability of deep denoising models, this paper presents a convolutional blind denoising network (CBDNet) by incorporating network architecture, noise modeling, and asymmetric learning. Our CBDNet is comprised of a noise estimation subnetwork and a denoising subnetwork, and is trained using a more realistic noise model by considering both signal-dependent noise and in-camera processing pipeline. Motivated by the asymmetric sensitivity of non-blind denoisers (e.g., BM3D) to noise estimation error, the asymmetric learning is presented on the noise estimation subnetwork to suppress more on under-estimation of noise level. To make the learned model applicable to real photographs, both synthetic images based on realistic noise model and real noisy photographs with nearly noise-free images are incorporated to train our CBDNet. The results on three datasets of real noisy photographs clearly demonstrate the superiority of our CBDNet over the state-of-the-art denoisers in terms of quantitative metrics and visual quality. The code and models will be publicly available at https://github.com/GuoShi28/CBDNet. [1807.04686v1]

LandmarkBoost: Efficient Visual Context Classifiers for Robust Localization

Marcin Dymczyk, Igor Gilitschenski, Juan Nieto, Simon Lynen, Bernhard Zeisl, Roland Siegwart

The growing popularity of autonomous systems creates a need for reliable and efficient metric pose retrieval algorithms. Currently used approaches tend to rely on nearest neighbor search of binary descriptors to perform the 2D-3D matching and guarantee realtime capabilities on mobile platforms. These methods struggle, however, with the growing size of the map, changes in viewpoint or appearance, and visual aliasing present in the environment. The rigidly defined descriptor patterns only capture a limited neighborhood of the keypoint and completely ignore the overall visual context. We propose LandmarkBoost – an approach that, in contrast to the conventional 2D-3D matching methods, casts the search problem as a landmark classification task. We use a boosted classifier to classify landmark observations and directly obtain correspondences as classifier scores. We also introduce a formulation of visual context that is flexible, efficient to compute, and can capture relationships in the entire image plane. The original binary descriptors are augmented with contextual information and informative features are selected by the boosting framework. Through detailed experiments, we evaluate the retrieval quality and performance of LandmarkBoost, demonstrating that it outperforms common state-of-the-art descriptor matching methods. [1807.04702v1]

Video Saliency Detection by 3D Convolutional Neural Networks

Guanqun Ding, Yuming Fang

Different from salient object detection methods for still images, a key challenging for video saliency detection is how to extract and combine spatial and temporal features. In this paper, we present a novel and effective approach for salient object detection for video sequences based on 3D convolutional neural networks. First, we design a 3D convolutional network (Conv3DNet) with the input as three video frame to learn the spatiotemporal features for video sequences. Then, we design a 3D deconvolutional network (Deconv3DNet) to combine the spatiotemporal features to predict the final saliency map for video sequences. Experimental results show that the proposed saliency detection model performs better in video saliency prediction compared with the state-of-the-art video saliency detection methods. [1807.04514v1]

Learning to Segment Medical Images with Scribble-Supervision Alone

Yigit B. Can, Krishna Chaitanya, Basil Mustafa, Lisa M. Koch, Ender Konukoglu, Christian F. Baumgartner

Semantic segmentation of medical images is a crucial step for the quantification of healthy anatomy and diseases alike. The majority of the current state-of-the-art segmentation algorithms are based on deep neural networks and rely on large datasets with full pixel-wise annotations. Producing such annotations can often only be done by medical professionals and requires large amounts of valuable time. Training a medical image segmentation network with weak annotations remains a relatively unexplored topic. In this work we investigate training strategies to learn the parameters of a pixel-wise segmentation network from scribble annotations alone. We evaluate the techniques on public cardiac (ACDC) and prostate (NCI-ISBI) segmentation datasets. We find that the networks trained on scribbles suffer from a remarkably small degradation in Dice of only 2.9% (cardiac) and 4.5% (prostate) with respect to a network trained on full annotations. [1807.04668v1]

Deep semi-supervised segmentation with weight-averaged consistency targets

Recently proposed techniques for semi-supervised learning such as Temporal Ensembling and Mean Teacher have achieved state-of-the-art results in many important classification benchmarks. In this work, we expand the Mean Teacher approach to segmentation tasks and show that it can bring important improvements in a realistic small data regime using a publicly available multi-center dataset from the Magnetic Resonance Imaging (MRI) domain. We also devise a method to solve the problems that arise when using traditional data augmentation strategies for segmentation tasks on our new training scheme. [1807.04657v1]

Robustness Analysis of Pedestrian Detectors for Surveillance

Yuming Fang, Guanqun Ding, Yuan Yuan, Weisi Lin, Haiwen Liu

To obtain effective pedestrian detection results in surveillance video, there have been many methods proposed to handle the problems from severe occlusion, pose variation, clutter background, \emph{etc}. Besides detection accuracy, a robust surveillance video system should be stable to video quality degradation by network transmission, environment variation, \emph{etc}. In this study, we conduct the research on the robustness of pedestrian detection algorithms to video quality degradation. The main contribution of this work includes the following three aspects. First, a large-scale Distorted Surveillance Video Data Set (\emph{DSurVD}) is constructed from high-quality video sequences and their corresponding distorted versions. Second, we design a method to evaluate detection stability and a robustness measure called \emph{Robustness Quadrangle}, which can be adopted to visualize detection accuracy of pedestrian detection algorithms on high-quality video sequences and stability with video quality degradation. Third, the robustness of seven existing pedestrian detection algorithms is evaluated by the built \emph{DSurVD}. Experimental results show that the robustness can be further improved for existing pedestrian detection algorithms. Additionally, we provide much in-depth discussion on how different distortion types influence the performance of pedestrian detection algorithms, which is important to design effective pedestrian detection algorithms for surveillance. [1807.04562v1]

Deep Learning for Imbalance Data Classification using Class Expert Generative Adversarial Network

Fanny, Tjeng Wawan Cenggoro

Without any specific way for imbalance data classification, artificial intelligence algorithm cannot recognize data from minority classes easily. In general, modifying the existing algorithm by assuming that the training data is imbalanced, is the only way to handle imbalance data. However, for a normal data handling, this way mostly produces a deficient result. In this research, we propose a class expert generative adversarial network (CE-GAN) as the solution for imbalance data classification. CE-GAN is a modification in deep learning algorithm architecture that does not have an assumption that the training data is imbalance data. Moreover, CE-GAN is designed to identify more detail about the character of each class before classification step. CE-GAN has been proved in this research to give a good performance for imbalance data classification. [1807.04585v1]

Subsampled Turbulence Removal Network

Wai Ho Chak, Chun Pong Lau, Lok Ming Lui

We present a deep-learning approach to restore a sequence of turbulence-distorted video frames from turbulent deformations and space-time varying blurs. Instead of requiring a massive training sample size in deep networks, we purpose a training strategy that is based on a new data augmentation method to model turbulence from a relatively small dataset. Then we introduce a subsampled method to enhance the restoration performance of the presented GAN model. The contributions of the paper is threefold: first, we introduce a simple but effective data augmentation algorithm to model the turbulence in real life for training in the deep network; Second, we firstly purpose the Wasserstein GAN combined with $\ell_1$ cost for successful restoration of turbulence-corrupted video sequence; Third, we combine the subsampling algorithm to filter out strongly corrupted frames to generate a video sequence with better quality. [1807.04418v1]

Sem-GAN: Semantically-Consistent Image-to-Image Translation

Anoop Cherian, Alan Sullivan

Unpaired image-to-image translation is the problem of mapping an image in the source domain to one in the target domain, without requiring corresponding image pairs. To ensure the translated images are realistically plausible, recent works, such as Cycle-GAN, demands this mapping to be invertible. While, this requirement demonstrates promising results when the domains are unimodal, its performance is unpredictable in a multi-modal scenario such as in an image segmentation task. This is because, invertibility does not necessarily enforce semantic correctness. To this end, we present a semantically-consistent GAN framework, dubbed Sem-GAN, in which the semantics are defined by the class identities of image segments in the source domain as produced by a semantic segmentation algorithm. Our proposed framework includes consistency constraints on the translation task that, together with the GAN loss and the cycle-constraints, enforces that the images when translated will inherit the appearances of the target domain, while (approximately) maintaining their identities from the source domain. We present experiments on several image-to-image translation tasks and demonstrate that Sem-GAN improves the quality of the translated images significantly, sometimes by more than 20% on the FCN score. Further, we show that semantic segmentation models, trained with synthetic images translated via Sem-GAN, leads to significantly better segmentation results than other variants. [1807.04409v1]

A Reflectance Based Method For Shadow Detection and Removal

Shadows are common aspect of images and when left undetected can hinder scene understanding and visual processing. We propose a simple yet effective approach based on reflectance to detect shadows from single image. An image is first segmented and based on the reflectance, illumination and texture characteristics, segments pairs are identified as shadow and non-shadow pairs. The proposed method is tested on two publicly available and widely used datasets. Our method achieves higher accuracy in detecting shadows compared to previous reported methods despite requiring fewer parameters. We also show results of shadow-free images by relighting the pixels in the detected shadow regions. [1807.04352v1]

Deepwound: Automated Postoperative Wound Assessment and Surgical Site Surveillance through Convolutional Neural Networks

Varun Shenoy, Elizabeth Foster, Lauren Aalami, Bakar Majeed, Oliver Aalami

Postoperative wound complications are a significant cause of expense for hospitals, doctors, and patients. Hence, an effective method to diagnose the onset of wound complications is strongly desired. Algorithmically classifying wound images is a difficult task due to the variability in the appearance of wound sites. Convolutional neural networks (CNNs), a subgroup of artificial neural networks that have shown great promise in analyzing visual imagery, can be leveraged to categorize surgical wounds. We present a multi-label CNN ensemble, Deepwound, trained to classify wound images using only image pixels and corresponding labels as inputs. Our final computational model can accurately identify the presence of nine labels: drainage, fibrinous exudate, granulation tissue, surgical site infection, open wound, staples, steri strips, and sutures. Our model achieves receiver operating curve (ROC) area under curve (AUC) scores, sensitivity, specificity, and F1 scores superior to prior work in this area. Smartphones provide a means to deliver accessible wound care due to their increasing ubiquity. Paired with deep neural networks, they offer the capability to provide clinical insight to assist surgeons during postoperative care. We also present a mobile application frontend to Deepwound that assists patients in tracking their wound and surgical recovery from the comfort of their home. [1807.04355v1]

A Generic Approach to Lung Field Segmentation from Chest Radiographs using Deep Space and Shape Learning

Awais Mansoor, Juan J. Cerrolaza, Geovanny Perez, Elijah Biggs, Kazunori Okada, Gustavo Nino, Marius George Linguraru

Computer-aided diagnosis (CAD) techniques for lung field segmentation from chest radiographs (CXR) have been proposed for adult cohorts, but rarely for pediatric subjects. Statistical shape models (SSMs), the workhorse of most state-of-the-art CXR-based lung field segmentation methods, do not efficiently accommodate shape variation of the lung field during the pediatric developmental stages. The main contributions of our work are: (1) a generic lung field segmentation framework from CXR accommodating large shape variation for adult and pediatric cohorts; (2) a deep representation learning detection mechanism, \emph{ensemble space learning}, for robust object localization; and (3) \emph{marginal shape deep learning} for the shape deformation parameter estimation. Unlike the iterative approach of conventional SSMs, the proposed shape learning mechanism transforms the parameter space into marginal subspaces that are solvable efficiently using the recursive representation learning mechanism. Furthermore, our method is the first to include the challenging retro-cardiac region in the CXR-based lung segmentation for accurate lung capacity estimation. The framework is evaluated on 668 CXRs of patients between 3 month to 89 year of age. We obtain a mean Dice similarity coefficient of $0.96\pm0.03$ (including the retro-cardiac region). For a given accuracy, the proposed approach is also found to be faster than conventional SSM-based iterative segmentation methods. The computational simplicity of the proposed generic framework could be similarly applied to the fast segmentation of other deformable objects. [1807.04339v1]

A Trilateral Weighted Sparse Coding Scheme for Real-World Image Denoising

Jun Xu, Lei Zhang, David Zhang

Most of existing image denoising methods assume the corrupted noise to be additive white Gaussian noise (AWGN). However, the realistic noise in real-world noisy images is much more complex than AWGN, and is hard to be modelled by simple analytical distributions. As a result, many state-of-the-art denoising methods in literature become much less effective when applied to real-world noisy images captured by CCD or CMOS cameras. In this paper, we develop a trilateral weighted sparse coding (TWSC) scheme for robust real-world image denoising. Specifically, we introduce three weight matrices into the data and regularisation terms of the sparse coding framework to characterise the statistics of realistic noise and image priors. TWSC can be reformulated as a linear equality-constrained problem and can be solved by the alternating direction method of multipliers. The existence and uniqueness of the solution and convergence of the proposed algorithm are analysed. Extensive experiments demonstrate that the proposed TWSC scheme outperforms state-of-the-art denoising methods on removing realistic noise. [1807.04364v1]

W-net：用于2D医学图像分割的桥接U-net

Wanli Chen, Yue Zhang, Junjun He, Yu Qiao, Yifan Chen, Hongjian Shi, Xiaoying Tang

Ashvin NairVitchyr PongMurtaza DalalShikhar BahlSteven LinSergey Levine

Miguel CampoCheng-Kang HsiehMatt NickensJJ EspinozaAbhinav TaliyanJulie RiegerJean HoBettina Sherick

Pengfei Zhang, Jianru Xue, Cuiling Lan, Wenjun Zeng, Zhanning Gao, Nanning Zheng

Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, Lei Zhang

LandmarkBoost：用于稳健本地化的高效视觉上下文分类器

Marcin DymczykIgor GilitschenskiJuan NietoSimon LynenBernhard ZeislRoland Siegwart

Guanqun Ding, Yuming Fang

Yigit B. CanKrishna ChaitanyaBasil MustafaLisa M. KochEnder KonukogluChristian F. Baumgartner

Yuming Fang, Guanqun Ding, Yuan Yuan, Weisi Lin, Haiwen Liu

Wai Ho Chak, Chun Pong Lau, Lok Ming Lui

Sem-GAN：语义一致的图像到图像转换

Anoop CherianAlan Sullivan

Deepwound：通过卷积神经网络自动化术后伤口评估和手术部位监测

Varun ShenoyElizabeth FosterLauren AalamiBakar MajeedOliver Aalami

Awais MansoorJuan J. CerrolazaGeovanny PerezElijah BiggsKazunori OkadaGustavo NinoMarius George Linguraru

Jun Xu, Lei Zhang, David Zhang