转载

arXiv Paper Daily: Mon, 16 Apr 2018

Neural and Evolutionary Computing

Neural Trajectory Analysis of Recurrent Neural Network In Handwriting Synthesis

Kristof B. Charbonneau , Osamu Shouno

Comments: 4 pages, 3 figures

Subjects

Neural and Evolutionary Computing (cs.NE)

Recurrent neural networks (RNNs) are capable of learning to generate highly

realistic, online handwritings in a wide variety of styles from a given text

sequence. Furthermore, the networks can generate handwritings in the style of a

particular writer when the network states are primed with a real sequence of

pen movements from the writer. However, how populations of neurons in the RNN

collectively achieve such performance still remains poorly understood. To

tackle this problem, we investigated learned representations in RNNs by

extracting low-dimensional, neural trajectories that summarize the activity of

a population of neurons in the network during individual syntheses of

handwritings. The neural trajectories show that different writing styles are

encoded in different subspaces inside an internal space of the network. Within

each subspace, different characters of the same style are represented as

different state dynamics. These results demonstrate the effectiveness of

analyzing the neural trajectory for intuitive understanding of how the RNNs

work.

The unreasonable effectiveness of the forget gate

Jos van der Westhuizen , Joan Lasenby

Comments: 15 pages, 5 figures

Subjects

Neural and Evolutionary Computing (cs.NE)

; Learning (cs.LG); Machine Learning (stat.ML)

Given the success of the gated recurrent unit, a natural question is whether

all the gates of the long short-term memory (LSTM) network are necessary.

Previous research has shown that the forget gate is one of the most important

gates in the LSTM. Here we show that a forget-gate-only version of the LSTM

with chrono-initialized biases, not only provides computational savings but

outperforms the standard LSTM on multiple benchmark datasets and competes with

some of the best contemporary models. Our proposed network, the JANET, achieves

accuracies of 99% and 92.5% on the MNIST and pMNIST datasets, outperforming the

standard LSTM which yields accuracies of 98.5% and 91%.

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Peter L. Bartlett , Steven N. Evans , Philip M. Long Subjects : Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Statistics Theory (math.ST); Machine Learning (stat.ML)

We show that any smooth bi-Lipschitz (h) can be represented exactly as a

composition (h_m circ … circ h_1) of functions (h_1,…,h_m) that are close

to the identity in the sense that each (left(h_i-mathrm{Id}

ight)) is

Lipschitz, and the Lipschitz constant decreases inversely with the number (m)

of functions composed. This implies that (h) can be represented to any accuracy

by a deep residual network whose nonlinear layers compute functions with a

small Lipschitz constant. Next, we consider nonlinear regression with a

composition of near-identity nonlinear maps. We show that, regarding Fr’echet

derivatives with respect to the (h_1,…,h_m), any critical point of a

quadratic criterion in this near-identity region must be a global minimizer. In

contrast, if we consider derivatives with respect to parameters of a fixed-size

residual network with sigmoid activation functions, we show that there are

near-identity critical points that are suboptimal, even in the realizable case.

Informally, this means that functional gradient methods for residual networks

cannot get stuck at suboptimal critical points corresponding to near-identity

layers, whereas parametric gradient methods for sigmoidal residual networks

suffer from suboptimal critical points in the near-identity region.

μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Yosuke Oyama , Tal Ben-Nun , Torsten Hoefler , Satoshi Matsuoka

Comments: 11 pages, 14 figures. Part of the content have been published in IPSJ SIG Technical Report, Vol. 2017-HPC-162, No. 22, pp. 1-9, 2017. (DOI: this http URL )

Subjects

Learning (cs.LG)

; Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used

in deep learning. Specifically, cuDNN implements several equivalent convolution

algorithms, whose performance and memory footprint may vary considerably,

depending on the layer dimensions. When an algorithm is automatically selected

by cuDNN, the decision is performed on a per-layer basis, and thus it often

resorts to slower algorithms that fit the workspace size constraints. We

present {mu}-cuDNN, a transparent wrapper library for cuDNN, which divides

layers’ mini-batch computation into several micro-batches. Based on Dynamic

Programming and Integer Linear Programming, {mu}-cuDNN enables faster

algorithms by decreasing the workspace requirements. At the same time,

{mu}-cuDNN keeps the computational semantics unchanged, so that it decouples

statistical efficiency from the hardware efficiency safely. We demonstrate the

effectiveness of {mu}-cuDNN over two frameworks, Caffe and TensorFlow,

achieving speedups of 1.63x for AlexNet and 1.21x for ResNet-18 on P100-SXM2

GPU. These results indicate that using micro-batches can seamlessly increase

the performance of deep learning, while maintaining the same memory footprint.

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

Christoph Treude , Markus Wagner Subjects : Computation and Language (cs.CL) ; Neural and Evolutionary Computing (cs.NE)

To make sense of large amounts of textual data, topic modelling is frequently

used as a text-mining tool for the discovery of hidden semantic structures in

text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model

that aims to explain the structure of a corpus by grouping texts. LDA requires

multiple parameters to work well, and there are only rough and sometimes

conflicting guidelines available on how these parameters should be set. In this

paper, we contribute (i) a broad study of parameters to arrive at good local

optima, (ii) an a-posteriori characterisation of text corpora related to eight

programming languages from GitHub and Stack Overflow, and (iii) an analysis of

corpus feature importance via per-corpus LDA configuration.

Computer Vision and Pattern Recognition

Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution

Ying Qu , Hairong Qi , Chiman Kwan

Comments: Accepted by The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018)

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In many computer vision applications, obtaining images of high resolution in

both the spatial and spectral domains are equally important. However, due to

hardware limitations, one can only expect to acquire images of high resolution

in either the spatial or spectral domains. This paper focuses on hyperspectral

image super-resolution (HSI-SR), where a hyperspectral image (HSI) with low

spatial resolution (LR) but high spectral resolution is fused with a

multispectral image (MSI) with high spatial resolution (HR) but low spectral

resolution to obtain HR HSI. Existing deep learning-based solutions are all

supervised that would need a large training set and the availability of HR HSI,

which is unrealistic. Here, we make the first attempt to solving the HSI-SR

problem using an unsupervised encoder-decoder architecture that carries the

following uniquenesses. First, it is composed of two encoder-decoder networks,

coupled through a shared decoder, in order to preserve the rich spectral

information from the HSI network. Second, the network encourages the

representations from both modalities to follow a sparse Dirichlet distribution

which naturally incorporates the two physical constraints of HSI and MSI.

Third, the angular difference between representations are minimized in order to

reduce the spectral distortion. We refer to the proposed architecture as

unsupervised Sparse Dirichlet-Net, or uSDN. Extensive experimental results

demonstrate the superior performance of uSDN as compared to the

state-of-the-art.

Comparatives, Quantifiers, Proportions: A Multi-Task Model for the Learning of Quantities from Vision

Sandro Pezzelle , Ionut-Teodor Sorodoc , Raffaella Bernardi

Comments: 12 pages (references included). To appear in the Proceedings of NAACL-HLT 2018

Journal-ref: Proceedings of NAACL-HLT 2018

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Learning (cs.LG); Machine Learning (stat.ML)

The present work investigates whether different quantification mechanisms

(set comparison, vague quantification, and proportional estimation) can be

jointly learned from visual scenes by a multi-task computational model. The

motivation is that, in humans, these processes underlie the same cognitive,

non-symbolic ability, which allows an automatic estimation and comparison of

set magnitudes. We show that when information about lower-complexity tasks is

available, the higher-level proportional task becomes more accurate than when

performed in isolation. Moreover, the multi-task model is able to generalize to

unseen combinations of target/non-target objects. Consistently with behavioral

evidence showing the interference of absolute number in the proportional task,

the multi-task model no longer works when asked to provide the number of target

objects in the scene.

Convolutional Neural Networks for Skull-stripping in Brain MR Imaging using Consensus-based Silver standard Masks

Oeslle Lucena , Roberto Souza , Leticia Rittner , Richard Frayne , Roberto Lotufo Subjects : Computer Vision and Pattern Recognition (cs.CV)

Convolutional neural networks (CNN) for medical imaging are constrained by

the number of annotated data required in the training stage. Usually, manual

annotation is considered to be the “gold standard”. However, medical imaging

datasets that include expert manual segmentation are scarce as this step is

time-consuming, and therefore expensive. Moreover, single-rater manual

annotation is most often used in data-driven approaches making the network

optimal with respect to only that single expert. In this work, we propose a CNN

for brain extraction in magnetic resonance (MR) imaging, that is fully trained

with what we refer to as silver standard masks. Our method consists of 1)

developing a dataset with “silver standard” masks as input, and implementing

both 2) a tri-planar method using parallel 2D U-Net-based CNNs (referred to as

CONSNet) and 3) an auto-context implementation of CONSNet. The term CONSNet

refers to our integrated approach, i.e., training with silver standard masks

and using a 2D U-Net-based architecture. Our results showed that we

outperformed (i.e., larger Dice coefficients) the current state-of-the-art SS

methods. Our use of silver standard masks reduced the cost of manual

annotation, decreased inter-intra-rater variability, and avoided CNN

segmentation super-specialization towards one specific manual annotation

guideline that can occur when gold standard masks are used. Moreover, the usage

of silver standard masks greatly enlarges the volume of input annotated data

because we can relatively easily generate labels for unlabeled data. In

addition, our method has the advantage that, once trained, it takes only a few

seconds to process a typical brain image volume using modern hardware, such as

a high-end graphics processing unit. In contrast, many of the other competitive

methods have processing times in the order of minutes.

An efficient deep convolutional laplacian pyramid architecture for CS reconstruction at low sampling ratios

Wenxue Cui , Heyao Xu , Xinwei Gao , Shengping Zhang , Feng Jiang , Debin Zhao

Comments: 5 pages. Accepted by ICASSP2018

Subjects

Computer Vision and Pattern Recognition (cs.CV)

The compressed sensing (CS) has been successfully applied to image

compression in the past few years as most image signals are sparse in a certain

domain. Several CS reconstruction models have been proposed and obtained

superior performance. However, these methods suffer from blocking artifacts or

ringing effects at low sampling ratios in most cases. To address this problem,

we propose a deep convolutional Laplacian Pyramid Compressed Sensing Network

(LapCSNet) for CS, which consists of a sampling sub-network and a

reconstruction sub-network. In the sampling sub-network, we utilize a

convolutional layer to mimic the sampling operator. In contrast to the fixed

sampling matrices used in traditional CS methods, the filters used in our

convolutional layer are jointly optimized with the reconstruction sub-network.

In the reconstruction sub-network, two branches are designed to reconstruct

multi-scale residual images and muti-scale target images progressively using a

Laplacian pyramid architecture. The proposed LapCSNet not only integrates

multi-scale information to achieve better performance but also reduces

computational cost dramatically. Experimental results on benchmark datasets

demonstrate that the proposed method is capable of reconstructing more details

and sharper edges against the state-of-the-arts methods.

CNN-based Landmark Detection in Cardiac CTA Scans

Julia M. H. Noothout , Bob D. de Vos , Jelmer M. Wolterink , Tim Leiner , Ivana Išgum

Comments: This work was submitted to MIDL 2018 Conference

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Fast and accurate anatomical landmark detection can benefit many medical

image analysis methods. Here, we propose a method to automatically detect

anatomical landmarks in medical images. Automatic landmark detection is

performed with a patch-based fully convolutional neural network (FCNN) that

combines regression and classification. For any given image patch, regression

is used to predict the 3D displacement vector from the image patch to the

landmark. Simultaneously, classification is used to identify patches that

contain the landmark. Under the assumption that patches close to a landmark can

determine the landmark location more precisely than patches farther from it,

only those patches that contain the landmark according to classification are

used to determine the landmark location. The landmark location is obtained by

calculating the average landmark location using the computed 3D displacement

vectors. The method is evaluated using detection of six clinically relevant

landmarks in coronary CT angiography (CCTA) scans: the right and left ostium,

the bifurcation of the left main coronary artery (LM) into the left anterior

descending and the left circumflex artery, and the origin of the right,

non-coronary, and left aortic valve commissure. The proposed method achieved an

average Euclidean distance error of 2.19 mm and 2.88 mm for the right and left

ostium respectively, 3.78 mm for the bifurcation of the LM, and 1.82 mm, 2.10

mm and 1.89 mm for the origin of the right, non-coronary, and left aortic valve

commissure respectively, demonstrating accurate performance. The proposed

combination of regression and classification can be used to accurately detect

landmarks in CCTA scans.

Pose estimation of a single circle using default intrinsic calibration

Mariyanayagam Damien , Gurdjos Pierre , Chambon Sylvie , Brunet Florent , Charvillat Vincent Subjects : Computer Vision and Pattern Recognition (cs.CV)

Circular markers are planar markers which offer great performances for

detection and pose estimation. For an uncalibrated camera with an unknown focal

length, at least the images of at least two coplanar circles are generally

required to recover their poses. Unfortunately, detecting more than one ellipse

in the image must be tricky and time-consuming, especially regarding concentric

circles. On the other hand, when the camera is calibrated, one circle suffices

but the solution is twofold and can hardly be disambiguated. Our contribution

is to put beyond this limit by dealing with the uncalibrated case of a camera

seeing one circle and discussing how to remove the ambiguity. We propose a new

problem formulation that enables to show how to detect geometric configurations

in which the ambiguity can be removed. Furthermore, we introduce the notion of

default camera intrinsics and show, using intensive empirical works, the

surprising observation that very approximate calibration can lead to accurate

circle pose estimation.

Learning to Exploit the Prior Network Knowledge for Weakly-Supervised Semantic Segmentation

Carolina Redondo-Cabrera , Roberto J. López-Sastre Subjects : Computer Vision and Pattern Recognition (cs.CV)

Training a Convolutional Neural Network (CNN) for semantic segmentation

typically requires to collect a large amount of accurate pixel-level

annotations, a hard and expensive task. In contrast, simple image tags are

easier to gather. With this paper we introduce a novel weakly-supervised

semantic segmentation model able to learn from image labels, and just image

labels. Our model uses the prior knowledge of a network trained for image

recognition, employing these image annotations, as an attention mechanism to

identify semantic regions in the images. We then present a methodology that

builds accurate class-specific segmentation masks from these regions, where

neither external objectness nor saliency algorithms are required. We describe

how to incorporate this mask generation strategy into a fully end-to-end

trainable process where the network jointly learns to classify and segment

images. Our experiments on PASCAL VOC 2012 dataset show that exploiting these

generated class-specific masks in conjunction with our novel end-to-end

learning process outperforms several recent weakly-supervised semantic

segmentation methods that use image tags only, and even some models that

leverage additional supervision or training data.

Group Anomaly Detection using Deep Generative Models

Raghavendra Chalapathy (The University of Sydney and Capital Markets CRC), Edward Toth (School of Information Technologies, The University of Sydney), Sanjay Chawla (Qatar Computing Research Institute, HBKU)

Comments: Submitted Under review to The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases ECML-2018 Conference Dublin, Ireland during the 10-14 September 2018

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Unlike conventional anomaly detection research that focuses on point

anomalies, our goal is to detect anomalous collections of individual data

points. In particular, we perform group anomaly detection (GAD) with an

emphasis on irregular group distributions (e.g. irregular mixtures of image

pixels). GAD is an important task in detecting unusual and anomalous phenomena

in real-world applications such as high energy particle physics, social media,

and medical imaging. In this paper, we take a generative approach by proposing

deep generative models: Adversarial autoencoder (AAE) and variational

autoencoder (VAE) for group anomaly detection. Both AAE and VAE detect group

anomalies using point-wise input data where group memberships are known a

priori. We conduct extensive experiments to evaluate our models on real-world

datasets. The empirical results demonstrate that our approach is effective and

robust in detecting group anomalies.

BodyNet: Volumetric Inference of 3D Human Body Shapes

Gül Varol , Duygu Ceylan , Bryan Russell , Jimei Yang , Ersin Yumer , Ivan Laptev , Cordelia Schmid Subjects : Computer Vision and Pattern Recognition (cs.CV)

Human shape estimation is an important task for video editing, animation and

fashion industry. Predicting 3D human body shape from natural images, however,

is highly challenging due to factors such as variation in human bodies,

clothing and viewpoint. Prior methods addressing this problem typically attempt

to fit parametric body models with certain priors on pose and shape. In this

work we argue for an alternative representation and propose BodyNet, a neural

network for direct inference of volumetric body shape from a single image.

BodyNet is an end-to-end trainable network that benefits from (i) a volumetric

3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate

supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them

results in performance improvement as demonstrated by our experiments. To

evaluate the method, we fit the SMPL model to our network output and show

state-of-the-art results on the SURREAL and Unite the People datasets,

outperforming recent approaches. Besides achieving state-of-the-art

performance, our method also enables volumetric body-part segmentation.

Learning Warped Guidance for Blind Face Restoration

Xiaoming Li , Ming Liu , Yuting Ye , Wangmeng Zuo , Liang Lin , Ruigang Yang

Comments: 25 pages, 14 figures and 1 table

Subjects

Computer Vision and Pattern Recognition (cs.CV)

This paper studies the problem of blind face restoration from an

unconstrained blurry, noisy, low-resolution, or compressed image (i.e.,

degraded observation). For better recovery of fine facial details, we modify

the problem setting by taking both the degraded observation and a high-quality

guided image of the same identity as input to our guided face restoration

network (GFRNet). However, the degraded observation and guided image generally

are different in pose, illumination and expression, thereby making plain CNNs

(e.g., U-Net) fail to recover fine and identity-aware facial details. To tackle

this issue, our GFRNet model includes both a warping subnetwork (WarpNet) and a

reconstruction subnetwork (RecNet). The WarpNet is introduced to predict flow

field for warping the guided image to correct pose and expression (i.e., warped

guidance), while the RecNet takes the degraded observation and warped guidance

as input to produce the restoration result. Due to that the ground-truth flow

field is unavailable, landmark loss together with total variation

regularization are incorporated to guide the learning of WarpNet. Furthermore,

to make the model applicable to blind restoration, our GFRNet is trained on the

synthetic data with versatile settings on blur kernel, noise level,

downsampling scale factor, and JPEG quality factor. Experiments show that our

GFRNet not only performs favorably against the state-of-the-art image and face

restoration methods, but also generates visually photo-realistic results on

real degraded facial images.

Spline Error Weighting for Robust Visual-Inertial Fusion

Hannes Ovrén , Per-Erik Forssén

Comments: To appear in CVPR 2018

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In this paper we derive and test a probability-based weighting that can

balance residuals of different types in spline fitting. In contrast to previous

formulations, the proposed spline error weighting scheme also incorporates a

prediction of the approximation error of the spline fit. We demonstrate the

effectiveness of the prediction in a synthetic experiment, and apply it to

visual-inertial fusion on rolling shutter cameras. This results in a method

that can estimate 3D structure with metric scale on generic first-person

videos. We also propose a quality measure for spline fitting, that can be used

to automatically select the knot spacing. Experiments verify that the obtained

trajectory quality corresponds well with the requested quality. Finally, by

linearly scaling the weights, we show that the proposed spline error weighting

minimizes the estimation errors on real sequences, in terms of scale and

end-point errors.

Offline and Online calibration of Mobile Robot and SLAM Device for Navigation

Ryoichi Ishikawa , Takeshi Oishi , Katsushi Ikeuchi Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Robotics (cs.RO)

Robot navigation technology is required to accomplish difficult tasks in

various environments. In navigation, it is necessary to know the information of

the external environments and the state of the robot under the environment. On

the other hand, various studies have been done on SLAM technology, which is

also used for navigation, but also applied to devices for Mixed Reality and the

like.

In this paper, we propose a robot-device calibration method for navigation

with a device using SLAM technology on a robot. The calibration is performed by

using the position and orientation information given by the robot and the

device. In the calibration, the most efficient way of movement is clarified

according to the restriction of the robot movement. Furthermore, we also show a

method to dynamically correct the position and orientation of the robot so that

the information of the external environment and the shape information of the

robot maintain consistency in order to reduce the dynamic error occurring

during navigation.

Our method can be easily used for various kinds of robots and localization

with sufficient precision for navigation is possible with offline calibration

and online position correction. In the experiments, we confirm the parameters

obtained by two types of offline calibration according to the degree of freedom

of robot movement and validate the effectiveness of online correction method by

plotting localized position error during robot’s intense movement. Finally, we

show the demonstration of navigation using SLAM device.

MSnet: Mutual Suppression Network for Disentangled Video Representations

Jungbeom Lee , Jangho Lee , Sungmin Lee , Sungroh Yoon

Comments: 17 pages, 7 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

The extraction of meaningful features from videos is important as they can be

used in various applications. Despite its importance, video representation

learning has not been studied much, because it is challenging to deal with both

content and motion information. We present a Mutual Suppression network (MSnet)

to learn disentangled motion and content features in videos. The MSnet is

trained in such way that content features do not contain motion information and

motion features do not contain content information; this is done by suppressing

each other with adversarial training. We utilize the disentangled features from

the MSnet for several tasks, such as frame reproduction, pixel-level video

frame prediction, and dense optical flow estimation, to demonstrate the

strength of MSnet. The proposed model outperforms the state-of-the-art methods

in pixel-level video frame prediction. The source code will be publicly

available.

Learning Deep Sketch Abstraction

Umar Riaz Muhammad , Yongxin Yang , Yi-Zhe Song , Tao Xiang , Timothy M. Hospedales

Comments: This paper is accepted at CVPR 2018 as poster

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Human free-hand sketches have been studied in various contexts including

sketch recognition, synthesis and fine-grained sketch-based image retrieval

(FG-SBIR). A fundamental challenge for sketch analysis is to deal with

drastically different human drawing styles, particularly in terms of

abstraction level. In this work, we propose the first stroke-level sketch

abstraction model based on the insight of sketch abstraction as a process of

trading off between the recognizability of a sketch and the number of strokes

used to draw it. Concretely, we train a model for abstract sketch generation

through reinforcement learning of a stroke removal policy that learns to

predict which strokes can be safely removed without affecting recognizability.

We show that our abstraction model can be used for various sketch analysis

tasks including: (1) modeling stroke saliency and understanding the decision of

sketch recognition models, (2) synthesizing sketches of variable abstraction

for a given category, or reference object instance in a photo, and (3) training

a FG-SBIR model with photos only, bypassing the expensive photo-sketch pair

collection step.

Precise Temporal Action Localization by Evolving Temporal Proposals

Haonan Qiu , Yingbin Zheng , Hao Ye , Yao Lu , Feng Wang , Liang He Subjects : Computer Vision and Pattern Recognition (cs.CV)

Locating actions in long untrimmed videos has been a challenging problem in

video content analysis. The performances of existing action localization

approaches remain unsatisfactory in precisely determining the beginning and the

end of an action. Imitating the human perception procedure with observations

and refinements, we propose a novel three-phase action localization framework.

Our framework is embedded with an Actionness Network to generate initial

proposals through frame-wise similarity grouping, and then a Refinement Network

to conduct boundary adjustment on these proposals. Finally, the refined

proposals are sent to a Localization Network for further fine-grained location

regression. The whole process can be deemed as multi-stage refinement using a

novel non-local pyramid feature under various temporal granularities. We

evaluate our framework on THUMOS14 benchmark and obtain a significant

improvement over the state-of-the-arts approaches. Specifically, the

performance gain is remarkable under precise localization with high IoU

thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.

Talking Face Generation by Conditional Recurrent Adversarial Network

Yang Song , Jingwen Zhu , Xiaolong Wang , Hairong Qi

Comments: Project Page: this http URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Given an arbitrary face image and an arbitrary speech clip, the proposed work

attempts to generating the talking face video with accurate lip synchronization

while maintaining smooth transition of both lip and facial movement over the

entire video clip. Existing works either do not consider temporal dependency on

face images across different video frames thus easily yielding

noticeable/abrupt facial and lip movement or are only limited to the generation

of talking face video for a specific person thus lacking generalization

capacity. We propose a novel conditional video generation network where the

audio input is treated as a condition for the recurrent adversarial network

such that temporal dependency is incorporated to realize smooth transition for

the lip and facial movement. In addition, we deploy a multi-task adversarial

training scheme in the context of video generation to improve both

photo-realism and the accuracy for lip synchronization. Finally, based on the

phoneme distribution information extracted from the audio clip, we develop a

sample selection method that effectively reduces the size of the training

dataset without sacrificing the quality of the generated video. Extensive

experiments on both controlled and uncontrolled datasets demonstrate the

superiority of the proposed approach in terms of visual quality, lip sync

accuracy, and smooth transition of lip and facial movement, as compared to the

state-of-the-art.

Deep Motion Boundary Detection

Xiaoqing Yin , Xiyang Dai , Xinchao Wang , Maojun Zhang , Dacheng Tao , Larry Davis

Comments: 17 pages, 5 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Motion boundary detection is a crucial yet challenging problem. Prior methods

focus on analyzing the gradients and distributions of optical flow fields, or

use hand-crafted features for motion boundary learning. In this paper, we

propose the first dedicated end-to-end deep learning approach for motion

boundary detection, which we term as MoBoNet. We introduce a refinement network

structure which takes source input images, initial forward and backward optical

flows as well as corresponding warping errors as inputs and produces

high-resolution motion boundaries. Furthermore, we show that the obtained

motion boundaries, through a fusion sub-network we design, can in turn guide

the optical flows for removing the artifacts. The proposed MoBoNet is generic

and works with any optical flows. Our motion boundary detection and the refined

optical flow estimation achieve results superior to the state of the art.

FishEyeRecNet: A Multi-Context Collaborative Deep Network for Fisheye Image Rectification

Xiaoqing Yin , Xinchao Wang , Jun Yu , Maojun Zhang , Pascal Fua , Dacheng Tao

Comments: 16 pages, 5 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Images captured by fisheye lenses violate the pinhole camera assumption and

suffer from distortions. Rectification of fisheye images is therefore a crucial

preprocessing step for many computer vision applications. In this paper, we

propose an end-to-end multi-context collaborative deep network for removing

distortions from single fisheye images. In contrast to conventional approaches,

which focus on extracting hand-crafted features from input images, our method

learns high-level semantics and low-level appearance features simultaneously to

estimate the distortion parameters. To facilitate training, we construct a

synthesized dataset that covers various scenes and distortion parameter

settings. Experiments on both synthesized and real-world datasets show that the

proposed model significantly outperforms current state of the art methods. Our

code and synthesized dataset will be made publicly available.

A Hybrid Model for Identity Obfuscation by Face Replacement

Qianru Sun , Ayush Tewari , Weipeng Xu , Mario Fritz , Christian Theobalt , Bernt Schiele

Comments: 17 pages of main paper and 5 pages of supplementary materials

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Cryptography and Security (cs.CR)

As more and more personal photos are shared and tagged in social media,

avoiding privacy risks such as unintended recognition becomes increasingly

challenging. We propose a new hybrid approach to obfuscate identities in photos

by head replacement. Our approach combines state of the art parametric face

synthesis with latest advances in Generative Adversarial Networks (GAN) for

data-driven image synthesis. On the one hand, the parametric part of our method

gives us control over the facial parameters and allows for explicit

manipulation of the identity. On the other hand, the data-driven aspects allow

for adding fine details and overall realism as well as seamless blending into

the scene context. In our experiments, we show highly realistic output of our

system that improves over the previous state of the art in obfuscation rate

while preserving a higher similarity to the original image content.

Multimodal Unsupervised Image-to-Image Translation

Xun Huang , Ming-Yu Liu , Serge Belongie , Jan Kautz

Comments: Code: this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Learning (cs.LG); Machine Learning (stat.ML)

Unsupervised image-to-image translation is an important and challenging

problem in computer vision. Given an image in the source domain, the goal is to

learn the conditional distribution of corresponding images in the target

domain, without seeing any pairs of corresponding images. While this

conditional distribution is inherently multimodal, existing approaches make an

overly simplified assumption, modeling it as a deterministic one-to-one

mapping. As a result, they fail to generate diverse outputs from a given source

domain image. To address this limitation, we propose a Multimodal Unsupervised

Image-to-image Translation (MUNIT) framework. We assume that the image

representation can be decomposed into a content code that is domain-invariant,

and a style code that captures domain-specific properties. To translate an

image to another domain, we recombine its content code with a random style code

sampled from the style space of the target domain. We analyze the proposed

framework and establish several theoretical results. Extensive experiments with

comparisons to the state-of-the-art approaches further demonstrates the

advantage of the proposed framework. Moreover, our framework allows users to

control the style of translation outputs by providing an example style image.

Code and pretrained models are available at this https URL

A Variational U-Net for Conditional Appearance and Shape Generation

Patrick Esser , Ekaterina Sutter , Björn Ommer

Comments: CVPR 2018 (Spotlight). Project Page at this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Deep generative models have demonstrated great performance in image

synthesis. However, results deteriorate in case of spatial deformations, since

they generate images of objects directly, rather than modeling the intricate

interplay of their inherent shape and appearance. We present a conditional

U-Net for shape-guided image generation, conditioned on the output of a

variational autoencoder for appearance. The approach is trained end-to-end on

images, without requiring samples of the same object with varying pose or

appearance. Experiments show that the model enables conditional image

generation and transfer. Therefore, either shape or appearance can be retained

from a query image, while freely altering the other. Moreover, appearance can

be sampled due to its stochastic latent representation, while preserving shape.

In quantitative and qualitative experiments on COCO, DeepFashion, shoes,

Market-1501 and handbags, the approach demonstrates significant improvements

over the state-of-the-art.

Cross-Domain Visual Recognition via Domain Adaptive Dictionary Learning

Hongyu Xu , Jingjing Zheng , Azadeh Alavi , Rama Chellappa

Comments: Submitted to IEEE TIP Journal

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In real-world visual recognition problems, the assumption that the training

data (source domain) and test data (target domain) are sampled from the same

distribution is often violated. This is known as the domain adaptation problem.

In this work, we propose a novel domain-adaptive dictionary learning framework

for cross-domain visual recognition. Our method generates a set of intermediate

domains. These intermediate domains form a smooth path and bridge the gap

between the source and target domains. Specifically, we not only learn a common

dictionary to encode the domain-shared features, but also learn a set of

domain-specific dictionaries to model the domain shift. The separation of the

common and domain-specific dictionaries enables us to learn more compact and

reconstructive dictionaries for domain adaptation. These dictionaries are

learned by alternating between domain-adaptive sparse coding and dictionary

updating steps. Meanwhile, our approach gradually recovers the feature

representations of both source and target data along the domain path. By

aligning all the recovered domain data, we derive the final domain-adaptive

features for cross-domain visual recognition. Extensive experiments on three

public datasets demonstrates that our approach outperforms most

state-of-the-art methods.

Geometric Consistency for Self-Supervised End-to-End Visual Odometry

Ganesh Iyer , J. Krishna Murthy , Gunshi Gupta , K. Madhava Krishna , Liam Paull Subjects : Robotics (cs.RO) ; Computer Vision and Pattern Recognition (cs.CV)

With the success of deep learning based approaches in tackling challenging

problems in computer vision, a wide range of deep architectures have recently

been proposed for the task of visual odometry (VO) estimation. Most of these

proposed solutions rely on supervision, which requires the acquisition of

precise ground-truth camera pose information, collected using expensive motion

capture systems or high-precision IMU/GPS sensor rigs. In this work, we propose

an unsupervised paradigm for deep visual odometry learning. We show that using

a noisy teacher, which could be a standard VO pipeline, and by designing a loss

term that enforces geometric consistency of the trajectory, we can train

accurate deep models for VO that do not require ground-truth labels. We

leverage geometry as a self-supervisory signal and propose “Composite

Transformation Constraints (CTCs)”, that automatically generate supervisory

signals for training and enforce geometric consistency in the VO estimate. We

also present a method of characterizing the uncertainty in VO estimates thus

obtained. To evaluate our VO pipeline, we present exhaustive ablation studies

that demonstrate the efficacy of end-to-end, self-supervised methodologies to

train deep models for monocular VO. We show that leveraging concepts from

geometry and incorporating them into the training of a recurrent neural network

results in performance competitive to supervised deep VO methods.

CalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks

Ganesh Iyer , Karnik Ram R. , J. Krishna Murthy , K. Madhava Krishna

Comments: Submitted to IEEE International Conference on Intelligent Robots and Systems (IROS) 2018

Subjects

Robotics (cs.RO)

; Computer Vision and Pattern Recognition (cs.CV)

3D LiDARs and 2D cameras are increasingly being used alongside each other in

sensor rigs for perception tasks. Before these sensors can be used to gather

meaningful data, however, their extrinsics (and intrinsics) need to be

accurately calibrated, as the performance of the sensor rig is extremely

sensitive to these calibration parameters. A vast majority of existing

calibration techniques require significant amounts of data and/or calibration

targets and human effort, severely impacting their applicability in large-scale

production systems. We address this gap with CalibNet: a self-supervised deep

network capable of automatically estimating the 6-DoF rigid body transformation

between a 3D LiDAR and a 2D camera in real-time. CalibNet alleviates the need

for calibration targets, thereby resulting in significant savings in

calibration efforts. During training, the network only takes as input a LiDAR

point cloud, the corresponding monocular image, and the camera calibration

matrix K. At train time, we do not impose direct supervision (i.e., we do not

directly regress to the calibration parameters, for example). Instead, we train

the network to predict calibration parameters that maximize the geometric and

photometric consistency of the input images and point clouds. CalibNet learns

to iteratively solve the underlying geometric problem and accurately predicts

extrinsic calibration parameters for a wide range of mis-calibrations, without

requiring retraining or domain adaptation. The project page is hosted at

this https URL

Artificial Intelligence

Monitoring and Executing Workflows in Linked Data Environments

Tobias Käfer , Andreas Harth Subjects : Artificial Intelligence (cs.AI) ; Software Engineering (cs.SE)

The W3C’s Web of Things working group is aimed at addressing the

interoperability problem on the Internet of Things using Linked Data as uniform

interface. While Linked Data paves the way towards combining such devices into

integrated applications, traditional solutions for specifying the control flow

of applications do not work seamlessly with Linked Data. We therefore tackle

the problem of the specification, execution, and monitoring of applications in

the context of Linked Data. We present a novel approach that combines

workflows, semantic reasoning, and RESTful interaction into one integrated

solution. We contribute to the state of the art by (1) defining an ontology for

describing workflow models and instances, (2) providing operational semantics

for the ontology that allows for the execution and monitoring of workflow

instances, (3) presenting a benchmark to evaluate our solution. Moreover, we

showcase how we used the ontology and the operational semantics to monitor

pilots executing workflows in virtual aircraft cockpits.

Roster Evaluation Based on Classifiers for the Nurse Rostering Problem

Roman Václavík , Přemysl Šůcha , Zdeněk Hanzálek Subjects : Artificial Intelligence (cs.AI) ; Learning (cs.LG); Optimization and Control (math.OC)

The personnel scheduling problem is a well-known NP-hard combinatorial

problem. Due to the complexity of this problem and the size of the real-world

instances, it is not possible to use exact methods, and thus heuristics,

meta-heuristics, or hyper-heuristics must be employed. The majority of

heuristic approaches are based on iterative search, where the quality of

intermediate solutions must be calculated. Unfortunately, this is

computationally highly expensive because these problems have many constraints

and some are very complex. In this study, we propose a machine learning

technique as a tool to accelerate the evaluation phase in heuristic approaches.

The solution is based on a simple classifier, which is able to determine

whether the changed solution (more precisely, the changed part of the solution)

is better than the original or not. This decision is made much faster than a

standard cost-oriented evaluation process. However, the classification process

cannot guarantee 100% correctness. Therefore, our approach, which is

illustrated using a tabu search algorithm in this study, includes a filtering

mechanism, where the classifier rejects the majority of the potentially bad

solutions and the remaining solutions are then evaluated in a standard manner.

We also show how the boosting algorithms can improve the quality of the final

solution compared with a simple classifier. We verified our proposed approach

and premises, based on standard and real-world benchmark instances, to

demonstrate the significant speedup obtained with comparable solution quality.