arXiv Paper Daily: Mon, 20 Jan 2020

2010 年 6 月 26 日

Computer Vision and Pattern Recognition

Unsupervised Learning of Camera Pose with Compositional Re-estimation

Seyed Shahabeddin Nabavi ,
Mehrdad Hosseinzadeh ,
Ramin Fahimi ,
Yang Wang

Comments: Accepted to WACV 2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We consider the problem of unsupervised camera pose estimation. Given an

input video sequence, our goal is to estimate the camera pose (i.e. the camera

motion) between consecutive frames. Traditionally, this problem is tackled by

placing strict constraints on the transformation vector or by incorporating

optical flow through a complex pipeline. We propose an alternative approach

that utilizes a compositional re-estimation process for camera pose estimation.

Given an input, we first estimate a depth map. Our method then iteratively

estimates the camera motion based on the estimated depth map. Our approach

significantly improves the predicted camera motion both quantitatively and

visually. Furthermore, the re-estimation resolves the problem of

out-of-boundaries pixels in a novel and simple way. Another advantage of our

approach is that it is adaptable to other camera pose estimation approaches.

Experimental analysis on KITTI benchmark dataset demonstrates that our method

outperforms existing state-of-the-art approaches in unsupervised camera

ego-motion estimation.

Combining PRNU and noiseprint for robust and efficient device source identification

Davide Cozzolino , Francesco Marra , Diego Gragnaniello , Giovanni Poggi , Luisa Verdoliva Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Image and Video Processing (eess.IV)

PRNU-based image processing is a key asset in digital multimedia forensics.

It allows for reliable device identification and effective detection and

localization of image forgeries, in very general conditions. However,

performance impairs significantly in challenging conditions involving low

quality and quantity of data. These include working on compressed and cropped

images, or estimating the camera PRNU pattern based on only a few images. To

boost the performance of PRNU-based analyses in such conditions we propose to

leverage the image noiseprint, a recently proposed camera-model fingerprint

that has proved effective for several forensic tasks. Numerical experiments on

datasets widely used for source identification prove that the proposed method

ensures a significant performance improvement in a wide range of challenging

situations.

TailorGAN: Making User-Defined Fashion Designs

Lele Chen ,
Justin Tian ,
Guo Li ,
Cheng-Haw Wu ,
Erh-Kan King ,
Kuan-Ting Chen ,
Shao-Hang Hsieh

Comments: fashion

Journal-ref: 2020 Winter Conference on Applications of Computer Vision

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Attribute editing has become an important and emerging topic of computer

vision. In this paper, we consider a task: given a reference garment image A

and another image B with target attribute (collar/sleeve), generate a

photo-realistic image which combines the texture from reference A and the new

attribute from reference B. The highly convoluted attributes and the lack of

paired data are the main challenges to the task. To overcome those limitations,

we propose a novel self-supervised model to synthesize garment images with

disentangled attributes (e.g., collar and sleeves) without paired data. Our

method consists of a reconstruction learning step and an adversarial learning

step. The model learns texture and location information through reconstruction

learning. And, the model’s capability is generalized to achieve

single-attribute manipulation by adversarial learning. Meanwhile, we compose a

new dataset, named GarmentSet, with annotation of landmarks of collars and

sleeves on clean garment images. Extensive experiments on this dataset and

real-world samples demonstrate that our method can synthesize much better

results than the state-of-the-art methods in both quantitative and qualitative

comparisons.

Subjective Annotation for a Frame Interpolation Benchmark using Artifact Amplification

Hui Men ,
Vlad Hosu ,
Hanhe Lin ,
Andrés Bruhn ,
Dietmar Saupe

Comments: arXiv admin note: text overlap with arXiv:1901.05362

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Current benchmarks for optical flow algorithms evaluate the estimation either

directly by comparing the predicted flow fields with the ground truth or

indirectly by using the predicted flow fields for frame interpolation and then

comparing the interpolated frames with the actual frames. In the latter case,

objective quality measures such as the mean squared error are typically

employed. However, it is well known that for image quality assessment, the

actual quality experienced by the user cannot be fully deduced from such simple

measures. Hence, we conducted a subjective quality assessment crowdscouring

study for the interpolated frames provided by one of the optical flow

benchmarks, the Middlebury benchmark. It contains interpolated frames from 155

methods applied to each of 8 contents. We collected forced choice paired

comparisons between interpolated images and corresponding ground truth. To

increase the sensitivity of observers when judging minute difference in paired

comparisons we introduced a new method to the field of full-reference quality

assessment, called artifact amplification. From the crowdsourcing data we

reconstructed absolute quality scale values according to Thurstone’s model. As

a result, we obtained a re-ranking of the 155 participating algorithms w.r.t.

the visual quality of the interpolated frames. This re-ranking not only shows

the necessity of visual quality assessment as another evaluation metric for

optical flow and frame interpolation benchmarks, the results also provide the

ground truth for designing novel image quality assessment (IQA) methods

dedicated to perceptual quality of interpolated images. As a first step, we

proposed such a new full-reference method, called WAE-IQA. By weighing the

local differences between an interpolated image and its ground truth WAE-IQA

performed slightly better than the currently best FR-IQA approach from the

literature.

GraphBGS: Background Subtraction via Recovery of Graph Signals

Jhony H. Giraldo , Thierry Bouwmans Subjects : Computer Vision and Pattern Recognition (cs.CV)

Graph-based algorithms have been successful approaching the problems of

unsupervised and semi-supervised learning. Recently, the theory of graph signal

processing and semi-supervised learning have been combined leading to new

developments and insights in the field of machine learning. In this paper,

concepts of recovery of graph signals and semi-supervised learning are

introduced in the problem of background subtraction. We propose a new algorithm

named GraphBGS, this method uses a Mask R-CNN for instances segmentation;

temporal median filter for background initialization; motion, texture, color,

and structural features for representing the nodes of a graph; k-nearest

neighbors for the construction of the graph; and finally a semi-supervised

method inspired from the theory of recovery of graph signals to solve the

problem of background subtraction. The method is evaluated on the publicly

available change detection, and scene background initialization databases.

Experimental results show that GraphBGS outperforms unsupervised background

subtraction algorithms in some challenges of the change detection dataset. And

most significantly, this method outperforms generative adversarial networks in

unseen videos in some sequences of the scene background initialization

database.

Latency-Aware Differentiable Neural Architecture Search

Yuhui Xu ,
Lingxi Xie ,
Xiaopeng Zhang ,
Xin Chen ,
Bowen Shi ,
Qi Tian ,
Hongkai Xiong

Comments: 11 pages, 7 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Differentiable neural architecture search methods became popular in automated

machine learning, mainly due to their low search costs and flexibility in

designing the search space. However, these methods suffer the difficulty in

optimizing network, so that the searched network is often unfriendly to

hardware. This paper deals with this problem by adding a differentiable latency

loss term into optimization, so that the search process can tradeoff between

accuracy and latency with a balancing coefficient. The core of latency

prediction is to encode each network architecture and feed it into a

multi-layer regressor, with the training data being collected from randomly

sampling a number of architectures and evaluating them on the hardware. We

evaluate our approach on NVIDIA Tesla-P100 GPUs. With 100K sampled

architectures (requiring a few hours), the latency prediction module arrives at

a relative error of lower than 10\%. Equipped with this module, the search

method can reduce the latency by 20% meanwhile preserving the accuracy. Our

approach also enjoys the ability of being transplanted to a wide range of

hardware platforms with very few efforts, or being used to optimizing other

non-differentiable factors such as power consumption.

BigEarthNet Deep Learning Models with A New Class-Nomenclature for Remote Sensing Image Understanding

Gencer Sumbul ,
Jian Kang ,
Tristan Kreuziger ,
Filipe Marcelino ,
Hugo Costa ,
Pedro Benevides ,
Mario Caetano ,
Begüm Demir

Comments: Submitted to IEEE Geoscience and Remote Sensing Magazine

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Success of deep neural networks in the framework of remote sensing (RS) image

analysis depends on the availability of a high number of annotated images.

BigEarthNet is a new large-scale Sentinel-2 benchmark archive that has been

recently introduced in RS to advance deep learning (DL) studies. Each image

patch in BigEarthNet is annotated with multi-labels provided by the CORINE Land

Cover (CLC) map of 2018 based on its most thematic detailed Level-3 class

nomenclature. BigEarthNet has enabled data-hungry DL algorithms to reach high

performance in the context of multi-label RS image retrieval and

classification. However, initial research demonstrates that some CLC classes

are challenging to be accurately described by considering only (single-date)

Sentinel-2 images. To further increase the effectiveness of BigEarthNet, in

this paper we introduce an alternative class-nomenclature to allow DL models

for better learning and describing the complex spatial and spectral information

content of the Sentinel-2 images. This is achieved by interpreting and

arranging the CLC Level-3 nomenclature based on the properties of Sentinel-2

images in a new nomenclature of 19 classes. Then, the new class-nomenclature of

BigEarthNet is used within state-of-the-art DL models (namely VGG model at the

depth of 16 and 19 layers [VGG16 and VGG19] and ResNet model at the depth of

50, 101 and 152 layers [ResNet50, ResNet101, ResNet152] as well as K-Branch CNN

model) in the context of multi-label classification. Experimental results show

that the models trained from scratch on BigEarthNet outperform those

pre-trained on ImageNet, especially in relation to some complex classes

including agriculture and other vegetated and natural environments. All DL

models are made publicly available, offering an important resource to guide

future progress on content based image retrieval and scene classification

problems in RS.

Efficient Facial Feature Learning with Wide Ensemble-based Convolutional Neural Networks

Henrique Siqueira ,
Sven Magg ,
Stefan Wermter

Comments: Accepted at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 1-1, New York, USA

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (cs.LG); Machine Learning (stat.ML)

Ensemble methods, traditionally built with independently trained

de-correlated models, have proven to be efficient methods for reducing the

remaining residual generalization error, which results in robust and accurate

methods for real-world applications. In the context of deep learning, however,

training an ensemble of deep networks is costly and generates high redundancy

which is inefficient. In this paper, we present experiments on Ensembles with

Shared Representations (ESRs) based on convolutional networks to demonstrate,

quantitatively and qualitatively, their data processing efficiency and

scalability to large-scale datasets of facial expressions. We show that

redundancy and computational load can be dramatically reduced by varying the

branching level of the ESR without loss of diversity and generalization power,

which are both important for ensemble performance. Experiments on large-scale

datasets suggest that ESRs reduce the remaining residual generalization error

on the AffectNet and FER+ datasets, reach human-level performance, and

outperform state-of-the-art methods on facial expression recognition in the

wild using emotion and affect concepts.

Vision Meets Drones: Past, Present and Future

Pengfei Zhu ,
Longyin Wen ,
Dawei Du ,
Xiao Bian ,
Qinghua Hu ,
Haibin Ling

Comments: arXiv admin note: text overlap with arXiv:1804.07437

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Drones, or general UAVs, equipped with cameras have been fast deployed with a

wide range of applications, including agriculture, aerial photography, fast

delivery, and surveillance. Consequently, automatic understanding of visual

data collected from drones becomes highly demanding, bringing computer vision

and drones more and more closely. To promote and track the developments of

object detection and tracking algorithms, we have organized two challenge

workshops in conjunction with European Conference on Computer Vision (ECCV)

2018, and IEEE International Conference on Computer Vision (ICCV) 2019,

attracting more than 100 teams around the world. We provide a large-scale drone

captured dataset, VisDrone, which includes four tracks, i.e., (1) image object

detection, (2) video object detection, (3) single object tracking, and (4)

multi-object tracking. This paper first presents a thorough review of object

detection and tracking datasets and benchmarks, and discuss the challenges of

collecting large-scale drone-based object detection and tracking datasets with

fully manual annotations. After that, we describe our VisDrone dataset, which

is captured over various urban/suburban areas of (14) different cities across

China from North to South. Being the largest such dataset ever published,

VisDrone enables extensive evaluation and investigation of visual analysis

algorithms on the drone platform. We provide a detailed analysis of the current

state of the field of large-scale object detection and tracking on drones, and

conclude the challenge as well as propose future directions and improvements.

We expect the benchmark largely boost the research and development in video

analysis on drone platforms. All the datasets and experimental results can be

downloaded from the website: this https URL .

Predicting the Physical Dynamics of Unseen 3D Objects

Davis Rempe ,
Srinath Sridhar ,
He Wang ,
Leonidas J. Guibas

Comments: In Proceedings of Winter Conference on Applications of Computer Vision (WACV) 2020. arXiv admin note: text overlap with arXiv:1901.00466

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (cs.LG)

Machines that can predict the effect of physical interactions on the dynamics

of previously unseen object instances are important for creating better robots

and interactive virtual worlds. In this work, we focus on predicting the

dynamics of 3D objects on a plane that have just been subjected to an impulsive

force. In particular, we predict the changes in state – 3D position, rotation,

velocities, and stability. Different from previous work, our approach can

generalize dynamics predictions to object shapes and initial conditions that

were unseen during training. Our method takes the 3D object’s shape as a point

cloud and its initial linear and angular velocities as input. We extract shape

features and use a recurrent neural network to predict the full change in state

at each time step. Our model can support training with data from both a physics

engine or the real world. Experiments show that we can accurately predict the

changes in state for unseen object geometries and initial conditions.

Review: deep learning on 3D point clouds

Saifullahi Aminu Bello , Shangshu Yu , Cheng Wang Subjects : Computer Vision and Pattern Recognition (cs.CV)

Point cloud is point sets defined in 3D metric space. Point cloud has become

one of the most significant data format for 3D representation. Its gaining

increased popularity as a result of increased availability of acquisition

devices, such as LiDAR, as well as increased application in areas such as

robotics, autonomous driving, augmented and virtual reality. Deep learning is

now the most powerful tool for data processing in computer vision, becoming the

most preferred technique for tasks such as classification, segmentation, and

detection. While deep learning techniques are mainly applied to data with a

structured grid, point cloud, on the other hand, is unstructured. The

unstructuredness of point clouds makes use of deep learning for its processing

directly very challenging. Earlier approaches overcome this challenge by

preprocessing the point cloud into a structured grid format at the cost of

increased computational cost or lost of depth information. Recently, however,

many state-of-the-arts deep learning techniques that directly operate on point

cloud are being developed. This paper contains a survey of the recent

state-of-the-art deep learning techniques that mainly focused on point cloud

data. We first briefly discussed the major challenges faced when using deep

learning directly on point cloud, we also briefly discussed earlier approaches

which overcome the challenges by preprocessing the point cloud into a

structured grid. We then give the review of the various state-of-the-art deep

learning approaches that directly process point cloud in its unstructured form.

We introduced the popular 3D point cloud benchmark datasets. And we also

further discussed the application of deep learning in popular 3D vision tasks

including classification, segmentation and detection.

Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network

Jungkyu Lee ,
Taeryun Won ,
Kiho Hong

Comments: 11 pages, 3 figures, 16 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Recent studies in image classification have demonstrated a variety of

techniques for improving the performance of Convolutional Neural Networks

(CNNs). However, attempts to combine existing techniques to create a practical

model are still uncommon. In this study, we carry out extensive experiments to

validate that carefully assembling these techniques and applying them to a

basic CNN model in combination can improve the accuracy and robustness of the

model while minimizing the loss of throughput. For example, our proposed

ResNet-50 shows an improvement in top-1 accuracy from 76.3% to 82.78%, and an

mCE improvement from 76.0% to 48.9%, on the ImageNet ILSVRC2012 validation set.

With these improvements, inference throughput only decreases from 536 to 312.

The resulting model significantly outperforms state-of-the-art models with

similar accuracy in terms of mCE and inference throughput. To verify the

performance improvement in transfer learning, fine grained classification and

image retrieval tasks were tested on several open datasets and showed that the

improvement to backbone network performance boosted transfer learning

performance significantly. Our approach achieved 1st place in the iFood

Competition Fine-Grained Visual Recognition at CVPR 2019, and the source code

and trained models are available at this https URL

SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On

Surgan Jandial ,
Ayush Chopra ,
Kumar Ayush ,
Mayur Hemani ,
Abhijeet Kumar ,
Balaji Krishnamurthy

Comments: Accepted at IEEE WACV 2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Image-based virtual try-on for fashion has gained considerable attention

recently. The task requires trying on a clothing item on a target model image.

An efficient framework for this is composed of two stages: (1) warping

(transforming) the try-on cloth to align with the pose and shape of the target

model, and (2) a texture transfer module to seamlessly integrate the warped

try-on cloth onto the target model image. Existing methods suffer from

artifacts and distortions in their try-on output. In this work, we present

SieveNet, a framework for robust image-based virtual try-on. Firstly, we

introduce a multi-stage coarse-to-fine warping network to better model

fine-grained intricacies (while transforming the try-on cloth) and train it

with a novel perceptual geometric matching loss. Next, we introduce a try-on

cloth conditioned segmentation mask prior to improve the texture transfer

network. Finally, we also introduce a dueling triplet loss strategy for

training the texture translation network which further improves the quality of

the generated try-on results. We present extensive qualitative and quantitative

evaluations of each component of the proposed pipeline and show significant

performance improvements against the current state-of-the-art method.

Two-Phase Object-Based Deep Learning for Multi-temporal SAR Image Change Detection

Xinzheng Zhang , Guo Liu , Ce Zhang , Peter M Atkinson , Xiaoheng Tan , Xin Jian , Xichuan Zhou , Yongming Li Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Image and Video Processing (eess.IV)

Change detection is one of the fundamental applications of synthetic aperture

radar (SAR) images. However, speckle noise presented in SAR images has a much

negative effect on change detection. In this research, a novel two-phase

object-based deep learning approach is proposed for multi-temporal SAR image

change detection. Compared with traditional methods, the proposed approach

brings two main innovations. One is to classify all pixels into three

categories rather than two categories: unchanged pixels, changed pixels caused

by strong speckle (false changes), and changed pixels formed by real terrain

variation (real changes). The other is to group neighboring pixels into

segmented into superpixel objects (from pixels) such as to exploit local

spatial context. Two phases are designed in the methodology: 1) Generate

objects based on the simple linear iterative clustering algorithm, and

discriminate these objects into changed and unchanged classes using fuzzy

c-means (FCM) clustering and a deep PCANet. The prediction of this Phase is the

set of changed and unchanged superpixels. 2) Deep learning on the pixel sets

over the changed superpixels only, obtained in the first phase, to discriminate

real changes from false changes. SLIC is employed again to achieve new

superpixels in the second phase. Low rank and sparse decomposition are applied

to these new superpixels to suppress speckle noise significantly. A further

clustering step is applied to these new superpixels via FCM. A new PCANet is

then trained to classify two kinds of changed superpixels to achieve the final

change maps. Numerical experiments demonstrate that, compared with benchmark

methods, the proposed approach can distinguish real changes from false changes

effectively with significantly reduced false alarm rates, and achieve up to

99.71% change detection accuracy using multi-temporal SAR imagery.

Registration made easy — standalone orthopedic navigation with HoloLens

Florentin Liebmann ,
Simon Roner ,
Marco von Atzigen ,
Florian Wanivenhaus ,
Caroline Neuhaus ,
José Spirig ,
Davide Scaramuzza ,
Reto Sutter ,
Jess Snedeker ,
Mazda Farshad ,
Philipp Fürnstahl

Comments: 6 pages, 5 figures, accepted at CVPR 2019 workshop on Computer Vision Applications for Mixed Reality Headsets ( this https URL )

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In surgical navigation, finding correspondence between preoperative plan and

intraoperative anatomy, the so-called registration task, is imperative. One

promising approach is to intraoperatively digitize anatomy and register it with

the preoperative plan. State-of-the-art commercial navigation systems implement

such approaches for pedicle screw placement in spinal fusion surgery. Although

these systems improve surgical accuracy, they are not gold standard in clinical

practice. Besides economical reasons, this may be due to their difficult

integration into clinical workflows and unintuitive navigation feedback.

Augmented Reality has the potential to overcome these limitations.

Consequently, we propose a surgical navigation approach comprising

intraoperative surface digitization for registration and intuitive holographic

navigation for pedicle screw placement that runs entirely on the Microsoft

HoloLens. Preliminary results from phantom experiments suggest that the method

may meet clinical accuracy requirements.

FPCR-Net: Feature Pyramidal Correlation and Residual Reconstruction for Semi-supervised Optical Flow Estimation

Xiaolin Song ,
Jingyu Yang ,
Cuiling Lan ,
Wenjun Zeng

Comments: 8 pages, 8 figures, 6 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Optical flow estimation is an important yet challenging problem in the field

of video analytics. The features of different semantics levels/layers of a

convolutional neural network can provide information of different granularity.

To exploit such flexible and comprehensive information, we propose a

semi-supervised Feature Pyramidal Correlation and Residual Reconstruction

Network (FPCR-Net) for optical flow estimation from frame pairs. It consists of

two main modules: pyramid correlation mapping and residual reconstruction. The

pyramid correlation mapping module takes advantage of the multi-scale

correlations of global/local patches by aggregating features of different

scales to form a multi-level cost volume. The residual reconstruction module

aims to reconstruct the sub-band high-frequency residuals of finer optical flow

in each stage. Based on the pyramid correlation mapping, we further propose a

correlation-warping-normalization (CWN) module to efficiently exploit the

correlation dependency. Experiment results show that the proposed scheme

achieves the state-of-the-art performance, with improvement by 0.80, 1.15 and

0.10 in terms of average end-point error (AEE) against competing baseline

methods – FlowNet2, LiteFlowNet and PWC-Net on the Final pass of Sintel

dataset, respectively.

Interpreting Galaxy Deblender GAN from the Discriminator’s Perspective

Heyi Li ,
Yuewei Lin ,
Klaus Mueller ,
Wei Xu

Comments: 5 pages, 4 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Image and Video Processing (eess.IV)

Generative adversarial networks (GANs) are well known for their unsupervised

learning capabilities. A recent success in the field of astronomy is deblending

two overlapping galaxy images via a branched GAN model. However, it remains a

significant challenge to comprehend how the network works, which is

particularly difficult for non-expert users. This research focuses on behaviors

of one of the network’s major components, the Discriminator, which plays a

vital role but is often overlooked, Specifically, we enhance the Layer-wise

Relevance Propagation (LRP) scheme to generate a heatmap-based visualization.

We call this technique Polarized-LRP and it consists of two parts i.e. positive

contribution heatmaps for ground truth images and negative contribution

heatmaps for generated images. Using the Galaxy Zoo dataset we demonstrate that

our method clearly reveals attention areas of the Discriminator when

differentiating generated galaxy images from ground truth images. To connect

the Discriminator’s impact on the Generator, we visualize the gradual changes

of the Generator across the training process. An interesting result we have

achieved there is the detection of a problematic data augmentation procedure

that would else have remained hidden. We find that our proposed method serves

as a useful visual analytical tool for a deeper understanding of GAN models.

Learning to Augment Expressions for Few-shot Fine-grained Facial Expression Recognition

Wenxuan Wang ,
Yanwei Fu ,
Qiang Sun ,
Tao Chen ,
Chenjie Cao ,
Ziqi Zheng ,
Guoqiang Xu ,
Han Qiu ,
Yu-Gang Jiang ,
Xiangyang Xue

Comments: 17 pages, 18 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Affective computing and cognitive theory are widely used in modern

human-computer interaction scenarios. Human faces, as the most prominent and

easily accessible features, have attracted great attention from researchers.

Since humans have rich emotions and developed musculature, there exist a lot of

fine-grained expressions in real-world applications. However, it is extremely

time-consuming to collect and annotate a large number of facial images, of

which may even require psychologists to correctly categorize them. To the best

of our knowledge, the existing expression datasets are only limited to several

basic facial expressions, which are not sufficient to support our ambitions in

developing successful human-computer interaction systems. To this end, a novel

Fine-grained Facial Expression Database – F2ED is contributed in this paper,

and it includes more than 200k images with 54 facial expressions from 119

persons. Considering the phenomenon of uneven data distribution and lack of

samples is common in real-world scenarios, we further evaluate several tasks of

few-shot expression learning by virtue of our F2ED, which are to recognize the

facial expressions given only few training instances. These tasks mimic human

performance to learn robust and general representation from few examples. To

address such few-shot tasks, we propose a unified task-driven framework –

Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize

facial images and thus augmenting the instances of few-shot expression classes.

Extensive experiments are conducted on F2ED and existing facial expression

datasets, i.e., JAFFE and FER2013, to validate the efficacy of our F2ED in

pre-training facial expression recognition network and the effectiveness of our

proposed approach Comp-GAN to improve the performance of few-shot recognition

tasks.

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Anoop Cherian , Jue Wang , Chiori Hori , Tim K. Marks Subjects : Computer Vision and Pattern Recognition (cs.CV)

Generating video descriptions automatically is a challenging task that

involves a complex interplay between spatio-temporal visual features and

language models. Given that videos consist of spatial (frame-level) features

and their temporal evolutions, an effective captioning model should be able to

attend to these different cues selectively. To this end, we propose a

Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned

on the language state, hierarchically combines spatial and temporal attention

to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which

first attends to regions that have temporal evolution, then temporally pools

the features from these regions; and (ii) a temporo-spatial (TS) sub-model,

which first decides a single frame to attend to, then applies spatial attention

within that frame. We propose a novel LSTM-based temporal ranking function,

which we call ranked attention, for the ST model to capture action dynamics.

Our entire framework is trained end-to-end. We provide experiments on two

benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy

between the ST and TS modules, outperforming recent state-of-the-art methods.

Automatic Discovery of Political Meme Genres with Diverse Appearances

William Theisen ,
Joel Brogan ,
Pamela Bilo Thomas ,
Daniel Moreira ,
Pascal Phoa ,
Tim Weninger ,
Walter Scheirer

Comments: 16 pages, 10 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Social and Information Networks (cs.SI)

Forms of human communication are not static — we expect some evolution in

the way information is conveyed over time because of advances in technology.

One example of this phenomenon is the image-based meme, which has emerged as a

dominant form of political messaging in the past decade. While originally used

to spread jokes on social media, memes are now having an outsized impact on

public perception of world events. A significant challenge in automatic meme

analysis has been the development of a strategy to match memes from within a

single genre when the appearances of the images vary. Such variation is

especially common in memes exhibiting mimicry. For example, when voters perform

a common hand gesture to signal their support for a candidate. In this paper we

introduce a scalable automated visual recognition pipeline for discovering

political meme genres of diverse appearance. This pipeline can ingest meme

images from a social network, apply computer vision-based techniques to extract

local features and index new images into a database, and then organize the

memes into related genres. To validate this approach, we perform a large case

study on the 2019 Indonesian Presidential Election using a new dataset of over

two million images collected from Twitter and Instagram. Results show that this

approach can discover new meme genres with visually diverse images that share

common stylistic elements, paving the way forward for further work in semantic

analysis and content attribution.

On- Device Information Extraction from Screenshots in form of tags

Sumit Kumar , Gopi Ramena , Manoj Goyal , Debi Mohanty , Ankur Agarwal , Benu Changmai , Sukumar Moharana Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Computation and Language (cs.CL); Information Retrieval (cs.IR)

We propose a method to make mobile screenshots easily searchable. In this

paper, we present the workflow in which we: 1) preprocessed a collection of

screenshots, 2) identified script presentin image, 3) extracted unstructured

text from images, 4) identifiedlanguage of the extracted text, 5) extracted

keywords from the text, 6) identified tags based on image features, 7) expanded

tag set by identifying related keywords, 8) inserted image tags with relevant

images after ranking and indexed them to make it searchable on device. We made

the pipeline which supports multiple languages and executed it on-device, which

addressed privacy concerns. We developed novel architectures for components in

the pipeline, optimized performance and memory for on-device computation. We

observed from experimentation that the solution developed can reduce overall

user effort and improve end user experience while searching, whose results are

published.

Tracking of Micro Unmanned Aerial Vehicles: A Comparative Study

Fatih Gökçe

Comments: In proceedings of the International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2019), 13 pages, 9 Figures

Journal-ref: F. G”okc{c}e. Tracking of Micro Unmanned Aerial Vehicles: A

Comparative Study. In Proceedings of the International Conference on

Artificial Intelligence and Applied Mathematics in Engineering, Antalya,

Turkey, 20-22 Apr. 2019, pp.374-386

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Robotics (cs.RO)

Micro unmanned aerial vehicles (mUAV) became very common in recent years. As

a result of their widespread usage, when they are flown by hobbyists illegally,

crucial risks are imposed and such mUAVs need to be sensed by security systems.

Furthermore, the sensing of mUAVs are essential for also swarm robotics

research where the individuals in a flock of robots require systems to sense

and localize each other for coordinated operation. In order to obtain such

systems, there are studies to detect mUAVs utilizing different sensing mediums,

such as vision, infrared and sound signals, and small-scale radars. However,

there are still challenges that awaits to be handled in this field such as

integrating tracking approaches to the vision-based detection systems to

enhance accuracy and computational complexity. For this reason, in this study,

we combine various tracking approaches to a vision-based mUAV detection system

available in the literature, in order to evaluate different tracking approaches

in terms of accuracy and as well as investigate the effect of such integration

to the computational cost.

Increasing the robustness of DNNs against image corruptions by playing the Game of Noise

Evgenia Rusak , Lukas Schott , Roland Zimmermann , Julian Bitterwolf , Oliver Bringmann , Matthias Bethge , Wieland Brendel Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG); Machine Learning (stat.ML)

The human visual system is remarkably robust against a wide range of

naturally occurring variations and corruptions like rain or snow. In contrast,

the performance of modern image recognition models strongly degrades when

evaluated on previously unseen corruptions. Here, we demonstrate that a simple

but properly tuned training with additive Gaussian and Speckle noise

generalizes surprisingly well to unseen corruptions, easily reaching the

previous state of the art on the corruption benchmark ImageNet-C (with

ResNet50) and on MNIST-C. We build on top of these strong baseline results and

show that an adversarial training of the recognition model against uncorrelated

worst-case noise distributions leads to an additional increase in performance.

This regularization can be combined with previously proposed defense methods

for further improvement.

Modality-Balanced Models for Visual Dialogue

Hyounghun Kim ,
Hao Tan ,
Mohit Bansal

Comments: AAAI 2020 (11 pages)

Subjects

Computation and Language (cs.CL)

; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The Visual Dialog task requires a model to exploit both image and

conversational context information to generate the next response to the

dialogue. However, via manual analysis, we find that a large number of

conversational questions can be answered by only looking at the image without

any access to the context history, while others still need the conversation

context to predict the correct answers. We demonstrate that due to this reason,

previous joint-modality (history and image) models over-rely on and are more

prone to memorizing the dialogue history (e.g., by extracting certain keywords

or patterns in the context information), whereas image-only models are more

generalizable (because they cannot memorize or extract keywords from history)

and perform substantially better at the primary normalized discounted

cumulative gain (NDCG) task metric which allows multiple correct answers.

Hence, this observation encourages us to explicitly maintain two models, i.e.,

an image-only model and an image-history joint model, and combine their

complementary abilities for a more balanced multimodal model. We present

multiple methods for this integration of the two models, via ensemble and

consensus dropout fusion with shared parameters. Empirically, our models

achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and

high balance across metrics), and substantially outperform the winner of the

Visual Dialog challenge 2018 on most metrics.

Tethered Aerial Visual Assistance

Xuesu Xiao ,
Jan Dufek ,
Robin R. Murphy

Comments: Submitted to special issue of “Field and Service Robotics” of the Journal of Field Robotics (JFR). arXiv admin note: text overlap with arXiv:1904.00078

Subjects

Robotics (cs.RO)

; Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

In this paper, an autonomous tethered Unmanned Aerial Vehicle (UAV) is

developed into a visual assistant in a marsupial co-robots team, collaborating

with a tele-operated Unmanned Ground Vehicle (UGV) for robot operations in

unstructured or confined environments. These environments pose extreme

challenges to the remote tele-operator due to the lack of sufficient

situational awareness, mostly caused by the unstructuredness and confinement,

stationary and limited field-of-view and lack of depth perception from the

robot’s onboard cameras. To overcome these problems, a secondary tele-operated

robot is used in current practices, who acts as a visual assistant and provides

external viewpoints to overcome the perceptual limitations of the primary

robot’s onboard sensors. However, a second tele-operated robot requires extra

manpower and teamwork demand between primary and secondary operators. The

manually chosen viewpoints tend to be subjective and sub-optimal. Considering

these intricacies, we develop an autonomous tethered aerial visual assistant in

place of the secondary tele-operated robot and operator, to reduce human robot

ratio from 2:2 to 1:2. Using a fundamental viewpoint quality theory, a formal

risk reasoning framework, and a newly developed tethered motion suite, our

visual assistant is able to autonomously navigate to good-quality viewpoints in

a risk-aware manner through unstructured or confined spaces with a tether. The

developed marsupial co-robots team could improve tele-operation efficiency in

nuclear operations, bomb squad, disaster robots, and other domains with novel

tasks or highly occluded environments, by reducing manpower and teamwork

demand, and achieving better visual assistance quality with trustworthy

risk-aware motion.

DeepSUM++: Non-local Deep Neural Network for Super-Resolution of Unregistered Multitemporal Images

Andrea Bordone Molini ,
Diego Valsesia ,
Giulia Fracastoro ,
Enrico Magli

Comments: arXiv admin note: text overlap with arXiv:1907.06490

Subjects

Image and Video Processing (eess.IV)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Deep learning methods for super-resolution of a remote sensing scene from

multiple unregistered low-resolution images have recently gained attention

thanks to a challenge proposed by the European Space Agency. This paper

presents an evolution of the winner of the challenge, showing how incorporating

non-local information in a convolutional neural network allows to exploit

self-similar patterns that provide enhanced regularization of the

super-resolution problem. Experiments on the dataset of the challenge show

improved performance over the state-of-the-art, which does not exploit

non-local information.

Detection Method Based on Automatic Visual Shape Clustering for Pin-Missing Defect in Transmission Lines

Zhenbing Zhao , Hongyu Qi , Yincheng Qi , Ke Zhang , Yongjie Zhai , Wenqing Zhao Subjects : Image and Video Processing (eess.IV) ; Computer Vision and Pattern Recognition (cs.CV)

Bolts are the most numerous fasteners in transmission lines and are prone to

losing their split pins. How to realize the automatic pin-missing defect

detection for bolts in transmission lines so as to achieve timely and efficient

trouble shooting is a difficult problem and the long-term research target of

power systems. In this paper, an automatic detection model called Automatic

Visual Shape Clustering Network (AVSCNet) for pin-missing defect is

constructed. Firstly, an unsupervised clustering method for the visual shapes

of bolts is proposed and applied to construct a defect detection model which

can learn the difference of visual shape. Next, three deep convolutional neural

network optimization methods are used in the model: the feature enhancement,

feature fusion and region feature extraction. The defect detection results are

obtained by applying the regression calculation and classification to the

regional features. In this paper, the object detection model of different

networks is used to test the dataset of pin-missing defect constructed by the

aerial images of transmission lines from multiple locations, and it is

evaluated by various indicators and is fully verified. The results show that

our method can achieve considerably satisfactory detection effect.

Sideways: Depth-Parallel Training of Video Models

Mateusz Malinowski , Grzegorz Swirszcz , Joao Carreira , Viorica Patraucean Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We propose Sideways, an approximate backpropagation scheme for training video

models. In standard backpropagation, the gradients and activations at every

computation step through the model are temporally synchronized. The forward

activations need to be stored until the backward pass is executed, preventing

inter-layer (depth) parallelization. However, can we leverage smooth, redundant

input streams such as videos to develop a more efficient training scheme? Here,

we explore an alternative to backpropagation; we overwrite network activations

whenever new ones, i.e., from new frames, become available. Such a more gradual

accumulation of information from both passes breaks the precise correspondence

between gradients and activations, leading to theoretically more noisy weight

updates. Counter-intuitively, we show that Sideways training of deep

convolutional video networks not only still converges, but can also potentially

exhibit better generalization compared to standard synchronized

backpropagation.

FedVision: An Online Visual Object Detection Platform Powered by Federated Learning

Yang Liu , Anbu Huang , Yun Luo , He Huang , Youzhi Liu , Yuanyuan Chen , Lican Feng , Tianjian Chen , Han Yu , Qiang Yang Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Visual object detection is a computer vision-based artificial intelligence

(AI) technique which has many practical applications (e.g., fire hazard

monitoring). However, due to privacy concerns and the high cost of transmitting

video data, it is highly challenging to build object detection models on

centrally stored large training datasets following the current approach.

Federated learning (FL) is a promising approach to resolve this challenge.

Nevertheless, there currently lacks an easy to use tool to enable computer

vision application developers who are not experts in federated learning to

conveniently leverage this technology and apply it in their systems. In this

paper, we report FedVision – a machine learning engineering platform to support

the development of federated learning powered computer vision applications. The

platform has been deployed through a collaboration between WeBank and Extreme

Vision to help customers develop computer vision-based safety monitoring

solutions in smart city applications. Over four months of usage, it has

achieved significant efficiency improvement and cost reduction while removing

the need to transmit sensitive data for three major corporate customers. To the

best of our knowledge, this is the first real application of FL in computer

vision-based tasks.

Spatiotemporal Camera-LiDAR Calibration: A Targetless and Structureless Approach

Chanoh Park ,
Peyman Moghadam ,
Soohwan Kim ,
Sridha Sridharan ,
Clinton Fookes

Comments: 8 pages, To appear, IEEE Robotics and Automation Letters 2020

Subjects

Robotics (cs.RO)

; Computer Vision and Pattern Recognition (cs.CV)

The demand for multimodal sensing systems for robotics is growing due to the

increase in robustness, reliability and accuracy offered by these systems.

These systems also need to be spatially and temporally co-registered to be

effective. In this paper, we propose a targetless and structureless

spatiotemporal camera-LiDAR calibration method. Our method combines a

closed-form solution with a modified structureless bundle adjustment where the

coarse-to-fine approach does not {require} an initial guess on the

spatiotemporal parameters. Also, as 3D features (structure) are calculated from

triangulation only, there is no need to have a calibration target or to match

2D features with the 3D point cloud which provides flexibility in the

calibration process and sensor configuration. We demonstrate the accuracy and

robustness of the proposed method through both simulation and real data

experiments using multiple sensor payload configurations mounted to hand-held,

aerial and legged robot systems. Also, qualitative results are given in the

form of a colorized point cloud visualization.

An adversarial learning framework for preserving users’ anonymity in face-based emotion recognition

Vansh Narula , Zhangyang (Atlas)

Wang , Theodora Chaspari Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Image and video-capturing technologies have permeated our every-day life.

Such technologies can continuously monitor individuals’ expressions in

real-life settings, affording us new insights into their emotional states and

transitions, thus paving the way to novel well-being and healthcare

applications. Yet, due to the strong privacy concerns, the use of such

technologies is met with strong skepticism, since current face-based emotion

recognition systems relying on deep learning techniques tend to preserve

substantial information related to the identity of the user, apart from the

emotion-specific information. This paper proposes an adversarial learning

framework which relies on a convolutional neural network (CNN) architecture

trained through an iterative procedure for minimizing identity-specific

information and maximizing emotion-dependent information. The proposed approach

is evaluated through emotion classification and face identification metrics,

and is compared against two CNNs, one trained solely for emotion recognition

and the other trained solely for face identification. Experiments are performed

using the Yale Face Dataset and Japanese Female Facial Expression Database.

Results indicate that the proposed approach can learn a convolutional

transformation for preserving emotion recognition accuracy and degrading face

identity recognition, providing a foundation toward privacy-aware emotion

recognition technologies.

Code-Bridged Classifier (CBC): A Low or Negative Overhead Defense for Making a CNN Classifier Robust Against Adversarial Attacks

Farnaz Behnia ,
Ali Mirzaeian ,
Mohammad Sabokrou ,
Sai Manoj ,
Tinoosh Mohsenin ,
Khaled N. Khasawneh ,
Liang Zhao ,
Houman Homayoun ,
Avesta Sasan

Comments: 6 pages, Accepted and to appear in ISQED 2020

Subjects

Machine Learning (cs.LG)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

In this paper, we propose Code-Bridged Classifier (CBC), a framework for

making a Convolutional Neural Network (CNNs) robust against adversarial attacks

without increasing or even by decreasing the overall models’ computational

complexity. More specifically, we propose a stacked encoder-convolutional

model, in which the input image is first encoded by the encoder module of a

denoising auto-encoder, and then the resulting latent representation (without

being decoded) is fed to a reduced complexity CNN for image classification. We

illustrate that this network not only is more robust to adversarial examples

but also has a significantly lower computational complexity when compared to

the prior art defenses.

Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning

Paola Cascante-Bonilla , Fuwen Tan , Yanjun Qi , Vicente Ordonez Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Semi-supervised learning aims to take advantage of a large amount of

unlabeled data to improve the accuracy of a model that only has access to a

small number of labeled examples. We propose curriculum labeling, an approach

that exploits pseudo-labeling for propagating labels to unlabeled samples in an

iterative and self-paced fashion. This approach is surprisingly simple and

effective and surpasses or is comparable with the best methods proposed in the

recent literature across all the standard benchmarks for image classification.

Notably, we obtain 94.91% accuracy on CIFAR-10 using only 4,000 labeled

samples, and 88.56% top-5 accuracy on Imagenet-ILSVRC using 128,000 labeled

samples. In contrast to prior works, our approach shows improvements even in a

more realistic scenario that leverages out-of-distribution unlabeled data

samples.

Artificial Intelligence

Fast Compliance Checking with General Vocabularies

P. A. Bonatti ,
L. Ioffredo ,
I. M. Petrova ,
L. Sauro

Comments: arXiv admin note: substantial text overlap with arXiv:2001.05390

Subjects

Artificial Intelligence (cs.AI)

We address the problem of complying with the GDPR while processing and

transferring personal data on the web. For this purpose we introduce an

extensible profile of OWL2 for representing data protection policies. With this

language, a company’s data usage policy can be checked for compliance with data

subjects’ consent and with a formalized fragment of the GDPR by means of

subsumption queries. The outer structure of the policies is restricted in order

to make compliance checking highly scalable, as required when processing

high-frequency data streams or large data volumes. However, the vocabularies

for specifying policy properties can be chosen rather freely from expressive

Horn fragments of OWL2. We exploit IBQ reasoning to integrate specialized

reasoners for the policy language and the vocabulary’s language. Our

experiments show that this approach significantly improves performance.

Visual Simplified Characters’ Emotion Emulator Implementing OCC Model

Ana Lilia Laureano-Cruces ,
Laura Hernández-Domínguez ,
Martha Mora-Torres ,
Juan-Manuel Torres-Moreno ,
Jaime Enrique Cabrera-López

Comments: 7 pages, 14 figures, 2 tables

Journal-ref: CGST Conference on Computer Science and Engineering, Istanbul,

Turkey, 19-21 December 2011

Subjects

Artificial Intelligence (cs.AI)

In this paper, we present a visual emulator of the emotions seen in

characters in stories. This system is based on a simplified view of the

cognitive structure of emotions proposed by Ortony, Clore and Collins (OCC

Model). The goal of this paper is to provide a visual platform that allows us

to observe changes in the characters’ different emotions, and the intricate

interrelationships between: 1) each character’s emotions, 2) their affective

relationships and actions, 3) The events that take place in the development of

a plot, and 4) the objects of desire that make up the emotional map of any

story. This tool was tested on stories with a contrasting variety of emotional

and affective environments: Othello, Twilight, and Harry Potter, behaving

sensibly and in keeping with the atmosphere in which the characters were

immersed.

A Critical Look at the Applicability of Markov Logic Networks for Music Signal Analysis

Johan Pauwels ,
György Fazekas ,
Mark B. Sandler

Comments: Accepted for presentation at the Ninth International Workshop on Statistical Relational AI (StarAI 2020) at the 34th AAAI Conference on Artificial Intelligence (AAAI) in New York, on February 7th 2020

Subjects

Artificial Intelligence (cs.AI)

; Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

In recent years, Markov logic networks (MLNs) have been proposed as a

potentially useful paradigm for music signal analysis. Because all hidden

Markov models can be reformulated as MLNs, the latter can provide an

all-encompassing framework that reuses and extends previous work in the field.

However, just because it is theoretically possible to reformulate previous work

as MLNs, does not mean that it is advantageous. In this paper, we analyse some

proposed examples of MLNs for musical analysis and consider their practical

disadvantages when compared to formulating the same musical dependence

relationships as (dynamic) Bayesian networks. We argue that a number of

practical hurdles such as the lack of support for sequences and for arbitrary

continuous probability distributions make MLNs less than ideal for the proposed

musical applications, both in terms of easy of formulation and computational

requirements due to their required inference algorithms. These conclusions are

not specific to music, but apply to other fields as well, especially when

sequential data with continuous observations is involved. Finally, we show that

the ideas underlying the proposed examples can be expressed perfectly well in

the more commonly used framework of (dynamic) Bayesian networks.

Plato Dialogue System: A Flexible Conversational AI Research Platform

Alexandros Papangelis , Mahdi Namazifar , Chandra Khatri , Yi-Chia Wang , Piero Molino , Gokhan Tur Subjects : Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

As the field of Spoken Dialogue Systems and Conversational AI grows, so does

the need for tools and environments that abstract away implementation details

in order to expedite the development process, lower the barrier of entry to the

field, and offer a common test-bed for new ideas. In this paper, we present

Plato, a flexible Conversational AI platform written in Python that supports

any kind of conversational agent architecture, from standard architectures to

architectures with jointly-trained components, single- or multi-party

interactions, and offline or online training of any conversational agent

component. Plato has been designed to be easy to understand and debug and is

agnostic to the underlying learning frameworks that train each component.