Object Detection with Deep Convolutional Neural Networks in Images with Various Lighting Conditions and Limited Resolution.
About Interesting Posts
Interesting documents about a variety of subjects from around the world. Posted on edocr.
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021
Object Detection with Deep
Convolutional Neural Networks in
Images with Various Lighting
Conditions and Limited Resolution
ROMAN LANDIN
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Object Detection with Deep
Convolutional Neural Networks in
Images with Various Lighting
Conditions and Limited Resolution
ROMAN LANDIN
Master’s Programme, System Control and Robotics, 120 credits Date:
June 16, 2021
Supervisor: John Folkesson
Supervisor at Atea: Ali Leylani
Examiner: Danica Kragic
School of Electrical Engineering and Computer Science Host
company: Atea Sverige AB
Swedish title: Detektion av objekt med Convolutional Neural Networks
(CNN) i bilder med dåliga belysningförhållanden och låg upplösning
Swedish title: Detektion av objekt med Convolutional Neural Networks
(CNN) i bilder med dåliga belysningförhållanden och lågupplösning
| i
Abstract
Computer vision is a key component of any autonomous system. Real
world computer vision applications rely on a proper and accurate detection
and classification of objects. A detection algorithm that doesn’t guarantee
reasonable detection accuracy is not applicable in real time scenarios
where safety is the main objective. Factors that impact detection accuracy
are illumination conditions and image resolution. Both contribute to
degradation of objects and lead to low classifications and detection
accuracy.
Recent development of Convolutional Neural Networks (CNNs) based
algorithms offers possibilities for low-light (LL) image enhancement and super
resolution (SR) image generation which makes it possible to combine such
models in order to improve image quality and increase detection accuracy.
This thesis evaluates different CNNs models for SR generation and LL
enhancement by comparing generated images against ground truth images.
To quantify the impact of the respective model on detection accuracy, a
detection procedure was evaluated on generated images. Experimental
results evaluated on images selected from NoghtOwls and Caltech
Pedestrian datasets proved that super resolution image generation and
low-light image enhancement improve detection accuracy by a substantial
margin. Additionally, it has been proven that a cascade of SR generation
and LL enhancement further boosts detection accuracy. However, the main
drawback of such cascades is related to an increased computational time
which limits possibilities for a range of real time applications.
Keywords
Object detection, Super Resolution image generation, Low-Light image enhancement,
Computer Vision
ii |
Sammanfattning
Datorseende är en nyckelkomponent i alla autonoma system. Applikationer
för datorseende i realtid är beroende av en korrekt detektering och
klassificering av objekt. En detekteringsalgoritm som inte kan garantera
rimlig noggrannhet är
inte tillämpningsbar i realtidsscenarier, där
huvudmålet är säkerhet.
Faktorer som påverkar detekteringsnoggrannheten är
belysningförhållanden och bildupplösning. Dessa bidrar till degradering av
objekt och leder till låg klassificerings- och detekteringsnoggrannhet.
Senaste utvecklingar av
Convolutional Neural Networks (CNNs) -baserade algoritmer erbjuder möjligheter
för förbättring av bilder med dålig belysning och bildgenerering med superupplösning
vilket gör det möjligt att kombinera sådana modeller för att förbättra bildkvaliteten
och öka detekteringsnoggrannheten.
I denna uppsats utvärderas olika CNN-modeller för superupplösning
och förbättring av bilder med dålig belysning genom att jämföra genererade
bilder med det faktiska data. För att kvantifiera inverkan av respektive
modell på detektionsnoggrannhet utvärderades en detekteringsprocedur
på genererade bilder. Experimentella resultat utvärderades på bilder
utvalda från NoghtOwls och Caltech datauppsättningar för fotgängare och
visade att bildgenerering med superupplösning och bildförbättring i svagt
ljus förbättrar noggrannheten med en betydande marginal. Dessutom har
det bevisats att en kaskad av superupplösning-generering och förbättring
av bilder med dålig belysning ytterligare ökar noggrannheten. Den största
nackdelen med sådana kaskader är relaterad till en ökad beräkningstid som
begränsar möjligheterna för en rad realtidsapplikationer.
Nyckelord
Detektion av objekt, Bildgenerering med superupplösning, Förbättring
av bilder med dålig belysning, Datorseende
| iii
Acknowledgments
I would like to thank my supervisor at Atea Sverige AB, Ali Leylani who
mentored me and shared his experience despite the situation with the
pandemic. Ali Leylani guided me throughout this project and accurately
answered my questions which helped me to gain a more deep
understanding regarding the work.
I thank my supervisor at KTH John Folkesson for all the constructive
advises and cheering up atmosphere during our zoom meetings.
I also want to thank my amazing wife Maria Karlsson Osipova and our
families who supported me throughout all these years at KTH.
Last but not the least I want to thank my friends "La Familia" and Victor
Cardenas who definitely made this journey easier to accomplish.
Stockholm, June 2021
Roman Landin
CONTENTS | vii
Contents
1
Introduction
1
1.1 Background ...................................................................................... 1
1.2 Problem ........................................................................................... 3
1.2.1 Original problem and definition .......................................... 3
1.3 Purpose ............................................................................................ 3
1.4 Goals ................................................................................................ 3
1.5 Research Methodology .................................................................... 4
1.6 Delimitations .................................................................................... 5
1.7 Structure of the thesis ..................................................................... 5
2 Theory and related work
7
2.1
Image Classification ......................................................................... 7
2.2 Object Detection .............................................................................. 9
2.2.1 CNNs based detectors ....................................................... 10
2.3 Resolution ...................................................................................... 11
2.4
Illumination .................................................................................... 13
2.5 Data................................................................................................ 14
2.6 Augmentation ................................................................................ 15
3 Method
19
3.1 Research process ........................................................................... 19
3.2 Selected CNNs models ................................................................... 20
3.2.1 D-DBPN ............................................................................. 20
3.2.2 MBLLEN ............................................................................. 22
3.2.3 YOLOv3 .............................................................................. 24
3.3 Selected augmentation methods................................................... 27
3.3.1 Bicubic downsampling ...................................................... 27
3.3.2 Gamma correction ............................................................ 27
3.3.3 Cropping ............................................................................ 28
viii | Contents
3.3.4 Poisson noise injection ......................................................29
3.4 Selected data ..................................................................................29
3.5 Experimental design .......................................................................30
4 Evaluation
33
4.1 Data preparation ............................................................................33
4.2 Training and evaluation metrics .....................................................34
4.2.1 YOLOv3 ..............................................................................35
4.2.2 D-DBPN ..............................................................................36
4.2.3 MBLLEN .............................................................................36
5 Results and Analysis
39
5.1 SR generation .................................................................................39
5.2 LL enhancement .............................................................................41
5.3 Detection ........................................................................................43
5.4 Summary ........................................................................................46
6 Discussion
49
6.1 SR generation .................................................................................49
6.2 LL enhancement .............................................................................49
6.3 Detection ........................................................................................51
6.3.1 Normal vs low resolution ...................................................51
6.3.2 SR generation ....................................................................51
6.3.3 LL enhancement ................................................................51
6.3.4 Subsequent LL enhancement and SR generation ..............52
6.4 Ethics and sustainability .................................................................53
7 Conclusions and Future work
55
7.1 Conclusions ....................................................................................55
7.2 Limitations ......................................................................................56
7.3 Future work ....................................................................................57
References
59
LIST OF FIGURES | ix
List of Figures
1.1 Three phases A: Super resolution generation, B: Low-light
enhancement, C: Detection ............................................................. 2
2.1 A simplified structure of ANN .......................................................... 8
2.2 A simplified structure of CNNs ......................................................... 9
2.3
Information loss due to downsampling. More complex objects
lose more information (Effects of varying resolution on
performance of CNN based image classification an experimental
study [1]) . 12
2.4 Overexposed image A, underexposed image B and both cases
in one image C (Detecting objects under challenging illumination
conditions [2]) ................................................................................ 14
3.1 Research process. .......................................................................... 20
3.2 The architecture of D-DBPN [3] ..................................................... 21
3.3 A: Up-projection unit and B: down-projection unit for iterative
SR generation [3] ........................................................................... 21
3.4 The architecture of MBLLEN [4] ..................................................... 23
3.5 The architecture of YOLOv3 ........................................................... 25
3.6 YOLOv3 detection process.
A: S×S grid generation, B: Anchor
box generation, C: Bounding box prediction .................................. 26
3.7 Effect of different γ values on output intensity. Here γ = 1
corresponds to the original image, γ > 1 and γ < 1 results in
decreased and increased illumination intensity ....................... 28
3.8
Images from A: Caltech Pedestrian, B: NightOwls data sets. ......... 29
3.9 Planned experiments ..................................................................... 31
x | LIST OF FIGURES
4.1 Selected images from NightOwls and Caltech Pedestrian
datasets. Low resolution training data for SR generation has
been prepared by bicubic downsampling. Low-light training
data for LL enhancement has been prepared by gamma
correction
and Poisson noise injection. .......................................................... 34
4.2
Intersection over union (IoU) ........................................................ 35
5.1 Resolution reconstruction progress for test data expressed as
PSNR metric and MSE loss for each epoch. .................................. 40
5.2 From left to right: Downsampled, generated SR and ground
truth images .................................................................................. 41
5.3
Illumination reconstruction progress for test data expressed as
PSNR metric and combined loss for each epoch. ......................... 42
5.4 From left to right: Synthesized low-light, enhanced and ground
truth images .................................................................................. 42
5.5 Green: Low resolution and normal resolution detection, Red:
ground truth pedestrians .............................................................. 43
5.6 Detection performed on the generated SR image provided
by D-DBPN. Green: detected pedestrians, Red: ground
truth
pedestrians .................................................................................... 44
5.7 A: ground truth images. B: Detection performed on ground
truth images. C: Detection performed on LL enhanced
images.
Green: detected pedestrians, Red: ground truth pedestrians ..... 45
5.8 A: Image resolution was first increased by D-DBPN model
and then illumination was enhanced by MBLLEN model. B:
Reversed setup. Illumination was first enhanced by MBLLEN
model and thereafter D-DBPN increased resolution. .................. 46
6.1 Patch boundaries observed in generated images. From left to
right, SR generated images for epoch 5 and 20. ............................ 50
6.2 From left to right: low resolution and SR generated image.
Pedestrian positioned at the background has been successfully
detected while pedestrian in the foreground is ignored. .............. 52
LIST OF TABLES | xi
List of Tables
3.1 YOLOv3 performance on the COCO Dataset .................................. 25
3.2 Main characteristics of NightOwls and Caltech Pedestrian
datasets. ........................................................................................ 30
4.1 D-DBPN training settings ............................................................... 36
4.2 MBLLEN training settings ............................................................... 37
5.1 AP of the respective experiment ................................................... 47
xii | List of acronyms and abbreviations
List of acronyms and abbreviations
ANN Artificial Neural Network
AP Average Precision
CNNs Convolutional Neural Network
CV Computer Vision
D-DBPN Dense Deep Back Propagation Networks
DBPN Deep Back Propagation Networks
DL Deep Learning
FDSR Feature Driven Super Resolution
fps Frames per second
GAN Generative Adversarial Networks
HD High Definition
KNN K Nearest Neighbour
LL Low-Light
mAP Mean Average Precision
MBLLEN Multi-Branch Low-Light Enhancement Network
ML Machine Learning
MSE Mean Squared Error
PSNR Peak Signal to Noise Ratio
RGB Red Green Blue
SR Super Resolution
SSIM Structural Similarity
SVM Support Vector Machine
YOLO You Only Look Once
Introduction | 1
Chapter 1
Introduction
1.1 Background
Object detection and classification is a critical element of any Computer Vision
(CV) system. Robust object detection algorithms are of great importance in
visual surveillance and security systems, human-machine interaction applications,
traffic monitoring, collision avoidance systems and many others.
To accurately detect and classify an object is a challenging task. Images
taken in an environment with the absence of an appropriate light source
usually include shot noise related to the photon counting in optical devices.
Moreover, these images might include motion blur as the result of a longer
exposure time required for collection of reflected light. These factors are
very common in night images which make it hard to create a general
solution and provide accurate detection.
Interaction between the surrounding world and CV system is achieved
through sensors or vision systems. This system could be a Light detection
and ranging (Lidar), Sound navigation and ranging (Sonar) or a simple red-
green- blue (RGB) monocular camera. In this thesis we explore constraints
of object detection given a monocular camera vision system as the most
available way for machines to observe the world.
The most important factor of any vision system is resolution. Image
resolution provided by a monocular camera depends strictly on the amount
of light sensitive sensors that build the sensor matrix.
Image resolution usually impacts detection accuracy for objects
positioned in the background [1]. As objects become smaller the amount of
pixels representing these objects decreases. It is generally very hard to
analyze low resolution objects and images due to the absence of
information.
2 | Introduction
While image resolution is important to create detailed images it is also
important to actually see objects clearly. Illumination conditions effect the
whole image resulting in the appearance of unwanted artifacts. Such
artifacts as reflections, shades, and completely darkened areas are very
common in images taken in dark environments. Objects altered by these
artifacts are extremely hard to analyze due to the high level of object
degradation [2].
The field of CV was revolutionized by introduction of the Artificial
Neural Networks (ANN). An imitation of brain neural structure has an
unlimited potential for many useful applications such as speech recognition
and digit recognition, forecasting and natural language processing. One of
the most useful forms of ANN in the CV field is Convolutional Neural
Networks (CNNs).
CNNs allow us to analyze images of higher resolution with less computational
resources by reducing spatial dimensionality of images while preserving
important information. CNNs provide solutions for object detection,
segmentation and classification along with multiple augmentation methods
such as Super Resolution (SR) generation and Low-Light (LL) enhancement.
Finally, CNNs
are extremely flexible and could be cascaded with other image processing
techniques.
In this thesis we explore possibilities of CNNs to perform object detection
and classification in low resolution images taken in dark environments. This
procedure could be divided into three steps as illustrated in figure 1.1:
super resolution generation (SR), low-light enhancement (LL) and finally
object detection.
Figure 1.1: Three phases A: Super resolution generation, B: Low-light
enhancement, C: Detection
The three step procedure could prove to be useful for applications
that involve image analysis in an environment with changing illumination
conditions.
Introduction | 3
1.2 Problem
1.2.1 Original problem and definition
Given an image taken in a dark environment a CNNs based algorithm will be
trained to enhance contrast, improve resolution and perform object
detection. This is a complex problem that has been reviewed in literature
independently. This project will be focused on investigation of subsequent
tasks. Original problem can be formulated as following questions.
• How do ML methodologies and structure of CNNs benefit robust
object detection and tracking in low resolution images taken in dark
environments?
• What data and data augmentation methods can be used to support
training?
• Are CNNs based models able to improve detection accuracy by cascading
SR and LL modules? In what order this cascading should be performed?
1.3 Purpose
This project is conducted as a final part of the System Control and Robotics
master’s degree programme in cooperation with Atea Serige AB and Royal
Institute of Technology (KTH). The main purpose of this thesis is to develop
deeper knowledge and understanding in the context of the programme,
conduct scientific experiments and apply acquired skills. Achieved results
and gained knowledge could be used by universities and industrial
organizations as a base for further improvement of object detection and
classification in low resolution dark images.
1.4 Goals
The main goal of this thesis is to achieve improved detection in low
resolution images with various illumination conditions with deep
convolutional neural networks. The main goal could be achieved by several
sub-goals defined as:
• Improve image resolution with selected CNNs model.
4 | Introduction
• Investigate quality of super resolution generation given normal- and
low- light images as an input.
• Enhance contrast and brightness with selected CNNs model.
• Investigate quality of enhanced images given normal- and low-light
images as an input.
•
Evaluate detection accuracy on generated SR and LL enhanced images.
In this thesis, we will show that image resolution and illumination
conditions are two critical factors that impact detection accuracy.
1.5 Research Methodology
Multiple CNNs based models for SR generation and LL enhancement have
been proposed lately in order to improve image quality. These models could
be cascaded with a detector in order to improve detection accuracy by
enhancing contrast and increasing resolution of the input image.
To evaluate the project following research methods will be used:
• Literature review
• Exploratory data analysis:
– Analysis of available data sets suitable for the project.
• Implementation:
– Augmentation methods.
– Data sampling and annotation.
– Selected CNNs models.
• Experiments:
– Performance analysis of super resolution generation and low-
light enhancement models.
– Analysis of detection accuracy on generated images.
Introduction | 5
1.6 Delimitations
• This work will not cover analyses of detection in overexposed images
due to several reasons. Most algorithms and proposed solutions are
based on a static environment where the light direction should be
estimated. To perform this estimation, a target image along with
ground truth image have to be collected in order to delight
overexposed areas. This approach is not applicable when the source of
the light is a detection target itself. For example a car with headlights
turned on or vehicles in a non-static environment.
• Selected CNNs based models will not be implemented from scratch
and instead, publicly available algorithms will be used.
• Hyperparameter values given by the authors of the proposed CNNs
models will be used. Smaller modifications might be done.
• In order to provide reasonable results a pre-trained detector will be
used. This work will focus mainly on training of SR generation and LL
enhancement models.
1.7 Structure of the thesis
Chapter 2 covers theory and related work regarding object classification
and detection, resolution and illumination impact on detection accuracy as
well as augmentation methods for data preparation. Chapter 3 presents
the method used to evaluate the project. It provides information about
selected models for detection, SR generation and LL enhancement.
Additionally, a few data augmentations methods are covered and
experimental setup presented in the same chapter. Chapter 4 presents the
data preparation process and training settings for the respective selected
network. Chapter 5 reports the results obtained and provides an analysis
for each experiment. Chapter 6 covers specific observations obtained
during each experiment. Finally, in chapter 7 the final conclusions are
drawn and future work suggested. Additionally answers for all scientific
questions are presented.
Theory and related work | 7
Chapter 2
Theory and related work
This chapter provides theory and related work information about object
classification and detection, data augmentation, super resolution
generation and low-light enhancement techniques.
Deep Learning (DL) is a powerful tool in the domain of digital image
processing. A wide range of challenges such as artificial data-generation,
segmentation, classification and object detection have been completed
with introduction of DL based methods. Convolutional Neural Networks
(CNNs) is an impressive form of DL models that revolutionized the
computer vision field and pushed boundaries of object detection precision
and accuracy. The ever-growing interest for new applications resulted in the
appearance of CNNs based super resolution (SR) generators and methods for
low-light (LL) image enhancement. Both methods could be used in the
scope of robust object detection improving detection accuracy and
classification precision
in
low resolution
images taken
in dark
environments.
2.1
Image Classification
Image classification is a key component of a wide range of computer vision
applications such as surveillance, traffic monitoring, collision avoidance, face
recognition, augmented reality, eye tracking, medical imaging etc.
Classification is a process of assigning a class to the context of an
image. Some of the widely used classification methods include naive Bayes
classifier, Support Vector Machines (SVM), K-Nearest Neighbors (KNN),
Gaussian mixture model, Decision Tree, Random Forest, Logistic Regression
and Radial Basis Function (RBF) classifiers. Biggest advantage of these
classifiers is their ability to perform classification by using relatively small
8 | Theory and related work
data sets. Despite satisfying performance on data sets with limited number
of classes the complexity and accuracy of mentioned algorithms for larger
problems becomes an issue [5]. Mostly because of a strong assumption of
feature independence which rarely holds true in the real world.
Appearance of ANN based classification methods resulted in overall
improvement of image classification accuracy [6]. ANN based classifiers
provide good performance while working with large data sets that
consist of hundreds of classes. Traditional classification methods can
compete with ANN when data is limited [7].
A fully connected ANN is a core of modern image classification and
detection algorithms. ANNs standard architecture includes an input layer,
few hidden layers and an output layer as shown in figure 2.1.
Figure 2.1: A simplified structure of ANN.
The dimension of the input layer is defined by the dimension of the
input data. In case of image classification this dimension corresponds to
the dimension of an image usually defined as Width x Height for gray scale
format and Width x Height x 3 for RGB format. Hence the amount of
calculations that ANN should perform in order to provide classification
depends strictly on the image size. Extensive calculations is the main factor
that limits the possibilities of ANN to perform classification and detection
on large size
Theory and related work | 9
images. An introduction of convolutional layers followed by fully connected
neural network addressed this problem and resulted in architecture known
now as CNNs.
2.2 Object Detection
A significant improvement of image classification was achieved with the
appearance of CNNs based classifiers [8]. As shown in figure 2.2 the problem
of redundant calculations has been resolved by introducing convolutional
and pooling layers. These layers reduce dimensionality of the input image
by producing high level feature maps that are further analyzed by a fully
connected neural network. Feature maps obtained by first convolutional
layers could represent simple things like curves and edges while deeper
convolutional layers could provide such high level features as legs or wheels
that represent classes pedestrian and car. These high level features are
further analyzed by a fully connected ANN that assigns the most probable
class to the image. In other words the ANN will make decisions based on
the class characteristics instead of analyzing each pixel.
Figure 2.2: A simplified structure of CNNs.
A process of object detection is similar to the process of image classification.
The only difference is that image classification aims to estimate a class for
the whole image while object detection aims to detect a specific location of
each class and mark it with a bounding box and label. This procedure is
usually referred as detection trough classification.
10 | Theory and related work
2.2.1 CNNs based detectors
Appearance of CNNs based detectors was a milestone in the CV field. Since
that moment, algorithms have been evolving at an unprecedented speed
and many useful applications were integrated in our life [9].
The AlexNet architecture proposed in 2012 and developed by Krizhevsky
et al. [6] wasn’t the first attempt to apply CNNs for object classification
and detection. However, it is considered to be "The one that started it all".
AlexNet proposed a set of convolutional layers along with pooling layers that
extract features and downsample images. This approach reduces
complexity of neural networks improving computational capabilities and
accuracy.
Region Based Convolutional Neural Networks (R-CNN) proposed by R.
Girshick et al. (2014) was the first introduced two-stage CNNs detector. R-
CNN improved detection accuracy evaluated on VOC07 dataset achieving
58.5% mean average precision (mAP) [10]. Despite great success R-CNN was
extremely slow, requiring redundant computations of region proposals
(almost 2000 proposals per image).
To address this problem K. He et al. proposed Spatial Pyramid Pooling
Networks (SPPNet) [11]. SPPNet computes the feature proposals only once,
and then pools proposed features in arbitrary regions. Following this
approach a repeated computation of the convolutional features is avoided.
Proposed network achieved 59.2% mAP on VOC07 data set detecting objects
more than 20 times faster than R-CNN.
Next generation of R-CNN called Fast R-CNN allowed training of the
detector and a bounding box regressor under the same network
configurations. Incredible 70.0% mAP on VOC07 data set was achieved
while detecting objects more than 20 times faster than R-CNN [12].
Faster R-CNN proposed by S. Ren et al. [13] was the first near real-time
deep learning detector. Introduced Region Proposal Network (RPN)
enabled almost cost-free region proposals. Faster R-CNN achieved 73.2%
mAP on VOC07 dataset providing nearly 17 frames per second (fps)
detection speed.
Next major milestone in object detection was established by the
introduction of one-stage detectors. The key idea of a one-stage detector is to
avoid region proposals that consume time and perform detection in one step.
You Only Look Once (YOLO) proposed by R. Joseph et al. [14] in 2015
improved detection speed. YOLO achieved 45 fps at an accuracy of 63.4%
mAP on VOC07 data set. By reducing the amount of convolutional layers the
fastest version of YOLO achieved 155 fps at an accuracy of 52.7% mAP
evaluated on the same data set.
Theory and related work | 11
Second one-stage detector to appear was the Single Shot MultiBox
Detector (SSD). W. Liu et al. proposed an SSD approach in 2015[15]. Main
contribution is related to the introduction of the multi-reference and multi-
resolution detection techniques. These techniques significantly improved
detection accuracy of small objects. SSD achieved speed of 59 fps at an
accuracy of 76.8% mAP evaluated on VOC07 data set.
One stage detectors could not compete in accuracy with two-stage
detectors until Y. Lin et al. discovered that the foreground-background class
imbalance experienced during training was a reason for low accuracy.
Researchers
proposed RetinaNet as a solution in 2017 [16]. RetinaNet utilizes focal loss
that puts more focus on misclassified background examples during training.
Proposed solution enables one-stage detectors to achieve comparable
accuracy of two-stage detectors while maintaining very high detection speed
[9].
R. Joseph et al. continued to improve YOLO architecture which resulted
in the appearance of YOLOv2 [17] and YOLOv3 [18]. Proposed improvements
increased the detection accuracy while keeping a very high detection
speed.
YOLOv3 outperformed most of the real-time detection algorithms with comparable
accuracy and speed. Additional modification of YOLOv3 known as YOLOv3-
tiny made it possible to achieve detection in applications where capabilities of
hardware is limited [18].
2.3 Resolution
The amount of information that an image can hold depends on the amount
of pixels. High resolution images such as HD (1920×1080×3) and 4K
(3840×2160×3)) provide more information that can be useful for classification.
The effects of varying resolution on performance of CNN based image
classification methods have been investigated by S.P.Kannojia and G.Jaisval
[1]. Experimental results showed that degradation in image resolution
from higher to lower reduces classification accuracy. Additionally the
classification accuracy showed to be lower for down-scaled images that
represent more complex structures as shown in figure 2.3. This study
highlighted the
importance of resolution on subsequent classification tasks.
Dai et al. confirmed that generated SR images improve detection and
segmentation compared to low resolution ground truth images. Presented
results showed strong correlation between detection accuracy and quality
of generated images [19].
Such classical methods as Nearest-Neighbor interpolation, Bicubic interpolation,
12 | Theory and related work
Figure 2.3: Information loss due to downsampling. More complex objects
lose more information (Effects of varying resolution on performance of CNN
based image classification an experimental study [1])
Fourier-based interpolation, Edge directed interpolation were widely
applied for high resolution image generation before appearance of more
sophisticated CNNs methods. Many of these algorithms provide satisfying
visual quality while struggling on subsequent tasks of object classification
[20]. A simple explanation is that these algorithms can’t enhance classification
critical features.
Appearance of CNNs based SR generators resulted in significant
improvement of generated images.
Super Resolution Convolutional Network (SRCNN) proposed by C.
Dong et al. directly learns an end-to-end mapping between low and high-
resolution images. SRCNN is a single deep neural network that achieves high
computational speed and significant boost in peak signal to noise ratio (PSNR)
compared to classical SR generators [21].
Deeply-recursive convolutional network (DRCN) proposed by J. Kim et
al. evaluates recursive-supervision and skip-connection in order to simplify
training and boost convergence. The deep structure of DRCNN resulted in
higher PSNR and structural similarity (SSIM) score than earlier introduced
SRCNN [22]. Due to the same reason achieved computational speed is lower
for DRCN than for SRCNN.
Another improvement of generated images was achieved by introducing
Deep Recursive Residual Network (DRNN). Proposed model utilizes enhanced
residual unit structure that learns in a recursive block recursively.
This
approach mitigates the difficulty of training very deep networks. DRNN
outperformed both SRCNN and DRCN on several extensive benchmark experiments
resulting in high PSNR and SSIM score [23].
Theory and related work | 13
Deep Back Projection Network (DBPN) and Dence Deep Back Projection
Network for super resolution (D-DBPN) proposed by M.Haris et al. utilizes
deep projection units that can be easily stacked upon each other building a
deeper architecture. Images generated with four projection units achieved
a greater PSNR score than earlier proposed DRNN [3].
Many other single-image and multi-image SR generation models have
been proposed recently. However, most of them target visual quality
improvement as a primal objective by utilizing mean-squared-error, PSNR or
SSIM as loss functions.
To address the problem a CNNs based Feature Driven Super Resolution
(FDSR) for object detection has been introduced [20]. Proposed
architecture is based on the DBPN model mentioned below. FDSR utilizes two
distinct loss functions, reconstruction loss based on PSNR metric and
feature-driven loss based on features extracted from generated and
ground truth images. Wang et al. reported that FDSR results in a higher
visual quality and classification accuracy compared to other bench-marking
algorithms [20].
2.4
Illumination
A human eye can recognise an object by its geometrical representation. To
interpret geometry we use our ability to analyze differences in colors and
light intensity. Similarly, in computer vision a rapid color change could
indicate a certain geometry of an observed object. The amount of light
reflected by an object and captured by the camera sensors will have a
major impact on how accurately the geometry could be interpreted.
One of the biggest challenges in image classification and object
detection is lightning bias [24, 2]. In real world applications images could be
collected in different circumstances (dark night or bright sunny day). It
usually leads to over- or underexposed images that are hard to generalise
and classify. Moreover, a combination of a dark environment and a bright
source of light is a common scenario. One simple example is a car driving at
night. All three examples are presented in figure 2.4.
Methods for low-light image enhancement could be divided in three
categories. The first category methods are built upon the histogram
equalization with additional modifications. The second category is based on
the Retinex theory that assumes that an image is a combination of reflection
and illumination [4]. Unfortunately in more critical cases like dark night
scenes they both provide poor results resulting in appearance of undesirable
artifacts such as
14 | Theory and related work
Figure 2.4: Overexposed image A, underexposed image B and both cases in
one image C (Detecting objects under challenging illumination conditions [2])
saturated pixels [2].
The third category is DL based methods which recently achieved great
success in low-light image processing.
Some of the proposed methods like Low-Light Networks (LLNet) [25] and
Low-Light Convolutional Neural Networks (LLCNN) [26] achieve impressive results
with image denoising and brightness/contrast enhancement. However, both
methods utilize visual loss metrics (PSNR and SSIM) instead of focusing
on features critical for subsequent detection and classification tasks.
This problem was addressed by Feifan Lv et al. by introducing a Multi-
Branch Low-Light Enhancement Network (MBLLEN). Proposed model extracts
features from different levels through convolutional processes. Extracted
features are thereafter enhanced and the final image is generated by fusion.
MBLLEN utilizes three distinct loss functions. Region loss for underexposed
regions enhancement, structure loss for visual quality improvement and
context loss for feature content enrichment.
Resulted
network showed improved performance on detection and classification
tasks compared to other methods [4].
2.5 Data
DL based algorithms rely on a big amount of data in order to provide
accurate results [24]. The Big-Data term is widely applied to deep
learning based
Theory and related work | 15
systems. In case of CV the main idea is to create a collection of images that
will represent objects in a general way.
Data collection is an application specific procedure and it takes a lot of
effort to prepare a data set that will satisfy requirements. The majority of
open source datasets for object detection is focused on detection at
satisfying illumination conditions while data sets for detection at night are
limited.
The Caltech Pedestrian dataset is a collection of nearly 250k images
that provides a large amount of annotated pedestrians [27]. All images were
taken in California at good illumination conditions by a monocular camera
attached to a car.
The KAIST dataset contains both night and day RGB and thermal images
with the focus on pedestrian detection at night. The collected images
represent city pedestrians during one season [28]. Despite being one of
few available datasets focusing on night scene detection the diversity of
included images is limited.
The LOL dataset is a paired dataset with 500 low-light and normal-light
image pairs. All images were taken in an environment where the source of
light has been controlled manually. LOL data set was mostly collected in an
indoor environment and doesn’t include annotations [29]. Additionally, the
amount of classes that the dataset represents is very limited.
NightOwls dataset is a collection of night images taken in several cities
across Europe. The dataset includes 279k fully annotated images with
pedestrians, bicycle drivers and motorcyclists including different year
seasons and weather conditions [30].
2.6 Augmentation
Image augmentation is a way to expand data size and improve
generalisation through data manipulation techniques.
Augmentation methods could be divided in two categories: classic
augmentation and DL based augmentation.
Classic augmentation methods include geometric transformations,
color space augmentations, kernel filters, mixing images and random
erasing [24]. DL based augmentation methods include adversarial training
and Generative Adversarial Networks (GAN) based data augmentations.
In some cases augmentation methods could result in label loss (non-
label preserving transformation). This fact should be taken into
consideration when choosing appropriate augmentation methods.
16 | Theory and related work
Downsampling
Most CNNs based image processing algorithms are designed and optimized
for a certain dimension of the input images. The downsampling based
augmentation helps to control the size of the data. Moreover, in case of
supervised SR generation the downsampling could be used to create a
paired dataset of ground truth and downsampled images.
Flipping
Flipping is the easiest augmentation method that has been used widely.
Horizontal axis flipping is more common than vertical axis flipping. In
handwritten digit images a vertical flip of number 6 will result in number 9.
Hence in this scenario augmentation is non-label preserving.
Colour space augmentation
Images are represented by tensor height×width×channels. Red-Green-Blue
(RGB) type images have 3 color channels and a variety of images could
be created by channel manipulation. Histogram equalization and Gamma
correction methods are two classic examples of the colour space
augmentation that can be used as a contrast enhancement techniques for
synthesis of underexposed images.
Cropping
Cropping is a process of patch extraction. This method is widely used in
images of various resolutions resulting in patches of equal size. Random
cropping is a popular method used in combination with unsupervised
learning. However it should be considered as a non-label preserving
augmentation in cases when labeled data is used. Cropping is a simple and
effective method to expand data size by converting high resolution images to
several low resolution images.
Rotation
Rotation is a stepwise form of flipping.
It is performed by rotating an
image clockwise or anticlockwise (usually expressed in degrees). As it was
Theory and related work | 17
mentioned above for some applications rotation is a non-label preserving
method. Rotations in a range of -20/+20 degrees are considered as safe
and useful. Augmentation through rotation helps to avoid rotational bias.
Translation
Image translation is used to generalise data and avoid positioning bias.
Images are translated right and left, up and down. This augmentation is
spatially preserving meaning that the initial size of the image remains the
same. New spaces that appear due to translation are usually filled with
zero values or Gaussian noise.
Noise Injection
Noise injection is a process of adding noise to the image. Different kinds
of noise such as Gaussian or Poisson noise could be generated and applied
to images. This augmentation method forces networks to learn more
robust features which usually improves data generalisation.
Random Erasing
Object occlusion issues could be resolved by using random erasing. The idea
is to erase random parts of an object to simulate occlusions. The process is
performed with AxB size mask. Erased area is usually filled with Gaussian
Noise. Random erasing helps to create a more general model that analyzes
the whole object instead of focusing on its specific parts.
Mixing Images
Mixing several images together by averaging their pixel values is a way to train
a robust algorithm. This process begins with a random image selection. RGB
channels of selected images are averaged and images are combined (added
or subtracted) to create a single image. Label of the first randomly selected
image is used as a label for a combined image. This augmentation method
has been proven to reduce classification error [31].
18 | Theory and related work
GAN data augmentation
Inspired by Adversarial Training, a similar approach is conducted by using
Generative Adversarial Networks (GAN). GAN was first introduced as a
realistic image generator. Later on GAN have been used for high resolution
image generation, texture generation and human face generation [32].
Flexibility of GAN makes them suitable for image augmentation purposes.
Similar to Adversarial training, GANs architecture consists of two rival
networks Generator and Discriminator. Generator generates artificial
images from existing dataset and tries to fool the Discriminator.
Discriminator, on the other hand, is trying to understand wether these
images are real or artificial. Both networks are learning through out the
process [33] resulting in more and more realistic generated images.
Method | 19
Chapter 3 Method
The purpose of this chapter is to provide an overview of the research
method used in this thesis. Section 3.1 describes the research process. In
section 3.2.3 the details and motivation of chosen detection block are
covered. Thereafter in section 3.2.1 the selected super resolution
generation (SR) block is reviewed. In section 3.2.2 the low-light
enhancement (LL) block is presented. Section
3.4 covers analysis of selected datasets. In section 3.3 the selected
augmentation methods are reviewed. Finally, in section 3.5 the
experimental design is covered.
3.1 Research process
The research process could be divided in 6 phases. At first the problem of
object detection in low-resolution, low-light images was formulated.
Thereafter hypotheses and reasonable objectives for the given problem have
been defined. Further a literature analysis has been performed in order to
collect existing knowledge. This analysis was used to define and select CNNs
based algorithms for detection, SR generation and low-light enhancement.
After that two extensive open source datasets have been reviewed and
analyzed. Further appropriate augmentation methods for synthetic low-
resolution, low-light image generation have been selected. Finally a set of
experiments was planned. Figure 3.1 illustrates all phases of the research
process.
20 | Method
Figure 3.1: Research process.
3.2 Selected CNNs models
SR image generation and LL image enhancement result in an increased
detection performance [34],[20],[4]. Both enhancement methods have
been investigated individually resulting in a higher mAP score than
corresponding accuracy achieved on ground truth images.
It is however unclear if a combination of SR image generation and LL
image enhancement will benefit robust object detection. In order to
investigate the impact of resolution and illumination on detection accuracy
the following CNNs models have been selected.
3.2.1 D-DBPN
SR image generation is a process of converting a low resolution input image
into an output image of higher resolution. The process is performed through
a supervised learning approach given paired low resolution and high
resolution images as a training dataset.
D-DBPN architecture is the bone network of FDSR that incorporates
mutually connected up- and down-sampling layers [3]. These layers
represent different types of image degradation along with high resolution
components. Proposed architecture is shown in figure 3.2. The process
is initialized by feature extraction from a given low resolution input
image. Further the extracted features are optimized by back projection
stages and final reconstruction is performed by concatenation of all
resulted projections.
Method | 21
Figure 3.2: The architecture of D-DBPN [3]
Visual quality of SR generated images is restored through iterative up-
and down-projection units defined as shown in figure 3.3. The key idea is
to use a concatenated feature map generated by all previous projection
units and feed it as input to the next units. All layers are followed by
parametric rectified linear activation function (PReLUs).
Figure 3.3: A: Up-projection unit and B: down-projection unit for iterative SR
generation [3]
D-DBPN is optimized with MSE loss by comparing the ground truth
image with the generated image. Reconstruction MSE loss function formed
after a predefined set of up- and down-projection units is defined as shown
in 3.1.
22 | Method
∗
MSE
c
N ∗ M
i,j
i,j
N M
MSERec =
1 Σ Σ
(G
(ILR) − IHR)2
MSERec = MSE
Rec + MSERec + MSERec
total
R
G
B
Here ILR is low resolution image, IHR is high resolution image, Gw(ILR)
is image generated by the SR model, w are the parameters of the network,
c corresponds to channels R, G and B.
The quality of generated images is evaluated by the PSNR metric that is
performed as shown in 3.2.
(MAXI)2
PSNR = 10
log10
Rec
total
(3.2)
Here MAXI = 255 is the maximum intensity that each pixel could
achieve in images given with RGB format.
D-DBPN provides SR image generation with scale factors 2x, 4x and 8x
as described in the original paper [3]. Input images are spited in patches
with the size of 32x32 pixels and processed individually. Resulted SR
patches are connected together according to the initial location.
3.2.2 MBLLEN
Object detection in over- or under exposed images is a challenging
task. A standard procedure of many image enhancement models is to
improve contrast and brightness by using visual quality as a metric.
However this approach doesn’t take into account features that are critical
for classification and detection purposes.
Feifan Lv et al. addressed this problem by introducing a multi-branch
low-light enhancement network (MBLLEN) [4]. The structure of the proposed
network is illustrated in figure 3.4.
The key idea is to extract features from different levels by performing
subsequent convolutions in Feature Extraction Module (FEM). Feature
enhancement corresponding to each convolutional layer in FEM is performed by
Enhancement Module (EM). Finally all enhanced layers are fused together by
using Fusion Module (FM). Combined network is optimized through three
distinct loss functions: Region, Structure and Context loss.
Region loss is designed to pay more attention to low-light regions
within an image by selecting a partition of darkest pixels. Following this
i=1
i=1
w
(3.1)
Method | 23
High
H
m
n
Figure 3.4: The architecture of MBLLEN [4]
idea, enhancement of underexposed regions is prioritized while minimizing
enhancement of over exposed regions. The reason for such priority is that
images taken with a light source pointed towards the camera provide very
little details about objects behind this light source. Therefore it is more
important to enhance everything around overexposed regions and make
some assumptions about what could possibly be a reason for overexposed
regions (a car at night, a person with a flashlight etc).
In MBLLEN identification of underexposed regions is achieved by choosing
the top 40% darkest pixels among all pixels. Region loss is defined as:
nL mL
L
= w 1 Σ Σ
E
(i, j) − G (i, j)
(3.3)
Low
mLnL
i=1
nH
L
L
j=1
mH
L
= w 1
Σ Σ
E
(i, j) − G
(i, j
(3.4)
LRegion = LLow + LHigh
(3.5)
Here EL and GL are underexposed regions of enhanced and ground
truth images, EH and GH other parts of images.
Over- and underexposed capture usually causes object degradation
such as motion blur, shades and reflections. The Structure loss of MBLLEN
is based on SSIM metric and designed to address this problem by
improving
H H
i=1
j=1
H
H
L
24 | Method
Σ Σ Σ
Structure
N
µ2 + µ2 + C
σ2 + σ2 + C
Context
W H C
i,j
i,j
i,j
x=1 y=1 z=1
visual quality of generated images. The structure loss is defined as:
L
=
1 Σ
2µxµy + C1
2σxy + C2
1 −
(3.6)
Here µxand µy are averaged pixel values, γ2 and γ2 are corresponding
x
x
variances, γxy is covariance, C1 and C2 are constants to prevent division by
zero.
Context loss is designed to enhance classification critical features by
utilizing pretrained VGG-19 feature extraction network. The reasoning is
that if generated and ground truth images are similar then the corresponding
feature maps given by the extraction network will also be similar. Given the
idea the context loss is defined as:
Wi,j Hi,j Ci,j
L
=
1
φ (E)
) − φ
(G)
(3.7)
Here φi,j(E) and φi,j(G) are feature maps of generated and ground truth
image, Wi,j, Hi,j and C correspond to width, height and channel dimensions,
i and j stand for block index and convolutional layer index in VGG-19
network. Combined loss function shown in 3.8 is very flexible and the
effects of each particular block could be easily minimized by adjusting
weights α, β and
γ.
Ltotal = αLRegion + βLossStructure + γLossContext
(3.8)
3.2.3 YOLOv3
YOLO – You Only Look Once, is a one step predictor that requires a single
forward pass to analyze images. This approach makes YOLOv3 a very fast
real-time object detection algorithm.
Proposed flexibility of the network allows the use of YOLOv3 in combination
with other image processing algorithms. Additionally there are few modifications
of YOLOv3 that differ in computational inference and detection accuracy. In
table 3.1 the comparison of detection accuracy and computational speed of the
respective YOLOv3 model is presented.
The new network designed for YOLOv3, Darketnet-53, is significantly
larger compared to previous versions and has 53 convolutional layers. Detection
module consists of 53 layers which combined results into a fully convolutional
x
y
1 x
y
2
p∈img
i,j
x,y,z
i,j
x,y,z
Method | 25
Model
fps mAP
YOLOv3-320
45
51.5
YOLOv3-416
35
55.3
YOLOv3-608
20
57.9
YOLOv3-tiny
220
33.1
YOLOv3-spp
20
60.6
Table 3.1: YOLOv3 performance on the COCO Dataset
106 layers deep network [18]. The increased amount of layers allows us to
calculate predictions in three different scales as illustrated in figure 3.5.
Figure 3.5: The architecture of YOLOv3
Structure of YOLOv3 allows to solve detection as a single regression
problem. The process is initialized from splitting an image into an S × S grid
(at three different scales S is equal to 13, 26 and 52). Thereafter a set of
anchor boxes is generated within each grid cell. Finally for each box the
network outputs class probabilities and selects the bounding box that
overlaps a ground truth object more than any other bounding boxes. First
detection performed on a downsampled image with the grid cell of 13x13
pixels is illustrated in figure 3.6.
YOLOv3 utilizes residual layers with skip connections and upsampling
blocks in order to concatenate generated feature maps from previous
layers with upsampled ones. Fusion of concatenated feature maps is
achieved by 1x1 convolutional layer.
Proposed model is optimized by a loss function that consists of 3 parts.
Localization loss as shown in 3.9 defines how well the generated bounding box
26 | Method
ij
B
B
Ci − Ĉi
1
Figure 3.6: YOLOv3 detection process. A: S×S grid generation, B: Anchor
box generation, C: Bounding box prediction
correlates with the ground truth bounding box.
S2 B
Lloc = γcoord
Σ Σ
1obj
(xi − x̂i )2 + (yi − yˆi)2
+ γcoord
S2
i=0
i=0
Σ
j=0
j=0
obj
ij
(
√
wi −
√
ŵi )2 + (
√
hi −
q
ĥi)2
(3.9)
Here x,y,w and h are the parameters of the ground truth bounding box.
x̂ , ŷ , ŵ and ĥ are the parameters of generated bounding boxes. S corresponds
to the grid cell index and B is the number of anchor boxes. γcoord is the
importance weight.
Confidence loss as shown in 3.10 reflects the confidence that an object
is in the grid cell and the generated anchor is responsible for its prediction.
It also penalizes predictions for objects that have been detected but are not
in a cell.
Lconf =
S2
i=0
S2
Σ
j=0
Σ
obj
ij
2
Ci − Ĉ i
2
(3.10)
Here C is the ground truth class, Ĉ is the predicted class and γnoobj is the
i=0
j=0
Σ
1
Σ
Σ
B
1
+ γnoobj
noobj
ij
Method | 27
Lclass =
Σ
1
importance weight.
Last part of the loss function is the classification loss that reflects the
difference between actual class probabilities and predicted class
probabilities as shown in 3.11.
S2
obj
i
i=0
c=
Σ
classes
(pi(c) − pˆi(c))2
(3.11)
The network learns to define bounding boxes with respect to
generated anchor boxes and therefore it is critical to select proper
dimensions of anchors.
R. Joseph et al. suggest using K-means clustering on the dimensions of
the training set bounding boxes in order to define what boxes occur more
frequently.
Proposed architecture resulted in more accurate prediction for small
background objects and significantly improved alignment of bounding boxes
with ground truth objects.
3.3 Selected augmentation methods
As is has been mentioned in previous sections the data preparation process
is essential for DL based methods.
In order to select a proper
augmentation method the nature of required effect should be investigated.
This section covers main augmentation methods that have been utilized for
data preparation.
3.3.1 Bicubic downsampling
SR image generation is a supervised procedure that requires a set of
ground truth images along with corresponding low-resolution images. Low
resolution images could be obtained by bicubic downsampling (bicubic
interpolation) procedure with required scale factor. Compared to other
downsampling methods bicubic downsampling yields smoother interpolated
edges that reflect background objects more naturally.
3.3.2 Gamma correction
Gamma correction is a method for adjustment of the brightness and intensity
of an image through a non linear transformation. While widely applied for
visual quality improvement of images this method could also be used to
simulate
28 | Method
night or near-night environments. Gamma correction could be seen as a
relationship between an input image Imagein and the resulting output
image Imageout where output is proportional to the input raised to the
power of 1/γ:
Imagein
1/γ
Imageout = 255 ∗
255
(3.12)
The relationship between input intensity and output intensity with
respect to different γ values is illustrated in figure 3.7.
Figure 3.7: Effect of different γ values on output intensity. Here γ = 1
corresponds to the original image, γ > 1 and γ < 1 results in decreased
and increased illumination intensity.
3.3.3 Cropping
Cropping is a convenient way to expand data size by extracting multiple
patches from a single image. In case of limited data size where each data
sample represents an image of size AxB a larger set of smaller size images
could be generated by extracting N patches of size A/NxB/N . This
method is considered to be label-preserving if annotation information is
taken into account.
Method | 29
3.3.4 Poisson noise injection
Due to the limitations of RGB cameras the quality of digital images may
degrade. Image degradation has a discrete nature of electric charge. The
amount of photons collected by the light sensitive sensors in the camera
follows a Poisson distribution. Due to this reason the photon noise in
images taken in dark environments could be associated with Poisson noise.
Poisson noise is signal dependent and the ratio of signal to noise
grows with the square root of the number of photons captured.
Underexposed images suffer from appropriate signal amplitude and as a
result more noise could be observed. Poisson noise injection with specified
peak value could be used as an augmentation method to imitate the effect
of underexposure in images.
3.4 Selected data
NightOwls and Caltech Pedestrian are two extensive data sets that have
been created to improve accuracy of pedestrian detection [27], [30]. Few
samples taken from both data sets are presented in figure 3.8.
Figure 3.8: Images from A: Caltech Pedestrian, B: NightOwls data sets.
NightOwls dataset is a collection of images and video frames used for
pedestrian detection at night. The dataset consists of 279k frames in 40
video sequences recorded at night across 3 countries by an industry-standard
camera, including different seasons and weather conditions [30]. All
the frames are fully annotated including 3 different classes (pedestrians,
bicycle driver,
30 | Method
motorbike driver). Low illumination, reflections, blur, noise and changing
contrast makes this dataset a perfect candidate for low-light enhancement
purposes.
On the contrary the Caltech Pedestrian dataset is a collection of high
contrast, low noise daily images [27]. The dataset consists of 250k fully
annotated images taken in California.
Both data sets have been collected by a camera attached to a car and
annotation is performed according to Caltech (VBB) annotation format
[27]. Occluded objects along with foreground and background pedestrians
make these data sets suitable for the purposes of the project.
Table 3.2 summarizes main characteristics of both data sets.
Model
Images
Image size Annotated pedestrians Object ID
NightOwls
281k
1024x640
55k
Yes
Caltech
250k
640 x 480
350k
Yes
Table 3.2: Main characteristics of NightOwls and Caltech Pedestrian datasets.
3.5 Experimental design
In order to investigate detection accuracy on SR generated and low-light
enhanced images a of set experiments was designed. These experiments
are illustrated in figure 3.9.
Detection was evaluated on ground truth (normal resolution and ground
truth illumination) and 4x down-sampled LR images in order to provide a
reliable reference point for investigation. Further the impact of SR generation
and low-light enhancement on detection accuracy has been investigated
separately. Thereafter a combination of SR generation with LL enhancement
has been tested and detection for the respective experiment evaluated.
Method | 31
Figure 3.9: Planned experiments
Evaluation | 33
Chapter 4
Evaluation
This chapter covers details of data collection and sampling along with
training procedures for the selected CNNs models. Evaluation metrics PSNR
and mAP are covered at the end of this chapter.
4.1 Data preparation
Two datasets selected for the project represent video sequences which
implies that many frames have almost identical background. In order to
minimize bias the initial datasets have been sampled with respect to
pedestrian identification number.
Caltech Pedestrian has 2300 images with unique pedestrians from which
2000 have been selected as training images and other 300 as test
images. NightOwls dataset has been sampled in a similar manner, 2000 train
and 300 test low-light images with unique pedestrians have been collected.
Thereafter selected images from NightOwls dataset were cropped with respect
to pedestrian ID and bounding box location in order to match image
resolution of Caltech Pedestrian data (see table 3.2). Obtained images
represent ground truth low- light and normal-light datasets.
Low-resolution data used for SR generation was obtained by bicubic
downsampling of the ground truth low-light and normal-light datasets with
scaling factor x4.
LL enhancement model trains on paired low-light and normal-light
images. These images have been prepared by utilization of gamma
correction and poisson noise injection performed on normal-light ground
truth data.
The process of data preparation is illustrated in figure 4.1
Annotations for downsampled and cropped images have been adjusted
34 | Evaluation
Figure 4.1: Selected images from NightOwls and Caltech Pedestrian
datasets. Low resolution training data for SR generation has been prepared
by bicubic downsampling. Low-light training data for LL enhancement has
been prepared by gamma correction and Poisson noise injection.
accordingly and new annotation files have been created in order to
evaluate detection accuracy with the selected detector.
4.2 Training and evaluation metrics
This section highlights the process of training and detection accuracy evaluation.
All code used in this thesis is an original or modified version implemented
with Keras or PyTorch frameworks. Training of selected CNNs models was
performed on GTX970 graphic card with utilization of CUDA.
Evaluation | 35
4.2.1 YOLOv3
Implementation of YOLOv3 called “keras-yolo3: Training and Detecting
Objects with YOLO3” was used as a source code for detection purposes. The
code is provided by MIT with the open source license. Given implementation
allows use of pre-trained YOLO models as well as transfer learning for
developing YOLOv3 models on new datasets.
Original weights trained on Coco dataset and provided by J.Redmon et
al. have been used in order to evaluate reasonable detection accuracy on
generated SR and low-light enhanced images.
Detection precision is evaluated by using average precision (AP) metric
considering only one class presented in the collected dataset. AP is defined
as the area under the precision-recall curve that reflects prediction
performance of the whole dataset.
Precision measures the percentage of correct predictions and defined
as shown in 4.1.
Precision =
TruePositive
TruePositive + FalsePositive
(4.1)
On the other hand Recall measures the percentage of all the positive
predictions that have been found. Recall is calculated as shown in 4.2.
Recall =
TruePositive
TruePositive + FalseNegative
(4.2)
True Positives and False Positives detection are defined by using Intersection
over union (IoU) metric as shown in figure 4.2. IoU evaluates how much the
predicted bounding box overlaps with the ground truth bounding box. In this
project the IoU = 0.5 was used as a threshold.
Figure 4.2: Intersection over union (IoU)
36 | Evaluation
4.2.2 D-DBPN
An open github repository for D-DBPN provided by M.Haris et al. contains
the original implementation done in PyTorch [3]. This implementation is
used as a source code for SR generation in this project.
Due to the limited performance of available hardware the training
settings described in the original publication have been modified.
The training settings suggested by the authors and the training
settings evaluated are presented in table 4.1.
D-DBPN
Official
Evaluated
Learning rate
0.0001
0.0001
Learning decay 10 for each 500k iterations 10 for each 500k iterations
Optimizer
Adam
Adam
β1
0.9
0.9
β2
0.999
0.999
Batch size
20
5
LR patch size
32x32
32x32
Projection units
7
7
Epochs
200
20
Table 4.1: D-DBPN training settings
Scale factor of 4x for SR image generation has been used. Quality of
generated images was evaluated by PSNR metric as has been explained in
section 3.2.1.
4.2.3 MBLLEN
An open github repository for MBLLEN provided by Feifan Lv et al. contains
the original implementation performed in Keras. This implementation is
used as a source code for low-light enhancement in this project [4].
The training settings suggested by the authors and the training
settings evaluated are presented in table 4.2.
Quality of generated images was evaluated by PSNR metric by calculating
MSE and PSNR as has been covered in section 3.2.1.
Evaluation | 37
MBLLEN
Official
Evaluated
Learning rate
0.001
0.001
Learning decay 5% for each new epoch 5% for each new epoch
Optimizer
Adam
Adam
β1
0.9
0.9
β2
0.999
0.999
epsilon
10−8
10−8
Batch size
16
4
Epochs
200
100
Table 4.2: MBLLEN training settings
Results and Analysis | 39
Chapter 5
Results and Analysis
A set of experiments described in section 3.5 has been prepared and
evaluated in order to understand the impact of low resolution and
illumination on detection accuracy. This section presents the results
obtained by training selected CNNs models.
5.1 SR generation
Data for the SR generation module based on D-DBPN network has been
prepared by downsampling 4000 training and 600 test images collected
from NightOwls and Caltech Pedestrian datasets. This procedure was
described in section 4.1. Downsampled training images were given as
input data for D-DBPN. Training has been evaluated according to the
training settings: 7 projection units, 20 epochs with learning rate of 0.0001
and decay of 10% for each 500k iterations, Adam optimizer with β1 = 0.9
and β2 = 0.999, batch size 5 and LR patch size 32x32.
D-DBPN is a very deep network with a large filter size in projection units
(8x8 considering scale factor 4x)[3]. Due to this reason the possibility to
perform training on a larger batch size was limited.
Averaged PSNR score and MSE loss evaluated on 600 downsampled test
images computed for each epoch is illustrated in figure 5.1.
After 20 epochs D-DBPN achieves a PSNR score of 28.7118 which is a
reasonable result considering the amount of training epochs that the
model was trained for. According to the original publication the final PSNR
score achieved by D-DBPN after 200 epochs is 31.8 [3].
Few patches that represent pedestrians were extracted from ground
truth, downsampled and SR generated images in order to visualize achieved
results.
40 | Results and Analysis
Figure 5.1: Resolution reconstruction progress for test data expressed as
PSNR metric and MSE loss for each epoch.
These patches are presented in figure 5.2. Generated images are less sharp
than the original ones. This however is an expected result considering
subsequent downsampling and upsampling operations. Despite that D-
DBPN provides a good base for SR image generation which can be used as
a method for detection of low resolution background objects. In section
5.3 the impact of low resolution on detection accuracy will be covered
more thoroughly.
Results and Analysis | 41
Figure 5.2: From left to right: Downsampled, generated SR and ground truth
images
5.2 LL enhancement
Dataset for MBLLEN was prepared by selecting 2000 training and 300 test
images with unique pedestrians from the Caltech Pedestrian dataset.
These images were additionally downsampled with the scaling factor 4x.
Normal resolution and low resolution images have been combined to form
test and training datasets. The synthesized low-light images were obtained
by gamma correction and noise injection as was described in 4.1.
The LL enhancement module based on MBLLEN network has been
trained with following training settings: 100 epochs with learning rate
decay of 5% after each epoch, batch size 4, step epoch 200, Adam
optimizer with β1 = 0.9, β2 = 0.999 and ϵ = 10−8. The weights ωLow = 4 and
ωHigh = 1 for region loss have been used as suggested by Feifan Lv et al. [4].
Averaged PSNR score and loss evaluated on 600 normal and low resolution,
low-light synthesized test images computed for each epoch is illustrated in
figure 5.3. After 100 epochs, images generated by D-DBPN achieve a PSNR
score of 24.6718. Original implementation evaluated on 16925 images from
the PASCAL VOC dataset with almost equal training settings achieve a PSNR
score of 26.57 [4].
42 | Results and Analysis
Figure 5.3: Illumination reconstruction progress for test data expressed
as PSNR metric and combined loss for each epoch.
Figure 5.4 illustrates achieved enhancement on both normal and low
resolution test samples. MBLLEN successfully enhanced brightness and
contrast in both cases. However denoising was more carefully performed
in normal resolution images.
Figure 5.4: From left to right: Synthesized low-light, enhanced and ground
truth images
Results and Analysis | 43
5.3 Detection
Object detection has been evaluated by YOLOv3 with original weights as has
been explained in section 4.2. Yolov3 has not been trained on selected
data intentionally in order to highlight differences in detection accuracy
evaluated on ground truth and generated images.
Two datasets selected for this project provide annotations for
pedestrians. Due to this reason the AP score was calculated with respect to
pedestrian classification accuracy and bounding box alignment.
Low and normal resolution
Initial evaluation of detection accuracy was performed on the test dataset
that consists of 600 normal resolution day and night images selected from
NightOwls and Caltech Pedestrian datasets. Normal resolution data has
been further downsampled in order to define detection accuracy for low-
resolution images.
Normal resolution images achieve 0.6442 AP while downsampled images
attain a score of 0.4359 AP. Original AP that YOLOv3 achieves on the COCO
dataset is 51.5 [18]. This however is the result of a multi-class classification
while selected data for this project provides annotations exclusively for
pedestrians. Figure 5.5 illustrates detected pedestrians in low and normal
resolution images.
Figure 5.5: Green: Low resolution and normal resolution detection, Red:
ground truth pedestrians
Yolov3 successfully detected pedestrians in normal resolution
images while failing to detect the same pedestrians in downsampled
images. These
44 | Results and Analysis
results illustrate strong correlation between image resolution and
detection accuracy.
SR resolution
The impact of resolution has been further investigated by using
downsampled dataset for SR image generation. SR images generated by D-
DBPN resulted in a 0.5164 AP score which is lower than the accuracy
achieved on normal resolution images but higher than corresponding
accuracy for low resolution images. Detection performed on generated SR
images is presented in figure5.6.
Figure 5.6: Detection performed on the generated SR image provided by D-
DBPN. Green: detected pedestrians, Red: ground truth pedestrians
Provided results indicate that SR generation is helpful to achieve more
precise detection in low resolution images. Quantitative evaluation
performed on low resolution and SR generated images shows that PSNR
metric and mAP accuracy are correlated.
LL enhancement
Further, the impact of LL enhancement on detection accuracy has been
investigated. Normal resolution test dataset including day and night
images collected from NightOwls and Caltech Pedestrian has been
enhanced by using MBLLEN model. Further detection performed by
YOLOv3 resulted in a 0.6725 AP score which is higher than the initial
accuracy performed on the ground truth data. This experiment highlights
the importance of an appropriate illumination for subsequent detection
tasks. Figure 5.7 illustrates detection that has been performed on ground
truth and ll enhanced images.
Differences between ground truth and enhanced images are almost
undetectable for a human eye. However even small improvement results in an
increased detection accuracy in noisy images taken in a dark environment.
Results and Analysis | 45
Figure 5.7: A: ground truth images. B: Detection performed on ground
truth images. C: Detection performed on LL enhanced images. Green:
detected pedestrians, Red: ground truth pedestrians
Provided results indicate that LL enhancement used as the
preprocessing step is helpful to achieve more precise detection in low-light
images.
Combined SR resolution and LL enhancement
Results achieved by SR generation and LL enhancement highlight the dependence
between detection accuracy and quality of analyzed images.
Both D-DBPN and MBLLEN networks generate images that result in
increased detection performance. However, so far these networks have
been tested individually. Two experiments have been designed in order to
evaluate detection accuracy on images generated by a combination of
these networks.
First experiment was performed with a setup of SR generation and
subsequent LL enhancement performed by D-DBPN and MBLLEN respectively.
The setup achieved accuracy of 0.5317 AP. For the second experiment the
setup was reversed. Images were first generated by the SR model and
thereafter LL enhanced. Reversed setup resulted in 0.4662 AP. Both
setups were evaluated on a low resolution test dataset obtained by bicubic
downsampling.
46 | Results and Analysis
Figure 5.8 illustrates detection results obtained by cascaded networks.
Figure 5.8: A: Image resolution was first increased by D-DBPN model and
then illumination was enhanced by MBLLEN model. B: Reversed setup.
Illumination was first enhanced by MBLLEN model and thereafter D-DBPN
increased resolution.
5.4 Summary
All results achieved on ground truth and generated images provided by
MBLLEN and D-DBPN are presented below.
Detection performed on normal resolution images achieves the
highest AP score while detection performed on low resolution images the
lowest. This result is rather expected and could be explained by a loss of
information during downsampling procedure.
Detection performed on enhanced, normal resolution images achieves
the highest score among all experiments.
SR generated images result in a better AP score compared to low
resolution images. However this score is not as good as the result achieved
on normal resolution images. SR generation helped to restore some
features that have been lost during downsampling while failing to restore
others.
Results and Analysis | 47
Detection performed on subsequent SR generation and LL
enhancement achieved two distinct scores. It was more beneficial to first
apply SR generation and after that perform LL enhancement.
Table 5.1 summarizes detection accuracy of the respective experiment.
Detection performed on:
AP
Normal resolution images
0.6442
Low resolution images
0.4359
Normal resolution, LL enhanced images 0.6725
SR generated images
0.5164
SR generated, LL enhanced images
0.5317
LL enhanced, SR generated images
0.4662
Table 5.1: AP of the respective experiment
Discussion | 49
Chapter 6
Discussion
Results presented in chapter 5 provide observations regarding training of
selected CNNs models and detection performed on generated images.
These observations along with possible reasons for achieved results are
covered in this chapter.
6.1 SR generation
SR module based on D-DBPN has been trained with limited batch size and
training epochs. The main reason is the complexity and depth of D-DBPN
architecture which make it hard to perform training with available
hardware. Limited batch size can partially explain oscillations in PSNR and
MSE loss observed throughout training.
The main side effect of D-DBPN is the modular structure of reconstructed
images. As has been explained in section 3.2.1 the D-DBPN provides SR
generation by splitting input images into several patches. Thereafter the
resolution of each patch is increased individually and resulting high
resolution patches are connected together. The connection boundaries of
these patches tend to become less visible throughout training as can be
seen in figure 6.1.
However, due to the limited amount of training epochs some
boundaries can still be observed in generated SR images. It is unclear how
this side effect impacts performance of object detection and further
investigation is required.
6.2 LL enhancement
Training of the LL enhancement module based on MBLLEN network has
been performed on synthesized low-light data. Ground truth images
were
50 | Discussion
Figure 6.1: Patch boundaries observed in generated images. From left to
right, SR generated images for epoch 5 and 20.
introduced to Poisson noise and gamma correction. The Poisson noise
with peak value 200 was injected into normal light images and the gamma
correction was performed by using γ = 2.5. This is however a very rough
approximation of the night scenes. This fact along with limited data could
explain why training converges already after 30 training epochs resulting in
generation of bright but still noisy images. It is probable that the model
learned how to deal with specific values of gamma correction and injected
noise while struggling to handle other more complicated scenarios. A more
general way to represent low-light data is to inject noise and apply gamma
correction randomly with Poisson noise peak value and gamma correction γ
value being normally distributed between some predefined values.
LL enhancement performed on low and normal resolution images provided
another important observation. Denoising was more carefully performed in
normal resolution images. One probable explanation is that the Poisson noise
that was applied pixel-wise distorted features of low resolution images by a
higher grade. This result could be connected to the structure of loss function
and more specifically feature loss that aims to enhance high level features. It is
possible that feature enhancement is more difficult to achieve in low resolution
images that include a lot of noise.
Discussion | 51
6.3 Detection
6.3.1 Normal vs low resolution
Normal resolution images achieve higher detection accuracy compared to
low resolution as has been shown in table 5.1. This phenomena has been
investigated in the literature and detection results achieved on low
resolution data are rather expected.
The structure of YOLOv3 network provides detection in three different
scales which significantly improves detection performance for small background
objects. However the results achieved indicate that the selected detector can
not provide sufficient detection of all annotated pedestrians in low resolution
images. This fact implies that an SR generation step could be considered as a
way to support robust object detection.
6.3.2 SR generation
D-DBPN model has been used to generate images of higher resolution. The
question was however if the structure of D-DBPN optimized with PSNR metric
is sufficient for detection purposes. Generated SR images have been
processed by YOLOv3 detector and corresponding AP score was evaluated.
Achieved results showed that SR generation improved detection accuracy
resulting in higher classification score and better alignment of bounding
boxes.
In some cases it was easier to detect pedestrians positioned in the
background than pedestrians positioned in the foreground. One such
example with detection performed in a low resolution and SR generated
image is illustrated in figure 6.2. The reason for this behaviour could be
related to the nature of SR patch generation provided by D-DBPN. This
process has been described in section 6.1.
6.3.3 LL enhancement
LL enhancement has been performed on the low-light and normal light
images of normal resolution. Detection performed on LL enhanced, normal
resolution images achieved the highest accuracy as has been presented in
table 5. Boost in detection performance can be more frequently observed in
enhanced low-light images. That is due to the region loss function
incorporated into MBLLEN network that puts more weight on the dark
regions. The region loss implies that the normal-light images will not be
enhanced as much as the low-light
52 | Discussion
Figure 6.2: From left to right: low resolution and SR generated image.
Pedestrian positioned at the background has been successfully detected
while pedestrian in the foreground is ignored.
images.
This experiment proved that the synthesized low light images
achieved with gamma correction and Poisson noise injection could be used
as a training dataset in absence of real data. However the nature of
underexposed scenes is much more complicated and further investigation of
low-light data generation is required.
6.3.4 Subsequent LL enhancement and SR generation
Final experiments with subsequent SR generation and LL enhancement
resulted in two distinct detection scores as has been presented in section
5.3. Both experiments improved detection accuracy compared to detection
performed on low-resolution data. However images that have been
generated by a combination SR+LL achieve higher AP score compared to a
combination LL+SR.
This result comes from the inability of MBLLEN to perform denoising on
low resolution images as has been explained in section 6.2. Remained noise
is simply amplified by SR generation which leads to poor detection
performance. These results highlight the importance of the denoising
methods for low-light images.
Another possible explanation is the structure of loss function provided
by MBLLEN model.The feature loss function incorporated into MBLLEN
network tries to enhance detection critical features. These features are
very
Discussion | 53
relevant for SR generated images that generally lack in sharpness. This could
be a reason why images that have been SR generated first and thereafter
LL enhanced resulted in the highest mAP score among all experiments
performed on low resolution data.
6.4 Ethics and sustainability
CNNs based object detection systems decrease risks of injuries in traffic
and lower the production energy consumption while increasing production
performance. Despite all positive impacts of the CV systems there are
many concerns regarding ethics and transparency. These concerns are
related to an increased usage of detection algorithms for people tracking
and face recognition. Many of available datasets for pedestrian detection
are collected without a consent of people presented in images.
Additionally there are big concerns regarding usage of object detection in
military purposes.
Conclusions and Future work | 55
Chapter 7
Conclusions and Future work
7.1 Conclusions
This thesis covered the problem of object detection in low resolution
images taken in various illumination conditions. The main goal was to
improve object detection by SR generation and LL enhancement. Images
generated by the SR module based on D-DBPN network and LL
enhancement module based on MBLLEN network have been used to
evaluate detection performance provided by the detector YOLOv3.
Several experiments have been performed on normal resolution, low-
resolution and generated data. Corresponding results have been analysed.
Achieved results confirm that CNNs based models for SR generation and
LL enhancement improve detection accuracy in general. These models can
be further combined and used as a cascade in order to boost detection
accuracy.
Performed analysis provided answers to the research questions and
highlighted possibilities for further improvement formulated as future work.
How do ML methodologies and structure of CNNs benefit robust object
detection and tracking in low resolution images taken in dark environments?
CNNs models for SR generation and LL enhancement provide opportunities
to improve resolution and enhance contrast and brightness in low
resolution, underexposed images. Images with increased resolution and
enhanced contrast boost detection performance and achieve higher AP score
compared to the low resolution, low-light data. The flexibility of CNNs
models allows it to utilize distinct loss functions which can optimize
performance on several levels.
As have been discussed in chapter 6 the loss function based on feature
extraction is critical for subsequent detection tasks. Hence it is important
to
56 | Conclusions and Future work
enhance high level features while improving overall visual quality of
generated images.
Additionally CNNs based models can be trained on datasets that
include noise which provides additional opportunities for low-light image
denoising.
What data and data augmentation methods can be used to support
training?
Data collection and preparation is the essence of CNNs based models. In
case of SR generation the training data could be obtained by downsampling
ground truth images. However the training data for LL enhancement is
almost impossible to collect. Following the supervised learning approach
a paired data that represents the same scene taken in different
illumination conditions is required. Such augmentations as Poisson noise
injection and gamma correction could be used to simulate the night
environment.
Are CNNs based models able to improve detection accuracy by cascading
SR and LL modules? In what order should this cascading be performed?
LL enhancement performed on normal resolution images achieves better
denoising compared to enhancement performed on low resolution images.
Due to this reason it is preferable to increase resolution of analyzed images
and after that perform enhancement in order to avoid amplification of the
presented noise.
The effect of amplified noise is more visible when low resolution images
are first enhanced and after that SR generated. Detection performance
achieved on a cascade of LL+SR generation validates these observations
resulting in lower AP score.
A cascade of SR+LL models resulted in the highest mAP score achieved
on low resolution, low-light data which illustrated possibilities of cascaded
CNNs models.
7.2 Limitations
Detection accuracy evaluation has been performed on a single class
dataset. Achieved AP score for respective experiment reflects detection
accuracy with respect to pedestrians. A more extent evaluation performed
on a multi-class data could yield different results.
Conclusions and Future work | 57
Despite increased detection accuracy achieved on generated SR
images there still is a big gap compared to detection accuracy achieved on
normal resolution data. The D-DBPN model needs to be trained for more
epochs and with a larger dataset in order to provide more accurate
reconstruction.
LL enhancement provided by MBLLEN cannot eliminate all noise presented
in images resulting in saturated pixels.
7.3 Future work
SR generation provided by D-DBPN is optimized according to PSNR metric.
However, considering the detection task it is more important to restore
detection critical features than to achieve satisfying visual quality. One
possible improvement is an implementation of feature driven loss that will
help to reconstruct high level features and support detection [20]. Another
reasonable improvement is an implementation of the task driven loss
described in [34]. The task driven loss for the purposes of this project could
be based on detection accuracy and optimization is performed with respect
to the attained AP score.
The ability to perform appropriate denoising while improving contrast
and brightness is critical for robust detection tasks. The training dataset
synthesized for MBLLEN is a rough approximation of the complex nature of
underexposed images. A further analysis of low-light noise and appropriate
methods for noise modeling should be performed.
Limited low-light data is the main factor that effects the ability of
MBLLEN to provide proper enhancement. There are only a few available
datasets that consist of paired low- and normal light images that can be used
as a training data. However the size of these datasets is usually limited to
several hundreds images. A more extensive image collection with similar
qualities could lead to a better generalized model. These images could be
taken by adjusting exposure time of the camera.
REFERENCES | 59
References
[1] G. J. Suresh Prasad Kannojia, “Effects of varying resolution on
performance of cnn based image classification an experimental
study,” International Journal of Computer Sciences and Engineering,
pp. 1–6, 2018. doi: 10.3115/1073083.1073135. [Online].
Available:
https://www.ijcseonline.org/pdf_paper_view.php?paper_id=
2890&78-IJCSE-04853.pdf
[2] Y. Atoum, “Detecting objects under challenging
illumination
conditions,” Dissertation. Michigan State University, p. 128, 2018.
[Online]. Available: https://d.lib.msu.edu/etd/6986
[3] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-
projection networks for super-resolution,” 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 1664– 1673,
2018. doi: 10.1109/CVPR.2018.00179. [Online]. Available:
https://ieeexplore.ieee.org/document/8578277
[4] F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video
enhancement using cnns,” British Machine Vision Conference, p. 13,
2018.
[Online].
Available:
http://bmvc2018.org/contents/papers/0700. pdf
[5] R. Szeliski, Computer Vision Algorithms and Applications, 1st ed., ser.
Texts in Computer Science, 2011. ISBN 1-84882-935-3
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” Commun. ACM, vol. 60,
no. 6, p. 84–90, May 2017. doi: 10.1145/3065386. [Online]. Available:
https://doi.org/10.1145/3065386
[7] R. Keeling, R. Chhatwal, N. Huber-Fliflet, J. Zhang, F. Wei, H. Zhao,
S. Ye, and H. Qin, “Empirical comparisons of cnn with other
60 | REFERENCES
learning algorithms for text classification in legal document review,”
2019 IEEE International Conference on Big Data, pp. 2038–2042, 2019.
doi:
10.1109/BigData47090.2019.9006248.
[Online].
Available:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9006248
[8] K. O’Shea and R. Nash, “An
introduction to convolutional
neural networks,” ArXiv e-prints, p. 12, 2015. [Online]. Available:
https://www.researchgate.net/publication/285164623_An_
Introduction_to_Convolutional_Neural_Networks
[9] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20
years: A survey,” ArXiv e-prints, p. 39, 2019. [Online]. Available:
http://arxiv.org/abs/1905.05055
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies
for
accurate
object
detection
and
semantic
segmentation,” 2014 IEEE Conference on Computer Vision and Pattern
Recognition,
p. 8, 2013. doi: 10.1109/CVPR.2014.81. [Online]. Available: https:
//ieeexplore.ieee.org/document/6909475
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
1916, 2015. doi: 10.1109/TPAMI.2015.2389824. [Online]. Available:
https://ieeexplore.ieee.org/document/7005506
[12] R. Girshick, “Fast r-cnn,” 2015 IEEE International Conference on
Computer
Vision
(ICCV),
pp.
1440–1448,
2015.
doi:
10.1109/ICCV.2015.169. [Online]. Available: https://ieeexplore.ieee.
org/document/7410526
[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–
1149, 2017. doi: 10.1109/TPAMI.2016.2577031. [Online]. Available:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7485869
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016. doi:
10.1109/CVPR.2016.91
pp.
779–788.
[Online].
Available:
https://ieeexplore.ieee.org/document/7780460
REFERENCES | 61
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A.
C. Berg, “Ssd: Single shot multibox detector,” Springer International
Publishing, pp. 21–37, 2016. [Online]. Available: https:
//link.springer.com/chapter/10.1007/978-3-319-46448-0_2
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
for dense object detection,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020. doi:
10.1109/TPAMI.2018.2858826. [Online]. Available: https:
//ieeexplore.ieee.org/document/8417976
[17] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,”
2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 6517–6525, 2017. doi: 10.1109/CVPR.2017.690. [Online].
Available: https://ieeexplore.ieee.org/document/8100173
[18] A. Farhadi and J. Redmon, “Yolov3: An incremental improvement,”
Tech report, p. 6, 2018. [Online]. Available: https://pjreddie.com/
media/files/papers/YOLOv3.pdf
[19] D. Dai, Y. Wang, Y. Chen, and L. Van Gool, “Is image super- resolution
helpful for other vision tasks?”
IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 1–9, 2016. doi:
10.1109/WACV.2016.7477613. [Online]. Available: https://ieeexplore.
ieee.org/document/7477613
[20] B. Wang, T. Lu, and Y. Zhang, “Feature-driven super- resolution
for object detection,” International Conference on Control, Robotics
and
Cybernetics
(CRC),
pp.
211–215,
2020.
doi:
10.1109/CRC51253.2020.9253468. [Online]. Available: https:
//ieeexplore.ieee.org/document/9253468
[21] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
deep convolutional networks,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016. doi:
10.1109/TPAMI.2015.2439281. [Online]. Available: https:
//ieeexplore.ieee.org/document/7115171
[22] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional
network for image super-resolution,” IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1637–1645, 2016. doi:
62 | REFERENCES
10.1109/CVPR.2016.181. [Online]. Available: https://ieeexplore.ieee.
org/document/7780550
[23] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive
residual network,” in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017. doi: 10.1109/CVPR.2017.298 pp.
2790–2798.
[Online].
Available:
https://ieeexplore.ieee.org/document/ 8099781
[24] C. Shorten and T. M. Khoshgoftaar, “A survey on image data
augmentation for deep learning,” Journal of big data, vol. 6, no. 1,
pp. 1–48, 2019. doi: 10.1186/s40537-019-0197-0. [Online]. Available:
https://doi.org/10.1186/s40537-019-0197-0
[25] K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A
deep
autoencoder
approach
to
natural
low-light
image
enhancement,” Pattern recognition, vol. 61, pp. 650–662, 2017.
doi:
https://doi.org/10.1016/j.patcog.2016.06.008.
[Online].
Available: https://doi.org/10.1016/j.patcog.2016.06.008
[26] L. Tao, C. Zhu, G. Xiang, Y. Li, H. Jia, and X. Xie, “Llcnn: A convolutional
neural network for low-light image enhancement,” in 2017 IEEE Visual
Communications and
Image Processing
(VCIP),
2017. doi:
10.1109/VCIP.2017.8305143
pp.
1–4.
[Online].
Available:
https://ieeexplore.ieee.org/document/8305143
[27] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:
An evaluation of the state of the art,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–
761, 2012. doi: 10.1109/TPAMI.2011.155. [Online]. Available:
https://ieeexplore.ieee.org/document/5975165
[28] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral
pedestrian detection: Benchmark dataset and baseline,” in 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
doi: 10.1109/CVPR.2015.7298706 pp. 1037–1045. [Online]. Available:
https://ieeexplore.ieee.org/document/7298706
[29] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep
retinex
decomposition for low-light enhancement,” pp. 1–12, 2018. [Online].
Available:
https://www.researchgate.net/publication/32703323
9_ Deep_Retinex_Decomposition_for_Low-Light_Enhancement