AI Currents by Exponential View

AI Currents by Exponential View, updated 3/27/20, 11:21 AM

AI Currents is a report by AI researcher Libby Kinsey and Exponential View which aims to showcase and explain the most important recent research in Artificial Intelligence and Machine Learning.

About Techcelerate Ventures

Tech Investment and Growth Advisory for Series A in the UK, operating in £150k to £5m investment market, working with #SaaS #FinTech #HealthTech #MarketPlaces and #PropTech companies.

Tag Cloud

AI CURRENTS
AI RESEARCH YOU SHOULD KNOW ABOUT





AI Currents is a report by AI researcher Libby Kinsey
and Exponential View which aims to showcase and
explain the most important research in Artificial
Intelligence and Machine Learning.
Breakthroughs in AI come at an ever more rapid pace.
We, at Exponential View, wanted to build a bridge
between the newest research and its relevance for
businesses, product leaders and data science teams across
the world.
AI Currents is our first attempt at this. To distill the most
important papers of the previous calendar quarter into a
digestible and accessible guide for the practitioner or unit
head. It is not an easy task with more than 60,000 AI
papers published last year.
We’re grateful to Libby Kinsey for stepping up to the
task!
We hope you enjoy it.
Azeem Azhar
Founder, Exponential View
After twelve years working in technology and VC, I went
back to university to study machine learning at UCL. It
was 2014, Deepmind had recently been acquired by
Google, Andrew Ng’s introductory course on machine
learning was already an online sensation, yet mostly,
machine learning hadn’t hit the mainstream.
Upon graduating, I felt super-powered for all of a few
minutes, and then the truth hit me that I didn’t know
how to do anything… real. I’d focused so much on
learning the maths and implementing the algorithms that
I had a lot of context to catch up on.
Since then, I’ve been trying to parse what is happening in
the world of research into what that means for
commercial opportunity, societal impact and widespread
adoption. I work in a variety of roles, as technologist,
advisor, and analyst, for startups, larger organisations and
government.
In this first edition of AI Currents, I’ve selected five
recent papers and one bigger theme that I think are
interesting to this wider perspective. There’s a mix of
topics among those that are definitely in the ‘fundamental
research’ category but which are advancing towards
capabilities that are at the heart of artificial intelligence,
those that examine the complexities of deployment, and
more application-focused concerns.
Research in AI is moving at such a fast pace that it’s
impossible to cover everything. What I hope this report
does is to slow down, to take papers that are interesting in
their own right but that also act as exemplars for some of
the ways researchers talk about their work or the claims
they sometimes make, and to respectfully consider them
outside of their primary, academic audience. Your time is
precious and the report is already long enough!
Azeem, Marija and I hope that you enjoy this report, that
you learn something and that you’ll let us know what you
think.
Libby Kinsey
AI Researcher
2019: THE YEAR TRANSFORMERS HIT THE BIG
TIME
Not based on a single paper but a chance to look back at a
step-change in natural language understanding and its
journey from experimental architecture to widely available
building block deployed in hyper-scale settings.
About
Language understanding has long been a focus of AI
research, exemplified by Turing’s famous empirical test
for machine intelligence. We can argue whether success in
a Turing-type test actually constitutes ‘intelligence’, but
it’s clear that the practical applications of machines that
(in some sense) understand language are numerous. Such
machines could answer questions, translate from one
language to another, summarise lengthy documents or
conduct automated reasoning; and they could interact
more naturally with, or learn more readily from, humans.
Progress towards language understanding experienced a
leap with the introduction of the ‘Transformer’ by Google
in 2017. The Transformer is a deep learning architecture
that was designed to increase performance on natural
language tasks in an efficient way. Deep neural networks
for learning from text previously used layers based on
local convolutions and recurrence. These analyse words in
the context of a few surrounding words and an
approximate compression of words further away. The
Transformer model combines point-convolutions with a
new mechanism called attention, in particular self-
attention, which allows words to be analysed in a much
wider context – whole surrounding sentences or
paragraphs or more. With it, the team beat previous
state-of-the-art models on English-to-French and
English-to-German translation benchmarks, at a fraction
of the training cost.
Research interest in attention ignited by the Transformer
paper was hot throughout 2018, producing a deluge of
architectural and training refinements and expansion to
many different languages and tasks. But it was the release
of work using simplified structures at a much greater scale
(particularly Google’s BERT and OpenAI’s GPT-2) that
lit up 2019. What this allowed was the ability to
efficiently (everything’s relative) process truly huge
amounts of text so as to learn some general attributes of
language. This knowledge can then be used as a good
starting point for finessing specific language
understanding tasks, and thereby it can vastly reduce the
amount of labelled training data required – and achieve
state-of-the-art results. The first task is known as ‘pre-
training’, the second ‘fine-tuning’.
Throughout 2019, the likes of Google, OpenAI,
Microsoft and Facebook released pre-trained models.
Now anyone can download one and fine-tune it for their
particular task, so avoiding the prodigious expense of
training from scratch. Hugging Face has collated all of
these pre-trained models under a unified API together
with everything needed to get started. In addition, in only
two years, this novel architecture has gone from being
effectively a prototype to large-scale deployments, such as
in search at Microsoft Bing and Google Search.
Why is it interesting?
The computational parallelism that the Transformer
model allows originally made for faster training times
than the methods that had previously been in vogue.
However, since then, transformers have become
synonymous with massive compute budgets as researchers
test the limits of giant architectures and abundant
training data. The training data is abundant because it’s
just text – often scraped from the internet – and it doesn’t
need labelling as the language models are trained in a
self-supervised manner to predict words contained in the
text.
If more compute plus more data equals better language
models, then the only limitation to progress is computing
resource, which makes it the purview of only a very small
number of well-resourced entities. This raises concerns
about who can participate in fundamental research and
who will ultimately benefit from it.
Having said that, there’s much to celebrate about how
open research has been to date and how so many pre-
trained language models have been made available for use.
The option of downloading a pre-trained language model
and fine-tuning it for specific applications (instead of
building a task-specific architecture and training it end-
to-end) massively reduces the difficulty of creating
cutting-edge applications. For example, it took about 250
chip-days to pre-train the full-sized version of BERT, but
it requires only a few chip-hours to fine-tune it to a
particular task.
Before we get too carried away, though, there is a reason
to question just what deep language models such as the
Transformer are learning and what they can be expected
to do. Simplistically, these models work by learning
associations between words from how they are used. In
the words of John Firth, ‘You shall know a word by the
company it keeps.’ This results in language generation
that is fluent but not typically coherent. The models have
in some sense acquired knowledge, perhaps about the
structure of language, but they haven’t necessarily learned
about the structure of the world – knowledge that is
required for reasoning and to make inferences, or to
answer common-sense questions.
A naive look at research results might obscure this fact,
since research performance is steadily improving against
benchmark tasks in reading comprehension, question
answering, text summarisation, etc. This is because the
benchmark tasks and evaluation metrics used in research
imperfectly capture what we want from language
understanding systems and are themselves subject to
research interrogation and refinement.
That’s one charge against transformers, that they haven’t
learned anything about language that can be equated with
intelligence or understanding. Gary Marcus is typically
forthright on this topic. Another charge is that their
successes have been such that they have diverted attention
and funding from other approaches, potentially less
resource-hungry ones. Stephen Merity’s ‘Single Headed
Attention RNN: Stop Thinking With Your Head’ paper
is an amusing examination of this point. Perhaps better
evaluation metrics, along with less-hyped results, could’ve
helped avoid too much focus on one approach, but papers
like his do offer glimmers of hope to those not affiliated
with FAANG budgets that alternative approaches are
worthwhile.
Finally, there’s been no end of hoo-ha about the potential
misuse of transformers, specifically the risk of mass-
production of high-quality ‘fake news’ and propaganda, or
of propagating biases learned from the text corpuses they
train on. OpenAI staged the release of their language
model, finally releasing the largest, best-performing one
only in November last year, after conducting a year-long
conversation about what responsible publication looked
like. No one would think to do this if the text outputted
by transformer models weren’t so convincing.
So, yes, there are limitations to what transformers can do,
but they’ve really made a splash, they’ve accelerated from
experimental architecture to production at remarkable
speed, and anyone can take advantage of their power by
downloading a pre-trained model and fine-tuning it for
their specific uses.
More:
A Transformer reading list:
The original Transformer paper: ‘Attention Is All
You Need’ (paper / blog)
Google’s BERT paper: ‘BERT: Pre-Training of Deep
Bi-directonal Transformers for Language
Understanding’ (paper / blog)
OpenAI’s GPT-2 paper: ‘Language Models Are
Unsupervised Multitask Learners’ (paper / blog)
An examination of model performance: ‘What Is
BERT Actually Learning? Probing Neural Network
Comprehension of Natural Language Arguments’
(paper)
Gary Marcus’s talk at the NeurIPS 2019 workshop
on context and compositionality in biological and
artificial systems, ‘Deep Understanding, the Next
Challenge for AI’ (video)
An argument against too much focus on
transformers: ‘Single Headed Attention RNN: Stop
Thinking With Your Head’ (paper)
Starting to probe how transformer models actually
work: ‘How Much Knowledge of Language Does
BERT Actually Learn? Revealing the Dark Secrets of
BERT’ (paper / blog)
OpenAI’s GPT-2 1.5bn parameter pre-trained model
is released (blog)
Google AI’s MEENA 2.6bn parameter chatbot:
‘Towards a Human-Like Open-Domain Chatbot’
(paper / blog)
‘Turing-NLG: A 17bn parameter Language Model
by Microsoft’ (blog)
Try generating some text here: Write with
Transformer: Get a modern neural network to auto-
complete your thoughts
Deep Learning for Symbolic Mathematics
G. Lample and F. Charton / December 2019 / paper
Abstract
Neural networks have a reputation for being better at
solving statistical or approximate problems than at
performing calculations or working with symbolic data.
In this paper, we show that they can be surprisingly good
at more elaborated tasks in mathematics, such as symbolic
integration and solving differential equations. We propose
a syntax for representing mathematical problems, and
methods for generating large datasets that can be used to
train sequence-to-sequence models. We achieve results
that outperform commercial Computer Algebra Systems
such as Matlab or Mathematica.
About
In this work, the authors generate a dataset of
mathematical problem-solution pairs and train a deep
neural network to learn to solve them. The particular
problems in question are function integration and
ordinary differential equations of the first and second
order, which many of us will remember learning to solve
using various techniques and tricks that required quite a
lot of symbol manipulation, experience and memory (I
seem to have forgotten all of it – such a waste of all that
study).
By converting the problem-solution pairs into sequences,
the authors were able to use a transformer network and
hence to take advantage of the attention mechanisms that
have been found so useful in processing text sequences.
They evaluated their trained model on a held-out test set,
deeming the solution ‘correct’ if any one of the top 10
candidate solutions outputted by the transformer was
correct. In practice, more than one may be correct as they
found that many outputs were equivalent after
simplification, or differed only by a constant.
The trained model was able not only to learn how to solve
these particular mathematical problems but also to
outperform the commercial systems (so-called Computer
Algebra Systems) using complex algorithms to do the
same thing, albeit constraining those systems to a time
cut-off.
Why is it interesting?
In Section 1, we saw how transformers (sequence
modelling with attention) have become the dominant
approach to language modelling in the last couple of
years. They’ve also been applied with success to other
domains that use sequential data such as in protein
sequencing and in reinforcement learning where a
sequence of actions is taken. What’s more surprising is
their use here, with mathematical expressions. On the
face of it, these aren’t sequences and ought not to be
susceptible to a ‘pattern-matching’ approach like this.
What I mean by ‘pattern-matching’ is the idea that the
transformer learns associations from a large dataset of
examples rather than understanding how to solve
differential equations and calculate integrals analytically.
It was really non-obvious to me that this approach could
work (despite prior work; see ‘More’ below). It’s one thing
to accept that it’s possible to convert mathematical
expressions into sequence representations; quite another
to think that deep learning can do hard maths!
Maths has been the subject of previous work in deep
learning, but that focused on the intuitively easier
problem of developing classifiers to predict whether a
given solution is ‘correct’ or not. Actually generating a
correct solution represents a step change in utility.
The paper attracted quite a lot of comment on social
media and on the OpenReview platform, highlighting
concerns about whether the dataset that the authors built
introduces any favourable learning biases or whether it
satisfactorily covers all potential cases. There were also
questions about the fairness of comparing the top-10
accuracy of this system against time-limited Computer
Algebra Systems (the benchmark analytical approaches)
that had only one shot at a solution, but I think that the
authors addressed these well in their revised paper.
It’s a stretch to imagine that a deep learning approach will
replace Computer Algebra Systems, but it could readily
support them. It would provide candidate answers to
problems that those systems fail (or take too long) to
compute, since candidate solutions can be readily
checked. Since the cost of checking solutions is low,
having 10 candidates need not be considered an
impediment. I’d love to see some commentary on this
from any of the Computer Algebra Systems publishers,
but I haven’t been able to find anything yet.
What other seemingly ‘symbolic’ problems will prove
susceptible to a pattern-matching approach?
Will we see the impact of this in the next year?
It’s possible that we will see commercial Computer
Algebra Systems integrating deep learning approaches
like this one in the next year, but the biggest impact is
likely to be in inspiring the application of transformers to
other domains in which there is readily available (or
synthesizable) labelled data that can be transformed into
sequences. In language translation, the sequence [je suis
etudient] maps to sequence [I am a student]. The
mathematical expression 2 + 3x2 can be written as a
sequence in normal Polish form, [ + 2 * 3 pow x 2] with
derivative [ * 6 x ].
More:
Signal: Accepted as a ‘spotlight’ talk at ICLR2020
Who to follow: Paper authors @GuillaumeLample,
@f_charton (Facebook AI Research)
Other things to read:
Neural Programming involves training neural networks to
learn programs, mathematics or logic from data. Some
prior work:
This paper was one of three awarded ‘best paper’ at
ICLR in 2017 and trains a neural network to do
grade-school addition, among other things: ‘Making
Neural Programming Architectures Generalize via
Recursion’ (paper)
This ICLR 2018 paper trained neural networks to do
equation verification and equation completion:
‘Combining Symbolic Expressions and Blackbox
Function Evaluations in Neural Programs’ (paper)
In NeurIPS 2018, Trask et al. proposed a new
module designed to learn systematic numerical
computation that can be used within any neural
network: ‘Neural Arithmetic Logic Units’ (paper)
In ICLR 2019, Saxton et al. presented a new
synthetic dataset to evaluate the mathematical
reasoning ability of sequence-to-sequence models and
trained several models on a wide range of problems:
‘Analysing Mathematical Reasoning Abilities of
Neural Models’ (paper)
Selective Brain Damage: Measuring the
Disparate Impact of Model Pruning
S. Hooker, A. Courville, Y. Dauphin, and A. Frome /
November 2019 / paper / blog
Abstract
Neural network pruning techniques have demonstrated
that it is possible to remove the majority of weights in a
network with surprisingly little degradation to test set
accuracy. However, this measure of performance conceals
significant differences in how different classes and images
are impacted by pruning. We find that certain examples,
which we term pruning identified exemplars (PIEs), and
classes are systematically more impacted by the
introduction of sparsity. Removing PIE images from the
test-set greatly improves top-1 accuracy for both pruned
and non-pruned models. These hard-to-generalize-to
images tend to be mislabelled, of lower image quality,
depict multiple objects or require fine-grained
classification. These findings shed light on previously
unknown trade-offs, and suggest that a high degree of
caution should be exercised before pruning is used in
sensitive domains.
About
A trained neural network consists of a model architecture
and a set of weights (the learned parameters of the
model). These are typically large (they can be very large –
the largest of Open AI’s pre-trained GPT-2 language
model referred to in Section 1 is 6.2GB!). The size
inhibits their storage and transmission and limits where
they can be deployed. In resource-constrained settings,
such as ‘at the edge’, compact models are clearly
preferable.
With this in mind, methods to compress models have
been developed. ‘Model pruning’ is one such method, in
which some of the neural network’s weights are removed
(set to zero) and hence do not need to be stored (reducing
memory requirements) and do not contribute to
computation at run time (reducing energy consumption
and latency). Rather surprisingly, numerous experiments
have shown that removing weights in this way has
negligible effect on the performance of the model. The
inspiration behind this approach is the human brain,
which loses ‘50% of all synapses between the ages of two
and ten’ in a process called synaptic pruning that improves
‘efficiency by removing redundant neurons and
strengthening synaptic connections that are most useful
for the environment’.
Because it’s ‘puzzling’ that neural networks are so robust
to high levels of pruning, the authors of this paper probe
what is actually lost. They find that although global
degradation of a pruned network may be almost
negligible, certain inputs or classes are disproportionately
impacted, and this can have knock-on effects for other
objectives such as fairness. They call this ‘selective brain
damage’.
The pruning method that the authors use is ‘magnitude’
pruning, which is easy to understand and to implement
(weights are successively removed during training if they
are below a certain magnitude until a model sparsity
target is reached) and very commonly used. The same
formal evaluation methodology can be extended to other
pruning techniques.
Why is it interesting?
The paper was rejected for ICLR2020. The discussion on
OpenReview suggested that it was obvious that pruning
would result in non-uniform degradation of performance
and that, having proven this to be the case, the authors
did not provide a solution.
It may indeed be obvious to the research community, but
I wonder whether it is still obvious when it comes to
production. Pruning is already a standard library function
for compression. For instance, a magnitude pruning
algorithm is part of Tensorflow’s Model Optimization
toolkit, and pruning is one of the optimisation strategies
implemented by semiconductor IP and embedded device
manufacturers to automate compression of models for use
on their technologies. What this paper highlights is that a
naive use of pruning in production, one that looks only at
overall model performance, might have negative
implications for robustness and fairness objectives.
Will we see the impact of this in the next year?
I hope we’ll see much more discussion like this in the
next year. As machine learning models increasingly
become part of complex supply chains and integrations,
we will require new methods to ensure that.
More:
Signal: Rejected for ICLR2020
Who to follow: @sarahookr
Other things to read:
One of the first papers (published in 1990) to
investigate whether a version of synaptic pruning in
human brains might work for artificial neural
networks: ‘Optimal Brain Damage’ (paper)
This is the paper that introduced me to the idea that
one could remove 90% or more of weights without
losing accuracy, at NeurIPS 2015: ‘Learning Both
Weights and Connections for Efficient Neural
Networks’ (paper)
The ‘magnitude’ pruning algorithm, workshop track
ICLR2018: ‘To Prune, or Not to Prune: Exploring
the Efficacy of Pruning for Model Compression’
(paper)
Explainable Machine Learning in Deployment
U. Bhatt, A. Xiang et al. / 10 December 2019 / paper
Abstract
Explainable machine learning offers the potential to
provide stakeholders with insights into model behavior by
using various methods such as feature importance scores,
counterfactual explanations, or influential training data.
Yet there is little understanding of how organizations use
these methods in practice. This study explores how
organizations view and use explainability for stakeholder
consumption. We find that, currently, the majority of
deployments are not for end users affected by the model
but rather for machine learning engineers, who use
explainability to debug the model itself. There is thus a
gap between explainability in practice and the goal of
transparency, since explanations primarily serve internal
stakeholders rather than external ones. Our study
synthesizes the limitations of current explainability
techniques that hamper their use for end users. To
facilitate end user interaction, we develop a framework for
establishing clear goals for explainability. We end by
discussing concerns raised regarding explainability.
About
The goal of this research was to study who is actually
using ‘explainability’ techniques and how they are using
them. The team conducted interviews with ‘roughly fifty
people in approximately thirty organisations’. Twenty of
these were data scientists not currently using
explainability tools; the other 30 were individuals in
organisations that have deployed explainability
techniques.
The authors supply local definitions of terms such as
‘explainability’, ‘transparency’ and ‘trustworthiness’ for
clarity and because they realised that they needed a shared
language for the interviews, which were with individuals
from a range of backgrounds – data science, academia and
civil society. Therefore, ‘[e]xplainability refers to attempts
to provide insights into a model’s behavior’,
‘[t]ransparency refers to attempts to provide stakeholders
(particularly external stakeholders) with relevant
information about how the model works’, while
‘[t]rustworthiness refers to the extent to which
stakeholders can reasonably trust a model’s outputs’.
The authors found that where explanation techniques
were in use, they were normally in the service of
providing data scientists with insight to debug their
systems or to ‘sanity check’ model outputs with domain
experts rather than to help those affected by a model
output to understand it. The types of technique favoured
were those that were easy to implement rather than
potentially the most illuminating; and causal explanations
were desired but unavailable.
Why is it interesting?
Explainable machine learning is a busy area of research,
driven by regulations such as Europe’s GDPR and by its
centrality to the many attempts to articulate what is
responsible AI – either as an explicit goal or as an enabler
for higher-level principles such as justice, fairness and
autonomy (explanations should facilitate informed
consent and meaningful recourse for the subjects of
algorithmic decision-making).
What this paper finds is that the explanation techniques
that are currently available fall short of what is required in
practice, substantively and technically. The authors offer
some suggestions of what organisations should do and
where research should focus next, to achieve the aim of
building ‘trustworthy explainability solutions’. Their
approach, of interviewing practitioners – i.e. of ‘user
centred design’, is one I like and would like to see more of
(the Holstein et al. paper (see ‘More’) is another notable
recent example). It allows for a much broader
consideration of utility than simple evaluation metrics –
in this case, asking how easy a given solution is to use or
to scale, and how useful the explanations it gives are for a
given type of stakeholder – and highlights that there is
often a gap between research evaluation criteria and
deployment needs.
Complementary work has sought to understand what
constitutes an explanation in a given context. I
recommend looking up Project ExplAIn by the UK’s
Information Commissioner’s Office (ICO) and The Alan
Turing Institute, which sought to establish norms for the
type of explanation required in given situations via
Citizen Juries and put together practical guidance on AI
explanation.
Starting with the type of explanation that is expected
allows us to ask ourselves whether it is actually feasible.
This is why it has been argued that there are some
domains in which black box algorithms ought never to be
deployed (see Cynthia Rudin under ‘More’). The finding
that organisations want causal explanations is illustrative
of this concern. Deep learning algorithms are good at
capturing correlations between phenomena, but not at
establishing which caused the other. Attempts to
integrate causality into machine learning are an exciting
frontier of research (see Bernhard Scholkopf under
‘More’).
As such, the paper is a welcome check to the proliferation
of software libraries and commercial services that claim to
offer explanation solutions and from which it’s easy to
imagine that this problem is essentially solved. As lead
author Umang Bhatt said when he presented the paper at
FAT* in January, ‘Express scepticism about anyone
claiming to be providing explanations.’
Will we see the impact of this in the next year?
Like research into machine learning algorithms
themselves, research into explainability has been subject
to cycles of hype and correction, and this is already
leading to more nuanced discussions which should benefit
everyone.
More
Signal: Accepted for FAT*2020
Who to follow: @umangsbhatt, @alicexiang, @RDBinns,
@ICOnews, @turinginst
Other things to read:
Holstein et al.: ‘Improving Fairness in Machine
Learning Systems: What Do Industry Practitioners
Need?’ (paper)
At NeurIPS 2018’s Critiquing and Correcting Trends
in Machine Learning workshop, Cynthia Rudin
argued, ‘Stop Explaining Black Box Machine
Learning Models for High Stakes Decisions and Use
Interpretable Models Instead’ (paper)
Bernhard Schölkopf: ‘Causality for Machine
Learning’ (paper)
The Alan Turing Institute and the Information
Commissioner’s Office (ICO) Project ExplAIn
(Interim report here, Guidance (draft) here). Final
guidance due to be released later in 2020.
International Evaluation of an AI System for
Breast Cancer Screening
S. Mayer McKinney et al. / January 2020 / paper
(viewable, not downloadable without subscription)
Abstract
Screening mammography aims to identify breast cancer at
earlier stages of the disease, when treatment can be more
successful. Despite the existence of screening programmes
worldwide, the interpretation of mammograms is affected
by high rates of false positives and false negatives. Here
we present an artificial intelligence (AI) system that is
capable of surpassing human experts in breast cancer
prediction. To assess its performance in the clinical
setting, we curated a large representative dataset from the
UK and a large enriched dataset from the USA. We show
an absolute reduction of 5.7% and 1.2% (USA and UK)
in false positives and 9.4% and 2.7% in false negatives.
We provide evidence of the ability of the system to
generalize from the UK to the USA. In an independent
study of six radiologists, the AI system outperformed all
of the human readers: the area under the receiver
operating characteristic curve (AUC-ROC) for the AI
system was greater than the AUC-ROC for the average
radiologist by an absolute margin of 11.5%. We ran a
simulation in which the AI system participated in the
double-reading process that is used in the UK, and found
that the AI system maintained non-inferior performance
and reduced the workload of the second reader by 88%.
This robust assessment of the AI system paves the way
for clinical trials to improve the accuracy and efficiency of
breast cancer screening.
About
This paper from Deepmind made a splash at the
beginning of the year when it was published in Nature
and the story was widely picked up by mainstream media
outlets. It describes the successful use of an AI system to
identify the presence of breast cancer from mammograms
and favourably compares performance against expert
human radiologists in the US and the UK.
The paper is the culmination of slow and careful work
from a multidisciplinary team with the involvement of
patients and clinicians. It relies on annotated
mammography data (via Cancer Research UK and a US
hospital) collected over extended time periods, since the
‘ground truth’ (whether cancer was actually present or
not) requires information from subsequent screening
events.
Deepmind submits that the performance that it achieved
suggests that AI systems might have a role as a decision
support tool, either to reduce the reliance on expert
human labour (such as in the UK setting, which currently
requires the consensus view from two radiologists) or to
flag suspicious regions for review by experts. It is possible
that use of AI detectors could help to detect cancers
earlier than is currently the case and to catch cancers that
are missed, but clinical studies are required to test this
out.
Why is it interesting?
Detecting cancer earlier and more reliably is a hugely
emotional topic and one with news value beyond the
technical media. It’s refreshing to see a counterpoint to
the negative press that has attended fears of AI-driven
targeting, deep fakes and tech-accelerated discrimination
in recent months. I, for one, am starving for evidence of
applications of AI with positive real-world impact. But
does the research justify the hype?
The first thing to note is that this kind of approach to AI
and mammography is not novel; it’s of established interest
in academia and the commercial sector. For instance, a
team from NYU published similar work last summer
comparing neural network performance against
radiologists, and London’s Kheiron Medical is engaged
with clinicians in the NHS to evaluate whether their
‘model is suitable for consideration as an independent
reader in double-read screening programmes’.
Deepmind’s reputation and effective PR department are
perhaps such that the media is more likely to notice its
results than these others’.
Where AI performance is evaluated against clinicians, we
should also be a little bit careful. The AI system has the
advantage of being able to select the decision threshold
for determining is / is not cancer that best showcases its
abilities. Even with that advantage, it performed (globally,
there were some interesting cases where the AI system
spotted things that humans did not) no better than the
two-reader system used in the UK. This suggests an
economic argument for use, but it isn’t yet representative
of a step-change in capability.
There’s still a very long way to go from here to
deployment. First, as the authors note, understanding the
‘full extent to which this technology can benefit patient
care’ will require clinical studies. That means evaluation
of performance in clinically realistic settings, across
representative patient cohorts and in randomised
controlled trials. Then, if the evidence supports
deployment, there are some non-trivial updates to service
design and technology integration required to incorporate
AI into large screening programmes. One might say that
Deepmind has demonstrated that we are at the end of the
beginning.
Scepticism has also been expressed about the use of AI
for mammography at all. There are three major concerns:
1) that research (or perhaps derivative media reports)
might be overstating the importance of results since these
systems have not been tested for robustness and
generalisability in real-world settings; 2) that using AI
will distort outcomes if the wrong questions are asked;
and 3) about how the data were obtained, who owns the
models and who benefits.
Deepmind has clearly taken great care in its experiments
and in what it reports, but it has not released its code, and
the supplementary information that it has released is very
light on the model design and implementation details
that would be required to reproduce the experiments that
it reports. The US dataset used in the study is not
publicly available and we do not know the details of the
licence under which the UK OPTIMAM dataset was
granted. We don’t know what Deepmind intends to do
with its models, either; thus, we don’t have enough
information to conduct a cost–benefit analysis.
In this light, an uncharitable judgement would be that the
paper published in Nature is more like a white paper than
a research one. However, I am optimistic that research
conducted in step with engagement around the concerns
outlined above will ultimately prove a net positive.
Will we see the impact of this in the next year?
It will take longer than a year to see the results of this
work in any clinically realistic setting. We do appear to be
at an inflexion point for AI radiology generally, with
many companies making progress and starting to move
into trials. This reflects in part in the suitability of
radiology for machine learning (since it is tech-driven
with large amounts of data), but is not necessarily
evidence of demand.
More
Signal: Published in Nature; all over the mainstream
press; subject of social media discussion
Who to follow: @DrHughHarvey, @EricTopol,
@DeepMind, @KheironMedical, @screenpointmed and
@Lunit_AI
Other things to read:
Lunit AI (and collaborators) (2020): ‘Changes in
Cancer Detection and False-Positive Recall in
Mammography Using Artificial Intelligence: A
Retrospective, Multireader Study’ (paper)
NYU (2019): ‘Deep Neural Networks Improve
Radiologists’ Performance in Breast Cancer
Screening’ (paper)
Nico Karssemeijer and co-authors (2016): ‘A
Comparison between a Deep Convolutional Neural
Network and Radiologists for Classifying Regions of
Interest in Mammography’ (paper)
View of current status of radiology AI:
From a radiologist’s perspective: ‘RSBA 2019
Roundup’ (blog)
From a machine learning perspective (paywall):
‘Artificial Intelligence for Mammography and Digital
Breast Tomosynthesis: Current Concepts and Future
Perspectives’ (paper)
Mastering Atari, Go, Chess and Shogi by
Planning with a Learned Mode
J. Schrittwieser et al. / November 2019 / paper / poster
Abstract
Constructing agents with planning capabilities has long
been one of the main challenges in the pursuit of artificial
intelligence. Tree-based planning methods have enjoyed
huge success in challenging domains, such as chess and
Go, where a perfect simulator is available. However, in
real-world problems the dynamics governing the
environment are often complex and unknown. In this
work we present the MuZero algorithm which, by
combining a tree-based search with a learned model,
achieves superhuman performance in a range of
challenging and visually complex domains, without any
knowledge of their underlying dynamics. MuZero learns
a model that, when applied iteratively, predicts the
quantities most directly relevant to planning: the reward,
the action-selection policy, and the value function. When
evaluated on 57 different Atari games - the canonical
video game environment for testing AI techniques, in
which model-based planning approaches have historically
struggled - our new algorithm achieved a new state of the
art. When evaluated on Go, chess and shogi, without any
knowledge of the game rules, MuZero matched the
superhuman performance of the AlphaZero algorithm
that was supplied with the game rules.
About
Reinforcement learning is the subfield of machine
learning concerned with learning through interaction
with the environment. Given an environment and a goal,
reinforcement learning algorithms try out (lots of ) actions
in order to learn which ones allow it to achieve the goal,
or to maximise its reward.
There are two principal categories of reinforcement
learning: model-based and model-free. As you would
expect, a model-based algorithm has a model of the
environment. That is, let’s say the objective is to learn to
play chess. Then using model-based learning, the
algorithm knows what the legal moves are, given a state
of play, so it can plan (or ‘look ahead’) accordingly to
optimise its next move – high-performance planning.
Model-based reinforcement learning typically works well
for logically complex problems, such as chess, in which
the rules are known or where the environment can be
accurately simulated.
In model-free reinforcement learning, the optimal actions
are learnt directly from interaction with the environment.
The learner encounters a state that it’s seen before, but it
does not know what the allowed moves are, only that one
action resulted in a better outcome than another. Model-
free reinforcement learning tends not to work well in
domains that require precision planning, but it is the
state-of-the-art for domains that are difficult to precisely
define or simulate, such as visually rich Atari games.
In this paper, DeepMind presents MuZero, a new
approach to model-based reinforcement learning that
combines the benefits of both high-performance planning
and model-free reinforcement learning by learning a
representation of the environment. The representation it
learns is not the ‘actual’ environment but a pared-down
version that needs only to ‘represent state in whatever way
is relevant to predicting current and future values and
policies. Intuitively, the agent can invent, internally, the
rules or dynamics that lead to most accurate planning.’ It
matches the performance of existing model-based
approaches in games such as chess, Go and Shogi and
also achieves state-of-the-art performance in many Atari
games.
Why is it interesting?
Deepmind has been making headlines since 2013 when it
first used deep reinforcement learning to play ’70s Atari
games such as Pong and Space Invaders. In a series of
papers since then, DeepMind has improved its
performance in the computer games domain (achieving
superhuman performance in many Atari games and
StarCraft II) and also smashed records in complex
planning games such as chess, Go and Shogi ( Japanese
chess) that have previously been tackled with ‘brute force’
(that is to say, rules plus processing power, rather than
learning).
DeepMind’s researchers show with this latest paper that
the same algorithm can be used effectively in both
domains – planning and visual ones – where before
different learning architectures were required. It does this
by taking DeepMind’s AlphaZero architecture (which
achieved superhuman performance in chess, Go and
Shogi in 2017) and adding the capability to learn its own
model of the environment. This makes MuZero a general
purpose reinforcement learning approach.
To be clear, MuZero, which was not supplied with the
games’ rules, matched the performance of AlphaZero,
which was. It also achieved state-of-the-art performance
on nearly all of the Atari-57 dataset of games by some
margin. That’s an impressive arc of achievement from
2013 to now.
But DeepMind has always had a bigger picture in mind
than success in games and simulated environments, and
that is to be able to use deep reinforcement learning in
complex, real-world systems, enabling it to model the
economy, the environment, the weather, pandemics and
so on. MuZero takes us one step closer to being able to
apply reinforcement learning methods in such situations
where we are not even sure of the environment dynamics.
In the authors’ words, ‘our method does not require any
knowledge of the game rules or environment dynamics,
potentially paving the way towards the application of
powerful learning methods to a host of real-world
domains for which there exists no perfect simulator’.
This transition from constrained research problems to
real-world applicability may still be a long way off, but we
can already see distinct research problems that would
extend MuZero’s capabilities on this path. At the
moment, MuZero works for deterministic environments
with discrete actions. This means that when an action is
chosen, the effect on the environment is always the same:
if I use my games console control to move right two steps,
my game avatar moves right two steps, for instance. In
many reinforcement learning problems, this is not true,
and we instead have stochastic environments with more
realistic and continuous actions.
Similarly, MuZero was extraordinarily effective at most of
the Atari games it played, but it was really challenged on
a couple, notably Montezuma’s Revenge, which deep
reinforcement algorithms always struggle with and which
requires long-term planning.
I look forward to seeing the progress against these and
other challenges, bringing the dream of scaling
reinforcement learning to large-scale, complex
environments that much closer.
Will we see the impact of this in the next year?
Deep reinforcement learning has lagged other machine
learning techniques in transferring from ‘lab to live’. We
have seen applications in autonomous transportation (e.g.
Wayve) and robotics (e.g. Covariant), but in principle, the
ability to adapt to an environment over time to maximise
a reward should have many applications. Research like
MuZero brings these closer.
More
Signal: NeurIPS Deep RL workshop 2019 (video)
Who to follow: @OpenAI, @DeepMind
Other resources:
Select DeepMind deep reinforcement learning papers:
(2013) The original Atari playing deep reinforcement
learning model: ‘Playing Atari with Deep
Reinforcement Learning’ (paper)
(2015) Deep Q-Networks achieve human-like
performance on 49 Atari 2600 games (paywall):
‘Human-Level Control through Deep Reinforcement
Learning’ (paper)
(2015) AlphaGo beats the European Go champion
Fan Hui, five games to zero (paywall): ‘Mastering the
Game of Go with Deep Neural Networks and Tree
Search’ (paper, blog)
(2017) AlphaGo Zero learns from self-play:
‘Mastering the Game of Go Without Human
Knowledge’ (paper, blog)
(2017) A generalised form of AlphaGo Zero,
AlphaZero, achieves superhuman performance in
high-performance planning scenarios: ‘A General
Reinforcement Learning Algorithm that Masters
Chess, Shogi and Go Through Self-Play’ (paper,
blog)
(2019) (paywall): AlphaStar achieves superhuman
performance at StarCraft II: ‘Grandmaster Level in
StarCraft II Using Multi-Agent Reinforcement
Learning’ (paper, blog)
Other papers:
A previous paper with a method that integrates
model-free and model-based RL methods into a
single neural network: Junhyuk Oh, Satinder Singh
and Honglak Lee (2017): ‘Value Prediction Network’
(paper)
Model-based reinforcement learning with a robot
arm: A. Zhang et al. (ICLR2019), ‘SOLAR: Deep
Structured Representations for Model-Based
Reinforcement Learning’ (paper, blog)
Complexity, Information and AI session at CogX
2019 with Thore Graepl, Eric Beinhocker and Cesar
Hidalgo (video)
Libby Kinsey
Libby is an AI researcher and practitioner. She spent ten
years as a VC investing in technology start-ups, and is co-
founder of UK AI ecosystem promoter Project Juno.
Libby is a Dean's List graduate in Machine Learning
from University College London, and has most recently
focused her efforts on working with organisations large
and small, public and private, to build AI capabilities
responsibly.
Azeem Azhar
Azeem is an award-winning entrepreneur, analyst,
strategist, investor. He produces Exponential View, the
leading newsletter and podcast on the impact of
technology on our future economy and society.
Marija Gavrilov
Marija leads business operations at Exponential View.
She is also a producer of the Exponential View podcast
(Harvard Business Presents Network).
Contact: aicurrents@exponentialview.co