Frontpage‎ > ‎


On the causality view of <Context-Aware Learning for Neural Machine Translation>

posted Mar 31, 2019, 4:59 PM by KyungHyun Cho   [ updated Apr 1, 2019, 7:10 AM ]

[Notice: what an unfortunate timing! This post is definitely NOT an april fool's joke.]

Sebastien Jean and I had a paper titled <context-aware learning for neural machine translation> rejected from NAACL'19, perhaps understandable because we did not report any substantial gain in the BLEU score. As I finally found some time to read Pearl's <Book of Why> due to a personal reason  (yes, personal reasons sometimes can help), I thought I wrote a short note on how the idea in this paper was originally motivated. As I was never educated in causal inference or learning, I was scared of using a term "causal" in any of my papers so far, and this paper was not an exception. I feel like this intentional avoidance of the term may have made the paper more obscure, and perhaps it's not a bad idea to use a blog post (and time from that personal reason) to write out my original motivation without worrying about academic scrutiny. 

Let me focus on building a translation model that takes as input both the current source sentence X and the previous source sentence C, and outputs the translation Y of the current source sentence X, although there is no reason to restricted C to be only a single immediate previous sentence. Let's introduce a variable Z that represents all that we do not observe directly, such as the world state, the author's intention and the actual meaning behind the text. You can think of Z also to include both benign and detrimental common sense, such as "bananas are always yellow" (when I was in Rwanda just a few weeks ago, I learned this to be false. see the picture on the left I took in Kimironko Market, Kigali), "presidents are often male; e.g., Monsieur President vs. Madame President", ...

If I were to draw a causal diagram, following Pearl, one version would look like below, where I used a dashed circle to explicitly indicate that Z is not observed:
The document, a part of which is represented by C and X, is created from (caused by) Z. The current sentence X is also caused by Z but not necessarily by its preceding sentence C. This is one assumption that I am not comfortable with, but it can be understood generously if we consider that in many cases we can more easily reorder sentences in a paragraph than reordering words in a sentence. Once we know both X and C, the translation Y of the source sentence X is determined (caused) by the source sentence X and the previous sentence C. Why is there an arrow from C to Y that bypasses X? This is due to the difference between the source and target languages. Consider an example of translating from a language without gendered pronouns to one with gendered pronouns. 

Based on this diagram, what we next want to know in this "context-aware neural machine translation" is the effect of the previous sentence C on the translation Y of the source sentence X. 

Now, a fair warning before we proceed: because I only just gave a quick read of <Book of Why>, I may be completely off here.

Let's consider two paths from C to Y in the diagram above: C->Y and C<-Z->X->Y. The first path corresponds to the direct effect of C on Y, and the second path could be thought of as a path with a mediator X. The effect of the cause C on Y will be some function of these two, and if all the relationships are linear, the sum of the effects from these two paths will be the total effect of C on Y. 

Now, obviously, it'll be best if we could estimate the coefficient (a set of neural net parameters) associated with each arrow in the diagram above somehow. Then, we can compute the total effect exactly, and that would be the end story of causal inference. Unfortunately, other than the coefficients of the two arrows (C->Y and X->Y), which can be estimated from data by fitting a neural machine translation system, it looks pretty unrealistic for us to estimate the parameters of Z->C and Z->X.

This is where we move away from causal inference and toward machine learning (in particular, machine translation). Instead of trying to estimate those coefficients and infer the causal effect of C on Y, our goal is now to train a neural machine translation system to maximally exploit the effect of C on Y. That is, we train a context-aware neural machine translation such that the context C maximally influences (causes) Y in addition to the source sentence X, according to the causal diagram above
Under this goal, the second path C<-Z->X->Y (the path colored blue above) is of our interest, as this is path contains two arrows of which we don't know how to estimate the coefficients. We notice that this path is blocked by the confounder Z which we don't observe nor control for (though, this could be an interesting exercise in the future to control Z by finely partitioning a corpus.) One classical technique in this case is to run a randomized trial on C, which effectively cuts the arrow from Z to C. 
This cut indicates that the choice of C is not dependent on Z. In the case of training a context-aware neural machine translation system, this can be thought of as replacing the previous sentence with any randomly drawn sentence from a large corpus (though, it is not at all clear what the distribution should be, and we discuss a few alternatives in Sec. 4.3.) Then, by contrasting the effect of C and X on Y and that of randomly drawn C and X on Y, we can measure the effect of C and Y. This can be expressed in an equation:

r(C) is a randomized context, and we use the conditional log-probability of Y given X and C (or r(C)) as the causal effect (score) s(Y|X,C). This formulation naturally lends itself to a new regularization term that encourages the context-aware neural machine translation system to maximize the effect of the context C on Y. We use the margin loss together with this causal effect on three different levels (minibatch, sentence and token). Here let me write out the sentence-level regularization term:
Minimizing this term literally maximizes the causal effect of C on Y until it is at least as large at some predefined threshold (δ) multiplied by the length of each sentence.

We call this regularization technique "context-aware learning" (or context-aware regularization), as I was actively avoiding a term "causal" anywhere. Indeed, this technique helps in a sense that the final, trained neural machine translation system actually degrades when a wrong context is provided, as opposed to a usual context-aware translation system which is often trained without considering this causal effect. Compare (c) and (d) below while contrasting the columns "Normal" and "Context-Marginalized". We did also observe some improvement even when the correct context was given (Normal), but the reviewers were not impressed.  

As you may have noticed, this approach is agnostic to an underlying machine translation system. As long as you can train the underlying system with the proposed regularization term, this framework carries over very naturally. It is furthermore decoupled with the actual problem of machine translation. The proposed approach can be applied to any other problems where we have a set of input modalities, of which some are only weakly correlated with the output but are known to cause the output. 

Phew, there you go! I'm glad that I found some time today to fulfill my deep desire to say "causality" out loud. 

PS1. I had another ill-fated attempt to apply this framework to generic supervised (unsupervised) learning and explain it without mentioning anything about causality or randomized trials: Though, I cannot tell whether Adji Dieng noticed this :)

PS2. the diagram above is slightly less satisfying, as there is no arrow from C to X. A natural next step would be the following:
We probably want to randomize both C and X. Though, I am pretty sure there must be better ways to do so.

PS3. While this paper was under review at NAACL'19, I saw a talk by Natasha Jaques who visited NYU. Her work nicely incorporated counterfactual analysis (now at the individual-level causal inference) to learning a set of coordinating neural net agents in a similar manner as my paper. Definitely worth a read: Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning.

Lecture note <Brief Introduction to Machine Learning without Deep Learning>

posted Jul 16, 2017, 12:01 PM by KyungHyun Cho   [ updated Jul 16, 2017, 12:07 PM ]

This past Spring (2017), I taught the undergrad <Intro to Machine Learning> course. This was not only the first time for me to teach <Intro to Machine Learning> but also the first time for me to teach an undergrad course (!) This course was taught a year before by David Sontag who has now moved to MIT. Obviously, I thought about re-using David's materials as they were, which you can find at These materials are really great, and the coverage of various topics in ML is simply amazing. I highly recommend all the materials on this web page. All the things you need to know in order to become a certified ML scientist can be found there.

I, however, felt that this great coverage may not be appropriate for an undergrad intro course and also that I wasn't qualified to talk about many of those topics without spending a substantial amount of time studying them myself first. Then, what can/should I do? Yes, I decided to re-create a whole course with two things in my mind. First, what's the minimal set of ML knowledge necessary for an undergrad to (1) grasp at least the high-level view of machine learning and (2) use ML in practice after they graduate? Second, what are topics in ML that I could teach well without having to pretend I know without knowing them in depth? With these two questions in my mind, as in the previous year for the NLP course, I started to write a lecture note as the semester continued. At the end of the day (or semester), I feel like I've taken a step toward a right direction however with much to be improved in the future.

I started with classification. Perceptron and logistic regression were introduced as examples showing the difference between traditional computer science (design an algorithm that solves a problem) and machine learning (design an algorithm that finds an algorithm for solving any given problem). I then moved on to defining (linear) support vector machine as a way to introduce various loss functions and regularization. I gave up on teaching kernel SVM due to time constraint, though. Logistic regression was then generalized to a multi-class logistic regression with softmax

For teaching how to deal with problems which are not linearly separable, I've decided an unorthodox approach. I started with a nearest-neighbour classifier, extend it into a radial basis function network with fixed basis vectors, and then to an adaptive basis function network which I dubbed as deep learning (which is true by the way.) At this point, I think I lost about half of the class, but the other half, I believe, was able to follow the logic based on their performance in the final exam. I should've talked about kernel methods here, but well, it's not like I can use the whole semester solely on classification.

Then, I moved on to regression. Here I focused on introducing probabilistic ML. To do so, I had to spend 2 hours on re-capping on probability itself. I introduced Bayesian linear regression and discussed how it corresponds to linear regression with Gaussian prior on the weight vector. This naturally led to a discussion on how to do Bayesian supervised learning. I wanted to show them Gaussian process regression, but again, there wasn't enough time.

For unsupervised learning, I again took an unorthodox route by putting (almost) everything under matrix factorization (X=WZ) with a reconstruction cost and varying constraints. PCA and NMF were discussed in-depth under this, and sparse coding and ICA were briefly introduced. k-means clustering was also introduced as a variant of matrix factorization, and hard EM algorithm was (informally) derived from minimizing a reconstruction error with a constraint that the code vectors (Z) were one-hot. This whole matrix factorization was then extended to deep autoencoders and to (metric) multi-dimensional scaling. Surprisingly, students were much more engaged with unsupervised learning than with supervised learning, and at this point, I had regained the half of the class I lost when I was teaching them nonlinear classifiers.

The course ended with the final lecture in which I briefly introduced policy gradient. This was again done in a rather unorthodox way by viewing RL as a sequence of classifiers. I'm quite sure RL researchers would cry over my atrocity here, but well, I thought this was a more intuitive way of introducing RL to a bunch of undergrad students who have highly varying backgrounds. Though, now that I think about it, it may have been better simply to play them the RL intro lecture by Joelle Pineau:

Anyways, you can find a draft of my lecture note (which will forever be a draft until I retire from the university) at 

Any suggestion or PR is welcome at 

However, do not expect them to be incorporated quickly, as I'm only planning to revise it next Spring (2018).

During the course, I showed the students the following talks here and there to motivate them (and to give myself some time to breathe):

to arXiv or not to arXiv

posted Feb 12, 2016, 3:56 PM by KyungHyun Cho   [ updated Feb 12, 2016, 4:36 PM ]

I believe it is a universal phenomenon: when you're swamped with work, you suddenly feel the irresistible urge to do something else. This is one of those something else.

Back in January (2016), right after the submission deadline of NAACL'16, Chris Dyer famously (?) posted on this Facebook wall, "to arxiv or not to arxiv, that is the increasingly annoying question." This question of "to arxiv or not to arxiv" a conference submission, that has not yet gone through peer-review, indeed has become a thorny issue in the field of machine learning and a wider research community around it, including natural language processing.

Perhaps one of the strongest proponent of "to arXiv" is Yann LeCun at NYU & Facebook. In his "Proposal for a new publishing model in Computer Science," he argues that "[m]any computer [s]cience researchers are complaining that our emphasis on highly selective conference publications, and our double-blind reviewing system stifles innovation and slow the rate of progress of [s]cience and technology." This is a valid concern, as we have observed that the rate of progress in computer science has largely overtaken the speed of publication process. Furthermore, as the focus (and assessment) has moved from journals to so-called top-tier conferences, more and more papers get stuck in the purgatory of submit-review-reject-resubmit. Although the conferences almost always guarantee faster decision making, it's a binary decision without much possibility of any revision. The only way to salvage a rejected paper is to wait for another conference in the same year, or for the same conference in a subsequent year. Throughout this process, it's quite often that the content and idea of the submission become stale, thus leading to a slowdown in the scientific progress.1

Of course, at the same time, there are many issues with this approach of "to arXiv," contrary to the more traditional double-blind peer reviewing system ("not to arXiv.") Nowadays we see a flood of conference submissions on arXiv a day or two after the submission deadline of one conference, at least in the field of machine learning, or more specifically deep learning. Unfortunately I must say that there are quite some low-quality submissions. Why are there many low-quality submissions being made public? After all, no author probably wants to be associated with a submission that is half-baked and incomplete.

One potential reason I see is the severe competition among researchers from all corners of the globe. Nobody wants to be scooped by simply forgetting to upload their submission on arXiv before their competitors do. Pushed by this anxiety over being scooped by others, authors often end up putting a rather half-baked manuscript out. Or, maybe authors are simply being naive thinking that one can always update her manuscript on arXiv with a newer version. Combined with the open reviewing system, such as that of ICLR, we see a surge of half-baked submissions on arXiv once or twice every year, and this has been spreading over to other conferences as well as other fields.2

Why is it an issue at all? Because it wastes many people's time. We see an interesting title popping up in our Google Scholar My Update or in someone's tweet, and as researchers, cannot ignore that submission, be it accepted at some conference or not. And, after reading the paper for 10-30m, we realize that "well, I should wait a few months for a next version!" Also, the oft-lack of thorough empirical validation may mislead readers into a wrong conclusion.

But, again, I'm not trying to either advocate or oppose the idea of "to arXiv" in this post.3 Instead, I'm here to share the result of an informal survey I ran right after reading Chris' FB posting. The goal of the survey was to see how many people follow either of "to arXiv" or "not to arXiv" paradigms and to which degree they do so. The poll was completely anonymous and was done using Facebook App <Polls for Pages>.4 It was rather informal, and the questions were slightly changed once at the beginning of the survey. Also, it's quite heavily biased, as most of the participants are people close to me, meaning that they are either working on deep learning or (statistical) natural language processing. In other words, take the result of this poll with a grain of salt. 

In total, 203 people participated, and they were either machine learning or natural language processing researchers. Among them, 64.5% said their major area of research is machine learning, and the rest natural language processing. 

The participants were asked first whether they upload their conference "submission" to arXiv. About two thirds of the participants answered that they do.
When I drew this pie chart, I noticed a striking resemblance to the chart showing the portion of machine learning researchers among the participants. Is it possible that all the machine learning researchers post their submissions to arXiv but no NLP researchers do? It turned out that the answer was "no."
Among ML researchers
Among NLP reseachers

But, still I was able to see a stark difference between the machine learning researchers and NLP researchers. While 75.6% of machine learning researchers said they upload their submissions to arXiv, less than 50% of NLP researchers did so. I believe this reflects the fact that this model of "to arXiv" has recently been strongly advocated by some machine learning researchers such as Yann LeCun and Yoshua Bengio.

The second question was on "when" they uploaded their submissions to arXiv.5
The respondents were quite divided between "to arXiv right away", "to arXiv after the deadline", and "to arXiv after the paper's accepted." One lesson is that an absolute majority of the respondents want to put their papers regardless of "official" publication (in proceedings.) 

Now, aren't you curious how much this trend depends upon the field of research? First up, machine learning!
Whoah! More than half of the machine learning respondents said they upload their conference submission to arXiv before any formal feedback on it. Furthermore, it shows that more than 80% of the machine learning researchers make their papers available online way before the actual conference, meaning that if anyone's determined enough, she can read most of the machine learning papers in far advance of actual conferences (of course, you can't drink beer with authors, which is a kind of deal breaker for me..)

How much does it differ if we only consider NLPers?
Surprise, surprise! We see a radically different picture here. Only about a fifth of all the NLP respondents said they upload their submissions before any formal feedback. Nearly half of the NLPers wait until the decision is made on the submission, before they arXiv it.  Also, nearly a quarter of them do not actively use arXiv for conference submissions.

Now, what have we learned from this? What have I learned from this? What have you learned from this? I have learned quite a lot of interesting things from this survey, but my dinner time's approaching too fast..

One thing for sure is that it'll be extremely interesting to conduct this type of survey, in a much more rigorous way, at some point this year, and do follow-up study each or every other year for the next decade. This will be an extremely valuable study that may help us build a better publication model for research.

So, my conclusion? It was $50 well spent.

The data (anonymized) along with a python script I used to draw those pie charts (it was my first time and I don't recommend it) is available at

1 There is also an issue of malicious reviewers, or more mildly put subconscious bias working against some submission, but I won't try to touch this can of worms in this post.

2 I am guilty of this myself and do not in any sense intend to blame anyone. I view this as a systematic issue rather than an issue of an individual.

3 I will perhaps make another post some day on this, but not today, tomorrow nor this year.

4 Which was a pretty bad idea, because it turned out that I had to pay $50 in order for me to see the response from more than 50 respondents.. :(

5 I assumed every researcher has a good intention of having their paper made public once it's published regardless of whether to arXiv or not. Therefore, "probably not" should be understood as "probably not uploading a manuscript that was published in another medium/venue to a preprint server such as arXiv." 

Lecture Note for <NLP with Distributed Representation> on arXiv Now

posted Nov 25, 2015, 5:50 PM by KyungHyun Cho   [ updated Nov 25, 2015, 6:01 PM ]

On the same day I moved to NYC at the end of August, I had coffee with Hal Daume III. Among many things we talked about, I just had to ask Hal for advice on teaching, as my very first full-semester course was about to start then. One of the first questions I asked was whether he had some lectures slides all ready now that it's been some years since he's started teaching. 

His response was that there was no slide! No slide? I was shocked for a moment. Though, now that I think about it, most of the lectures I attended during my undergrad were in fact given as a chalkboard lecture. 

I can understand that there are many advantages in chalkboard lectures. And, most of them to students. The slow pace of chalkboard lectures likely (but not necessarily) fits better with the pace of understanding what's going on in the lecture room, than simply flipping through slides. Also, it becomes nearly impossible for a lecturer to skip anything, as any board starts empty. 

I took this as a challenge (though, I'm sure Hal never meant it to be a challenge in the first place.) Also, I naively thought that the amount of time I need to spend in preparing 100 slides would be much larger than the amount of time I prepare for a chalkboard lecture. After all I've been talking about this NLP with DL over and over, and those talks successfully landed me a job. 

One advice from Hal was that it is better to keep the record or note of what I will teach or have taught so that I can reuse this note over and over. In hindsight, it was perhaps not an advice but simply his personal regret (+ a hint that I shouldn't do chalkboard lectures..)

Sticking to this advice I decided to write a lecture note of roughly 10 pages each week. Since I cannot even remember when it was the last time I hand-wrote any text, I decided to use latex. So far so good, except that it turned out to be an amazingly time-consuming job. Writing 10 pages each week felt never so difficult before (and I used the default latex article class which has gigantic margins..) 

After about a month since the beginning of the semester, I found this amazing review article (or lecture note, I'd say) by Yoav Goldberg. Only if Yoav uploaded this to arXiv a 1.5 month earlier! The course was already more than a third way into the semester, and I couldn't suddenly ask the students to switch from my (ongoing) lecture note to Yoav's. Why? Two reasons: (1) my lecture note had deviated quite a far from Yoav's and (2) my ego wouldn't let me declare my failure at making a lecture note myself in front of the whole class.

Anyways, I continued on writing the lecture note, and this Monday had the last lecture. I thought of cleaning it up quite significantly, adding more materials and even putting some exercises, but you know.. I'm way too exhausted to do even one of them now. I decided to put the latest version, as of evening Monday, on arXiv, and it's showed up today:

I must confess that this lecture note is likely to be full of errors (both major and minor.) Also, I had to skip quite many exciting, new stuffs due to time constraint (only if I had twice longer a semester! nope.) I kindly ask your understanding.. I mean, it's been rough.

Any future plan for this lecture note? Hopefully I will convince the Center for Data Science at NYU the importance of this course, and they'll let me teach the very same course next year. In that case I will likely clean it up more, fix all those errors, update some of the later chapters, and this time for real, add some exercise problems. Wish me luck!

Oh, right! Before finishing this post, I'd like to thank all the students and non-students who came to the lectures, and two TA's, Kelvin and Sebastien, who've been awesome help.

Brief Summary of the Panel Discussion at DL Workshop @ICML 2015

posted Jul 12, 2015, 12:27 PM by KyungHyun Cho   [ updated Jul 14, 2015, 6:04 AM ]

The finale of the Deep Learning Workshop at ICML 2015 was the panel discussion on the future of deep learning. After a couple of weeks of extensive discussion and exchange of emails among the workshop organizers, we invited six panelists; Yoshua Bengio (University of Montreal), Neil Lawrence (University of Sheffield), Juergen Schmidhuber (IDSIA), Demis Hassabis (Google DeepMind), Yann LeCun (Facebook, NYU) and Kevin Murphy (Google). As recent deep learning revolution has come from both academia and industry, we tried our best to balance the panelists so that audience can hear from the experts in both industry and academia. Before I say anything more, I would like to thank the panelists for having accepted the invitation! 

Max Welling (University of Amsterdam) moderated the discussion, and personally, I found his moderation to be perfect. A very tight schedule of one hour, with six amazing panelists, on the grand topic of the future of deep learning; I cannot think anyone could've done a better job than Max. On behalf of all the other organizers (note that Max Welling is also one of the workshop organizers), I thank him a lot!

Now that the panel discussion is over, I'd like to leave a brief note of what I heard from the six panelists here. Unfortunately, only as the panel discussion began, I realized that I didn't have a notepad with me.. I furiously went through my backpack and found a paper I need to review. In other words, due to the lack of space, my record here is likely not precise nor extensive. 

I'm writing this on the plane, and forgive me for any error below (or above.) I wanted to write it down before the heat from the discussion cools down. Also, almost everything inside quotation marks is not an exact quote but a paraphrased one.

On the present and future of deep learning

Bengio began by noting that natural language processing (NLP) has not been revolutionized by deep learning, though there has been huge progress during the last one year. He believe NLP has a potential to become a next big thing for deep learning. Also, he wants more effort invested in unsupervised learning, which was resonated by LeCun, Hassabis and Schmidhuber. 

Interestingly, four out of the six panelists, LeCun, Hassabis, Lawrence and Murphy, all found medicine/healthcare as a next big thing for deep/machine learning. Some of the areas they expressed their interest in were medical image analysis (LeCun) and drug discovery (Hassabis). Regarding this, I believe Lawrence is already pushing into this direction (DeepHealth from his earlier talk on the same day,) and it'll be interesting to contrast his approach and those from Google DeepMind and Facebook later.

LeCun and Hassabis both picked Q&A and natural language dialogue systems as next big things. Especially, I liked how LeCun puts these in the context of incorporating reasoning based on knowledge, its acquisition and planning into neural networks (or as a matter of fact, any machine learning model.) This was echoed by both Hassabis and Schmidhuber. 

Schmidhuber and Hassabis found sequential decision making as a next important research topic. Schmidhuber's example of Capuchin monkeys was both inspiring and fun (not only because he mistakenly pronounced it as a cappuccino monkey.) In order to pick a fruit at the top of a tree, Capuchin monkey plans a sequence of sub-goals (e.g., walk to the tree, climb the tree, grab the fruit, …) effortlessly. Schmidhuber believes that we will have machines with animal-level intelligence (like a Capuchin smartphone?) in 10 years.

Slightly different from the other panelists, Lawrence and Murphy are more interested in transferring the recent success of deep learning to tasks/datasets that humans cannot solve well (let me just call these kinds of tasks 'non-cognitive' tasks for now.) Lawrence noted that the success of deep learning so far has largely been constrained to the tasks humans can do effortlessly, but the future may be with non-cognitive tasks. When it comes to these non-cognitive tasks, interpretability of trained models will become more valuable, noted by Murphy. 

Hierarchical planning, knowledge acquisition and the ability to perform non-cognitive tasks naturally lead to the idea of automated laboratory, explained Murphy and Schmidhuber. In this automated laboratory, a machine will actively plan its goals to expand its knowledge of the world (by observation and experiments) and provide insights into the world (interpretability.)

On the Industry vs. Academia

One surprising remark from LeCun was that he believes the gap between the infrastructures at the industry labs and academic labs will shrink over time, not widen. This will be great, but I am more pessimistic than he is.

LeCun continued on explaining the open research effort at Facebook AI Research (FAIR). According to him, there are three reasons why industry (not only specific to FAIR) should push open science: (1) this is how research advances in general, (2) this makes a company more attractive to prospective employee/researcher and (3) there's competition among different companies in research, and this is the way to stay ahead of others. 

To my surprise, according to Hassabis, Google DeepMind (DeepMind from here on) and FAIR have agreed to share research software framework based on Torch. I vaguely remember hearing something about this under discussion some weeks or months ago, but apparently it has happened. I believe this will further speed up research from both FAIR and DeepMind. Though, it's to be seen whether it'll be beneficial to other research facilities (like universities) for two places with the highest concentration of deep learning researchers in the world to share and use the same code base.

Hassabis, Lawrence, Murphy and Bengio all believe that the enormous resource available in industry labs is not necessarily an issue for academic labs. Lawrence pointed out that other than those data-driven companies (think of Google and Facebook) most of companies in the world are suffering from the abundance of data rather than enjoying it, which opens a large opportunity to researchers in academic labs. Murphy compared research in academia these days to Russians during the space race between US and Russia. The lack of resource may prove useful, or even necessary for algorithmic breakthroughs which Bengio and Hassabis found important still. Furthermore, Hassabis suggested finding tasks or problems where one can readily generate artificial data such as games.

Schmidhuber's answer was the most unique one here. He believes that the code for truly working AI agents will be so simple and short that eventually high school students will play around with it. In other words, there won't be any worry of industries monopolizing AI and its research. Nothing to worry at all!

On the Hype and the Potential Second NN Winter

As he's been asked this question of overhyping everytime he was interviewed by a journalist, LeCun started on this topic. Overhyping is dangerous, said LeCun, and there are four factors;  (1) self-deluded academics who need funding, (2) startup founders who need funding, (3) program managers of funding agencies who manage funding and (4) failed journalism (who probably also needs funding/salary.) Recently in the field of deep learning, the forth factor has played a major role, and surprisingly, not all news articles have been the result of the PR machines at Google and Facebook. Rather, LeCun prefers if journalists would call researchers before writing potentially nonsense.

LeCun and Bengio believe that a potential solution both to avoid overhyping and to speed up the progress in research is the open review system, where (real) scientists/researchers put up their work online and publicly comment on them so as to let people see both the upside and downside of the paper (and why this paper alone won't cause singularity.) Pushing it further, Murphy pointed out the importance of open sourcing research software, using which other people can more easily understand weakness or limitations of newly proposed methods in papers. Though, he pointed out it's important for authors themselves to clearly state the limitation of their approaches whenever they write a paper. Of course, this requires what Leon Bottou said in his plenary talk (reviewers should encourage the discussion of limitations not kill the paper because of them.)

Similarly, Lawrence proposed that we, researchers and scientists, should slowly but surely approach the public more. If we can't trust journalists, then we may need to do it ourselves. A good example he pointed to is “Talking Machines” podcast by Ryan Adams and Katherine Gorman.

Hassabis agrees that overhyping is dangerous, but also believes that there will be no third AI/NN winter. For, we now know better what caused the previous AI/NN winters, and we are better at not promising too much. If I may add here my own opinion, I agree with Hassabis, and especially because neural networks are now widely deployed in commercial applications (think of Google Voice), it'll be even more difficult to have another NN winter (I mean, it works!)

Schmidhuber also agree with all the other panelists that there won't be any more AI/NN winter, but because of yet another reason; the advances in hardware technology toward "more RNN-like (hence brain-like) architectures," where "a small 3D volume with lots of processors are connected by many short and few long wires."

One comment from Murphy was my favourite; 'it is simply human nature.' 

On AI Fear and Singularity

Apparently Hassabis of DeepMind has been at the core of recent AI fear from prominent figures such as Elon Musk, Stephen Hawking and Bill Gates. Hassabis introduced AI to Musk, which may have alarmed him. However, in recent months, Hassabis has convinced Musk, and also had a three-hour-long chat with Hawking about this. According to him, Hawking is less worried now. However, he emphasized that we must be ready, not fear, for the future.

Murphy found this kind of AI fear and discussion of singularity to be a huge distraction. There are so many other major problems in the world that require much immediate attention, such as climate changes and spreading inequalities. This kind of AI fear is a simply oversold speculation and needs to stop, to which both Bengio and LeCun agree. Similarly, Lawrence does not find the fear of AI the right problem to be worried about. Rather, he is more concerned with the issue of digital oligarchy and data inequality

One interesting remark from LeCun was that we must be careful at distinguishing intelligence and quality. Most of the problematic human behaviours, because of which many fear human-like AI, are caused by human quality not intelligence. Any intelligent machine need not inherit human quality.

Schmidhuber had a very unique view on this matter. He believes that we will see a community of AI agents consisting of both smart ones and dumb ones. They will be more interested in each other (as ten-year-old girls are more interested in and hang out with other ten-year-old girls, and Capuchin monkeys are interested in hanging out with other Capuchin monkeys,) and may not be interested in humans too much. Furthermore, he believes AI agents will be significantly smarter than humans (or rather himself) without those human qualities that he does not like of himself, which is in lines with LeCun's remark.

Questions from the Audience

Unfortunately, I was carrying around the microphone during this time and subsequently couldn't make any note. There were excellent questions (for instance, from Tijmen Tieleman) and the responses from the panelists. Hopefully, if anyone reads this and remembers those questions and answers, please share this in the comment section. 

One question I remember came from Tieleman. He asked the panelists about their opinions on active learning/exploration as an option for efficient unsupervised learning. Schmidhuber and Murphy responded, and before I reveal their response, I really liked it. In short (or as much as I'm certain about my memory,) active exploration will happen naturally as the consequence of rewarding better explanation of the world. Knowledge of the surrounding world and its accumulation should be rewarded, and to maximize this reward, an agent or an algorithm will active explore the surrounding area (even without supervision.) According to Murphy, this may reflect how babies learn so quickly without much supervising signal or even without much unsupervised signal (their way of active exploration compensates the lack of unsupervised examples by allowing a baby to collect high quality unsupervised examples.) 

I had an honor to ask the last question directed mainly at Hassabis, LeCun and Murphy on what companies would do if they (accidentally or intentionally) built a truly working AI agent (in whatever sense.) Would they conceal it thinking that the world is not ready for this? Would they keep it secret because of potential opportunities for commercialization? Let me put a brief summary of their responses (as I remember, but again, I couldn't write it down then.)

All of them expressed that it won't happen like that (one accident resulting in a thinking machine.) And, because of this, LeCun does not find it concerning, as this will happen gradually as a result of joint efforts of many scientists both in industry and academia. Hassabis believes similarly to how LeCun does, and also couldn't imagine that this kind of discovery, had it happened, will be able to be contained (probably the best leak of human history.) However, he argued for getting ready for the future where we, humans, will have access to truly thinking machines, of which sentiment I share. Murphy agreed with both LeCun and Hassabis. He together with LeCun made a remark about a recently released movie, Ex-Machina (which is by the way my favourite this year so far): It's a beautifully filmed movie, but nothing like that will happen.

I agree with all the points they made. Though, there was another reason behind my question, which was unfortunately not discussed by them (undoubtably due to time constraint.) That is, once we have algorithms or machineries that are “thinking” and say the most important few pieces were developed in a couple of commercial companies (like the ones Hassabis, LeCun and Murphy are,) who will have right to those crucial components, will those crucial components belong to those companies or individuals, will they have to be made public (something like universal right to artificial intelligence?), and most importantly who will decide any of these?


Obviously, there is no conclusion. It is an ongoing effort, and I, or we the organizers, hope that this panel discussion has been successful at shedding at least a bit of light in the path toward the future of deep learning as well as general artificial intelligence (though, Lawrence pointed out the absurdity of this term by quoting Zoubin Ghahramani 'if a bird flying is flying, then is a plane flying artificial flying?') 

But, let me point a few things that I've personally found very interesting and inspiring:

(1) Unsupervised learning as reinforcement learning and automated laboratory: instead of taking into account every single unlabeled example as it is, we should let a model selectively consider a subset of unlabeled examples to maximize a reward defined by the amount of accumulated knowledge.

(2) Overhyping can be avoided largely by the active participation of researchers in distributing latest results and ideas, rather than by letting non-experts explain them to non-experts. Podcasting, open reviewing and blogging may help, but there's probably no one right answer here.

(3) I don't think there was any one agreement on industry vs. academia. However, I felt that all three academic panelists as well as the other industry panelists all agree that each has its own role (sometimes overlapping) toward a single grand goal.

(4) Deep learning has been successful at what humans are good at (e.g., vision and speech), and in the future we as researchers should also explore tasks/datasets where humans are not particularly good at (or only become good at after years and years of special training.) In this sense, medicine/health care seems to be one area where most of the panelists were interested in and probably are investing in.

When it comes to the format of the panel discussion, I liked it in general, but of course, as usual with anything, there were a few unsatisfactory things. The most unsatisfactory thing was the time constraint (1 hour) we set ourselves. We have gathered six amazing panelists who have so much to share with the audience and world, but on average, only 10 minutes per panelist was allowed. In fact, as one of the organizers, this is partly due to my fault in planning. It would've been even better if the panel discussion was scheduled to last the whole day with more panelists, more topics and more audience involvement (at least, I would've loved it!) But, of course, a three-day-long workshop has been way out of our league.

Another thing I think can be improved is the one-time nature of the discussion. It may be possible to make this kind of panel discussion some kind of yearly event. It may be co-located with a workshop, or can even be done online. This can help, as many of the panelists pointed out, us (and others) avoid overhyping our research result or the future of the whole field of machine learning, and will be a great way to approach a much larger audience including both senior and junior researchers as well as other informed/interested public. Maybe, I, or you who's reading this, should email the hosts of “Talking Machines” and suggest it.

Comment from Juergen Schmidhuber

Schmidhuber read this post and emailed me with his comment to clarify a few things. With his permission I am putting here his comment as it is:

Thanks for your summary! I think it would be good to publish the precise transcript. Let me offer a few clarifications for now:

1. Why no additional NN winter? The laws of physics force our hardware to become more and more 3D-RNN-like (and brain-like): densely packed processors connected by many short and few long wires, e.g., Nature seems to dictate such 3D architectures, and that’s why both fast computers and brains are the way they are.  That is, even without any biological motivation, RNN algorithms will become even more important - no new NN winter in sight.

2. On AI fear: I didn’t say "Nothing to worry at all!” I just said we may hope for some sort of protection from supersmart AIs of the far future through their widespread lack of interest in us, like in this comment:
And in the near future there will be intense commercial pressure to make very friendly, not so smart AIs that keep their users happy. Unfortunately, however, a child-like AI could also be trained by sick humans to become a child soldier, which sounds horrible. So I’d never say "Nothing to worry at all!” Nevertheless, silly goal conflicts between robots and humans in famous SF movie plots (Matrix, Terminator) don’t make any sense.


[UPDATE 13 July 2015 7.10AM: rephrased the Schmidhuber's reply on AI/NN Winter as requested by Schmidhuber]
[UPDATE 14 July 2015 9.03AM: Schmidhuber's comments added]

1-5 of 5