Frontpage‎ > ‎


NYU Center for Data Science: What is intelligence?

posted Dec 23, 2019, 7:13 AM by KyungHyun Cho   [ updated Dec 23, 2019, 10:32 AM ]

A few weeks ago there was an open house at NYU Center for Data Science intended for faculty members of NYU. As one of the early members of the Center (i know! already!) i was given an opportunity to share why i joined the center and my experience at the Center so far with the audience. although i'm much more familiar with giving a research talk using a set of slides, i decided to try something new and give a talk without any slide. of course, this is totally new to me, and i couldn't help but prepare a script in advance. i didn't really stick to the script during my talk then, but thought it's not too bad an idea to share this with a broader community beyond NYU.


What is intelligence?

It turned out that there’s a great list of speakers scheduled after my quick lightning talk, covering a broad set of topics spanning from mathematics, computer science, natural sciences, healthcare and medicine all the way to law. Each speaker will without a doubt tell us about the latest and greatest research in each direction they pursue and how it’s connected to data science and perhaps even more broadly artificial intelligence. 

For me, i’m going through a bit of research identity crisis at the moment and thought i would spend a brief moment talking about why i decided to join the NYU Center for Data Science as one of the earliest so-called core faculty members in 2015. 

My background is in computer science. I have received all of my degrees in computer science. The reason why i decided to pursue computer science was simple; i was fascinated by the idea that we can pose and answer this question; “what is computation?” this seemingly straightforward question has a lot of implications. First, it brings an abstract concept of computation into a scientifically well-founded concept that we can characterize and study. Second, this investigation into what computation is has led to practical solutions to many problems that were simply not straightforward to even define due to the lack of the definition of what computation is. What started as formal, mathematical journey into figuring out computation had become a major scientific discipline touching every corner of the society already when i started my undergrad years. Look around yourself and think of what you do everyday both personally and professionally. It is pretty much impossible nowadays to find a single activity that does not involve the outcome of computer science, and computer science continues to make progress in answering the question: “what is computation?”

Then, what is the next question we should and must ask? In my opinion, the next question is this; “what is intelligence?” or perhaps equivalently “what is knowledge?” 

This question asks us what key concepts are needed for defining a sophisticated problem and a solution to such a problem, how these concepts could be scientifically and rigorously defined and characterized, and how they should be combined and searched through for us to automatically find a solution or algorithm to solve a complicated, real-world problems. In answering this question, two things have emerged as crucial components; they are learning and data. 

“Learning” refers to in this context a process by which we automatically construct an algorithm to solve a problem. In other words, it’s a meta-algorithm that automatically builds a new algorithm. This “learning” process heavily relies on the availability of “data”, be it collected by humans, other algorithms or the learning process itself. From data, it identifies many underlying rules and regularities that could be exploited to solve a problem efficiently and effectively. This is precisely why we refer to this whole new discipline as data science. 

We study mathematical and computational aspects surrounding this core concept of “data” behind intelligence and knowledge. What is the correct way to characterize data? What is the correct way to automate the collection of data in order to maximize the effectiveness and efficiency of “learning”? What is the correct way for a learning algorithm to maximally extract underlying rules and regularities from this data to construct an algorithm to solve a problem? All these questions point to the ultimate question of what intelligence is and what knowledge is, and on the way, solve many real-world problems based on data and learning. 

This is the reason why I decided to join the Center for Data Science in addition to computer science in 2015 even when the center was in its early years. I haven’t had a single moment of regret since I joined CDS especially looking at the trajectory we have been taking.

Now let me tell you briefly about my own research in this context. One particular aspect of intelligence that sets us (humans) apart from other seemingly intelligent animals, such as other mammals and insects, is our use of sophisticated language. This use of sophisticated language presents a unique opportunity for us to push the boundaries of our experience. Although none of us in this room (i’m quite certain) has ever been to the Antarctica ourselves, we somehow all know that there are penguins in Antarctica. Although none of us in this room (i’m 100% certain) has ever been to the ancient Greece in person, we somehow know a lot about Ancient Greece, and probably more so than average Ancient Greeks who lived it themselves. Both of these are possible, because we use language to share experiences and broaden our boundary both spatially and temporally, which sets ourselves clearly apart from any other intelligent being on this earth. Together with our unique level of intelligence, it makes me believe we must study language carefully in order for us to answer the question “what is intelligence?”

There are two parts to studying and designing learning algorithms for natural language. One is to build a learning algorithm that focuses on extracting underlying semantics of language in order to solve problems that require in-depth knowledge expressed in text. This direction is pursued mainly by Sam Bowman and He He at NYU CDS, and I will skip it here myself. The other is to build a learning algorithm that knows how to generate a well-formed text, and this is my main research direction. 

The problem of text generation belongs to a wider category of structured output prediction. In structured output prediction, a set of possible outcomes is very large, which is equivalent to technically saying the size grows exponentially w.r.t. the input size. In other words, it is not possible for our learning algorithm to naively test each and every possible configuration, and the learning algorithm must extract and exploit underlying structures that are often not apparent. once a good set of regularities have been extracted, learning provides us an efficient algorithm to rapidly search for a good configuration/sentence from this exponentially large space. 

One particular approach I’ve been exploring since 2014 is called neural autoregressive models with attention, which has become the de facto standard not only in academia but also in industry for building machine translation, speech recognition and speech synthesis systems. This approach has recently been found by others as well as by my own group to be generally applicable to any structured object generation, where structured objects refer to generic graphs. One quick example is conditional molecule design. Together with Prof. Kang from SKKU who was visiting NYU Center for Data Science on his sabbatical, I was able to demonstrate the effectiveness of recurrent nets with attention mechanism and latent variables in controllable generation of molecular hypotheses. This effort, which started late 2017, has now been expanded to using graph neural networks (about which I believe Joan Bruna will tell you a more exciting story) to better capture the graph-likeness of molecules and proteins. 

We are a very long way from answering the question; “what is intelligence” or “what is knowledge” in a rigorous manner. We have barely taken a step toward this goal, and if the history of any scientific discipline is any indication, it will be many correct and incorrect steps taken over decades if not centuries before we could barely claim to have taken a peek at the answer to this question. 

One thing that is certain however is that we have been successfully building an environment here at the Center for Data Science by bringing in and hiring people with expertise necessary for us to advance toward answering this ultimate question. My research has certainly benefited from having a diverse set of colleagues of world-class caliber. I have designed and proposed a unified framework for online learning algorithms for recurrent networks together with Cristina Savin, which will be a crucial component for us to build an intelligent agent that lives indefinitely. I have worked with Sam Bowman to better understand and characterize these language understanding neural networks. I have studied the applicability of deep learning to physics and biology by working together with Kyle Cranmer and Rich Bonneau. I have been spending my effort in building a deep learning based diagnostic system for early-stage breast cancer screening together with Kryzsztof Geras, who was a postdoc at the NYU Center for Data Science and is now an assistant professor at Radiology. I have even had the pleasure of investigating the impact of uncertainty-aware word embedding in political science together with Arthur Spirling.

Thanks for listening to me, and I’ll be happy to chat more about any of these topics as well as how my experience with CDS has been so far. 

a short note on <Rebooting AI> by Marcus & Davis

posted Oct 12, 2019, 12:21 PM by KyungHyun Cho   [ updated Oct 12, 2019, 2:05 PM ]

Disclaimer: I received the hard copy of <Rebooting AI> from the publisher, although I had by then purchased the Kindle version of the book myself on Amazon. I only gave a quick look at the book on my flight between UIUC and NYC and wrote this brief note on my flight back to NYC from Chicago. I also felt it would be good to have even a short note by a machine learning researcher to balance all those praises by “Noam Chomsky, Steven Pinker, Garry Kasparov” and others.

<Rebooting AI> is a well-written piece (somewhat hastily) summarizing the current state of artificial intelligence (or perhaps more like machine learning) in both terms of research and deployment. If one has not been in the field themself, they would appreciate the effort of the authors in gathering various recent (and old) findings that succinctly describe what we could and should expect from the current technology and what we cannot expect from the current technology. To me, and perhaps some of my colleagues in the field of deep learning (and slightly more broadly machine learning), which is often the target of skepticism from the authors (to be fair, the authors do demonstrate healthy skepticism toward any other existing technology in machine learning and artificial intelligence,) the book feels relatively light despite its grand reception expressed by various folks on social media. 

Why do I feel this way? Perhaps it’s because I could classify the set of failure modes of the current technology, which are presented in this book as surprisingly findings, into two categories. The first category of these failure cases almost exclusively consists of what have been reported by machine learning researchers. That is, unlike how I have felt the book was implying (either implicitly or explicitly), it is machine learning researchers who are at the frontier of discovering, investigating and trying their best to address these weaknesses of the current technology. The second category consists of failures that were found largely by the authors themselves manually playing around (or more seriously testing) some of the products or demos that boast to have employed latest technology. Whether this limited interaction (just because everyone has 24 hours a day without any exception) is enough depends on what kind of argument in which way these failure cases are used, and I see some cases in this book that I find refreshing as these examples do clearly demonstrate weak aspects of those systems. It is however the empirical side of me that finds it a bit less satisfying to see a scientific argument made based on a few manually selected examples. In summary, unlike the authors’ implication, these problems are known and are being actively discovered by AI researchers (in particular ML researchers), and we are actively seeking to tackle these problems, although it’s rare for journalists or pundits to talk about these compared to other fancier news, e.g., silicon valley acquisition/merger/funding of supposedly-AI companies. 

Yet another reason might be that the book does not really provide a clear, verified (or even verifable) way to “reboot” AI or even how we would think of approaching the problem of AI. In short, there were too many “seems to”, “will need to”, “should”, “is pretty clear”, and other uncertain, perhaps risk-avoiding terms in the book whenever the authors tried to argue the importance (or more like necessity) of a certain direction or method they “pretty clearly” believe a general AI system “seems to” require to have. The empirical side of me had struck again and again whenever I ran into these statements; that is, if we cannot prove it somewhat rigorously nor cannot demonstrate empirically and convincingly it, my scientific trust in these arguments tends to go down. Especially, in the latter case (empirical demonstration), the level at which the demonstration was convincing almost directly correlates with my trust, and sadly I could not find much of those in this book. For instance, I was much more convinced of the importance of common sense, which was emphasized over and over by the authors, by Yejin Choi of UW who showed me, over beer in Chicago two days ago, her latest work on natural language based learning of common sense, than by the arguments in this book. This is of course not to say that the authors’ proposals/arguments are incorrect nor fully unconvincingly. It is just that, as I mentioned earlier, they feel lighter than what I would’ve expected from its title <Rebooting AI> and the weights of the authors, Gary Marcus and Ernest Davis, both of whom I know in person.

This brief note on what I thought of <Rebooting AI> has concentrated mostly on the first part (which arguably takes most of the book) that is mainly about the technological side of AI. For me, it was more enjoyable to read the second part (or the last part) that discusses the true danger/consequence of AI, perceived by the authors, beyond the usual straw-man argument on humanity’s extinction by super-intelligence. I wonder what researchers in AI safety or ethical use of AI/ML think of this second part. Would they also find it too light, as I have found the first part of the book, however without sacrificing the correctness? If so, that would ironically imply that the authors have done a commendable job of summarizing various, latest developments (and non-developments) in AI/ML, while nicely blending in their own views and research, so as to pique the interest of bystanders and educate them to a level that they are aware of these developments and potential consequences/concerns. 

<Rebooting AI> reads a bit too light for my taste, but it’s almost certainly due to my own involvement in the field of AI myself as a researcher and educator. Taking a small step back from my current position, I believe it was necessary and perhaps timely for some book to succinctly summarize both the up- and down-sides of the current state of AI for laypersons (as in anyone who is necessarily following the non-stop flood of academic papers in the field of AI), and it is not easy to imagine a better person (or a better team of people) than Gary and Ernie. 

In short, I would recommend my parents (when Korean translation becomes available) to read <Rebooting AI> (although they might feel sad my name wasn’t mentioned even once when the improvement in Google Translate was described ;)) if your parents are not AI researchers, I’d suggest you recommend them as well. I would not however find it necessary for AI researchers themselves to read this book, unless you want to get a short, but interesting discussion on trustworthy AI toward the end of the book. Of course, if you want to have a Twitter or Facebook debate with Gary, I guess it wouldn’t hurt giving the book a quick look (although I don’t find it too necessary.) 

Discrepancy between GD-by-GD and GD-by-SGD

posted Sep 23, 2019, 7:00 AM by KyungHyun Cho   [ updated Sep 23, 2019, 8:20 AM ]

The ICLR deadline is approaching, and of course, it's time to write a short blog post that has absolutely nothing to do with any of my manuscripts in preparation. i'd like to thank Ed Grefenstette, Tim Rocktäschel and Phu Mon Htut for fruitful discussion.

Let's consider the following meta-optimization objective function:
which we want to minimize w.r.t. θ₀. it has become popular recently thanks to the success of MAML and its earlier and more recent variants to use gradient descent to minimize such a meta-optimization objective function. the gradient can be written down as*
where θ' is the updated parameter set. in this derivation, what we see is that the gradient w.r.t. the original parameter set θ₀ is propagated from the outer objective function L' via θ' which was computed using the gradient of the inner objective function L w.r.t. θ evaluated at the original parameter set θ₀. 

so far so good, but what if the inner optimization procedure was stochastic?

that is, what if the meta-optimization objective function was:
where z is used to absorb any stochasticity present in this gradient descent procedure. for instance, z could be use to sample a subset from D to build a minibatch gradient. after all, this is often what we do in deep learning rather than full-batch, deterministic gradient descent as shown above. 

in this case, the gradient of the meta-objective function w.r.t. θ₀ looks slightly different from above:*
what's really important to notice here is that there are suddenly two expectations rather than just one expectation in the meta-objective function. this makes a huge difference, because we now need two independent sets of samples from z to estimate the meta-objective gradient w.r.t. 

how would this be implemented in practice? we first draw one minibatch and update θ₀ up to θ'. we then draw another minibatch and update θ₀ up to θ'' (notice the double prime here!) we draw a validation minibatch D' to evaluate θ' using the meta-objective function L'. then we backprop up until θ' (using the same validation minibatch). we then suddenly switch to θ'' and backprop through it until θ₀. in other words, we use two separate paths until θ' for forward and backward passes, which is pretty different from a usual practice. 

what does this imply? what it implies is that correct meta-objective optimization looks for θ₀ that is robust to the optimization trajectory taken due to the inherent stochasticity in SGD. in order to do so it must consider what would have happened had a different optimization trajectory been used, and this can be estimated well by using separate minibatches for forward and backward passes. i believe Ferenc Huszar made a similar argument in "What is missing? Stochasticity" section of his recent blog post

an interesting question here is what z is and what kind of distribution we should impose on z. for instance, can we fold the choice of optimization algorithm into z in addition to other stochastic behaviours such as data permutation, dropout and others? if so, can we extend MAML to find a more robust initialization that would not only be robust to the stochasticity behind a select optimization algorithm but robust to the choice of optimization algorithm itself?

(*) i'm being massively sloppy with scalars, vectors, matrices, gradient and jacobian, and my apologies in advance. you could simply think of scalars only and the whole argument still largely holds.

Sharing some good news and some bad news

posted Jun 17, 2019, 5:55 PM by KyungHyun Cho   [ updated Jun 18, 2019, 6:48 AM ]

I have some news, both good and bad, to share with everyone around me, because I've always been a big fan of transparency and also because i've recently realized that it can easily become awkward when those who know of these news and who don't are in the same place with me. Let me begin.

The story, which contains all these news, starts sometime mid-2017, when I finally decided to apply for permanent residence (green card) after spending three years here in US. As I'm already in the US, the process consists of two stages. In the first stage, I, or more precisely my employer New York University, petitions me for permanent residence, which is followed by the second stage in which I switch my status from non-immigrant alien to a permanent resident. The first stage proceeded pretty smoothly, and as a part of preparing documents for the second stage (adjustment of status), i had to get a medical checkup. I thought it was a good chance for me to get some health checkup, because i cannot even remember when I had a regular health check up last time. It was probably in 2008 when I came back to the university to finish my undergrad education, when I had to submit a medical checkup report in order to qualify for a place in the university dorm. So, it had been almost exactly 10 years.

Checkup in April 2018 went pretty smoothly as well. Although I was not in the greatest shape, I was found to be relatively healthy as in no apparent disease nor abnormality. There was however one anomaly in thyroid hormone level. My primary care physician thus recommended me to take a low dose of levothyroxine, as he suspected I had so-called hashimoto's syndrome (where a thyroid is "under"-active and does not produce enough hormone, which often leads to exhaustion and so on.) he told me it was nothing to worry about, as about 10% of the population have this issue and because all that's needed is to take the pill for the rest of life. I wasn't happy when I heard it, but I became happy pretty quickly, as I was feeling much better once I started taking synthroid. I didn't know I was getting exhausted so easily before that, because I thought life was just supposed to be that exhausting ever since I was in middle school.

Six months had passed and, in a follow-up exam early this year (January 2019), my PCP (Dr. M for now. I'm not revealing the names of the doctors here for their privacy) told me hormone level is now all normal and that there is no need to either increase or decrease the dose. He then suggested me to take ultrasound as a precaution. He told me it was fine for me not to have ultrasound, but that there was no harm in doing so. No harm in it, why wouldn't I do it? Furthermore, my mom also suggested it earlier, when I reacted by saying "if my doc told me to do so, i will, but until then, i won't have unnecessary ultrasound or any other exam." Now, Dr. M told me to do so, and I did not hesitate to move forward. He referred me to NYU Radiology, and I was fortunate enough to get my ultrasound scheduled within a week from there on.

As an avid user of Google and Internet in general, I had to look it up a bit. What can be found by ultrasound? What is the most realistic scenario? What i learned back then was that a male in early 30s should not have anything in their thyroid that shows up in ultrasound. It was essentially a zero-probability event that ultrasound would reveal anything in my thyroid. Furthermore, as hashimoto's syndrome was known not to be associated with any nodule or anomaly in a thyroid, there was no reason for me to even worry about the whole thing. On the day of ultrasound appointment, I went to the hospital, had my thyroid ultrasound and came back to the office. It's quite usual for these exams to take a couple to a few days to be read, and I was not expecting much when I arrived back at the office (how do I know this? because I went through some severe medical issue 2.5 years ago, which turned out to have nothing to do with the issue i'm about to tell you.)

Somehow within 2-3 hours, Dr. M called me. He told me that "to his surprise" ultrasound revealed a small nodule on the right side of my thyroid and that it was recommended for me to have it biopsied (fine needle aspiration, FNA). The only federal grant that I have received is from NIH on reducing the recall rate of breast cancer screening using deep learning, and from preparing the grant proposal and from listening to Dr. Linda Moy and others at the NYU Radioloigy, I knew that patients go through enormous stress when they are being recalled for further exams even if the chance of anything malignant is often less than a few percent. Until the moment my doctor told me to get biopsy I don't think I ever "felt" this stress of recalled patients. Although Dr. M assured me that 90% or so of biopsied thyroid nodules turn out to be benign and only less than 5% of them malignant, my mind was completely fixated on that <5% of chance. Again, the doctor referred me to the pathology & radiology at NYU Langone medical center, and I was fortunately scheduled to have FNA within a week (or to be precisely i had FNA exactly a week after ultrasound.)

I did some homework and confirmed what Dr. M told me. It's pretty rare for a thyroid nodule to be malignant, although this chance was a bit meaningless in my case, as it was pretty rare already for a male in early (or mid-) 30's to have any thyroid nodule to start with. Essentially, there was an issue with exposure bias, and I fell into this rare bucket that makes it difficult to predict what would actually happen. Nevertheless I decided to (try to) stay calm and show up at the biopsy, although I must say I could not really focus on anything but googling thyroid cancer and anomalies my entire week. For FNA, it's me and two doctors, from pathology and radiology. Based on ultrasound performed real-time by the radiologist, the pathologist (<= is this a right term?) collects cells from the nodule using fine needles. Because it's FNA, there's almost no pain and it only takes about 20 minutes in total. It was done much more quickly and I was back at my office within an hour from the time I had arrived at the hospital.

Pathology takes a few days up to a week or so if all the tests are performed. My natural expectation was to hear about the result in a couple of days after some essential tests were done. As you now see the pattern, this expectation was completely shattered, and I received a call from my primary care physician in about two hours after the biopsy. He told me that again "to his surprise", the nodule turned out to be malignant. It is however nothing to worry about, the tumour is pretty small and looks to be confined completely to the right side of thyroid and has not metastasized out of the thyroid. He referred me to a surgeon (Dr. P for now) at the NYU Langone hospital and told me he is *the* best thyroid surgeon you can find and is worth waiting many months if i could get a surgery from him. That was a relief, and it turned out to be true (up to now at the very least.)

You may have heard from somewhere that thyroid cancer (or more specifically a common form of thyroid cancer) is the best cancer you can have. Some call it the only treatable cancer, and prognosis is indeed excellent relative to other types of cancer. Even then, it is really not great and still pretty desperate to be diagnosed with thyroid cancer. This is further worsened by a few facts: (1) no one can really know how aggressive or spread-out any cancer is until one looks at it physically (i.e. surgically and pathology thereafter,) (2) some rare (5%) form of thyroid cancer (anaplastic cancer) is effectively a death sentence, and (3) it's still "cancer". It was right at the beginning of a new semester (I truly thank my department chair, center director and colleagues who have kindly understood my situation and helped me so much along the way,) there was the ICML deadline (I must thank my co-authors so much, because I effectively checked out worrying about and googling about thyroid cancer from here on,) and so many things were happening. This really didn't help.

On top of all these matters, I was confronted with perhaps the biggest challenge I’ve ever encountered so far. That was to break this news to my mom, when my relationship to my parents has almost strictly been to share the good news only. When it comes to bad news, it has been “don’t ask, don’t tell.” I was suddenly in a situation where I had to share bad news, as I thought it was probably a wise idea to tell my parents that their son was diagnosed with cancer, and I wasn’t really aware how I should do so. After all, I just called mom and broke the news to her. She’s a strong person and took it much more modestly at least on the phone, but it was one of the most difficult moments in my life. So, if you’re somewhat young, as in your teenage years or in twenties, listen to me and practice sharing not only good but also bad news with your parents. This will definitely help you later when you have to really share bad news with them.

It's a bit weird to talk about luck when I was just diagnosed with thyroid cancer, but I was lucky to be scheduled for a consulting session with Dr. P within two weeks. I learned a few days later that other than that appointment the earliest slot available was in April or so. Starting from the medical checkup mandated by USCIS for permanent residence, a bit of a random recommendation of Dr. M for ultrasound and early scheduling of ultrasound and biopsy, there have been a series of luck showing up during this past one year, except for thyroid cancer it self.

Dr. P is one of the most confident people I've ever met. He is professional, enthusiastic and confident, which magically makes a patient (i.e., me) suddenly optimistic about this whole ordeal. He carefully looked into the ultrasound and biopsy reports and told me there is nothing to worry about other than usual risk involved with any surgery (in particular neck surgery.) He suggested me to have the entire thyroid removed even though the cancer seems to be confined to the right side of the thyroid, since it anyway doesn't work well and this removal will reduce the chance of recurrence (no thyroid no thyroid cancer.) I agreed with him, and we decided to go ahead with total thyroidectomy. As Dr. P had a travel schedule to Korea in February (how ironic!), I was scheduled to go through a surgery mid-March. I so wanted to have surgery as soon as possible and consulted with Dr. M, and he reminded me once again that it was worth waiting a few months to have Dr. P operate on me.

The first thing I had to sort out after scheduling was to figure out what i should do about my planned teaching at the African Master's Program in Machine Intelligence (AMMI). My original plan was to fly to Rwanda, teach a full week and have a few days of sightseeing, but because the surgery was scheduled to be a week after the course starts, i had to for sure give up on sightseeing and was not sure whether i could make it back in time for the surgery. from NYC Kigali, Rwanda is pretty far and the flight connection was not great. there was one route that could fly me back to NYC from Kigali in slightly under 20 hours and that had 50m layover in Nairobi. Will I be able to make this connection? If i miss it, it looks like I would have to stay in Nairobi for at least 6 to 8 hours before I can find another flight that could fly me to JFK (via another airport, though.) that would put a lot of pressure on my surgery schedule.

I called Kenya airways and asked them if it was possible to make this connection. They said it was certainly possible to make this connection, based on which I decided to go ahead with this plan. I planned to fly to Rwanda on Saturday, arrive at Kigali on Sunday (evening), teach three days non-stop from Monday to Wednesday, give a final exam Thursday morning, grade the exam in the afternoon and fly back to NYC in the early evening. Let me pause a bit and say thanks to all the teaching assistants (Roberta Raileanu, Sreyas Mohan and Ilya Kulikov) who flew all the way to Kigali with me to teach this intensive course. Without them this course would not have been possible. Also, I'd like to thank all the students at AMMI who were present and attentive all the time even with this extreme schedule I had to follow due to the surgery schedule. Also, I'd like to thank Moustapha and Teta, the co-directors of the program, for accommodating the sudden change to the lecture schedule. I couldn’t believe I was even remotely thinking of cancelling the whole thing due to the surgery, because this was one of *the* best experiences I’ve had in my career as a professor. This trip to and teaching at Kigali warrants a separate blog post, and let me fast-forward to the last day in Kigali. Though, let me say one thing: I cannot believe that I missed this chance to see wild mountain gorillas! I will have to teach again next year and sincerely hope AMMI will invite me over again next year.

Because Kenya airways told me it was possible to make the connection within 50 minutes, I expected at least the gates to be in the same terminal, or at the very least to be separated by walking distance. My expectation was completely shattered, when I learned that I had to run from the gate to the end of the arrival terminal, wait for an airport train, take it to another terminal, and then run to the departure gate that is pretty much at the end of this terminal (because it's a US-bound flight, the gate needs a separate security check and is often near the end of the terminal.) I ran. I really ran all the way. I barely made it, and I was literally the last one to board the plane. The doors were closed shortly after I boarded the plane. So, yes, it was indeed possible to make the connection, although they forgot to tell me "barely".

The second matter was my teaching. Quite a few travels I had planned, it was easy to cancel them. I learned that the term "cancer" is pretty magical in that when one asks to be excused because of "cancer", one is excused from literally anything immediately often with sincere comments (according to Keunwoo, I started to enjoy this excuse slightly a bit too much at some point, with which I cannot disagree.) It was however not necessarily the case with a full semester course. Until my surgery, I was obviously capable of teaching and also wanted to teach, because that would be one routine thing (this is my third time teaching this course) that distracts me from thinking about thyroid cancer, surgery and afterward. The question was more about after the surgery. By some coincidence, the surgery was scheduled in the week of spring break, but I expected to be out for about 2-3 weeks. I talked with the department chair (Prof. Denis Zorin) who kindly encouraged me to find substitute lecturers and assured me the department will take care of all arrangements and that i needn't worry about it at all. I thus arranged four weeks of substitute teaching, out of which three weeks were taken care by Alfredo who's a postdoc of Yann. Students loved his lectures, and I sincerely thank him for this. I feel bad about having had to introduce this discontinuity in the course, which was mentioned a couple of times in the course evaluation, but well.. what could I do? I was about to go through a neck surgery to remove thyroid (cancer).

Especially in the first few weeks or more like between the diagnosis and surgery, i could not let go of trying to figure out why I had thyroid cancer. This is pretty useless to figure it out, especially when it's thyroid cancer, as it is one of those few cancer types that is known to be less affected by environmental factors (except for one!) From my quick (but frequent) search, thyroid cancer seems to be largely attributed to one of two factors; (1) genetic and (2) radioactive exposure. My initial suspicion was fixated on the second factor. I was born and raised in a country with the highest density of nuclear power plant in the world, and from my experience of interacting with some of the people from the government-owned company that takes care of all the nuclear power plants in Korea, my confidence in how well those plants were built, maintained and transparently operated has always been pretty low. Perhaps all South Koreans, including me, are susceptible to thyroid cancer because of all those nuclear power plants. But, then, as far as I know, I am really the only one among my friends of similar age who have been diagnosed with thyroid cancer. I then remembered that I once accompanied those folks from Korea on their tour of the nuclear waste facility construction site in Finland as an interpreter. Perhaps that's where I was exposed to some radioactive iodine. I then remembered that it was a construction site with only a minimal amount of any radioactive waste and that it was probably safer there than a wilderness in terms of radiation. Or, maybe, it's because helsinki is only about 1,000km away from Chernobyl, although it's just impossible because radioactive iodine has half life of 13 days or less. It's 33 years too late for that to be the cause of my thyroid cancer. It turned out at the end that one of my uncles was diagnosed with thyroid cancer and had a surgery a few years back, which largely confirmed it's genetic. That uncle and my family are not really in good terms (or more like in a we-are-not-a-family-anymore term), and I just didn't know about it until now. In short, there is no way to tell for sure why, but it looks increasingly likely that it’s genetic.

One thing that has been impossible for me and bothering me so much since the diagnosis was writing and to a certain degree reading. To write a paper, what I need (and probably what anyone needs) is focus. I don't listen to music, I don't talk to people and I don't do anything extra when I write and read. This cancer diagnosis unfortunately filled up more than half if not most of my mind, and it has been impossible (or at least until quite recently) to fully focus on writing and reading. I sit down, start writing or revising what has been written so far, and immediately my mind springs to thyroid cancer. What if it turned out to be anaplastic cancer? What if it had spread beyond thyroid? What if it had spread beyond neck and to lungs? What if surgery goes wrong and I lose my voice? So many unnecessary, low-chance scenarios swirled around inside my mind constantly, which was understandable because it would not have been the first time an extremely low chance event happened to me (ramsay hunt syndrome in my early 30s and now thyroid cancer in my mid 30’s. I’m both looking forward to and super worried about my late 30s.) This inability of writing or reading was pretty brutal, because those two activities form a (super-)majority of my job after all. If I can’t really write nor read effectively, who am I to start with?

The surgery was a day surgery with an overnight stay at the hospital. I had to fast from midnight and came to the hospital for surgery in the morning. My mom flew all the way from Seoul to take care of me after the surgery. Thanks, mom. My mom and I went to the hospital on the day of surgery. I was prepared by nurses for the surgery: changing into a gown, wearing a hair (head?) cover, putting everything into a locker and so on. The nurses were extremely helpful and kind, and I really didn’t feel any frustration, although i was about to have my first surgery with general anaesthesia. I never imagined my first surgery would be a cancer surgery and that it would be in NYC. Dr. P came to the waiting area and marked a planned incision line on my neck and told me with his usual confidence that i do not need to worry about it and that he will do his best to remove my thyroid and any visible cancerous parts from my neck. Although my mom does not really speak English (she reads English quite well, speaks pretty okay but listening is always a trouble, just like any other Korean who learned English in Korea,) she told me she felt like she understood what Dr. P said and that she could feel his confidence and a bit more comfortable herself, shortly after Dr. P left. Dr. P told me, as did he in January in the first consulting session, that the surgery is expected to take about 2.5 to 3 hours, and a nurse told me to tell my mom to come back to the waiting room in about two hours to be called in after the surgery was done. I did, and she went downstairs to have a sandwich for her lunch and met Prof. Gene Kim who is my collaborator at NYU Radiology and is a Korean. I’d like to thank Gene for kindly taking time to talk with my mom when she was pretty much alone in the hospital cafeteria not speaking the language and waiting for her son’s surgery.

A nurse walked me to the operation room. What did I know about the operation room? The first thing I noticed when I entered the room was that it was pretty cold inside. I was asked to lie down on the operating table (bed?) with my arms wide open and my neck bent backward. On my way to the operating table, I said “hi” to the surgery team, and they actually said “hi” back and even said “welcome”. It was a surreal situation, but this welcoming atmosphere was pretty nice and calmed me down quite a bit. The anaesthesiologist put a needle on my left hand and told me he’s starting to give me anaesthesia. He asked me if there’s anything uncomfortable, I told him I felt a weird metallic taste in my mouth. He smiled and told me I would not be the first one to tell him that, and that’s the last thing I remember before I suddenly woke up (or more like opened my eyes).

I entered the operation room around 10.30am expecting to be out by 1.30pm or so. The first thing I saw when I opened my eyes with my neck still bent backward was the monitor that show various curves and numbers. From that monitor I saw that it was 3.30pm, and out of curiosity and concern, I wanted to ask and tried to ask those who were pushing me somewhere why it took so much longer than expected. I couldn’t get any answer then, and I cannot tell whether I was even able to ask the question (I may very well have just imagined asking that question.) I didn’t stick to this question any further, as there was an urgent matter to take care of. That was, I had to pee. I had to pee so much. I asked a nurse (I think) that I need to pee desperately, and the nurse gave me a portable plastic urinal so that i can pee in the bed. It unfortunately did not work, and as soon as my bed stopped in one corner of the waiting area, I stood up and started to pee, during which the nurse hurriedly closed the curtain (I feel terrible that I just started to pee in the portable urinal even before the curtain was closed, but I just couldn’t hold it any longer.)

It took my mom quite some time to come find me in the waiting area. It took 2-2.5 hours longer than it was expected, and due to language mismatch (or more like name pronunciation mismatch), it took nurses some time to locate my mom and bring her to the waiting area. My mom is really a strong person. She was a bit frustrated because of the unexpectedly long surgery but was happy to see me lying down in my bed in one piece. She was also relieved that I was able to talk as usual and be snarky as usual. After a couple of hours, the surgery team stopped by and okayed for me to be moved to a patient room in the same building.

The wing in which these patient rooms are located is a pretty new one. It was still under active construction when I joined NYU and was completed only recently. A nice feature of this new wing and the patient room within it is that all the patient rooms are single. Instead of building rooms with different capacities, each and every patient room in this wing is for one patient at a time. Each of them is pretty small, but in my opinion, this is a better way than to build rooms with many patients in each, because of pretty much guaranteed privacy. I only heard about this specific design and rationale behind it recently. I never imagined I would spend a night in one of these patient rooms, but here I was.

My experience overnight at the hospital was extremely pleasant (some of which may be due to percocet, I must admit.) The attending nurse overnight was super kind and nice not only to me but also to my mom. Although I just had a five-hour-long surgery, I did not feel any discomfort staying at the hospital. Of course, I was on opioid and was lying down on the bed comfortably, as opposed to my mom who was not on opioid and was barely able to lie down on a large chair next to the bed. Again, thanks, mom, for flying all the way here and having taken care of me. The nurse monitored me every four hour, checked my pain level (it wasn’t too bad compared to the pain I had when I suffered from ramsay hunt syndrome, because it was as if someone was hammering a nail to my head, but it was still considerable) and gave me appropriate painkiller and other medications. The night passed by quite rapidly (at least for me.)

In the morning, a part of the surgery team stopped by and took out “staples” from the incision on my neck. I couldn’t see it because I had to look up in order to give the doctors a room for taking them out. It didn’t feel much, but just a bit of pinching every time a staple was taken out. According to my mom, it was a sight of wonder. It had been less than 20 hours since my surgery, and somehow the incision, which was held by staples, has already closed down. My mom and I both blame ourselves for not having taken the incision on my neck before and after removing the staples. Sometime later, Dr. P stopped by and explained (again with his confidence and excellent voice) to us why the surgery went longer and how the surgery went. When he opened up my neck and started looking around inside my neck, he noticed that lymph nodes appeared cancerous. He immediately took out a few samples of lymph nodes and sent them for pathology on the spot. While waiting for the pathology’s opinion, he began removing the thyroid. As the pathology confirmed that lymph nodes were cancerous, Dr. P removed 18 lymph nodes that looked suspicious. According to Dr. P, he removed all he saw and sent those samples to pathology for final confirmation. I was told to go home and come back for a follow-up visit in a week. Based on the wondrous sight of staple removal and the confident explanation by Dr. P, my mom was quite satisfied with the whole surgery and experience (of course, not with the fact that her son had a thyroid cancer.)

The first day at home was pretty uneventful. I largely took a rest without doing anything. It wasn’t physical pain that forced me to rest on the first day, but was the scare that any activity might tear up the incision that was barely healing. I really did nothing but eating, web surfing and lying down. One thing that was unusual was that I had to use an incentive spirometer once a while to inflate my lungs to the normal level. It was the second day that all the pain suddenly appeared, likely because the effect of general anesthesia wore off completely. It was not the pain in the incision area nor inside my neck, but it was largely muscle pains in my shoulders, back and neck. It was just impossible for me to have my head up straight, because my head felt so heavy. I couldn’t raise my arms more than half way, because they were so heavy. All I could do without any pain was to lie down and look at the ceiling. These muscleaches were probably due to my pose during the surgery. My neck was bent backward, and consequently my back and arms were bent backward, for five hours non-stop. This was striking back as soon as the anesthesia wore off. Percocet helped a bit, but was not perfect. It took about two days for these muscle aches to go away.

The follow-up visit to Dr. P’s office was on Monday in a week. At the visit, Dr. P first took off the bandage over the incision which I kept until then even when I was taking a showever. Voila, the incision was closed completely with some minimal scar. Dr. P somehow magically aligned the scar perfectly with the neck creases. It will leave a scar, but I’ve never been a pretty one to start with and am super satisfied with how the surgery went. Dr. P then showed me the pathology report (and gave me the copy) and explained it in detail, which is what I really like about my experience of US health care so far that doctors explain things really well and carefully. According to pathology, what I had/have is papillary thyroid cancer which is the most common type of thyroid cancer and responds to radioactive iodine treatment. This is a good news, but a slightly bad news is that it is metastatic and exhibits a so-called tall cell feature, both of which are indicators of an aggressive thyroid cancer. It had indeed metastasized to lymph nodes (11 out of 18 lymph nodes removed during the operation were found cancerous,) but fortunately is in its early stage and did not look to have spread beyond the neck (boundaries are clean). It was also found that the cancer was not only on the right side but had already spread to the left side (microcalcinoma in the left side.) And, what was most hilarious to me was that hashimoto’s syndrome was also confirmed. I asked Dr. P whether there was any connection between hashimoto’s and cancer, and his answer was a firm no. It was just a coincidence.

In terms of surgery, it was successfully done (as was later confirmed and told over and over by other doctors.) As the final step (of intervention not of monitoring which I will have to do for many years,) I was referred to an endocrinologist, Dr. A, in the same hospital who would order a radioactive iodine treatment, help me find the new, right dose of synthroid and monitor me for the next few years. Before I delve into explaining radioactive iodine treatment, let me tell you how lucky I was again in scheduling radioactive iodine treatment. I called Dr. A’s office to make an appointment. I was told sadly that there was only one spot available before September, and that spot was in the second week of May, overlapping with ICLR’19 to which I really looked forward. Of course, as Douwe told me when I shared this with him; “priority, man, priority”, I took this spot. I could’ve asked to be put on a waiting list and waited for another appointment to be cancelled, which would have been a reasonable strategy for any other treatment and with any other doctor. I didn’t do it out of a hunch that this is *the* spot for me. The reason for the complete absence of any appointment slot in the summer, I learned while I was talking with Dr. A at my consulting appointment, turned out to be that she was planning to go on maternity leave for three months starting from mid-June. In other words, if I hadn’t taken that spot to attend ICLR’19 and asked to be on the waiting list, this blog post would’ve had to wait until this fall or even winter. Furthermore, because Dr. A wanted to make sure that she sees through my treatment before she goes on maternity leave, she scheduled my radioactive iodine treatment as early as possible, which I greatly appreciate. So, yes, yet another luck in scheduling.

At the very beginning I said I wanted to share a few news, both good *and* bad. The bad news was obviously thyroid cancer. Let me give you one good news at this point. Early last Fall (2018), the department suggested me to go for tenure. It was earlier than I expected, but I gladly and gratefully agreed to prepare my docket and (perhaps a bit hastily) submitted it, as I have been professionally happy and extremely satisfied at NYU. I was told evaluation takes quite some time, as it involves three stages; (1) at the Department of Computer Science (and Center for Data Science), (2) at the Courant Institute and (3) at the provost office. Since this is a big deal, I could’ve or perhaps should’ve been anxious about how the whole process of evaluation was happening and at which level my docket was sitting. Unfortunately or fortunately I did not have the luxury to worry about my tenure case from this January onward thanks to thyroid cancer. It may sound ridiculous or obnoxious, but at some point in January, I totally forgot about my tenure case for an understandable reason.

After the surgery, I gradually ramped up the number of hours I spend at the office, starting from about a week after. Before the week’s mark, however, I generally stayed home resting or took some light stroll but nothing more than that. Since I was really just lying down playing games or surfing the web, I was checking the email almost real-time (though, I did not reply to most of the emails I received back then, because I still couldn’t focus enough to reply any of those emails that required careful attention.) The title was “FAS Promotion and Tenure Review”. Yes, I was officially granted tenure and promoted to associate professor effective September 1 2019. Okay, okay, I am still not tenured nor an associate professor yet for the next 2.5 months, but it was pretty nice to receive this letter in my bed resting after having the five-hour-long thyroid cancer surgery. It was even better, because I received this letter in a moment when nothing felt good nor optimistic and I was least expecting any good surprise (although the surgery went well according to Dr. P, the exhaustion from the surgery and stress made it difficult for me to feel optimistic.)

Let’s go back to thyroid cancer. In the week of ICLR’19, when all my friends, colleagues and former colleagues were enjoying the conference in New Orleans, I was in NYC which was weirdly empty. I joked about how the city is empty, because everyone went to New Orleans to attend ICLR. I went to see Dr. A. She read my file, discussed it with Dr. P and had a treatment plan in her mind already when I entered her office. She told me that my cancer was quite aggressive, as evident from the pathology report, but that it was found early and was well removed by Dr. P. She wanted me to have a radioactive iodine treatment with a medium-to-high dose of I-131 radioiodine as early as possible so that she can make sure my treatment went successfully before she went on maternity leave. In order to do so, she told me to start low-iodine diet immediately to prepare my remaining thyroid cells for radioactive iodine therapy, which I will talk about a bit more shortly. She explained to me all the details about radioactive iodine therapy and that the treatment would be given by “Nuclear Medicine” in Radiology. I called Nuclear Medicine to schedule the therapy, and it was scheduled to be in the week of June 3, based on which I sadly had to cancel all my planned travels to ICML’19 and NAACL’19. But again, as Douwe said, “priority, man, priority.”

So, what is this radioactive iodine therapy? Obviously I’m no medical doctor (though, I’m a doctor of science in computer science,) and my knowledge is pretty limited. In other words, take whatever I say about this therapy with a grain of salt. Thyroid is one of the few organs that actively consumes iodine. There are a few other organs that require iodine (I believe parathyroid may?), but the thyroid is by far the biggest consumer of iodine in our body. When consuming iodine, thyroid does not distinguish different isotopes of iodine. It consumes I-125 which is a naturally occurring, non-radioactive iodine but also consumes other radioactive isotopes (radioiodine) such as I-123 and I-131, both of which are created as by-products of nuclear fission. This creates an opportunity to design a therapy that specifically targets thyroid cells. We can use for instance I-123 which is much less radioactive than I-131 and has a shorter half-life of approximately 13 hours for scanning a body of any cell or organ that takes in iodine, such as thyroid cells. I-131, which is much more radioactive and has a longer half-life of approximately 8 days, kills any thyroid cell that consumes it if the right dose was given. This property is used to destroy any remnant thyroid cells after total thyroidectomy, and this procedure is called radioactive iodine therapy (or treatment). I still cannot believe how we (humans) have figured this out and can now use it regularly to treat thyroid cancer and hyperthyroidism. Modern medicine is just full of wonders, and this spring I’ve experienced several of them including neck surgery, synthetic thyroid hormones, radioactive iodine therapy and of course opioid (I understand addiction is an issue, but when under severe pain, such as that after tumor removal or ramsay hunt syndrome, opioid is often the only way to sleep and function.)

Once a patient takes in I-131, they become pretty badly radioactive. This is not really an issue for the patient themself, as this radiation is precisely what kills off the remaining thyroid cells, either cancerous or not. Radiation is however an issue for others who have their thyroids intact, as I-131 will destroy their thyroid cells leading to hypothyroidism and in the worst (but very rare) case thyroid cancer. Small children/babies and pregnant women are especially at risk, and the patient was asked to quarantine themself at home avoiding public (because you never know who’s in the public) for approximately a week. With a higher dose, it could be longer, but generally according to Dr. F of Nuclear Medicine, a week is enough for the radiation level to decay significantly. Though, I decided to quarantine myself a couple of days more because there’s a bit of baby epidemic in my social circle with new babies popping out pretty consistently every few months. The last thing I want to hear is that some of my friends’ babies need to take synthroid for the rest of their life because I happened to stand next to them accidentally a bit too long when I was radioactive. Thyroid cancer would not work as well as an excuse in that case.

In order to have this radioactive iodine therapy, I had to see three groups of people at the hospital. The first group is obviously a group of medical doctors. The second group consists of radiation safety officers (the best job title ever!) They are officers at the Radiology department who are responsible for the safety of handling radioactive materials. Before the therapy began, I sat down together with one officer for about 20 minutes and listened to all the safety measures that I must take while I quarantine myself at home. Those measures include (1) always sit when you pee, (2) flush at least twice each time, (3) use disposable utensils, plates and cups, (4) do not taste food while cooking, (5) wear socks and slippers at home, (6) wrap your mattress with a plastic cover and throw the cover away, (7) keep laundry separate inside a plastic bag and wash them only a week after the quarantine is done twice, (8) do not walk into public and stay there, (9) throw away bar soaps and tooth brushes, (10) do not use public transportation for the first three to seven days, and so on. On the third day of the therapy when the actual I-131 was taken, another radiation safety officer reminded me of these safety measures once more and measured the radioactivity level before I left the hospital.

The third group consists of imaging technicians with which I interacted most. They are in charge of operating various scanning machines at the hospital. Because I had to take several whole body scans before and after taking I-131, it was them I saw most often and spent the most time with. On the second day (Tuesday) of the therapy, I received the second shot of thyrogen, which significantly increases the amount of thyroid stimulating hormone (TSH) to excite remnant thyroid cells, and then took one particular radioisotope of iodine, I-123, which is mildly radioactive and has a half life of approximately 12-13 hours. I-123 was absorbed quickly by these excited thyroid cells, allowing PET scan to detect the trace of gamma rays emitted as a part of radioactive decay from those thyroid cells. This is used to determine the degree to which thyroid cancer has spread beyond the neck area and also to determine the dose of I-131. The scan was done the first thing on the third day (Wednesday), and based on the scan, the dose was finalized (to be slightly lower than was discussed earlier) to be 125 mCI. The technician brought a gigantic lead case in which another lead case contained yet another lead case that contained two I-131 capsules. Before those capsules were taken out, I was warned not to take time to look at those capsules and wonder about them: just pour them into your mouth and swallow them as quickly as possible.

Because I did my homework, one of the first things I did when I learned that I had to go through radioactive iodine therapy was to purchase a geiger counter. If you’ve watched any movies or TV series that involve nuclear meltdown or apocalypse, you’ve already seen a geiger counter. Of course, in the movies, they tend to use gigantic, industry-grade geiger counters, and in reality, what you buy off of Amazon is a smartphone-sized geiger counter with LCD screen and WiFi, although I haven’t figured out why my geiger counter has WiFi. Similarly to PET scan, it measures the level of radioactivity by detecting alpha, beta and gamma rays resulting from decay of radioactive isotopes. The geiger counter came with a small reference card that tells us what we should do according to the reading on the geiger counter. For instance, the counts per minute (CPM) under 50 corresponds to the normal level of background radiation, and there is no need to take any action. If it’s over 50 but still under 100, no need to panic, but it’s a good idea to “check the reading regularly.” If it’s over 1,000, you must “leave the area ASAP and find out why.” When CPM is over 2,000, you must “evacuate immediately and report to government.”

Already after taking I-123, the geiger counter showed over 6000 CPM when placed near my stomach, as that’s where the I-123 capsule dissolved. As the half life of I-123 is quite short and my body got rid of it via sweat and pee, it’s dropped rapidly overnight. The surprise came after I took I-131 which was expectedly much more radioactive than I-123, and the dose of I-131 I took was also much higher than that of I-123 of which primary purpose was scan. I walked home as instructed after taking in I-131 on the third day, which took roughly 30m, and almost immediately measured the radioactivity level using the geiger counter. CPM after approximately an hour after the in-take was over 800,000, and I instantly understood why radiation safety officers were emphasizing over and over the importance of quarantining myself home away from anyone as well as the apartment walls facing neighboring apartments (because gamma ray doesn’t care much about walls, unlike us.) The level decayed quite rapidly over the next 2-3 days, although I am still more radioactive than usual background radiation. I made a home video clip of approximately 5-10 minutes each day to share the progress with my parents, brother and some of the close friends who were genuinely worried about me. I am still debating with myself whether to make them publicly available.

This high level of radioactivity turned out to be largely a non-issue, because it’s the geiger counter that violently reacts to it not my body (except for the first two days, when I really didn’t feel well.) A bigger issue with radioactive iodine therapy was three-week-long low-iodine diet I had to go through starting from approximately 2.5 weeks before the actual therapy started and ending 2-3 days after taking I-131. Unless you live right next to nuclear power plants or recently watched the TV series Chernobyl, it’s very likely that you’ve never thought of iodine, just like me until this therapy. Only when I learned I should not consume iodine, I realized how much food I consume and like contain a high level of iodine. For instance, I was discouraged from having any seafood, any dairy products including milk and cheese, no soy bean based products and perhaps most importantly any salt (even non-iodized salt tends to contain a small amount of iodine.) This implies two things: first, there is effectively no korean food I could/should have, and second, I cannot eat out during this period of low-iodine diet, because you never know what they use to prepare a dish at a restaurant. The only silver lining was that “beer, wine or any alcoholic beverage” was fine to have during this horrendous diet. In other words, I survived off of a lot of beer and wine, a lot of fruit, a lot of salad without any salad dressing but olive oil, a bit of whole grain pasta and a bit of fresh meat “without” any salt. Without any salt, a lot of things you cook start to taste like cardboard boxes. To escape from this tasteless reality, I had to rely on beer and wine. Let me thank Douwe and Keunwoo who were kindly and always there to drink beer and wine with me throughout this difficult period and also to successfully discourage me from having any iodine-rich dish at a bar. I shouldn’t forget to thank one more person, Kat, who taught me two amazing life lessons that kept me survive in this period; (1) you boil chopped up tomatoes long and it becomes great sauce even without any salt, and (2) a modern microwave (including mine at home) has a defrosting feature. The second lesson in particular has forever changed my life. Thanks, Kat!

As I write it down, I am realizing that the whole thing about radioactive iodine therapy seems rather dull, boring and uneventful. It is perhaps because I was pretty well-prepared and was aware of various aspects of the therapy before and after the in-take of I-131. It is all thanks to Prof. S (I’m using this acronym to protect his privacy) to whom I was introduced and had a chance to talk with right after my surgery and before the radioactive iodine therapy. Prof. S had a similar experience roughly a decade ago. He was diagnosed with thyroid cancer, and it had already spread throughout the neck, affecting his voice. He had *two* surgeries, one of which took *ten* hours, and went through radioactive iodine therapy afterwards. He kindly invited me to his office and spent enough time patiently and kindly telling me of his earlier experience and what I should expect and prepare for. Since this past January, I learned that it is the uncertainty that kills, and these words from Prof. S greatly reduce the uncertainty associated with the surgery and radioactive iodine therapy and helped me feel much more comfortable and relaxed. Thanks, Prof. S.

On the seventh day since the in-take of I-131, I was asked to come back to the hospital (Nuclear Medicine) to have another whole body scan to determine the spread of thyroid cancer (by observing the spread of I-131, as described above) and whether I would need another radioactive iodine therapy. From my research, I learned that it is pretty rare to have another therapy in a condition similar to mine, but it’s not like I’ve been following usual, more probable routes when it comes to health and career (I mean how many professors in US universities have degrees from Korea and Finland and was a postdoc in Quebec.) It was best put in words by Prof. Julia Kempe, who’s the Director of the Center for Data Science and has been amazingly supportive from the day one of my diagnosis: “your life is not statistically samplable.”

Whole body scan along with a more focused scan on the neck area takes about an hour lying down on the scanner table (<= bed?), which is surprisingly comfortable and cozy. Because you’re naked with only a hospital gown and because the room temperature is kept low (presumably for the scanner), the technician will put a warm blanket on you. There is low-frequency background noise coming from the scanner, but otherwise there is absolutely no other noise. It is a pretty cozy atmosphere, and I always took a nap for the first half of scan. Even then, after about 40-50m and after a couple of whole body scans, it becomes a bit boring and frustrating to lie down on the table without moving.

After the scan on the seventh day, I was asked to call Nuclear Medicine in the morning the next day, as it was already late and the attending radiologist would only have time to read my scan result next day. Because I was anxious to know the outcome and also to hear the confirmation that I could finally break free from self-quarantine, I called Nuclear Medicine first thing in the morning at 8am. I was told I called way too early (okay, understandably) and should expect to hear from them around 10am. My phone rang almost exactly at 10am, and I was told I need to come back in for another scan, as the image from the previous scan was not clear to draw a conclusion. From my experience so far this year, “recall” has never been a good sign, and I almost felt like I was falling down from an infinitely tall building, or something like that. In other words, it was bad news.

Before I continue on, this seems to be a good moment to share with you another good news in order to balance out the ratio between good and bad news. I am thrilled to announce that I was invited by Sainaa to give a talk in *Mongolia* as a part of the summer school on deep learning there. As I began preparing for my trip to Mongolia, I learned that Mongolia is one of the few countries that requires South Koreans to have a visa in advance of arrival, i.e., I had to apply for a visa soon. As the general consulate of Mongolia in US is in DC and because I was not able to travel on public transportation for more than a short duration in order to avoid hurting others with my newly acquired skill of radiation, I sent the application package (including my passport) to the Mongolian consulate via UPS. Although it said visa processing takes approximately two days and up to a week on their homepage, I did not receive my passport back with Mongolian visa for a week, and I was becoming a bit anxious about it. On my way out to the hospital on the seventh day, I however noticed the slip in my mailbox and picked up my passport with a shiny new Mongolian visa. Good news: I’m all ready to travel to Ulaanbaatar!

With this mix of (potentially) bad and (really) good news, I went back to the hospital for the second scan the next day. The technician told me it is quite usual for a patient to come back for the second scan, because it provides much more information to radiologists to look at two consecutive scans to know how radioiodine is spreading throughout the body. This was comforting to hear and understandable, and I appreciated his explanation (unfortunately I did not get his name,) although it would’ve been even better if I was told this in the morning. The scan was done after an hour, and the image was delivered to the attending radiologist for confirmation. About 10 minutes later, another technician came back to me and told me that I needed to have this time 3-D body scan. She may have read some frustration from my face (though, I think I did a good job of maintaining a rather smiley and cheerful face expression) and kindly told me it is not because the radiologist saw some anomaly but because he wanted to be extra careful and certain in reading these scans. Her kind words helped but I was still pretty anxious of what would be the ultimate outcome. Another radioactive iodine therapy would imply that thyroid cells have spread to some parts of my body that were not easily reachable by the usual dose of radioactive iodine. What would that mean for me and my eventual recovery from thyroid cancer?

It took another hour or so to take 3-D scan of my neck area and the whole body. The same machine, a slightly different mode of operation, but it was exactly the same for me as a patient: lie down and don’t move. The technician told me to wait in the room while he checks with the attending radiologist to confirm whether this new image (together with the two 2-D scans) was enough. I was told to change back. I’m quite sure, now that I think back, it probably took less than 5 minutes for the technician to leave the room and come back, but it felt a lot longer than 5 minutes. To distract my mind away from thyroid cancer, I even read the gigantic poster on whether and how 3-D PET scan improves over 2-D scan for recurrent melanoma, which was hung on the wall and was authored by Dr. F of Nuclear Medicine. It was comforting to see that my doctor was active in research as well, although the improvement seemed relatively modest. If it were a submission to an ML conference, R2 would’ve recommended it be rejected.

Five minutes after, the technician came back to the room and told me it was done. I had to ask what he meant by “it was done” and whether I should come back sometime later to talk to the radiologist. He simply told me there was no need for me to come back to the Nuclear Medicine for another scan for now and that I would hear back from Nuclear Medicine if there’s any reason to come back in for another scan. It sounded like a good news because there would not be a need for another round of radioactive iodine therapy. I somehow couldn’t believe it, because this would imply the end of the initial treatment of my thyroid cancer (of course, again, I will have to come back frequently for the purpose of follow-up monitoring.) Almost exactly six months since thyroid ultrasound in January, the treatment was almost at the end, although I still need to have a blood test for thyroid hormone level next week and have another post-op follow-up visit to Dr. P’s office next month.

Two days later (Saturday), I received an email from the patient portal of NYU Langone Medical Center saying that there’s a new test result available. I logged in to see the test result. Although I was expecting some good result based on the fact that I was not recalled for yet another scan, my heart was still pounding fast. After all, how often have I heard good vs. bad test results from the hospital? The odd has never been on my side when it comes to test results. In the comment section at the top of the report said

“Dear Mr. Cho

All looks good

Dr A”


A couple of hours after reading the test result and more specifically that positive comment from Dr. A, I received a message from my collaborator, Kryzstof, and our co-supervisee, Nan Wu, that our extended abstract, which is a shortened version of <Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening>, submitted to the AI for Social Good Workshop at ICML’19 received the best paper award. When I started to talk with radiologists at NYU Langone Medical Center and decided to devote some time and resources to applying deep learning to breast cancer screening early 2016, I never imagined that I would become a cancer patient myself and have a thyroidectomy surgery and radioactive iodine therapy at the very same department. Only recently thanks to a gigantic effort by Krzysztof and his team, we began to see positive outcomes from our research, and it coincided with my thyroid cancer. How ironic!

This blog post started with my application for permanent residence mid-2017 that led to the eventual discovery and removal of aggressive thyroid cancer. I was asked to show up for an interview two weeks after the surgery and received the green card mid-April. Together with my approved tenure case, removed thyroid (along with 18 lymph nodes) and green card, I guess it’s a pretty undramatic way to wrap up the latest stage (tenure-track) in my career.

Thanks to..

The challenge has been more psychological than physical, which started with me blaming everything and everyone that seemed to have even remotely caused thyroid cancer all the way to giving up totally and lying down on my bed for a couple of days doing nothing (or playing games.) I tried to keep it to myself initially, but learned that it was impossible to do so (because there are people who interact with me daily and are dependent on me operating as usual) and also that it helped to talk with people about this. Julia, Denis and Rob, who are my three bosses respectively at NYU CDS, NYU Courant CS and FAIR, were amazingly supportive of me going through this ordeal, providing me with administrative slack as well as psychological support.

Douwe must’ve been the first one outside my family to have learned about this, and he’s been extremely supportive, listening to my gloomy rant and drinking beer with me whenever I needed it. Keunwoo is another one who checked in on me every weekend (and often weekday) and was ready to come to the village to drink beer with me or just to sit and chat with me. Hal checked in on me regularly and even spent a couple of evenings drinking a lot of sake/beer with me. If I remember correctly, Hal “donated” a bunch of booze to Mila (formerly Lisa) at the end of 2013 (or beginning of 2014) and told me he did not drink as much booze as he used to when he was an undergrad student. Somehow he went out of his usual way and had some nice beer and sake with me. One day Rich suddenly stopped by at my office and dropped off a bunch of books and comics for my self-quarantine period. I haven’t gotten around to read all of them (yet!) but <Dispatches> was pretty awesome to read. Orhan and Li checked in on me regularly and insisted to stop by at NYC, which I had to decline in order to avoid harming them with my radioactivity. I’m looking forward to having them over soon in NYC. It was extremely pleasant and comforting to talk with and hear from Prof. S about his experience, because it greatly reduced the uncertainty which was killing me (probably more so than thyroid cancer was.) Gene kindly talked with my mom while she was waiting for my surgery to finish, and according to mom, it was hugely comforting to talk with him at the hospital cafeteria “in Korean”. Thanks, Gene! Heng, thanks for all your encouragements. No one will ever know how much each of your text meant to me.

The members of CILVR group, both faculty, postdocs and students, have been extremely kind and supportive. They sent me a potted plant to wish me fast recovery after my surgery, and that plant is the first ever plant I’ve owned that I hadn’t killed within a week. The plant is still growing pretty nicely (perhaps a bit too much and I will need to replace the pot with a larger one soon) and sits next to the window at my apartment. And, lastly, all the doctors, nurses, technicians and radiation safety officers at the NYU Langone hospital have been amazing: they were kind, nice, patient and thoroughly every step of the way. I still need to come see them frequently for many years, but it has so far been a pleasant experience and I’m certain it will be.

Oh, shoot. I almost forgot to mention my mom. My mom flew all the way from Seoul to take care of me during and after the surgery, just like she did last time I was quite sick. Somehow all her trips to NYC have been pretty gloomy as they were all taking-care-of-my-son trips. Mom, thanks, and next time I’ll make sure you visit me in NYC while I’m feeling well.

BERT has a Mouth and must Speak, but it is not an MRF

posted May 28, 2019, 7:03 AM by KyungHyun Cho   [ updated May 28, 2019, 7:20 AM ]

It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly. 

Here, I would like to briefly point out this mistake for the readers of our technical report.

In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index t, as shown below:
Based on this formulation, I mistakenly thought that xt would not be affected by the other log-potentials, i.e., log φt'≠ t (X). this is clearly not true, because xt is clearly used as an input to the BERT fθ

In other words, the following equation (Eq. 3 in the technical report) is not a conditional distribution of the MRF defined with the log-potential above:

It is however a conditional distribution over the t-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.

Apologies to everyone who read our technical report and thought BERT as an MRF. It is a generative model and must speak, but it is not an MRF. sincere apologies again.

Are we ready for self-driving cars?

posted May 4, 2019, 9:45 AM by KyungHyun Cho   [ updated May 4, 2019, 10:54 AM ]

Last Monday (April 29), I had an awesome experience of having been invited and participating in the debate event organized by the Review and Debates at NYU ( By being born and raised in South Korea, I can confidently tell you that i cannot remember a single moment where I participated in any kind of formal debate nor a single chance in which i was taught how to make an argument for or against any specific topic. My mom often tells me I draw way too gloomy picture of Korean K-12 education I had, but it is true that our education system (at least when I was in it) never encouraged students to express their opinions and never taught us how to do so properly. 

I digressed. Back to my point, it was my first time participating in such an event and I was quite nervous. To further this, I was travelling for four days before this debate and had absolutely no time to prepare except for an hour in this airport, another hour in that airpot, another hour on this plane and yet another hour on that plane. Also, I was asked to oppose the statement "we are not ready for self-driving cars," which frankly sounded a lot like biting a poisoned apple. The opponent in this debate was a dear colleague, Prof. Vasant Dhar, which did not help with my anxiety, because he is a great speaker, debater and thought leader in the field of data science and automated decision making.  

I was supported by an amazing student from NYU Abu Dhabi, Muhammed Ali, who I only met right before the debate. Muhammed is only a sophomore at NYU AD, majoring in Philosophy and Economics. I was impressed by his quick grasp of the points made by participants, which included me, Vasant and Vasant's student supporter, Ankita Sethi, and logical and enthusiastic response to or support of them. 

It was a great joy to participate in this debate. Although we, Muhammed and I, lost the debate eventually to Vasant and Ankita, decided based on how many in the audience changed their mind over the debate, I learned a lot from this debate, listening to Vasant's and Ankita's arguments, and learned how awesome these students here are in thinking about and discussing these societal issues. Lucky y'all studying here. 

Anyhow, I at least prepared an opening argument before the debate and would like to share with you here. I've also attached at the end another, follow-up piece that I prepared but could not find an appropriate moment to read out in the debate. Apologies in advance as both pieces are pretty rough, because they were just a scrap material i've prepared for making a speech.

What do you think? Are we ready for self-driving cars? 

(Of course, don't forget this is an opening argument of a team who argued we are ready for self-driving cars and who lost the debate. )


There is a misconception that a new technology is introduced to a static, frozen society. Under this misconception, no new technology is ever ready. A reality is that the society and new technology interact with each other and evolve together to be more compatible with each other. There is no perfect time to introduce a new technology, especially when it is disruptive, revolutionary and potentially has a long-term effect on the society. In this sense, we are either ready from the very beginning or never ready, and I do not believe the central question to this debate is debatable to start with.

Rather, I believe we must carefully dissect the statement “we are ready for self-driving cars” along multiple axes. 

First, we must consider different subsets of the society. Within this axis, we must further consider different ways in which the society could be partitioned. 

One particular axis I want to emphasize is the primary purpose and use case of motor vehicles. Some use cars to drive around on their own or for their family. Some use cars to drive passengers picked up from the street, such as taxi drivers. Some use cars to deliver various products from one place to another. Some use trucks to deliver products from one location to a remote location, often taking days to arrive at the destination. 

Across these use cases, what we notice is there is a large variety of drivers according to their skills, their dedication, # of hours spent driving and other probably seemingly unnoticeable properties. Furthermore, there is a large degree of differences across different types of driving. For instance, we won’t be able to easily compare the experience of driving a taxi in New York City against that of driving a long-haul truck across North America. Driving your kids to a school in a suburban area cannot be compared to driving from Downtown Manhattan to JFK in the rush hour. 

Of course, these are only a small number of use cases of motor vehicles that i have listed. With these fine-grained categories of “driving”, which use case do we have in our mind when we ask ourselves whether we are ready for self-driving cars?

Let us consider a hypothetical situation in which we let a machine drive on its own only on highways. For instance, a human driver will drive a long-distance truck to the city boundary, a machine would take over control, drive the truck all the way across the continent and give the control back to potentially another human drive in the destination city. Certainly in this case, the difficulty of self-driving is significantly lower than that required for a machine to drive in a crowded city. Would we declare ourselves ready for this situation? If so, we are perhaps already ready for self-driving cars.

The main topic of this debate is thus ill-defined, and the validity of this statement is subjective at best. In this regard, i argue that we are ready for self-driving cars, because there are certainly scenarios under which self-driving could be deployed and used.

Second, we must consider different levels and axes of self-driving technology. Already the Department of Transportation and similar governmental authorities all around the world have begun investigating and defining what we mean by self-driving technology. What they along with many other policy makers, policy advisors and scientists have realized is that it is not a single chunk of machinery, that is self-driving technology. It is rather a large set of relevant technologies, such as sensor technology, motor technology, control algorithms, software integration and so many others, that together constitute self-driving technology. In other words, we must consider different ways in which a subset of these technologies are combined to form one instance of self-driving technology. 

At this point, we must ask ourselves a few questions in order to clarify ourselves of what we mean by self-driving technology before asking ourselves whether we are ready for such technology.

Is automatic transmission considered autonomous driving technology relative to manual transmission? 

Is self parking considered autonomous driving technology? If so, is the society ready for this feature? Ford, BMW, Volkswagen, … all have actual products in the market that support this technology.

At what point have we begun considering ourselves to be ready for some of those technologies above? Similarly to the earlier argument i made, it is an ill-posed problem to ask whether we are ready for “self-driving cars”. If there is any hope in answering this question, which needs to be better specified, perhaps the criterion should be that whether such a technology, or its variants, has already been deployed and used successfully in the society. If we agree such a criterion is a reasonable one, i argue that we are and have been ready for self-driving cars for long time.

In fact, what I argue here is that we have already been ready for these new technologies, as this readiness to adopt new technologies and even new ways of thinking is precisely what differentiates ourselves from other species. 


At this point, I hope I have been able to convince at least some of you and to establish that it is impossible to argue for or against our readiness in adopting self-driving cars. We have been ready for these kinds of technology for many decades, and gradually have been implementing these technologies in the real world. Self-driving “evolution” not revolution has been under underway ever since the successful demonstration of self-driving van by Pomerleau in Pittsburgh in late 80’s and early 90's, and has become much more evident exactly a week ago when Tesla demonstrated their latest self-driving capability.

I do not want however to give you the impression that everything has been solved and that our roads will be full of self-driving cars soon as in a couple of years. I also am not arguing that we should just see what happens.

Each and everyone of us is ready for self-driving cars on the road, but we do not necessarily agree on the timeline of their deployment, how we would and should regulate them just like how we have done so with various other technologies (including human-driven cars!), and their future. These decisions cannot be made before self-driving cars hit the road (though, they have already done so) and will have to be newly created and constantly amended over time as we deploy more and more of them across the society.

An interesting example of a similar spirit happened decades ago in Sweden, when they overnight switched from driving on the left (just like in London) to driving on the right (just like in New York City.) this decision was made many years after cars had become ubiquitous on the streets of Stockholm, and was made based on the government’s conclusions drawn from observing how cars are used and driven on the roads. What this implies is that it was not whether Swedes were ready for the technology but what Swedes could agree to be a better way of using this new technology of “human-driven” automobiles. 

We call autonomous driving a new technology despite its age. We talk about our readiness and technology’s readiness because it is a new technology. We are worried and excited about this new technology. At the very root of these different “feelings” about the new technology of autonomous driving lies the lack of our informed imagination of how the new technology would change our society and how our society would change the technology. 

Can we then wait until we have a clearer picture of the future? Unfortunately i do not believe it is possible to do so. Our ability of looking far into the future is simply not there. In other words, we are incapable of making the optimal action, and we must acknowledge that. We however have an ability to look into near future and adapt ourselves and our society accordingly.  And, there are a few problems we can make an educated guess and hopefully are working toward fixing them, such as value alignment problem, liability issues leading to potentially deepening societal inequality and economic incentives (from manufacturers and operators) leading to economic issues. All of these have been correctly and appropriately raised by Vasant and Ankita earlier today. 

Then the question is whether we should keep self-driving cars away from roads because we would never be ready for them (though, again, self-driving cars are already on the road) or we should embrace the reality that self-driving cars will be deployed increasingly more over time and that we are ready to adapt ourselves and our society to be in harmony with these self-driving cars. I believe we are ready. The next and perhaps better question to ask ourselves and debate about among ourselves is: “are we preparing ourselves to adapt ourselves and our society to the ever increasing adoption of autonomous technology for our benefit?” 

On the causality view of <Context-Aware Learning for Neural Machine Translation>

posted Mar 31, 2019, 4:59 PM by KyungHyun Cho   [ updated Apr 1, 2019, 7:10 AM ]

[Notice: what an unfortunate timing! This post is definitely NOT an april fool's joke.]

Sebastien Jean and I had a paper titled <context-aware learning for neural machine translation> rejected from NAACL'19, perhaps understandable because we did not report any substantial gain in the BLEU score. As I finally found some time to read Pearl's <Book of Why> due to a personal reason  (yes, personal reasons sometimes can help), I thought I wrote a short note on how the idea in this paper was originally motivated. As I was never educated in causal inference or learning, I was scared of using a term "causal" in any of my papers so far, and this paper was not an exception. I feel like this intentional avoidance of the term may have made the paper more obscure, and perhaps it's not a bad idea to use a blog post (and time from that personal reason) to write out my original motivation without worrying about academic scrutiny. 

Let me focus on building a translation model that takes as input both the current source sentence X and the previous source sentence C, and outputs the translation Y of the current source sentence X, although there is no reason to restricted C to be only a single immediate previous sentence. Let's introduce a variable Z that represents all that we do not observe directly, such as the world state, the author's intention and the actual meaning behind the text. You can think of Z also to include both benign and detrimental common sense, such as "bananas are always yellow" (when I was in Rwanda just a few weeks ago, I learned this to be false. see the picture on the left I took in Kimironko Market, Kigali), "presidents are often male; e.g., Monsieur President vs. Madame President", ...

If I were to draw a causal diagram, following Pearl, one version would look like below, where I used a dashed circle to explicitly indicate that Z is not observed:
The document, a part of which is represented by C and X, is created from (caused by) Z. The current sentence X is also caused by Z but not necessarily by its preceding sentence C. This is one assumption that I am not comfortable with, but it can be understood generously if we consider that in many cases we can more easily reorder sentences in a paragraph than reordering words in a sentence. Once we know both X and C, the translation Y of the source sentence X is determined (caused) by the source sentence X and the previous sentence C. Why is there an arrow from C to Y that bypasses X? This is due to the difference between the source and target languages. Consider an example of translating from a language without gendered pronouns to one with gendered pronouns. 

Based on this diagram, what we next want to know in this "context-aware neural machine translation" is the effect of the previous sentence C on the translation Y of the source sentence X. 

Now, a fair warning before we proceed: because I only just gave a quick read of <Book of Why>, I may be completely off here.

Let's consider two paths from C to Y in the diagram above: C->Y and C<-Z->X->Y. The first path corresponds to the direct effect of C on Y, and the second path could be thought of as a path with a mediator X. The effect of the cause C on Y will be some function of these two, and if all the relationships are linear, the sum of the effects from these two paths will be the total effect of C on Y. 

Now, obviously, it'll be best if we could estimate the coefficient (a set of neural net parameters) associated with each arrow in the diagram above somehow. Then, we can compute the total effect exactly, and that would be the end story of causal inference. Unfortunately, other than the coefficients of the two arrows (C->Y and X->Y), which can be estimated from data by fitting a neural machine translation system, it looks pretty unrealistic for us to estimate the parameters of Z->C and Z->X.

This is where we move away from causal inference and toward machine learning (in particular, machine translation). Instead of trying to estimate those coefficients and infer the causal effect of C on Y, our goal is now to train a neural machine translation system to maximally exploit the effect of C on Y. That is, we train a context-aware neural machine translation such that the context C maximally influences (causes) Y in addition to the source sentence X, according to the causal diagram above
Under this goal, the second path C<-Z->X->Y (the path colored blue above) is of our interest, as this is path contains two arrows of which we don't know how to estimate the coefficients. We notice that this path is blocked by the confounder Z which we don't observe nor control for (though, this could be an interesting exercise in the future to control Z by finely partitioning a corpus.) One classical technique in this case is to run a randomized trial on C, which effectively cuts the arrow from Z to C. 
This cut indicates that the choice of C is not dependent on Z. In the case of training a context-aware neural machine translation system, this can be thought of as replacing the previous sentence with any randomly drawn sentence from a large corpus (though, it is not at all clear what the distribution should be, and we discuss a few alternatives in Sec. 4.3.) Then, by contrasting the effect of C and X on Y and that of randomly drawn C and X on Y, we can measure the effect of C and Y. This can be expressed in an equation:

r(C) is a randomized context, and we use the conditional log-probability of Y given X and C (or r(C)) as the causal effect (score) s(Y|X,C). This formulation naturally lends itself to a new regularization term that encourages the context-aware neural machine translation system to maximize the effect of the context C on Y. We use the margin loss together with this causal effect on three different levels (minibatch, sentence and token). Here let me write out the sentence-level regularization term:
Minimizing this term literally maximizes the causal effect of C on Y until it is at least as large at some predefined threshold (δ) multiplied by the length of each sentence.

We call this regularization technique "context-aware learning" (or context-aware regularization), as I was actively avoiding a term "causal" anywhere. Indeed, this technique helps in a sense that the final, trained neural machine translation system actually degrades when a wrong context is provided, as opposed to a usual context-aware translation system which is often trained without considering this causal effect. Compare (c) and (d) below while contrasting the columns "Normal" and "Context-Marginalized". We did also observe some improvement even when the correct context was given (Normal), but the reviewers were not impressed.  

As you may have noticed, this approach is agnostic to an underlying machine translation system. As long as you can train the underlying system with the proposed regularization term, this framework carries over very naturally. It is furthermore decoupled with the actual problem of machine translation. The proposed approach can be applied to any other problems where we have a set of input modalities, of which some are only weakly correlated with the output but are known to cause the output. 

Phew, there you go! I'm glad that I found some time today to fulfill my deep desire to say "causality" out loud. 

PS1. I had another ill-fated attempt to apply this framework to generic supervised (unsupervised) learning and explain it without mentioning anything about causality or randomized trials: Though, I cannot tell whether Adji Dieng noticed this :)

PS2. the diagram above is slightly less satisfying, as there is no arrow from C to X. A natural next step would be the following:
We probably want to randomize both C and X. Though, I am pretty sure there must be better ways to do so.

PS3. While this paper was under review at NAACL'19, I saw a talk by Natasha Jaques who visited NYU. Her work nicely incorporated counterfactual analysis (now at the individual-level causal inference) to learning a set of coordinating neural net agents in a similar manner as my paper. Definitely worth a read: Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning.

Lecture note <Brief Introduction to Machine Learning without Deep Learning>

posted Jul 16, 2017, 12:01 PM by KyungHyun Cho   [ updated Jul 16, 2017, 12:07 PM ]

This past Spring (2017), I taught the undergrad <Intro to Machine Learning> course. This was not only the first time for me to teach <Intro to Machine Learning> but also the first time for me to teach an undergrad course (!) This course was taught a year before by David Sontag who has now moved to MIT. Obviously, I thought about re-using David's materials as they were, which you can find at These materials are really great, and the coverage of various topics in ML is simply amazing. I highly recommend all the materials on this web page. All the things you need to know in order to become a certified ML scientist can be found there.

I, however, felt that this great coverage may not be appropriate for an undergrad intro course and also that I wasn't qualified to talk about many of those topics without spending a substantial amount of time studying them myself first. Then, what can/should I do? Yes, I decided to re-create a whole course with two things in my mind. First, what's the minimal set of ML knowledge necessary for an undergrad to (1) grasp at least the high-level view of machine learning and (2) use ML in practice after they graduate? Second, what are topics in ML that I could teach well without having to pretend I know without knowing them in depth? With these two questions in my mind, as in the previous year for the NLP course, I started to write a lecture note as the semester continued. At the end of the day (or semester), I feel like I've taken a step toward a right direction however with much to be improved in the future.

I started with classification. Perceptron and logistic regression were introduced as examples showing the difference between traditional computer science (design an algorithm that solves a problem) and machine learning (design an algorithm that finds an algorithm for solving any given problem). I then moved on to defining (linear) support vector machine as a way to introduce various loss functions and regularization. I gave up on teaching kernel SVM due to time constraint, though. Logistic regression was then generalized to a multi-class logistic regression with softmax

For teaching how to deal with problems which are not linearly separable, I've decided an unorthodox approach. I started with a nearest-neighbour classifier, extend it into a radial basis function network with fixed basis vectors, and then to an adaptive basis function network which I dubbed as deep learning (which is true by the way.) At this point, I think I lost about half of the class, but the other half, I believe, was able to follow the logic based on their performance in the final exam. I should've talked about kernel methods here, but well, it's not like I can use the whole semester solely on classification.

Then, I moved on to regression. Here I focused on introducing probabilistic ML. To do so, I had to spend 2 hours on re-capping on probability itself. I introduced Bayesian linear regression and discussed how it corresponds to linear regression with Gaussian prior on the weight vector. This naturally led to a discussion on how to do Bayesian supervised learning. I wanted to show them Gaussian process regression, but again, there wasn't enough time.

For unsupervised learning, I again took an unorthodox route by putting (almost) everything under matrix factorization (X=WZ) with a reconstruction cost and varying constraints. PCA and NMF were discussed in-depth under this, and sparse coding and ICA were briefly introduced. k-means clustering was also introduced as a variant of matrix factorization, and hard EM algorithm was (informally) derived from minimizing a reconstruction error with a constraint that the code vectors (Z) were one-hot. This whole matrix factorization was then extended to deep autoencoders and to (metric) multi-dimensional scaling. Surprisingly, students were much more engaged with unsupervised learning than with supervised learning, and at this point, I had regained the half of the class I lost when I was teaching them nonlinear classifiers.

The course ended with the final lecture in which I briefly introduced policy gradient. This was again done in a rather unorthodox way by viewing RL as a sequence of classifiers. I'm quite sure RL researchers would cry over my atrocity here, but well, I thought this was a more intuitive way of introducing RL to a bunch of undergrad students who have highly varying backgrounds. Though, now that I think about it, it may have been better simply to play them the RL intro lecture by Joelle Pineau:

Anyways, you can find a draft of my lecture note (which will forever be a draft until I retire from the university) at 

Any suggestion or PR is welcome at 

However, do not expect them to be incorporated quickly, as I'm only planning to revise it next Spring (2018).

During the course, I showed the students the following talks here and there to motivate them (and to give myself some time to breathe):

to arXiv or not to arXiv

posted Feb 12, 2016, 3:56 PM by KyungHyun Cho   [ updated Feb 12, 2016, 4:36 PM ]

I believe it is a universal phenomenon: when you're swamped with work, you suddenly feel the irresistible urge to do something else. This is one of those something else.

Back in January (2016), right after the submission deadline of NAACL'16, Chris Dyer famously (?) posted on this Facebook wall, "to arxiv or not to arxiv, that is the increasingly annoying question." This question of "to arxiv or not to arxiv" a conference submission, that has not yet gone through peer-review, indeed has become a thorny issue in the field of machine learning and a wider research community around it, including natural language processing.

Perhaps one of the strongest proponent of "to arXiv" is Yann LeCun at NYU & Facebook. In his "Proposal for a new publishing model in Computer Science," he argues that "[m]any computer [s]cience researchers are complaining that our emphasis on highly selective conference publications, and our double-blind reviewing system stifles innovation and slow the rate of progress of [s]cience and technology." This is a valid concern, as we have observed that the rate of progress in computer science has largely overtaken the speed of publication process. Furthermore, as the focus (and assessment) has moved from journals to so-called top-tier conferences, more and more papers get stuck in the purgatory of submit-review-reject-resubmit. Although the conferences almost always guarantee faster decision making, it's a binary decision without much possibility of any revision. The only way to salvage a rejected paper is to wait for another conference in the same year, or for the same conference in a subsequent year. Throughout this process, it's quite often that the content and idea of the submission become stale, thus leading to a slowdown in the scientific progress.1

Of course, at the same time, there are many issues with this approach of "to arXiv," contrary to the more traditional double-blind peer reviewing system ("not to arXiv.") Nowadays we see a flood of conference submissions on arXiv a day or two after the submission deadline of one conference, at least in the field of machine learning, or more specifically deep learning. Unfortunately I must say that there are quite some low-quality submissions. Why are there many low-quality submissions being made public? After all, no author probably wants to be associated with a submission that is half-baked and incomplete.

One potential reason I see is the severe competition among researchers from all corners of the globe. Nobody wants to be scooped by simply forgetting to upload their submission on arXiv before their competitors do. Pushed by this anxiety over being scooped by others, authors often end up putting a rather half-baked manuscript out. Or, maybe authors are simply being naive thinking that one can always update her manuscript on arXiv with a newer version. Combined with the open reviewing system, such as that of ICLR, we see a surge of half-baked submissions on arXiv once or twice every year, and this has been spreading over to other conferences as well as other fields.2

Why is it an issue at all? Because it wastes many people's time. We see an interesting title popping up in our Google Scholar My Update or in someone's tweet, and as researchers, cannot ignore that submission, be it accepted at some conference or not. And, after reading the paper for 10-30m, we realize that "well, I should wait a few months for a next version!" Also, the oft-lack of thorough empirical validation may mislead readers into a wrong conclusion.

But, again, I'm not trying to either advocate or oppose the idea of "to arXiv" in this post.3 Instead, I'm here to share the result of an informal survey I ran right after reading Chris' FB posting. The goal of the survey was to see how many people follow either of "to arXiv" or "not to arXiv" paradigms and to which degree they do so. The poll was completely anonymous and was done using Facebook App <Polls for Pages>.4 It was rather informal, and the questions were slightly changed once at the beginning of the survey. Also, it's quite heavily biased, as most of the participants are people close to me, meaning that they are either working on deep learning or (statistical) natural language processing. In other words, take the result of this poll with a grain of salt. 

In total, 203 people participated, and they were either machine learning or natural language processing researchers. Among them, 64.5% said their major area of research is machine learning, and the rest natural language processing. 

The participants were asked first whether they upload their conference "submission" to arXiv. About two thirds of the participants answered that they do.
When I drew this pie chart, I noticed a striking resemblance to the chart showing the portion of machine learning researchers among the participants. Is it possible that all the machine learning researchers post their submissions to arXiv but no NLP researchers do? It turned out that the answer was "no."
Among ML researchers
Among NLP reseachers

But, still I was able to see a stark difference between the machine learning researchers and NLP researchers. While 75.6% of machine learning researchers said they upload their submissions to arXiv, less than 50% of NLP researchers did so. I believe this reflects the fact that this model of "to arXiv" has recently been strongly advocated by some machine learning researchers such as Yann LeCun and Yoshua Bengio.

The second question was on "when" they uploaded their submissions to arXiv.5
The respondents were quite divided between "to arXiv right away", "to arXiv after the deadline", and "to arXiv after the paper's accepted." One lesson is that an absolute majority of the respondents want to put their papers regardless of "official" publication (in proceedings.) 

Now, aren't you curious how much this trend depends upon the field of research? First up, machine learning!
Whoah! More than half of the machine learning respondents said they upload their conference submission to arXiv before any formal feedback on it. Furthermore, it shows that more than 80% of the machine learning researchers make their papers available online way before the actual conference, meaning that if anyone's determined enough, she can read most of the machine learning papers in far advance of actual conferences (of course, you can't drink beer with authors, which is a kind of deal breaker for me..)

How much does it differ if we only consider NLPers?
Surprise, surprise! We see a radically different picture here. Only about a fifth of all the NLP respondents said they upload their submissions before any formal feedback. Nearly half of the NLPers wait until the decision is made on the submission, before they arXiv it.  Also, nearly a quarter of them do not actively use arXiv for conference submissions.

Now, what have we learned from this? What have I learned from this? What have you learned from this? I have learned quite a lot of interesting things from this survey, but my dinner time's approaching too fast..

One thing for sure is that it'll be extremely interesting to conduct this type of survey, in a much more rigorous way, at some point this year, and do follow-up study each or every other year for the next decade. This will be an extremely valuable study that may help us build a better publication model for research.

So, my conclusion? It was $50 well spent.

The data (anonymized) along with a python script I used to draw those pie charts (it was my first time and I don't recommend it) is available at

1 There is also an issue of malicious reviewers, or more mildly put subconscious bias working against some submission, but I won't try to touch this can of worms in this post.

2 I am guilty of this myself and do not in any sense intend to blame anyone. I view this as a systematic issue rather than an issue of an individual.

3 I will perhaps make another post some day on this, but not today, tomorrow nor this year.

4 Which was a pretty bad idea, because it turned out that I had to pay $50 in order for me to see the response from more than 50 respondents.. :(

5 I assumed every researcher has a good intention of having their paper made public once it's published regardless of whether to arXiv or not. Therefore, "probably not" should be understood as "probably not uploading a manuscript that was published in another medium/venue to a preprint server such as arXiv." 

Lecture Note for <NLP with Distributed Representation> on arXiv Now

posted Nov 25, 2015, 5:50 PM by KyungHyun Cho   [ updated Nov 25, 2015, 6:01 PM ]

On the same day I moved to NYC at the end of August, I had coffee with Hal Daume III. Among many things we talked about, I just had to ask Hal for advice on teaching, as my very first full-semester course was about to start then. One of the first questions I asked was whether he had some lectures slides all ready now that it's been some years since he's started teaching. 

His response was that there was no slide! No slide? I was shocked for a moment. Though, now that I think about it, most of the lectures I attended during my undergrad were in fact given as a chalkboard lecture. 

I can understand that there are many advantages in chalkboard lectures. And, most of them to students. The slow pace of chalkboard lectures likely (but not necessarily) fits better with the pace of understanding what's going on in the lecture room, than simply flipping through slides. Also, it becomes nearly impossible for a lecturer to skip anything, as any board starts empty. 

I took this as a challenge (though, I'm sure Hal never meant it to be a challenge in the first place.) Also, I naively thought that the amount of time I need to spend in preparing 100 slides would be much larger than the amount of time I prepare for a chalkboard lecture. After all I've been talking about this NLP with DL over and over, and those talks successfully landed me a job. 

One advice from Hal was that it is better to keep the record or note of what I will teach or have taught so that I can reuse this note over and over. In hindsight, it was perhaps not an advice but simply his personal regret (+ a hint that I shouldn't do chalkboard lectures..)

Sticking to this advice I decided to write a lecture note of roughly 10 pages each week. Since I cannot even remember when it was the last time I hand-wrote any text, I decided to use latex. So far so good, except that it turned out to be an amazingly time-consuming job. Writing 10 pages each week felt never so difficult before (and I used the default latex article class which has gigantic margins..) 

After about a month since the beginning of the semester, I found this amazing review article (or lecture note, I'd say) by Yoav Goldberg. Only if Yoav uploaded this to arXiv a 1.5 month earlier! The course was already more than a third way into the semester, and I couldn't suddenly ask the students to switch from my (ongoing) lecture note to Yoav's. Why? Two reasons: (1) my lecture note had deviated quite a far from Yoav's and (2) my ego wouldn't let me declare my failure at making a lecture note myself in front of the whole class.

Anyways, I continued on writing the lecture note, and this Monday had the last lecture. I thought of cleaning it up quite significantly, adding more materials and even putting some exercises, but you know.. I'm way too exhausted to do even one of them now. I decided to put the latest version, as of evening Monday, on arXiv, and it's showed up today:

I must confess that this lecture note is likely to be full of errors (both major and minor.) Also, I had to skip quite many exciting, new stuffs due to time constraint (only if I had twice longer a semester! nope.) I kindly ask your understanding.. I mean, it's been rough.

Any future plan for this lecture note? Hopefully I will convince the Center for Data Science at NYU the importance of this course, and they'll let me teach the very same course next year. In that case I will likely clean it up more, fix all those errors, update some of the later chapters, and this time for real, add some exercise problems. Wish me luck!

Oh, right! Before finishing this post, I'd like to thank all the students and non-students who came to the lectures, and two TA's, Kelvin and Sebastien, who've been awesome help.

1-10 of 11