Frontpage‎ > ‎Blog‎ > ‎

BERT has a Mouth and must Speak, but it is not an MRF

posted May 28, 2019, 7:03 AM by KyungHyun Cho   [ updated May 28, 2019, 7:20 AM ]
It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly. 

Here, I would like to briefly point out this mistake for the readers of our technical report.

In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index t, as shown below:
Based on this formulation, I mistakenly thought that xt would not be affected by the other log-potentials, i.e., log φt'≠ t (X). this is clearly not true, because xt is clearly used as an input to the BERT fθ

In other words, the following equation (Eq. 3 in the technical report) is not a conditional distribution of the MRF defined with the log-potential above:

It is however a conditional distribution over the t-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.

Apologies to everyone who read our technical report and thought BERT as an MRF. It is a generative model and must speak, but it is not an MRF. sincere apologies again.