ACL 2019 Conference Summary

Computer Science
Machine Learning

February 10, 2020

My colleague Ananda and I attended ACL 2019 conference at the enchanting city of Florence. All the accepted papers can be accessed here. Here’s the summary of interesting trends and also specific research work that caught my eye at the conference. A note of thanks to my employer at Zoho for sponsoring us to attend.

I wrote this summary an many months ago and forgot posting it. Better late than never I guess.

Grammatical Error Correction

The system description papers for this competition were presented as posters in the conference.

  • Three tracks were present in the competition. Restricted track - Only organizer provided human labelled parallel (error and corrected sentence pairs) data can be used. (No restriction on synthetic data) Unrestricted track - Any data including private data can be used. Low Resource track - No human labelled data can be used.
  • Interestingly, the winning team (Edinburgh + Microsoft)’s submission for Track 1 also beat Track 2 without using additional restricted data.
  • Synthetic data generated by corrupting good grammatical sentences from news, books and wikipedia are the techniques used overall by top performing teams.

Multi-Lingual Models

MultiLingual models is a hot area of research now. Earlier results where using single model to perform tasks on multiple languages has shown promising results.

Rise of Automated Metrics

Until recently, we compare model outputs with human written sentences for translation, summarization etc. This can artificially penalize models that generate sentences with equivalent meaning but not same words. There are couple of papers that train models to score quality of the output. Then use these model scores as reward for reinforcement learning. (FYI reinforcement learning is only used for fine tuning, none of the seq2seq models can be trained from scratch using it)

Statistical Evaluation

If we have two architectures and couple of datasets, how to say empirically one is better than the other? Few questions are how to compare two models on the same dataset, across multiple datasets, across various hyperparameter configurations. Problems in applying frequentist tests on the metrics such as accuracy, f1-score etc are that assumptions such as Independent and Identically distributed (IID) cannot be made for deep learning datasets. So we cannot assume that the score the model gets in one dataset is “independent” of the score on another dataset. Statistical tests that don’t assume underlying distribution are needed. Recent statistical methods/tests to do so are being developed and some were presented at the conference.

Bayesian Methods

  • Attended a very detailed tutorial on it. The presenter has summarized the evolution of research in this area and the current papers. Here’s link to the detailed slides for fellow Bayesians.

Analyzing Neural Nets and Interpretability

There is an entire sub-fields of research into analyzing and interpreting neural networks.


“BERT-ology” papers that explore what linguistic structures do pre-trained models like BERT learn.

BlackBoxNLP Workshop

An entire workshop devoted for analyzing what Neural Networks learn.

Formal Languages Workshop

An entire small workshop devoted to finding what Formal Languages (Finite state Automata, etc) neural networks can learn. e.g. Can we reduce a RNN to Weighted Finite State Machine (which is far more interpretable, amenable to theory etc). Although this area sounds exciting to me, I was unable to attend it as I was in an another workshop. Slides from talk of Noah Smith’s talk on Rational Recurrences at this workshop.

Neuroscience and NLP

Neuroscience labs have started to use deep learning. An interesting conjunction of research in NLP and neuroscience research in correlating ANN representations with brain signals was presented.

Language Emergence in Multi-Agent systems

In this frontier, people try train models to solve some task by communicating symbols. Researchers analyze the properties of language used by the agents to solve the task and how it compares with properties of human language.

Conversational AI

  • Neural Models for selecting conversation from past history, detecting intent and slot fitting are all increasingly being deployed by companies.
  • PolyAI (a startup at Singapore shipping conversational AI) shared three interesting papers. Their slides are also interesting.
  • On a related note, Baidu has is doing impressive research and engineering on meeting transcription. They have a stack that does speech to text, translating the text as its spoken (a problem that needed separate research as the text would be incomplete), detecting english phrases being spoken (code switching) and then NLP over the transcribed text.


  • Lots of new work on adapting translation models for low-resource languages.
  • Unsupervised translation, Multi-lingual translation models are few areas of research.
  • Unbabel a YC funded startup doing translation systems shared lots of interesting and important results. Slides from their talk. This company employs a hybrid system where human translators do “post-edits” on machine translations. And some of their system work in real-time.

Contextual Search using Neural Representations at scale

This paper has demonstrated a system which does dense vector search on entire wikipedia for open domain QA.

Scaling search on neural vectors to do question answering on entire wikipedia on CPU -

Demo -