ACL 2020
Recursive Template-based Frame Generation for Task Oriented Dialog
- NLU Systems - Intent and Slots (ATIS example).
- ATIS has a shallow hierarchy (from_loc.city_name) but usually these are completely ignored but this limits sharing of data and statistical strength across labels.
- In restaurant domain, we can apply hierarchy to slots - orderitems.[item].{quantity, name, size}
- Contributions
- Recursive Hierarchical Frame Based Representation that captures complex relationships between slot labels and intents.
- Formulate Frame Generation as a template-based tree-decoding task (use Pointer Mechanism to copy slot values from input utterance).
- Tree-based loss function with global supervision and optimize jointly for all loss functions end-to-end.
- Approach
- Encoder - this paper used BERT as encoder but can be any.
- Slot Decoder - predicts general form of slot (city)name, month_name) from each step.
- Template-based Tree Decoder - decode hierarchical representation of slots (introduces NT (non-terminal) concept).
- Pointer Network - to predict positions for every terminal pointing to a specific token in the user sentence.
- Global Context - Tree decoder tends to repeat nodes since representations may remain similar from parent to child and we overcome this by providing global supervision.
- Evaluation
- ATIS and Simulated Restaurant Ordering Dataset
- EM and F1
Unknown Intent Detection using Gaussian Mixture Model with an application to Zero-shot Intent Classification
- Problem Formulation
- Given a training set of seen classes, can we train a model to predict {seen classes, unknown}.
- Can we take it a step further by predicting the unknown class.
- Gaussian Mixture Loss
- Better than softmax loss
- Large-margin Gaussian Mixture Cross-entropy loss (margin is adaptive and is the distance between utterance and class feature centroid)
- Semantic enhancement via class description
- we extract features from description of class (1 word or a sentence)
- We assign this to be the class centroid.
- Identify unknown intent for generalized zero-shot learning
- Local outlier Factor (LOF) - unsupervised density-based anomaly detection method.
- The intuition is that objects that have a substantially lower relative density value than their neighbors are considered to be outliers.
Multi-Domain Dialog Acts and Response Co-Generation
- DST + NLG
- Hierarchical Dialog Acts (domains, actions, slots)
- Previously represent dialog acts as a one-hot vector.
- Convert dialog act to a generation problem and the representation is from the sequence model.
- Act Generator and Response Generator share same encoder and input.
- Adopt uncertainty loss (Kendall et al 2018)
A Contextual Hierarchical Attention Network with Adaptive Objective for DST
- Slot Imbalance
- Focal loss instead of Cross entropy loss?
- Adaptive objective
- CHAN Approach
- BERT for slots and values (treat as sentence and is fixed)
- Concat user and system utterance for every turn and calculate attention (slot-word attention). This is trainable.
- This is concatenated to form context encoder which is a transformer encoder network.
- Slot0turn attention - also multihead to calculate attention across turns.
- State Transition Prediction
- Predict v_t == v_t-1 for every slot.
XiaIce
- Introduction
- Designed to be an AI companion
- Based on empathetic computing, integrating IQ and EQ
- Optimized for expected conversation-turns per session (CPS)
- aimed to pass time-sharing test
- Design Principle
- IQ: set of skills to keep up with the user and complete tasks
- EQ: empathy and social skills
- Personality: consistent
- Social Chat
- Hierarchical decision making
- Top-level (select dialog skills) and Low level (choose primitive actions).
- Optimize CPS
- Architecture and Core Conversation Engine
- Hybrid system
- 3 layers - UX layer, Conversation Engine layer, Data Layer
- Conversation Engine
- Dialog Manager
- Empathetic computing
- Core Chat
- Skills
- 230 skills released since 2014
- Core Chat
- General chat skill and domain chat skills
- Candidate Generation and Ranking
- Ethics Concerns
- Privacy
- Control
- Set right expectation (XiaIce is a bot)
A Generative Model for Joint NLU & G
- NLU - takes natural language to semantic representation
- NLG - is the reverse of NLU
- JUG
- latent variable z
- NLU is p(y | z, x)
- NLG is p(x | z, y)
- Objectives
- optimize joint probability of x and y.
What does BERT with Vision Look At? (VisualBERT)
- VisualBERT
- Image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision.
- Pretrained with 2 tasks - MLM and sentence-image prediction task
- What does BERT with Vision learn during pre-training?
- Entity Grounding
- Some attention heads map entity to image regions
- Accuracy peaks in higher layers (10 and 11)
- Model deepens its understanding of image as layers go.
- Syntactic Grounding
- Non-entity words attend to image regions (for example “wearing” attends to “man region”)
- Different attention heads specialize in different relations.
- Entity Grounding
Image Chat: Engaging Grounded Conversations
- The goal is to have machines engage with humans in conversations.
- Communication grounded in images is one of the ways to achieve this.
- Given (image, style trait) -> write the message in the conversation
- Models
- Retrieval-based and Generative Dialog Models
- Image Encoder
- ResNet152
- ResNeXt-IG-3.5B
- Style Encoder
- Embed style into N-dimensional vector
- Dialog History Encoder
- Transformer-based encoder
- Reddit dataset (next utterance)
- Image Encoder
- Retrieval model (TransResNet-RET)
- Multimodal combiner (sum or attention)
- Final Score is dot product of combined representation with Candidate Representation
- Training has 499 negative candidates.
- Generative model (TransResNet-GEN)
- Dialog Encoder
- Jointly encodes the style and dialog history
- Dialog Decoder
- CONCATE(image_encoding, dialog encoder)
- standard Seq2Seq Transformer
- Beam search with beam size 2 and trigram blocking at inference time.
- Dialog Encoder
- Retrieval-based and Generative Dialog Models
Adversarial NLI: A New Benchmark for NLU
- Are current NLU models as good as their high performance on standard benchmark?
- They are vulnerable to adversaries.
- Them model is brittle and general NLU is far from achieved despite SoTA.
- HAMLET (Human and Model in the Loop Enabled Training)
- Write examples
- Get model feedback
- Verify examples and make splits
- Retrain
- Adversarial NLI
- ANLI is smaller than SNLI and MNLI but more useful and robust
- Error rates decreases as we progress through rounds (A1, A2 etc)
- Model error rates halved with just 3 rounds
BART
- See this post
Enabling Language Models To Fill in the Blanks
- Editing and Revising
- We often write in a non-linear manner.
- Existing auto-complete system only considers the preceding text.
- Connecting Ideas
- Writing novel and connecting ideas.
- This is Text Infilling
- Arbitrary number of blanks
- Each blank has arbitrary number of words
- Enable LMs to perform task of infilling
- GPT2 only left to right
- BERT must know exact number of tokens
- ILM (Infilling by Language Modeling)
- Works better than GPT2.
BLEURT: Learning Robust Metrics for Text Generation
- NLG (Translation, Abstractive Summarization and Data-Document Generation)
- How do we evaluate NLG?
- Human evaluation :(
- Automatic Metric?
- Automatic metric?
- Given (candidate sentence, reference sentence) - can we come up with a score?
- BLEU (NGram overlap)
- no paraphrasing
- no synonyms
- WMT Metrics task
- Hybrid Metrics
- BERTScore (BERT representation dot product)
- more robust
- E2E Metrics
- Take pair of sentences as input and produce rating
- BEER, RUSE
- more flexible
- BLEURT
- transformer model with regression objective
- 4 steps (2 pre-training and 2 fine-tuning)
- Pre-training: BERT and synthetic sentence pairs
- Random substitutions
- Back-translation
- Random Deletions
- 15 Existing Metrics (ROUGE, BERTSCore, Entailment, Sentence BLEU)
- Open sourced (GitHub)
Dialogue Dodecathlon: Open Domain Knowledge and Image Grounded Conversational Agents
- Introduction
- Goal
- Conv Agent that can have multiple skills (empathy, knowledge, personable and engaging).
- Should work in multiple modalities
- multitask
- Goal
- DodecaDialog
- 12 sub-tasks
- Get to know you when you first talk to it (ConvAI2)
- Discuss everyday topics (DailyDialog, pushshift.io, Twitter)
- Speak knowledgeably at depth (Wizard of Wiki, Ubuntu)
- Answer questions on such topics (ELI5)
- Demonstate empathy (Empathetic Dialog, LIGHT)
- Discuss Images (Image Chat, IGC)
- Models
- Generative BERT Baseline
- only fine-tuned for text based tasks
- Image + Seq2Seq
- ResNeXt Pretrained on 3.5B IG Images
- Generative BERT Baseline
Reverse Engineering Configurations of Neural Text Generation Models
- Models leave detectable artifacts
- Do some modeling choices leave behind more artifacts than others?
- Can we distinguish between text generation models based on text generated alone?
- Which model configurations leave behind the most detectable artifacts?
- Given generated text, we try to predict the configurations.
- Configurations
- Top-K and Top-p nucleus sampling
- Length of initial conditional text
- Model size (base, large, mega)
Span-ConverRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations
- Summary
- light-weight model for dialog slot-filling
- also present a unique dataset
- Multi-Woz has categorical slots and ATIS has single turn interactions
- Restaurants8k
- GitHub available
- 5 slots (time, date, people, first, last name)
- Models
- Subword vectors + CNN + CRF
- Read about ConverRT
Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling
- Background
- cross-domain slot filling
- 0 or few labels in target domain
- Concept Tagger (Bapna et al 2017)
- Conducts slot filling for each slot type
- Coach Framework
- Crossly learn the pattern of entities (only 3 tags (B, I, O))
- In second step, we predict the specific slot step (slot description and similarity matrix)
- Template Regularization
- create templates and fill with wrong slots
- SNIPS and CoNLL 2003
MuTual: A dataset for Multi-Turn dialog reasoning
- Reasoning in chit-chat dialog
- chatbots tend to select the best surface matching candidates
- Listening Comprehension
- High school listening comprehension
- Choose best answer from 3 options
Conversational Word Embedding for Retrieval-Based Dialog System
- Word representation
- Static
- Contextual
- Previous word embedding methods for conversation
- single sentence
- single vector space (both post and reply are in same vector space)
- PR-Embedding
- from conversation pairs in 2 vector spaces (post and reply)
- 2 embedding matrix
- Sentence level learning
- Dot Product, CNN and Max-pool
Multimodal Transformer for Multimodal Machine Translation
- Goal
- Input is (Image, Source language text)
- Output is target language text
- Problem
- Image is not as important
- Text is more important
- We should consider relative importance between them
- Methodology
- Extract features from image and fed into transformer.
- Perform multimodal self attention
- only query contains multimodal
- key and value only contain text
- see image for details
Slot Consistent NLG for Task-oriented Dialog Systems with Iterative Rectification Network
- Motivation
- NLG (convert meaningful representation to natural language)
- use Dialog Acts
- Convert NLG to Template to Lexicalization to Natural Language
- Delexicalization
- replace all slot values by their corresponding slot type in DA
- Hallucination
- misplacement error of unseen slots
- missing slots
- Approach
- Slot Extraction function and Slot consistent objective
- g(f(x)) = g(x)
- Model
- Iterative process
- Generate word/copy of template elements
- Obtain training samples from a buffer that consists of mistaken NLG results
- Slot Extraction function and Slot consistent objective
Stanza - A Python NLP Toolkit for Many Human Languages
- Performance of toolkits compromised for multiple languages
- very limited multilingual linguistic resources
- Models (traditional toolkits have hand-written rules)
- Stanza
- Full support for Universal dependencies v2
- Fully neural pipeline
- currently support 66 human languages
- Python client interface to CoreNLP library
- Stanza Neural Components
- Tokenization / Sentence Segmentation
- character sequence tagging (split at current position?)
- Multi-word token expansion
- expands to syntactic words
- Seq2Seq (Dictionary as L1)
- POS Tagger
- same as dependency parser below
- Lemmatization
- same as Multi-word token expansion
- Dependency Parsing
- Head and dependent representation for each word
- Bi-affine Transform
- NER
- Neural BiLSTM-CRF tagger enhanced with character LMs (FLAIR, Akbik et al)
- Tokenization / Sentence Segmentation
- Evaluation
- Better than SpaCy in accuracy and equal to FLAIR
- slower than SpaCy (10x slower!!)
Photon: A Robust Cross-Domain Text-to-SQL System
- Live Demo at http://naturalsql.com/
- Natural Language to SQL
- accurately map NL input to executable SQL queries
- work across different databases
- robustness (don’t know is better than mistakes)
- support user interaction
- Photon
- SotA neural text to SQL
- Novel Confusion detection approach
- Template based response generation for user interaction
- Text to SQL Semantic Parsing
- Spider dataset
- Seq2Seq Encoder-Decoder
- Encoder (NL + Schema)
- Pointer Generator Decoder + Cross entropy loss
- Confusion Detection
- UTran-SQL dataset
- back-translation and adverserial filtering
- Binary classification and confusion span detector
Efficient DST by selectively overwriting memory
- DST is defined by a set of (slot, value) pairs.
- Pre-defined ontology based vs open-vocab based.
- This paper is about the later.
- MultiWoz dataset (2.1)
- Selectively overwriting memory
- Carryover or update from previous turn?
- State Operation
- Carryover, delete, don’t care or update
- DST is divided into 2 tasks -
- state operation and slot generation
Improving Low-Resource NER using Joint Sentence and Token Labeling
- Low-Resource
- Thai, Vietnamese and Indonesian
- Domain - e-commerce
- 2k annotation per language and this leads to overfitting
- BiLSTM CRF
- Improving
- Introduce fake task of predicting category
- Maxpooling on final hidden represenation and feed through linear layer to predict category.
- This helps in overfitting and helps in generalization.
- Attention
- squeeze performance by a little.
Soft Gazetteers for Low-Resource NER
- Integrating handcrafter features with neural models is useful for NERs
- Gazetteer features
- very limited for low-resource languages
- Soft Gazetteer
- English KBs only
- can use entity linking
- Features
- top-3 candidate scores
- top-3 type-wise counts
- top-30 type-wise counts
- margins between top-4 candidates
- BiLSTM CRF
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
- The goal is to accelerate BERT inference
- Early exiting
- using fewer layers based on confidence
- Have N classifiers for each of the layers
- Fine-Tuning
- Stage 1: same as BERT fine-tuning
- Stage 2: only fine-tune classifiers and transformer layers are froen
- Inference
- Exit when you’ve higher confidence at a lower layer
- Experiments
- Layers 1,7 and 9 are most important
TriggerNER: Learning with Entity Triggers as Explanations for NER
- Triggers are phrases in the sentence that give clues about the entities
- Provide Trigger annotations for CoNll 2003
- Trigger Matching Networks
- BiLSTM + Structured Self-Attention Layer
- Sentence Representation - Mean Pooling
- Trigger Representation - only trigger words
- Contrastive loss between sentence and trigger representation for soft matcher
- Classification loss
- We use the trigger representation for global attention in BiLSTM for classification
- Results
- With 17% of data in TMN, it matches 60% of data in BLSTM-CRF
Climbing Towards NLU: On Meaning, Form and Understanding in the Age of Data
- Any system trained only on linguistic form cannot in principle learn meaning
- What is meaning?
- we easily conflate ‘form’ and ‘meaning’
- Form: what we see (marks on a page, pixels or bytes)
- Meaning: relationship between linguistic form and external to language (expressions & communicative intents)
- NLU: Given an expression e, in a context, recover the communicative intent i.
- Babies learn language
- Exposure to a language via TV or radio alone is not sufficient
- Interaction allows for joint attention
- Thought Experiments
- Java (Learn from source code and predict output at test time)
- Island (Bear GPT-2 example)
- Bottom-up vs Top-down
Beyond Accuracy: Behavioral Testing of NLP Methods with CheckList
- How do you check if a model works?
- If model is SotA, do you ship it in production?
- Checklist
- Capabilities - NER, negation, sentiment
- Minimum Functionality, Perturbation (INV) and Direction (DIR)
- Writing tests at scale: Tooling
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages
- Prior methods learn one model per website
- New representation that allow a single model to extract from any website
- Overview
- Wrapper Induction (requires training data per website)
- Distantly supervised approached (use KB to automatically create training data)
- Output (subject,relationship,object)
- Graph attention network: Build rich representations of text, video using GNN
- Methodology
- Page Layout Graph
- OpenIE and ClosedIE
- Pre-training (3 way classification of object, relation, other)
Prta: A system to support the analysis of Propaganda techniques in the News
- Left-leaning vs right-leaning
- Dataset Prta with live demo
Efficient Intent Detection with Dual Sentence Encoder
- Characteristics of a good intent detector
- accurate
- easily adaptable to specific domains
- not reliant on large amounts of data
- computationally efficient
- Datasets
- HWU (64 intents, 25K examples)
- CLINC (150 intents, 23K examples)
- Banking dataset (77 intents, 13K examples)
- Intent Detection Models
- NN (Sentence Repr) + FF Layer + Softmax
- ElMo, BERT, USE, ConveRT
ClarQ: A large-scale and diverse dataset for Clarification Question Generation
- Example
- Tell me more about ACL?
- Do you mean the ligament or the awesome conference?
- Benefits
- Resolve ambiguities
- Better user engagement
- Drive Long Term Conversation
- Limitations
- Data Source problem
- Web QA and StackExchange
- Data Source problem
Multilingual USE for Semantic Retrieval
- How are they trained?
- Multi-task Dual Encoder (Chidambaram et al)
- Retrieval QA, NLI and Translation Ranking
- Encoder architectures
- CNN
- Transformer
- Purpose
- Feature for any down-streaming tasks
- Highlight for Semantic Text Retrieval
- Use ANN Search
Learning Robust Models for E-Commerce Product Search
- Query-Item mismatch
- For example, when query is “running shoes for men”, we return “men’s waterproof hiking shoes”
- Using fixed embeddings of query and title to compute similarity often fail to capture nuances
- Limited amount of data
- Attention on right key words in query and title
- Synthesis of hard positive samples
- Attention-based LSTM classifier
- standard architecture
- Query Generator
- Based on Item, Query as encoder, can we generate Q’ as decoder such that (I, Q’) is a mismatch.
TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
- 3 goals
- Taxonomy aware neural network for attribute value extraction
- Single model for all categories
- Category-specific attribute values with conditional self attention
- Use hierarchical taxonomy
- TXtract
- Train a sepearate neural net for each categories?
- Expensive and prone to overfitting
- Merge hierarchical into flat categories
- not effective
- Category encoder
- generate category embeddings
- use this for conditional self attention (as query)
- Train a sepearate neural net for each categories?
- Wrong category assignments
- first classify category
- Taxonomy aware loss function (predict both category and ancestors)
MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices
- Requirements
- Small model size and fast inference
- Task agnostic
- Good performance
- Mobile BERT is trained by layer-to-layer imitating IB-BERT
- Bottleneck
- deep but much thinner (bottleneck)
- Stacked FeedForward
- MHA:FFN = 1:2
- MobileBert > 1.2
- No Layer Normalization
- Use ReLu Activation
- Hard to train deep and thin network
Improving Slot Filling by Utilizing Contextual Information
- Multi-Task
- Increase Mutual Info between word and context representation
- Predict word label using only its context
- Predict set of SF labels in the given sentence from the sentence representation
- Model Architecture
- standard BiLSTM with CRF
Learning to Classify Intents and Slot Labels Given a Handful of Examples
- Fine-Tuning
- Prototypical Networks
- Centroid for each class
- Return class which is closest to the given utterance
- softmax distance
- Model Agnostic Meta Learning (MAML)
- Optimization based
- Parameter initializations from which model can fine tune to small examples
- MAML fine tunes on support examples and optimizes initial parameters based on loss on query set
Curriculum Learning for NLU
- Split dataset into N buckets - train on 1 and predict on the rest.
- Use this to get the difficulty level of buckets.
- Perform curriculum learning
- Start with easiest and slowly expand to difficult buckets
- Good gain on GLUE
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
- Generate questions from summaries
- Seq2Seq with beam search decoding
- Get answers from both summary and actual article
- Squad 2.0
- Check if they both match
- Better metric than ROUGE (corelation to human eval)
Intermediate-Task Transfer Learning with Pre-trained Language Models: When and Why Does It Work?
- Transfer learning works on related auxillary task
- STILTs training
- finetune on intermediate task before finetuning on target task
- For eg:, MRC on large generic corpus before fine-tuning on MRC on scientific corpus
- Is there a correlation between “skills” learned from an intermediate task and transferability on a specific target task?
- Probing Tasks
- TLDR
- common sense datasets (HellaSWAG, Cosmos QA and CommonsenseQA) transfer positively to downstream tasks
- Higher level semantic abilities have a higher correlation with target task performance
Kaushik Rangadurai
Code. Learn. Explore