Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Apr 29, 2020 NLG Comments

Introduction

The goal is to answer natural language queries using code snippets. Think of code snippets in Google Web Answers.
Query: Open a file “f.txt” in write mode.
Answer: f=open(‘f.txt’, ‘w’)

Architecture

Pretraining the model on data extracted automatically from external knowledge resources such as existing API documentation, before fine-tuning it on a small manually curated dataset.
we implement it on top of a state-of-the-art syntax-based method for code generation, TranX (Yin and Neubig, 2018), with additional hypothesis reranking (Yin and Neubig, 2019).

Mined NL Code Pairs

Yin et al. (2018) propose training a classifier to decide whether an NL-code pair is valid, resulting in a large but noisy parallel corpus of NL intents and source code snippets.
The probability assigned by the method can serve as confidence, representing the quality of the automatically mined NL-code pairs. We use these mined pairs as a first source of external knowledge.

Re-sampling API knowledge

Motivation here is some libraries have extensive documentation (curses for example) but don’t have enough usage (as compared to json for example).
To mitigate this problem, we propose a retrievalbased re-sampling method to close the gap between the API documentation and the actual NL-code pairs we want to model.

Results

Using BLEU score on CoNaLa dataset.
Man is training only on CoNaLa.
Man + Mined - training on both CoNaLa and mined data.
API - also use API documentation for training.

NL Code Gen Results

Efficient Natural Language Response Suggestion for Smart Reply

Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Kaushik Rangadurai

Kaushik Rangadurai

Code. Learn. Explore

Share this post

Comments