Ontonotes 5.0 Dataset

Dec 27, 2019 NER Comments

This blog post list the steps needed to train from OntoNotes 5.0 dataset.

Download the dataset from LDC (Linguistic Data Consortium)
As the above data is hard to parse, we download some skeleton files from here (http://conll.cemantix.org/2012/data.html). Make sure to download both the skeleton data and the scripts.
Unzip all the skeleton files and scripts.
Convert the skeleton to conll files by running this command (the following command requires python2 and doesn’t work in python3) -
```
./conll-2012/v3/scripts/skeleton2conll.sh -D ../ontonotes-release-5.0/data/files/data conll-2012
```
Use the code here to convert the gold conll files to ner format.
We might still need to add DocStart to train with spacy.

Code. Learn. Explore