Ontonotes 5.0 Dataset
This blog post list the steps needed to train from OntoNotes 5.0 dataset.
- Download the dataset from LDC (Linguistic Data Consortium)
- As the above data is hard to parse, we download some skeleton files from here (http://conll.cemantix.org/2012/data.html). Make sure to download both the skeleton data and the scripts.
- Unzip all the skeleton files and scripts.
- Convert the skeleton to conll files by running this command (the following command requires python2 and doesn’t work in python3) -
./conll-2012/v3/scripts/skeleton2conll.sh -D ../ontonotes-release-5.0/data/files/data conll-2012
- Use the code here to convert the gold conll files to ner format.
- We might still need to add DocStart to train with spacy.
Code. Learn. Explore