TAP Corpus
From HLT@INESC-ID
Getting the corpus
git clone ssh://ssh.l2f.inesc-id.pt/afs/l2f/home/filcab/git/tap.git
Assuming the pdfs are in "originals/UP*…"
(You can do the following commands to get the pdfs there.)
cd tap mkdir originals cd originals lndir /afs/l2f/corpora/up-magazine/originals
Getting the corpus aligned
Prerequisite: stanford coreNLP at tap/stanford-corenlp
Run:
./everything
If the pdfs are not at originals/, run: ./everything <directory>
Aligned sentences are stored in aligned/ Tagged corpus is at tagged/.
The aligned and tagged corpora are available at: /afs/l2f/home/filcab/tap-corpus