TAP Corpus

From HLT@INESC-ID

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Getting the corpus

git clone ssh://ssh.l2f.inesc-id.pt/afs/l2f/home/filcab/git/tap.git

Assuming the pdfs are in "originals/UP*…"

(You can do the following commands to get the pdfs there.)

cd tap
mkdir originals
cd originals
lndir /afs/l2f/corpora/up-magazine/originals


Getting the corpus aligned

Prerequisite: stanford coreNLP at tap/stanford-corenlp;

l2fstring available.

Running

./everything

If the pdfs are not at originals/, run: ./everything <directory>

Aligned sentences are stored in aligned/ Tagged corpus is at tagged/.

The aligned and tagged corpora are available at: /afs/l2f/home/filcab/tap-corpus