TAP Corpus

From HLT@INESC-ID

Revision as of 14:31, 29 November 2011 by Filcab (talk | contribs) (Instructions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Getting the corpus

git clone ssh://ssh.l2f.inesc-id.pt/afs/l2f/home/filcab/git/tap.git

Assuming the pdfs are in "originals/UP*…"

(You can do the following commands to get the pdfs there.)

cd tap
mkdir originals
cd originals
lndir /afs/l2f/corpora/up-magazine/originals


Getting the corpus aligned

Prerequisite: stanford coreNLP at tap/stanford-corenlp

Run:

./everything

If the pdfs are not at originals/, run: ./everything <directory>

Aligned sentences are stored in aligned/ Tagged corpus is at tagged/.