TAP Corpus: Difference between revisions

From HLT@INESC-ID

No edit summary
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 12: Line 12:
== Getting the corpus aligned ==
== Getting the corpus aligned ==
Prerequisite:
Prerequisite:
stanford coreNLP at <tt>tap/stanford-corenlp</tt>
stanford coreNLP at <tt>tap/stanford-corenlp</tt>;


Run:
<tt>l2fstring</tt> available.
 
=== Running ===
<pre>./everything</pre>
<pre>./everything</pre>



Latest revision as of 15:28, 29 November 2011

Getting the corpus

git clone ssh://ssh.l2f.inesc-id.pt/afs/l2f/home/filcab/git/tap.git

Assuming the pdfs are in "originals/UP*…"

(You can do the following commands to get the pdfs there.)

cd tap
mkdir originals
cd originals
lndir /afs/l2f/corpora/up-magazine/originals


Getting the corpus aligned

Prerequisite: stanford coreNLP at tap/stanford-corenlp;

l2fstring available.

Running

./everything

If the pdfs are not at originals/, run: ./everything <directory>

Aligned sentences are stored in aligned/ Tagged corpus is at tagged/.

The aligned and tagged corpora are available at: /afs/l2f/home/filcab/tap-corpus