TAP Corpus: Difference between revisions
From HLT@INESC-ID
(Instructions) |
No edit summary |
||
(3 intermediate revisions by the same user not shown) | |||
Line 12: | Line 12: | ||
== Getting the corpus aligned == | == Getting the corpus aligned == | ||
Prerequisite: | Prerequisite: | ||
stanford coreNLP at <tt>tap/stanford-corenlp</tt> | stanford coreNLP at <tt>tap/stanford-corenlp</tt>; | ||
<tt>l2fstring</tt> available. | |||
=== Running === | |||
<pre>./everything</pre> | <pre>./everything</pre> | ||
Line 20: | Line 22: | ||
Aligned sentences are stored in <tt>aligned/</tt> Tagged corpus is at <tt>tagged/</tt>. | Aligned sentences are stored in <tt>aligned/</tt> Tagged corpus is at <tt>tagged/</tt>. | ||
The aligned and tagged corpora are available at: <tt>/afs/l2f/home/filcab/tap-corpus</tt> |
Latest revision as of 15:28, 29 November 2011
Getting the corpus
git clone ssh://ssh.l2f.inesc-id.pt/afs/l2f/home/filcab/git/tap.git
Assuming the pdfs are in "originals/UP*…"
(You can do the following commands to get the pdfs there.)
cd tap mkdir originals cd originals lndir /afs/l2f/corpora/up-magazine/originals
Getting the corpus aligned
Prerequisite: stanford coreNLP at tap/stanford-corenlp;
l2fstring available.
Running
./everything
If the pdfs are not at originals/, run: ./everything <directory>
Aligned sentences are stored in aligned/ Tagged corpus is at tagged/.
The aligned and tagged corpora are available at: /afs/l2f/home/filcab/tap-corpus