Getting Started
This documentation might be sparse, but hopefully will be enough to get you started using the code. If you have questions that aren’t answered, please feel free to ask your questions via the issue tracker, and I will update the documentation to answer your question.
This documentation is for version 3.0 of the PRA code. The parameter specification has changed a bit since prior versions of the code, and what you see here will not match what the code expects if you’re using a prior version. There’s also the possibility that things will change (parameters either added, removed, or moved) in the future, so if you want to be certain that the code you’re using matches the documentation here, be sure to check out the tagged version 3.0 of the PRA code.
Compiling and running the code
This PRA code was originally written in java, using GraphChi-java as the engine for performing random walks on a graph. I have since switched to using scala as my main development language, so I expect that most new code in this repository will be written in scala, and much of the codebase has already migrated from java to scala.
Scala uses sbt
as its main build tool. You can download sbt
here. Once you have sbt installed and in your shell’s
PATH
, you can run the following commands to clone the repository and run the tests, verifying
that things are working correctly:
git clone https://github.com/matt-gardner/pra
cd pra
sbt test
If the end output is something like the following, you can be confident that everything is set up properly:
/home/mg1/clone/pra$ sbt test
[info] Loading global plugins from /usr0/home/mg1/.sbt/0.13/plugins
[info] Loading project definition from /usr0/home/mg1/clone/pra/project
[info] Set current project to pra (in build file:/usr0/home/mg1/clone/pra/)
[... a lot of output from the tests ...]
[info] Passed: Total 65, Failed 0, Errors 0, Passed 65
[success] Total time: 2 s, completed Sep 29, 2014 5:11:42 PM
If you’re getting an error, it’s possible there’s a bug in the most recent code; you could try
checking out a tagged version (such as v3.0, with git checkout v3.0
), as tagged versions should
be relatively stable. (One of the tests is also a bit flaky, so if you get just one error in
SubgraphFeatureGeneratorSpec, just re-run the tests and it should pass.)
If the test succeeds, you can actually run the code using sbt run
. There are two different main
methods in the code, however, so this will give you a list of possible main methods (the order you
see may be different, and any given commit might have more or less than what is listed here):
/home/mg1/clone/pra$ sbt "run /home/mg1/pra/"
[info] Loading global plugins from /usr0/home/mg1/.sbt/0.13/plugins
[info] Loading project definition from /usr0/home/mg1/clone/pra/project
[info] Set current project to pra (in build file:/usr0/home/mg1/clone/pra/)
Multiple main classes detected, select one to run:
[1] edu.cmu.ml.rtw.pra.experiments.ExperimentScorer
[2] edu.cmu.ml.rtw.pra.experiments.ExperimentRunner
Enter number:
These Experiment*
classes are scala
drivers
that examine a directory structure and run whichever experiments
are specified by the files it finds (more on that later). The base directory it examines is the
first argument passed to the program - hence the /home/mg1/pra/
argument in the example command
above. To pass arguments with sbt
, you need to surround the whole command in quotes ("run
/home/mg1/pra/"
), or just type run /home/mg1/pra/
after getting an sbt
interactive
console.
Main Experiment Classes
There are two main experiment classes available, and each is described in its own section below.
In what follows, I will repeatedly refer to $pra_base
, which is the directory you pass in as the
first argument to each of these classes (/home/mg1/pra/
in the snippet above). Each of these
methods also take an optional second parameter that will filter the experiments. See below for
more details.
ExperimentRunner
This code will run experiments. It looks for a directory called experiment_specs/
under
$pra_base
, and recursively searches that directory for any files ending in .json
. Each such
file specifies an experiment. When the experiment is run, the results will show up in
$pra_base/results/
. Running this method will run all experiments which do not already have a
corresponding directory under results/
(so if, e.g., there was an error in your experiment and
you need to re-run it, remove the directory in results/
. If you want to run an experiment
multiple times to test for variability in the algorithm, duplicate the .json
file with different
names, like test_run1.json
, test_run2.json
, etc.).
The hierarchical structure makes it easy to organize focused experiments. For example, you might
create a directory called experiment_specs/tuning/
, with .json
files l1_.05,l2_.1.json
,
l1_.1,l2_.1.json
, or whatever you wish to do. Then when running the experiments, all of the
results will appear under $pra_base/results/tuning/
. You can filter which experiments to run
with the second parameter to this main method, so, e.g., run /home/mg1/pra/ tuning
would only run
the experiments with tuning
in their name. (You can add as many filters as you like; the
results of each filter are merged, so this is an OR, not an AND operation on multiple filters.)
ExperimentRunner
will attempt to create any inputs that it needs from the specification you give
it. For example, PRA needs a graph as input, and a graph is made up of some number of relation
sets. You specify in the .json
file how you want the graph made, and ExperimentRunner
will
check to see if it already exists, and create it if not. And so, as you can imagine, the .json
file has a lot of potential options. For more information on the available options, see the
experiment spec format. For more information on
the output of ExperimentRunner
, see the
results directory contents.
ExperimentRunner
is designed to allow for several instances running in parallel. So if you have
some long set of experiment specifications, and you have a big enough machine to handle 4
experiments running at the same time, you can start 4 instances of this process, and things should
just work. If you run into a problem when trying to do this, it’s a bug and you should let me
know.
ExperimentScorer
After running experiments, this code will look through $pra_base/results/
, compute metrics for
all of the experiments there, and output a table with results, along with significance tests. You
can also filter the experiments scored with this, so run /home/mg1/pra/ tuning
would only show
experiments with tuning
in their name in the table. If there is a lot of output for an
experiment, scoring the results can take some time, so this code caches the results in
$pra_base/results/saved_metrics.tsv
. This looks at timestamps to see if the results have been
updated, so you shouldn’t ever have to mess with the saved_metrics.tsv
file, but if you’re
getting odd errors with ExperimentScorer
, you might try deleting that file.
I have plans to make the output displayed configurable with commandline options, but it’s not done
yet. If you want some other metric displayed in the table, just edit sortResultsBy_
and
displayMetrics_
in
ExperimentScorer.scala
This scorer is also extendable. You can add a MetricComputer
relatively easily to compute your
own metric, and call scoreExperiments
with your customized list of MetricComputers
. As an
example of why this is useful, I used this PRA code as the basis for a simple question answering
system, and so I wrote a MetricComputer
that goes through
the list of questions, evaluates each SVO triple in the question given PRA model output, and gives
an accuracy score on the question set. The result hooked right into this code as an additional
metric that could be displayed alongside the standard MAP and MRR metrics computed by default. If
you want to use this functionality and have trouble getting it to work, let me know. The example
mentioned above can be found
here.