Additional queries for Sofa
The virtual machine available for download includes data and required libraries to perform experiments on the TPC-H query described in our paper. Due to licensing issues, we are not allowed to ship other data sets and some required libraries within in a single system.
Repeatability
To repeat our experiments, you need to create the following data sets from the additional data:
Meteor query | Data set | Data size |
---|---|---|
medline.meteor | Pubmed | 10 million randomly selected citations |
topic_detection.meteor | Wikipedia (english) | 100,000 full-text articles |
bankrupt.meteor | Wikipedia (english) | 50,000 articles from 2010 and 2012 each |
persons.meteor | Wikipedia (english) | 100,000 full-text articles |
persons_locations.meteor | Wikipedia (english) | 100,000 full-text articles |
tpch.meteor | TPC-H | 100 GB |
Scalability experiments:
Meteor query | Data set | Data sizes |
---|---|---|
topic_detection.meteor | Wikipedia (english) | 20 GB, 200 GB, 1 TB, 2 TB |
persons.meteor | Wikipedia (english) | 12 GB, 60 GB, 120 GB |
tpch.meteor | TPC-H | 2GB, 20GB, 200GB, 1TB, 2TB |
0. Hardware and software requirements
Stratosphere expects the cluster to consist of one master node and one or more worker nodes. Before you setup the system, make sure your computing environment has the following configuration:- a compute cluster (in our experiments, we used a cluster of 28 nodes),
- 1 TB HDD and 24 GB main memory per node,
- gigabit LAN connection between each of the nodes,
- Oracle Java, version 1.6.x or 1.7.x installed, and
- ssh installed (sshd must be running to use the Stratosphere scripts that manage remote components)
1. Prepare the data sets
(a) Pubmed data set
Obtain a free license for the data set from the U.S. National Library of Medicine, see http://www.nlm.nih.gov/databases/journal.html for details. Download the data set in XML format and transform it into a JSON data set containing of multiple JSON records, each consisting of at least the following attributes:
Attribute name | Data type, description |
---|---|
pmid | integer, corresponds to the Pubmed ID of an article |
text | string, corresponds to the concatenated title and abstract of an article |
mesh | array of strings, corresponds to the associated mesh terms of an article |
year | integer, corresponds to the year of publication |
A record in the required format should look like this:
{ "pmid": 12345,
"text": "This is a sample medline text",
"mesh": ["Aged", "Adolescent",...,"Female"],
"year": 2000
}
For efficient record processing, we suggest to store several citations per file, i.e., in an array of json records. In our experiments, we used a chunk size of 5000 citations per file.
(b) Wikipedia data set
Download two dumps of Wikipedia articles from two different points in time, preferrably from 2011 and 2013. The respective data sets can be found at http://dumps.wikimedia.org/archive/ (2010 and older) and http://dumps.wikimedia.org/backup-index.html (2011 until now). Transform the data into a JSON data set containing of multiple JSON records, each consisting of at least the following attributes:
Attribute name | Data type, description |
---|---|
id | integer, corresponds to the Wikipedia UID of an article |
url | string, URL of the article |
title | string, corresponds to the title text of an article |
timestamp | string, corresponds to the timestamp of the lastest version of the article |
author | string |
text | string, corresponds to the content of the article. Any wiki formatting markdown must be removed in advance |
A record in the required format should look like this:
{ "id": 12345,
"url": "http://en.wikipedia.org/wiki?curid=12345",
"title": "Test entry",
"timestamp": "2010-11-15T15:11:26Z",
"author": "Some Author",
"text": "This is a wikipedia article with removed markdown."
}
For efficient record processing, we suggest to store several citations per file, i.e., in an array of json records. In our experiments, we used a chunk size of 1000 articles per file.
2. Prepare required libraries
- Download Linnaeus in version 2.0 from http://linnaeus.sourceforge.net/
- Unpack the archive and place the file linnaeus/bin/linnaeus.jar inside the VM folder
/home/sofa/experiments/lib/
3. Obtain IE library and model files
Due to licensing issues with implemented IE operators and the heavy space consumption of the required model files for entity and relation recognition operators, these files are not included in the virtual machine. However, you can get both components upon request by contacting Astrid Rheinländer (rheinlae 'at' informatik 'dot' hu 'minus' berlin 'dot' de). Once you obtained both files, place both in the folder /home/sofa/experiments/lib/
.
4. Run IE queries for plan enumeration experiments
Run the additional queries using the following commands:
/home/sofa/experiments/bin/meteor-client.sh medline.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/meteor-client.sh topic_detection.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/meteor-client.sh bankrupt.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/meteor-client.sh persons.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/meteor-client.sh persons_locations.meteor --[enumerate|optimize|competitors]
5. Setup cluster for scalability and distributed experiments
- Cluster configuration
In order to start/stop remote processes, the master node requires access viassh
to the worker nodes. It is most convenient to use ssh’s public key authentication for this. To setup public key authentication, log on to the master as the user who will later execute all the Stratosphere components. Important: The same user (i.e. a user with the same user name) must also exist on all worker nodes. We will refer to this user assofa
. Using the super userroot
is highly discouraged for security reasons. Once you logged in to the master node as the desired user, you must generate a new public/private key pair. The following command will create a new public/private key pair into the.ssh
directory inside the home directory of the user sofa:
See the ssh-keygen man page for more details. Note that the private key is not protected by a passphrase. Next, copy/append the content of the filessh-keygen -b 2048 -P '' -f ~/.ssh/id_rsa
.ssh/id_rsa.pub
to your authorized_keys file:
Finally, the authorized keys file must be copied to every worker node of your cluster. You can do this by repeatedly typing incat .ssh/id_rsa.pub >> .ssh/authorized_keys
and replacingscp .ssh/authorized_keys
:~/.ssh/
with the host name of the respective worker node. After having finished the copy process, you should be able to log on to each worker node from your master node via ssh without a password. - Setting JAVA_HOME on each node
Stratosphere requires the JAVA_HOME environment variable to be set on the master and all worker nodes and point to the directory of your Java installation. You can set this variable in/home/sofa/experiments/conf/stratosphere-conf.yaml
via theenv.java.home
key. Alternatively, add the following line to your shell profile.export JAVA_HOME=/path/to/java_home/
- Configure Stratosphere
- Configure Master and Worker nodes
Edit/home/sofa/experiments/conf/stratosphere-conf.yaml
. Set thejobmanager.rpc.address
key to point to your master node.
Furthermode define the maximum amount of main memory the JVM is allowed to allocate on each node by setting thejobmanager.heap.mb
andtaskmanager.heap.mb
keys. The value is given in MB. To repeat our experiments, please set the jobmanager heap size to 1000MB and the taskmanager heap size to 22528 MB.
Finally you must provide a list of all nodes in your cluster which shall be used as worker nodes. Each worker node will later run a TaskManager. Edit the fileconf/slaves
and enter the IP address or host name of each worker node:
Each entry must be separated by a new line, as in the following example:vi /home/sofa/experiments/conf/slaves
The directory<worker 1> <worker 2> . . . <worker n>
/home/sofa/experiments
and its subdirectories must be available on every worker under the same path. You can use a shared NSF directory, or copy the entire experiments directory to every worker node. Note that in the latter case, all configuration and code updates need to be synchronized to all nodes. - Configure network buffers
Network buffers are a critical resource for the communication layers. They are used to buffer records before transmission over a network, and to buffer incoming data before dissecting it into records and handing them to the application. A sufficient number of network buffers is critical to achieve a good throughput.
Since the intra-node-parallelism is typically the number of cores, and more than 4 repartitioning or broadcasting channels are rarely active in parallel, it frequently boils down to. To support a cluster of the same size as in our original experiments (28 nodes with 6 cores) , you should use roughly 4032 network buffers for optimal throughput. Each network buffer has by default a size of 32 KiBytes. In the above example, the system would allocate roughly 300 MiBytes for network buffers. The number and size of network buffers can be configured with the parameters
taskmanager.network.numberOfBuffers
andtaskmanager.network.bufferSizeInBytes
. - Configure temporary I/O directories
Although Stratosphere aims to process as much data in main memory as possible, it is not uncommon that more data needs to be processed than memory is available. The system's runtime is designed to write temporary data to disk to handle these situations. The parametertaskmanager.tmp.dirs parameter
specifies a list of directories into which temporary files are written to. The paths of the directories need to be separated by a colon character. If the taskmanager.tmp.dirs parameter is not explicitly specified, Stratosphere writes temporary data to the temporary directory of the operating system, such as/tmp
in Linux systems.
- Configure Master and Worker nodes
- Install and configure HDFS
Similar to the Stratosphere system HDFS runs in a distributed fashion. HDFS consists of a NameNode which manages the distributed file system’s meta data. The actual data is stored by one or more DataNodes. For our experiments, we require that the HDFS’s NameNode component runs on the master node while all the worker nodes run an HDFS DataNode.- Download and unpack Hadoop
Download Hadoop version 1.X from http://hadoop.apache.org/releases.html to your master node and extract the Hadoop archive.tar -xvzf hadoop*
- Configure HDFS
Change into the Hadoop directory and edit the Hadoop environment configuration file:
Uncomment and modify the following line in the file according to the path of your Java installation:cd hadoop-* vi conf/hadoop-env.sh
Save the changes and open the HDFS configuration file:export JAVA_HOME=/path/to/java_home/
The following excerpt shows a minimal configuration which is required to make HDFS work. More information on how to configure HDFS can be found in the HDFS User Guide guide.vi conf/hdfs-site.xml
Replace<configuration> <property> <name>fs.default.name</name> <value>hdfs://MASTER:50040/</value> </property> <property> <name>dfs.data.dir</name> <value>DATAPATH</value> </property> </configuration>
MASTER
with the IP address or the host name of your master node running the NameNode.DATAPATH
must be replaced with path to the directory in which the actual HDFS data shall be stored on each worker node. Make sure that the user 'sofa' has sufficient permissions to read and write in that directory.
After having saved the HDFS configuration file, open the file DataNode configuration file:
Enter the IP/host name of those worker nodes which shall act as DataNodes. Each entry must be separated by a line break:vi conf/slaves
<worker 1> <worker 2> . . . <worker n>
- Initialize HDFS
Initialize the HDFS by typing in the following command:
Note that the command deletes all data previously stored in the HDFS. Since we have installed a fresh HDFS, it should be safe to answer the confirmation with yes.bin/hadoop namenode -format
- Hadoop directory on DataNodes
Make sure that the Hadoop directory is available to all worker nodes which are intended to act as DataNodes. Similar to Stratosphere, all nodes must find this directory under the same path. To accomplish this, you can either use a shared network directory (e.g. an NFS share) or you one can copy the directory to all nodes. Note that in the latter case, all configuration and code updates need to be synchronized to all nodes. - Start HDFS
Enter the following commands on the NameNode:
Please see the Hadoop quick start guide for troubleshooting.cd hadoop-* bin/start-dfs.sh
- Move data into HDFS
Move all data you want to experiment with into HDFS using the command
In this command, localsrc corresponds to the path of your files in the Unix file system and HDFS_dest corresponds to URI of the destination in HDFS. A complete guide of HDFS file system commands is described in the HDFS users guide.bin/hadoop fs -put <localsrc> <HDFS_dest>
- Download and unpack Hadoop
- Change queries for use with HDFS
All queries contained in the virtual machine do not use HDFS but the local file system. For each query you want to try, you need to change the file paths of the input and output files to the respective HDFS URIs where your experimental data is stored. - Start the system in cluster mode
To start the Stratosphere system in cluster mode, execute the following commands on the master node:/home/sofa/experiments/bin/start-cluster.sh /home/sofa/experiments/bin/sopremo-server.sh start
- Execute queries and optimize with SOFA as described in Section 4.
- Shutdown system
When finished, shut down the system and HDFS using the commands/home/sofa/experiments/bin/stop-cluster.sh /home/sofa/experiments/bin/sopremo-server.sh stop cd ~/hadoop-* bin/stop-dfs.sh
Contact
- Astrid Rheinländer, rheinlae 'at' informatik 'dot' hu-berlin 'dot' de
- Ulf Leser, leser 'at' informatik 'dot' hu-berlin 'dot' de